Why the concept is not good enough
- by Gerd Stolpmann,
2014-07-04
In the upcoming release 4.02 of the OCaml programming language, the type
string
can be made immutable by a compiler
switch. Although this won't be the default yet, this should be seen as
the announcement of a quite disruptive change in the
language. Eventually this will be the default in a future version. In
this article I explain why I disagree with this particular plan, and
which modifications would be better.
Of course, the fact that string
is mutable doesn't fit
well into a functional language. Nevertheless, it has been seen as
acceptable for a long time, probably because the developers of OCaml
did not pay much attention to strings, and felt that the benefits of a
somewhat cleaner concept wouldn't outweigh the practical disadvantages
of immutable strings. Apparently, this attitude changed, and we will
see a new bytes
type in OCaml-4.02. This type is
accompanied by a Bytes
module with library functions
supporting it. The compiler was also extended so
that string
and bytes
can be used
interchangably by default. If, however, the -safe-strings
switch is set on the command-line, the compiler
sees string
and bytes
as two completely
separate types.
This is a disruptive change (if enabled): Almost all code bases will
need modifications in order to be compatible with the new
concept. Although this will often be trivial, there are also harder
cases where strings are frequently used as buffers. Before discussing
that a bit more in detail, let me point out why such disruptive
changes are so problematic. So far there was an implicit guarantee
that your code will be compatible to new compiler versions if you
stick to the well-established parts of the language and avoid
experimental additions. I have in deed code that was developed for
OCaml-1.03 (the first version I checked out), and that code still
runs. Especially in a commercial context this is a highly appreciated
feature, because this protects the investment in the code base. As I'm
trying to sell OCaml to companies in my carreer this is a point that
bothers me. Giving up this history of excellent backward compatibility
is something we shouldn't do easily, and if so, only if we get something
highly valuable back. (Of course, if you only look at the open source
and academic use of OCaml, you'll put less emphasis on the compatibility
point, but it's also not completely unimportant there.)
The problem
I'm fully aware that immutable strings fix some problems (the
worst probably: so far even string literals can be mutated, which can be
very surprising). However, creating a completely new type bytes
comes also with some disadvantages:
- Lack of generic accessor functions: There is
String.get
and
there is Bytes.get
. The shorthand s.[k]
is now
restricted to strings. This is mostly a stylistic problem.
- The conversion of string to bytes and vice versa requires a copy:
Bytes.of_string
, and Bytes.to_string
. You have
to pay a performance penalty.
- In practical programming, there is sometimes no clear conceptual
distinction between string data that are read-only and those that require
mutation. For example, if you add data to a buffer, the data may come from
a string or from another buffer. So how do you type such an
add
function?
This latter point is, in my opinion, the biggest problem. Let's assume
we wanted to reimplement the
Lexing
module of the
standard library in pure OCaml without resorting to unsafe coding
(currently it's done in C). This module implements the lexing buffer
that backs the lexers generated with ocamllex. We now have to
use
bytes
for the core of this buffer. There are three
functions in
Lexing
for creating new buffers:
val from_channel : in_channel -> lexbuf
val from_string : string -> lexbuf
val from_function : (string -> int -> int) -> lexbuf
The first observation is that we'll better offer two more constructors
to the users of this module:
val from_bytes : bytes -> lexbuf
val from_bytes_function : (bytes -> int -> int) -> lexbuf
So why do we need the ability to read from
bytes
,
i.e. copy from one buffer to the other? We could just be a bad host
and don't offer these functions to the users of the module. However,
it's unavoidable anyway for
from_channel
, because I/O
buffers are of course
bytes
:
let from_channel ch =
from_bytes_function (Pervasives.input ch)
So whenever we implement buffers that also include I/O capabilities,
it is likely that we need to handle both the
bytes
and
the
string
case. This is not only a problem for the
interface design. Because
string
and
bytes
are completely separated, we need two different
implementations:
from_string
and
from_bytes
cannot share much code.
This is the ironical part of the new concept: Although it tries to
make the handling of strings more sound and safe, the immediate
consequence in reality is that code needs to be duplicated because of
missing polymorphisms. Any half-way intelligent programmer will of
course fall back to unsafe functions for casting bytes to strings and
vice versa (Bytes.unsafe_to_string
and Bytes.unsafe_of_string
), and this only means
that the new -safe-strings
option will be a driving force
for using unsafe language features.
Let's look at three modifications of the concept. Is there some easy
fix?
Idea 1: string
as a supertype of bytes
We just allow that bytes
can officially be
coerced to string
:
let s = (b : bytes :> string)
Of course, this weakens the immutability property: string
may now be a read-only interface for a bytes
buffer, and
this buffer can be mutated, and this mutation can be observed through
the string
type:
let mutable_string() =
let b = Bytes.make 1 'X' in
let s = (b :> string) in
(s, Bytes.set 0)
let (s, set) = mutable_string()
(* s is now "X" *)
let () = set 'Y'
(* s is now "Y" *)
Nevertheless, this concept is not meaningless. In particular, if a
function takes a string argument, it is guaranteed that the string
isn't modified. Also, string literals are immutable. Only when a
function returns a string, we cannot be sure that the string isn't
modified by a side effect.
This variation of the concept also solves the polymorphism problem we
explained at the example of the Lexing
module: It is now
sufficient when we implement Lexing.from_string
, because
bytes
can always be coerced to string
:
let from_bytes s =
from_string (s :> string)
Idea 2: Add a read-only type stringlike
Some people may feel uncomfortable with the implication of Idea 1 that
the immutability of string
can be easily circumvented.
This can be avoided with a variation: Add a third type
stringlike
as the common supertype of both
string
and bytes
. So we allow:
let sl1 = (s : string :> stringlike)
let sl2 = (b : bytes :> stringlike)
Of course,
stringlike
doesn't implement mutators (like
string
). It is nevertheless different from
string
:
string
is considered as absolutely immutable (there is no
way to coerce bytes
to string
)
stringlike
is seen as the read-only API for either
string
or bytes
, and it is allowed to mutate
a stringlike
behind the back of this API
stringlike
is especially interesting for interfaces that
need to be compatible to both string
and bytes
.
In the Lexing
example, we would just define
val from_stringlike : stringlike -> lexbuf
val from_stringlike_function : (stringlike -> int -> int) -> lexbuf
and then reduce the other constructors to just these two, e.g.
let from_string s =
from_stringlike (s :> stringlike)
let from_bytes b =
from_stringlike (b :> bytes)
These other constructors are now only defined for the convenience
of the user.
Idea 3: Base bytes
on bigarrays
This idea doesn't fix any of the mentioned problems. Instead, the
thinking is: If we already accept the incompatibility
between string
and bytes
, let's at least do
in a way so that we get the maximum out of it. Especially for I/O
buffers, bigarrays are way better suited than strings:
- I/O primitives can directly pass the bigarrays to the operating
system (no need for an intermediate buffer as it is currently the case
for
Unix.read
and Unix.write
)
- Bigarrays support the slicing of buffers (i.e. you can reference
subbuffers directly)
- Bigarrays can be aligned to page boundaries (which is accelerated
for some operating systems when used for I/O)
So let's define:
type bytes =
(char,Bigarray.int8_unsigned_elt,Bigarray.c_layout) Bigarray.Array1.t
Sure, there is now no way to unsafely cast strings to bytes and vice
versa anymore, but arguably we shouldn't prefer a design over the other
only for it's unsafety.
Regarding stringlike
, it is in deed possible to define it,
but there is some runtime cost. As string
and bytes
have now different representations, any accessor function for
stringlike
would have to check at runtime whether it is
backed by a string
or by bytes
. At least, this
check is very cheap.
Conclusion
I hope it has become clear that the current plan is not far reaching
enough, as the programmer would have to choose between bad alternatives:
either pay a runtime penalty for additional copying and accept that
some code needs to be duplicated, or use unsafe coercion
between
string
and
bytes
. The latter is not
desirable, of course, but it is surely the task of the language
(designer) to make sound and safe string handling an attractive option.
I've presented three ideas that would all improve the concept in
some respect. In particular, the combination of the ideas 2 and 3
seems to be very attractive: back
bytes
by bigarrays,
and provide an
stringlike
supertype for easing the
programming of application buffers.