Plasma GitLab Archive
Projects Blog Knowledge

BLOG ON CAMLCITY.ORG: GS

Immutable strings in OCaml-4.02

Why the concept is not good enough - by Gerd Stolpmann, 2014-07-04

In the upcoming release 4.02 of the OCaml programming language, the type string can be made immutable by a compiler switch. Although this won't be the default yet, this should be seen as the announcement of a quite disruptive change in the language. Eventually this will be the default in a future version. In this article I explain why I disagree with this particular plan, and which modifications would be better.

Of course, the fact that string is mutable doesn't fit well into a functional language. Nevertheless, it has been seen as acceptable for a long time, probably because the developers of OCaml did not pay much attention to strings, and felt that the benefits of a somewhat cleaner concept wouldn't outweigh the practical disadvantages of immutable strings. Apparently, this attitude changed, and we will see a new bytes type in OCaml-4.02. This type is accompanied by a Bytes module with library functions supporting it. The compiler was also extended so that string and bytes can be used interchangably by default. If, however, the -safe-strings switch is set on the command-line, the compiler sees string and bytes as two completely separate types.

This is a disruptive change (if enabled): Almost all code bases will need modifications in order to be compatible with the new concept. Although this will often be trivial, there are also harder cases where strings are frequently used as buffers. Before discussing that a bit more in detail, let me point out why such disruptive changes are so problematic. So far there was an implicit guarantee that your code will be compatible to new compiler versions if you stick to the well-established parts of the language and avoid experimental additions. I have in deed code that was developed for OCaml-1.03 (the first version I checked out), and that code still runs. Especially in a commercial context this is a highly appreciated feature, because this protects the investment in the code base. As I'm trying to sell OCaml to companies in my carreer this is a point that bothers me. Giving up this history of excellent backward compatibility is something we shouldn't do easily, and if so, only if we get something highly valuable back. (Of course, if you only look at the open source and academic use of OCaml, you'll put less emphasis on the compatibility point, but it's also not completely unimportant there.)

The problem

I'm fully aware that immutable strings fix some problems (the worst probably: so far even string literals can be mutated, which can be very surprising). However, creating a completely new type bytes comes also with some disadvantages:

  • Lack of generic accessor functions: There is String.get and there is Bytes.get. The shorthand s.[k] is now restricted to strings. This is mostly a stylistic problem.
  • The conversion of string to bytes and vice versa requires a copy: Bytes.of_string, and Bytes.to_string. You have to pay a performance penalty.
  • In practical programming, there is sometimes no clear conceptual distinction between string data that are read-only and those that require mutation. For example, if you add data to a buffer, the data may come from a string or from another buffer. So how do you type such an add function?
This latter point is, in my opinion, the biggest problem. Let's assume we wanted to reimplement the Lexing module of the standard library in pure OCaml without resorting to unsafe coding (currently it's done in C). This module implements the lexing buffer that backs the lexers generated with ocamllex. We now have to use bytes for the core of this buffer. There are three functions in Lexing for creating new buffers:
val from_channel : in_channel -> lexbuf
val from_string : string -> lexbuf
val from_function : (string -> int -> int) -> lexbuf
The first observation is that we'll better offer two more constructors to the users of this module:
val from_bytes : bytes -> lexbuf
val from_bytes_function : (bytes -> int -> int) -> lexbuf
So why do we need the ability to read from bytes, i.e. copy from one buffer to the other? We could just be a bad host and don't offer these functions to the users of the module. However, it's unavoidable anyway for from_channel, because I/O buffers are of course bytes:
let from_channel ch =
  from_bytes_function (Pervasives.input ch)
So whenever we implement buffers that also include I/O capabilities, it is likely that we need to handle both the bytes and the string case. This is not only a problem for the interface design. Because string and bytes are completely separated, we need two different implementations: from_string and from_bytes cannot share much code.

This is the ironical part of the new concept: Although it tries to make the handling of strings more sound and safe, the immediate consequence in reality is that code needs to be duplicated because of missing polymorphisms. Any half-way intelligent programmer will of course fall back to unsafe functions for casting bytes to strings and vice versa (Bytes.unsafe_to_string and Bytes.unsafe_of_string), and this only means that the new -safe-strings option will be a driving force for using unsafe language features.

Let's look at three modifications of the concept. Is there some easy fix?

Idea 1: string as a supertype of bytes

We just allow that bytes can officially be coerced to string:

let s = (b : bytes :> string)

Of course, this weakens the immutability property: string may now be a read-only interface for a bytes buffer, and this buffer can be mutated, and this mutation can be observed through the string type:

let mutable_string() =
  let b = Bytes.make 1 'X' in
  let s = (b :> string) in
  (s, Bytes.set 0)

let (s, set) = mutable_string()
(* s is now "X" *)
let () = set 'Y'
(* s is now "Y" *)

Nevertheless, this concept is not meaningless. In particular, if a function takes a string argument, it is guaranteed that the string isn't modified. Also, string literals are immutable. Only when a function returns a string, we cannot be sure that the string isn't modified by a side effect.

This variation of the concept also solves the polymorphism problem we explained at the example of the Lexing module: It is now sufficient when we implement Lexing.from_string, because bytes can always be coerced to string:

let from_bytes s =
  from_string (s :> string)

Idea 2: Add a read-only type stringlike

Some people may feel uncomfortable with the implication of Idea 1 that the immutability of string can be easily circumvented. This can be avoided with a variation: Add a third type stringlike as the common supertype of both string and bytes. So we allow:

let sl1 = (s : string :> stringlike)
let sl2 = (b : bytes :> stringlike)
Of course, stringlike doesn't implement mutators (like string). It is nevertheless different from string:
  • string is considered as absolutely immutable (there is no way to coerce bytes to string)
  • stringlike is seen as the read-only API for either string or bytes, and it is allowed to mutate a stringlike behind the back of this API

stringlike is especially interesting for interfaces that need to be compatible to both string and bytes. In the Lexing example, we would just define

val from_stringlike : stringlike -> lexbuf
val from_stringlike_function : (stringlike -> int -> int) -> lexbuf
and then reduce the other constructors to just these two, e.g.
let from_string s =
  from_stringlike (s :> stringlike)

let from_bytes b =
  from_stringlike (b :> bytes)
These other constructors are now only defined for the convenience of the user.

Idea 3: Base bytes on bigarrays

This idea doesn't fix any of the mentioned problems. Instead, the thinking is: If we already accept the incompatibility between string and bytes, let's at least do in a way so that we get the maximum out of it. Especially for I/O buffers, bigarrays are way better suited than strings:

  • I/O primitives can directly pass the bigarrays to the operating system (no need for an intermediate buffer as it is currently the case for Unix.read and Unix.write)
  • Bigarrays support the slicing of buffers (i.e. you can reference subbuffers directly)
  • Bigarrays can be aligned to page boundaries (which is accelerated for some operating systems when used for I/O)

So let's define:

type bytes =
  (char,Bigarray.int8_unsigned_elt,Bigarray.c_layout) Bigarray.Array1.t
Sure, there is now no way to unsafely cast strings to bytes and vice versa anymore, but arguably we shouldn't prefer a design over the other only for it's unsafety.

Regarding stringlike, it is in deed possible to define it, but there is some runtime cost. As string and bytes have now different representations, any accessor function for stringlike would have to check at runtime whether it is backed by a string or by bytes. At least, this check is very cheap.

Conclusion

I hope it has become clear that the current plan is not far reaching enough, as the programmer would have to choose between bad alternatives: either pay a runtime penalty for additional copying and accept that some code needs to be duplicated, or use unsafe coercion between string and bytes. The latter is not desirable, of course, but it is surely the task of the language (designer) to make sound and safe string handling an attractive option. I've presented three ideas that would all improve the concept in some respect. In particular, the combination of the ideas 2 and 3 seems to be very attractive: back bytes by bigarrays, and provide an stringlike supertype for easing the programming of application buffers.
Gerd Stolpmann works as O'Caml consultant
This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml