Blog on camlcity.org

Blog on camlcity.org http://blog.camlcity.org en Articles by Gerd Stolpmann about O'Caml WasiCaml: Translate OCaml Code to WebAssembly http://blog.camlcity.org/blog/wasicaml1.html http://blog.camlcity.org/blog/wasicaml1.html 15 Jul 2021 00:00 GMT <div> <b>The portability story behind WasiCaml</b><br/>  </div> <div> For a recent project we wrote a compiler that translates a domain-specific language (DSL) to some runnable form, and we did that in OCaml. The DSL is now part of an Electron-based integrated development environment (IDE) that will soon be available from <a href="https://remixlabs.com">Remix Labs</a>. Electron runs on a couple of operating systems, but the DSL compiler orginally did not. How do we accomplish it to run the DSL compiler on as many different operating systems? This was the question we faced when starting the development of WasiCaml, a translator from OCaml bytecode to WebAssembly. </div> <div> <p> Of course, Electron is just an example of a cross-platform environment. You can develop apps for Mac, Windows, and Linux, and it is Javascript-based. We picked Electron for porting the user interface of the IDE to the desktop - originally the IDE was written for the web, and the DSL compiler was running on a server backing the web app. Initially, the Electron version of the IDE started just a native binary of the DSL compiler as a server process that ran in the background, just like we did it for the web, but this means that you run into the cross-build problem again that you actually want to avoid by running something in Electron: we would have needed to set up several build pipelines, one for each OS, in order to build the DSL compiler for the targets we wanted to support. </p><p> There are already tools to translate OCaml to Javascript (namely Bucklescript and js_of_ocaml), and we could have used these to fiddle the DSL compiler into the Javascript code base. However, this does not feel right: we would have had to reorganize the OCaml code base because you can't link in C libraries, and driving the DSL compiler would have been quite adventurous (it talks via a bidirectional pipeline to its clients). At that time we were already exploring WebAssembly for other parts of the system, and the idea came up to also use WebAssembly for running the DSL compiler. The <a href="https://github.com/remixlabs/wasicaml/">WasiCaml</a> project was born (and the translation to Javascript only plan B should this turn out to be more difficult than expected). </p><h2>A quick intro to WebAssembly</h2> <p> As the name suggests, WebAssembly provides a fairly low-level virtual machine for running the code. The instructions are comparable to the ones you find in a CPU, e.g. load, store, arithmetic. The code is structured into functions which take a fixed number of parameters and return a single result. The functions can have local variables that can be read and written by the code. The parameters and variables can have one of four numeric types (i32, i64, f32, and f64). </p><p> For example, this is a WebAssembly module with just one function that increments a 32 bit number at a memory location by one, and returns the value: </p><blockquote> <code style="white-space: pre"> (module (import "env" "memory" (memory $memory 1)) (func $incr (export "incr") (param $x i32) (result i32) (local.get $x) (i32.load) (i32.const 1) (i32.add) (return) ) ) </code> </blockquote> <p> Here, the code is given in the textual format known as WAT. For running it, you first need to convert it to the binary format (WASM), e.g. with a tool like <a href="https://github.com/WebAssembly/wabt">wat2wasm</a>. </p><p> Also note that there is an operands stack: <code>local.get</code> pushes the result on this stack, and <code>i32.load</code> loads the number from the address found on the stack, and also pushes the result on the stack. This stack is mainly meant to express the code in a very compact way. The engine running code normally translates the stack operations into a more efficient form before starting up. </p><p> A WebAssembly VM is equipped with linear memory, i.e. the memory addresses go from 0 to a maximum address, without fragmentation, and without address ranges supporting special semantics like mapped files. The memory is only used for data - the running code is inaccessible (i.e. the VM has a Harvard architecture), and this also includes the call stack and other parts of the VM (e.g. you cannot iterate over the local variables of the functions). In order to also support indirect jumps, there is a way to reference functions by numeric IDs. </p><p> Typically, WebAssembly VMs translate the code to the native instruction set of the host running of code before running the code (often as JIT compilers, but there are now also engines doing the translation statically ahead of time, and producing native binaries), and these engines almost reach native speed. All current browsers support WebAssembly now, and it is also present in other Javascript-based environments (like node, or the Electron platform). Although it started as a web technology, WebAssembly is not limited to the web. For example, <a href="https://docs.wasmtime.dev/">wasmtime</a> allows you to embed a WebAssembly engine into almost any environment - e.g. you could embed the engine into an application server written in Go. In this case, there is no Javascript involved at all. </p><h2>WASI</h2> <p> While the WebAssembly standard defines how to express the code and how to run it, there is still the question how to use it with popular languages like C, and Rust. The <a href="https://wasi.dev/">WASI</a> standard is an ABI that answers a lot of the questions. As an ABI it defines calling conventions, but it is not limited to that. In particular, there is a version of libc that defines a Unix-like set of base functionality the language-specific runtime can use. Also, WASI defines a set of host functions that play a role comparable to system calls in the WebAssembly world, and that allow access to files, the process environment, and the current time. With the help of WASI you can compile many C or Rust libraries to WebAssembly, and the porting effort is low. </p><p> WASI is multi-lingual environment, and you can in particular link code written in different languages into the same executable. This is possible because the language-specific runtimes have a common foundation (libc), and e.g. memory allocated from one language also counts as "taken" within the other language. </p><p> WASI is still in an early stage. While developing with it I discovered a couple of bugs, but the functionality is already impressive and usable for many purposes. </p><h2>WasiCaml</h2> <p> So now, what is WasiCaml, and how can I use it? </p><p> Let's assume you have a bytecode executable created by something like </p><blockquote> <code style="white-space: pre"> ocamlc -o myexecutable mycode.ml </code> </blockquote> <p>Now, you can further translate the bytecode executable to WebAssembly: </p><blockquote> <code style="white-space: pre"> wasicaml -o mywasm.wasm myexecutable </code> </blockquote> <p>If you want to run this executable, you need a specially configured WebAssembly engine which can be found in ~/.wasicaml/js after installation: </p><blockquote> <code style="white-space: pre"> node ~/.wasicaml/js/main.js ./mywasm.wasm ./mywasm.wasm arg ... </code> </blockquote> <p>The <code>mywasm.wasm</code> binary is portable and can be run everywhere! </p><p>For simplicity, wasicaml can also generate a wrapper that hides the <code>node</code> invocation, and this is triggered by just omitting the .wasm suffix: </p><blockquote> <code style="white-space: pre"> wasicaml -o mywasm myexecutable </code> </blockquote> <p>Now you can run the program simply with <code>./mywasm</code> (but note that the wrapper is not portable). </p><p>Another option is to link in C libraries like e.g. </p><blockquote> <code style="white-space: pre"> wasicaml -o mywasm.wasm myexecutable -cclib ~/.wasicaml/lib/ocaml/libunix.a </code> </blockquote> Of course, the C library must also be WASI-compatible. <p>Note that WasiCaml-produced code can so far not be run with wasmtime or wasmer, in particular because there is no machinery for exception handling in these engines. Browsers are fully supported, though. </p><h2>The WasiCaml project</h2> <p>WebAssembly is still a very new technology and information about it is rare. For example, it took a while until I understood that LLVM includes a full-featured assembler for WebAssembly, i.e. you can feed it a <code>code.s</code> file, and you get a <code>code.o</code> file back with partially linked WebAssembly code. This is documented nowhere, and I could only figure out some parts of the assembler syntax by reading the source code of LLVM. </p><p>What I already knew from an earlier WebAssembly project is that there is no exception handling (EH) mechanism yet in the standard (although this will likely change soon). This turned out as a special problem for WasiCaml, because the OCaml runtime uses long jumps in external C code to trigger OCaml exceptions. I remembered the way the <a href="https://emscripten.org">Emscripten</a> toolchain (which is another wrapper around LLVM) gets around this difficulty. If the host language is Javascript, embedded WebAssembly code is compiled to run in the same VM that is also used to execute Javascript itself, and this means that Javascript exceptions also work perfectly for WebAssembly! Of course, this trick is really limited to Javascript hosts, but at least I could remove the blocker for one of the possible execution environments. </p><p>The very first task was then to get the OCaml bytecode interpreter working in a WASI (plus EH) environment. </p><h2>Milestone: running the bytecode interpreter in the WASI environment</h2> <p>Essentially, this means that I wanted to (1) clone the OCaml source code, (2) <code>configure</code> it, and (3) <code>make</code> the bytecode interpreter (and the whole OCaml bytecode toolchain). The C compiler comes from the <a href="https://github.com/WebAssembly/wasi-sdk">WASI SDK</a>, and it compiles directly to WebAssembly. Now, if you just set the <code>CC</code> variable to this C compiler, <code>configure</code> will consider the target as a cross-compile target. Such targets are still very tricky, and - because we actually <em>can</em> run the code somehow - I thought it is better to avoid cross-compilation altogether, and to add some tooling so that binaries are directly runnable. </p><p>Instead of pointing <code>CC</code> directly to the C compiler of the WASI SDK, there is now a wrapper script <code>wasi_cc</code>. The main purpose of this script is to reshape the WebAssembly executables so that they are directly runnable on the host system. This is accomplished by prepending a <em>starter</em> to the WebAssembly code. The <em>starter</em> runs <code>node</code> with the right driver script, and extracts the WebAssembly code from the executable file. For example, if you do </p><blockquote> <code style="white-space: pre"> wasi_cc -o ex code.c </code> </blockquote> the resulting file <code>ex</code> can be directly run with <code>./ex</code>. <p>With this trick, <code>configure</code> now "thinks" that the target is a native target of the operating system. <code>configure</code> could also run the tests on the existence of the various libc library functions the OCaml runtime needs, and figured out a lot of that stuff correctly. Nevertheless, not everything was working, and I had to fork the OCaml sources in order to disable functions that are not available (see <a href="https://github.com/gerdstolpmann/ocaml/compare/4.12.0...gerd/wasi-4.12.0">gerd/wasi-4.12.0</a> for the changes). </p><p>In this branch of OCaml I also changed the main function of the bytecode interpreter so that it catches exceptions from Javascript (actually, this function was split into two, and the outer function catches the exception thrown by the inner function). </p><p>A final difficulty was that function pointers in WebAssembly are typed - which is a logical consequence of the fact that functions are typed. OCaml generates a file <code>prims.c</code> that initializes the list of FFI functions, and initially LLVM did not like this file, because it could not infer the types of the function pointers. The solution was <em>not</em> to generate WebAssembly for this single file but to leave it as LLVM IR ("bitcode"). In this format function pointers can remain untyped, and the LLVM linker is smart enough to fix up the problem at link time, and to convert LLVM IR to WebAssembly when the types of the FFI functions are known. </p><p>With this trick, everything worked fine! The speed of the bytecode interpreter did not slow much down in WebAssembly, which was very encouraging. </p><h2>Milestone: the direct translator</h2> <p>After the bytecode interpreter was running, the second step was to directly generate WebAssembly code from OCaml. Actually, there were two choices: either to pick up one of the internal formats of OCaml (e.g. "Lambda" or "C--") and to change the OCaml compiler directly, or to take the bytecode as the starting point. I preferred the latter because WasiCaml is then an add-on processor that can be easily added to existing OCaml projects, and because some difficulties could be avoided (e.g. incremental compilation, and many many fixups through the whole toolchain). Also, I hoped that the resulting speed would still be "good enough" (at least for the purposes of the DSL compiler we wanted to run with WebAssembly). </p><p>Also, bytecode made it also a lot easier for me to get started. There were really a lot of unanswered questions: what does the function call mechanism look like? How do we get around the problem that OCaml code typically requires tail calls to be working but there aren't tail calls in WebAssembly (yet)? What does the code look to allocate a block of memory? How do we emulate exceptions? Picking bytecode meant that I could focus on these questions, while the bytecode instructions could initially be translated in a naive way, e.g. by translating each bytecode instruction separately to a fixed block of WebAssembly instructions (like instantiating a template). (Note that the current WasiCaml compiler is already a lot better than that.) </p><p>Picking bytecode also meant that WasiCaml inherits the bytecode stack. This is actually not a bad thing - because of OCaml's memory management the stack must reside in addressable memory, and the bytecode stack could serve as what the WebAssembly community calls a <em>shadow stack</em>. (Even for the C language there is a shadow stack - and the alternative would have been to also use the shadow stack of the C language.) So we got the shadow stack for OCaml code practically for free. </p><p>The stack is important because the garbage collector must be able to run over all locations where OCaml values are stored. As already mentioned, the locations WebAssembly natively supports cannot be traversed over (like local and global variables), and hence it is crucial to put OCaml values into memory whenever there is the chance of a garbage collector run. </p><p>Note that the native OCaml compiler is not much different in this respect - only that the native stack of the operating system can be used for storing values because it resides in memory. The details are different, though. When a value is moved temporarily to the stack, this is usually called "register spilling", and this is done because (1) there is only a limited amount of registers, but another register is needed, or (2) you don't know which register remains untouched when you call a function, or (3) you call some code that may run the garbage collector. Now, in WebAssembly, reason (1) is never the case because there can be any number of local variables (which take over the role of registers), and the details of (3) are very different, because in a native environment the registers are global stores, permitting some time-saving tricks that are unavailable in WebAssembly. </p><p>So, for developing the WasiCaml code emitter, this meant that it had to follow constraints so that OCaml values end up on the stack in the right moment. Actually, these constraints mainly shaped the layout of the WasiCaml code. </p><h2>32 bit comes back!</h2> <p>Once WasiCaml was working, we got back to the DSL compiler we originally wanted to make cross-platform. And we actually got it running! There was one remaining problem, though: WebAseembly is a 32 bit environment. As you may know, OCaml suffers from some limitations in this case. Most annoyingly, strings can only be 16 MB in size at most. </p><p>Fortunately, this problem occurred only here and there, mostly in the code emitter. Here, we could switch to <a href="https://github.com/Chris00/ocaml-rope">ropes</a> as alternate representation - and, lucky as we were, it turned out that this change did not eat much performance. </p><p>The DSL compiler is quite big, and the WebAssembly version takes around 3 seconds to start up. This is longer than usual, but for our application we could hide the startup time, and are now quite happy with the product. </p><hr/> <p>PS. Interested in WebAssembly and you know OCaml (or another functional language like Elm, Scala, Haskell, ...)? <a href="https://www.mixtional.de/recruiting/2021-01/index.html">We might have a job for you (July 2021)</a>. </p> </div> <div> Gerd Stolpmann is the CEO of <a href="https://mixtional.de">Mixtional Code GmbH</a>, currently busy with the last development steps of the <a href="http://remixlabs.com">Remix Labs</a> platform </div> <div> </div> OMake On Steroids (Part 3) http://blog.camlcity.org/blog/omake3.html http://blog.camlcity.org/blog/omake3.html 23 Jun 2015 00:00 GMT <div> <b>Faster builds with omake, part 3: Caches</b><br/>  </div> <div> In this (last) part of the series we have a closer look at how OMake uses caches, and what could be improved in this field. Remember that we saw in total double speed for large OMake projects, and that we also could reduce the time for incremental builds. In particular for the latter, the effect of caching is important. <cc-field name="maintext"> <div style="float:right; width:50%; border: 1px solid black; padding: 10px; margin-left: 1em; margin-bottom: 1em; margin-top: 1em; background-color: #E0E0E0"> This text is part 3/3 of a series about the OMake improvements sponsored by <a href="http://lexifi.com">LexiFi</a>: <ul> <li>Part 1: <a href="/blog/omake1.html">Overview</a> </li><li>Part 2: <a href="/blog/omake2.html">Linux</a> </li><li>Part 3: Caches (this page) </li></ul> The original publishing is on <a href="http://blog.camlcity.org/blog">camlcity.org</a>. </div> <p> Caching more is better, right? Unfortunately, this attitude of many application programmers does not hold if you look closer at how caches work. Basically, you trade memory for time, but there are also unwanted effects. As we learned in the last part, bigger process images may also cost time. What we examined there at the example of the fork() system call is also true for any memory that is managed in a fine-grained way. Look at the garbage collector of the OCaml runtime: If more memory blocks are allocated, the collector also needs to cycle through more blocks in order to mark and reclaim memory. Although the runtime includes some clever logic to alleviate this effect (namely by allowing more waste for bigger heaps and by adjusting the collection speed to the allocation speed), the slowdown is still measurable. </p><p> Another problem for large setups is that if processes consume more memory the caches maintained by the OS have less memory to work with. The main competitor on the OS level is the page cache that stores recently used file blocks. After all, memory is limited, and it is the question for what we use it. Often enough, the caches on the OS level are the most effective ones, and user-maintained caches need to be justified. </p><p> In the case of OMake there are mainly two important caches: </p><ul> <li>The target cache answers the question whether a file can be built in a given directory. The cache covers both types of build rules: explicit and implicit rules. For the latter it is very important to have this cache because the applicable implicit rules need to be searched. As OMake normally uses the "-modules" switch of ocamldep, it has to find out on its own in which directory an OCaml module is built. </li><li>The file cache answers the question whether a file is still up to date, or whether it needs to be rebuilt. This is based on three data blobs: first, the Unix.stat() properties of the file (and whether the file exists at all). Second, the MD5 digest of the file. Third, the digest of the command that created the file. If any of these blobs change the file is out of date. The details are somewhat complicated, though, in particular the computation of the digest costs some time and should only be done if it helps avoiding other expensive actions. Parts of the file cache survive OMake invocations as these are stored in the ".omakedb" file. </li></ul> <p> All in all, I was looking for ways of reducing the size of the caches, and for a cleverer organization that makes the cache operations cheaper. </p><h2>The target cache</h2> The target cache is used for searching the directory where a file can be built, and also the applicable file extensions (e.g. if a file m.ml is generated from m.mly there will be entries for both m.ml and m.mly). As I found it, it was very simple, just a mapping <blockquote> filepath ↦ buildable_flag </blockquote> and if a file f could potentially exist in many directories d there was a separate entry d/f for every d. For a given OCaml module m, there were entries for every potential suffix (i.e. for .cmi, .cmo, .cmx etc.), and also for the casing of m (remember that a module M can be stored in both m.ml and M.ml). In total, the cache had 2 * D * S * M entries (when D = number of build directories and S = number of file suffixes). It's a high number of entries. <p> The problem is not only the size, but also the speed: For every test we need to walk the mapping data structure. </p><p> The new layout of the cache compresses the data in the following way: </p><blockquote> filename ↦ (directories_buildable, directories_non_buildable) </blockquote> On the left side, only simple filenames without paths are used. So we need only 1/D entries than before now. On the right side, we have two sets: the directories where the file can be built, and the directories where the file cannot be built (and if a directory appears in neither set, we don't know yet). As the number of directories is very limited, these sets can be represented as bitsets. <p> Note that if we were to program a lame build system, we could even simplify this to </p><blockquote> filename ↦ directory_buildable option </blockquote> but we want to take into account that files can potentially be built in several directories, and that it depends on the include paths currently in scope which directory is finally picked. <p> It's not only that the same information is now stored in a compressed way. Also, the main user of the target cache picks a single file and searches the directory where it can be built. Because the data structure is now aligned with this style of accessing it, only one walk over the mapping is needed per file (instead of one walk per combination of directory and file). Inside the loop over the directories we only need to look into the bitsets, which is very cheap. </p><h2>The file cache</h2> Compared to the target cache, the file cache is really complicated. For every file we have three meta data blobs (stat, file digest, command digest). Also, there are two versions of the cache: the persistent version, as stored in the .omakedb file, and the live version. <p> Many simpler build systems (like "make") only use the file stats for deciding whether a file is out of date. This is somewhat imprecise, in particular when the filesystem stores the timestamps of the files with only low granularity (e.g. in units of seconds). Another problem occurs when the timestamps are not synchronous with the system clock, as it happens with remote filesystems. </p><div style="float:right; width:50%; border: 1px solid black; padding: 10px; margin-left: 1em; margin-top: 1em; background-color: #E0E0E0"> There is a now a <a href="https://github.com/gerdstolpmann/omake-fork/tags">pre-release omake-0.10.0-test1</a> that can be bootstrapped! It contains all of the described improvements, plus a number of bugfixes. </div> <p> OMake is programmed so that it only uses the timestamps between invocations. This means that if OMake is started another time, and the timestamp of a file changed compared with the previous invocation of OMake, it is assumed that the file has changed. OMake does not use timestamps during its runs. Instead it relies on the file cache as the instance that decides which files need to be created again. For doing so, it only uses digests (i.e. a rule fires when the digests of the input files change, or when the digest of the command changes). </p><p> The role of the .omakedb file is now that a subset of the file cache is made persistent beween invocations. This file stores the timestamps of the files and the digests. OMake simply assumes that the saved digest is still the current one if the timestamp of the file remains the same. Otherwise it recomputes the digest. This is the only purpose of the timestamps. Inaccuracies do not play a big role when we can assume that users typically do not start omake instances so quickly after each other that clock deviations would matter. </p><p> The complexity of the file cache is better understood if you look at key operations: </p><ul> <li>Load the .omakedb file and interpret it in the right way </li><li>Decide whether the cached file digest can be trusted or not (and in the latter case the digest is recomputed from the existing file) </li><li>Decide whether a rule is out of date or not. This check needs to take the cache contents for the inputs and the outputs of the rule into account. </li><li>Sometimes, we want to avoid expensive checks, and e.g. only know whether a digest might be out of date from the available information without having to recompute the digest. </li></ul> <p> After finding a couple of imprecise checks in the existing code, I actually went through the whole Omake_cache module, and went through all data cases. After that I'm now sure that it is perfect in the sense that only those digests are recomputed that are really needed for deciding whether a rule is out of date. </p><p> There are also some compressions: </p><ul> <li>The cache no longer stores the complete Unix.stat records, but only the subset of the fields that are really meaningful (timestamps, inode), and represent these fields as a single string. </li><li>There is a separate data structure for the question whether a file exists. This is one of the cases where OS level caches already do a good job. Now, only for the n most recently accessed files this information is available (where n=100). On Linux with its fast system calls this cache is probably unnecessary, but on Windows I actually saw some speedup. </li></ul> <p> All taken together, this gives another little boost. This is mostly observable on Windows as this OS does not profit from the improvements described in the previous article of the series. <img src="/files/img/blog/omake3_bug.gif" width="1" height="1"/> </p></cc-field> </div> <div> </div> <div> Gerd Stolpmann works as OCaml consultant. </div> <div> </div> OMake On Steroids (Part 2) http://blog.camlcity.org/blog/omake2.html http://blog.camlcity.org/blog/omake2.html 19 Jun 2015 12:00 GMT <div> <b>Faster builds with omake, part 2: Linux</b><br/>  </div> <div> The Linux version of OMake suffered from specific problems, and it is worth looking at these in detail. </div> <div> <div style="float:right; width:50%; border: 1px solid black; padding: 10px; margin-left: 1em; margin-bottom: 1em; background-color: #E0E0E0"> This text is part 2/3 of a series about the OMake improvements sponsored by <a href="http://lexifi.com">LexiFi</a>: <ul> <li>Part 1: <a href="/blog/omake1.html">Overview</a> </li><li>Part 2: Linux (this page) </li><li>Part 3: Caches (will be released on Tuesday, 6/23) </li></ul> The original publishing is on <a href="http://blog.camlcity.org/blog">camlcity.org</a>. </div> <p>While analyzing the performance characteristics of OMake, I found that the features of the OS were used in a non-optimal way. In particular, the fork() system call can be very expensive, and by avoiding it the speed of OMake could be dramatically improved. This is the biggest contribution to the performance optimizations allowing OMake to run roughly twice as fast on Linux (see <a href="/blog/omake1.html">part 1</a> for numbers). </p><h2>The fork/exec problem</h2> <p> The traditional way of starting commands is to use the fork/exec combination: The fork() system call creates an almost identical copy of the process, and in this copy the exec() call starts the command. This has a number of logical advantages, namely that you can run code between fork() and exec() that modifies the environment for the new command. Often, the file descriptors 0, 1, and 2 are assigned as it is required for creating pipelines. You can also do other things, e.g. change the working directory. </p><p> The whole problem with this is that it is slow. Even for a modern OS like Linux, fork() includes a number of expensive operations. Although it can be avoided to actually copy memory, the new address space must be set up by duplicating the page table. This is the more expensive the bigger the address space is. Also, memory must be set aside even if it is not immediately used. The entries for all file mappings must be duplicated (and every linked-in shared library needs such mappings). The point is now that all these actions are not really needed because at exec() time the whole process image is replaced by a different one. </p><p> In my performance tests I could measure that forking a 450 MB process image needs around 10 ms. In the n=8 test for compiling each of the 4096 modules two commands are needed (ocamldep.opt and ocamlopt.opt). The time for this fork alone sums up to 80 seconds. Even worse, this dramatically limits the benefit of parallelizing the build, because this time is always spent in the main process. </p><p> The POSIX standard includes an alternate way of starting commands, the posix_spawn() call. It was originally developed for small systems without virtual memory where it is difficult to implement fork() efficiently. However, because of the mentioned problems of the fork/exec combinations it was quickly picked up by all current POSIX systems. The posix_spawn() call takes a potentially long list of parameters that describes all the actions needed to be done between fork() and exec(). This gives the implementer all freedom to exploit low-level features of the OS for speeding the call up. Some OS, e.g. Mac OS X, even implement posix_spawn directly as system call. </p><p> On Linux, posix_spawn is a library function of glibc. By default, however, it is no real help because it uses fork/exec (being very conservative). If you pass the flag POSIX_SPAWN_USEVFORK, though, it switches to a fast alternate implementation. I was pointed (by Török Edwin) to a few emails showing that the quality in glibc is not yet optimal. In particular, there are weaknesses in signal handling and in thread cancellation. Fortunately, these weaknesses do not matter for this application (signals are not actively used, and on Linux OMake is single-threaded). </p><p> Note that I developed the wrapper for posix_spawn already years ago for OCamlnet where it is still used. So, if you want to test the speed advantage out on yourself, just use OCamlnet's Shell library for starting commands. </p><h2>Pipelines and fork()</h2> <p>It turned that there is another application of fork() in OMake. When creating pipelines, it is sometimes required that the OMake process forks itself, namely when one of commands of the pipeline is implemented in the OMake language. This is somewhat expected, as the parts of a pipeline need to run concurrently. However, this feature turned out to be a little bit in the way because the default build rules used it. In particular, there is the pipeline </p><blockquote> <code><small> $(OCAMLFIND) $(OCAMLDEP) ... -modules $(src_file) | ocamldep-postproc </small></code> </blockquote> which is started for scanning OCaml modules. While the first command, $(OCAMLFIND), is a normal external command, the second command, ocamldep-postprocess, is written in the OMake language. <p>Forking for creating pipelines is even more expensive than the fork/exec combination discussed above, because memory needs really to be copied. I could finally avoid this fork() by some trickery in the command starter. When used for scanning, and the command is the last one in the pipeline (as in the above pipeline), a workaround is activated that writes the data to a temporary file, as if the pipeline would read </p><blockquote> <code><small> $(OCAMLFIND) $(OCAMLDEP) ... -modules $(src_file) >$(tmpfile);<br/> ocamldep-postproc <$(tmpfile) </small></code> </blockquote> <p>(NB. You actually can also program this in the OMake language. However, this does not solve the problem, because for sequences of commands $(cmd1);$(cmd2) it is also required to fork the process. Hence, I had to find a solution deeper in the OMake internals.) </p><div style="float:right; width:50%; border: 1px solid black; padding: 10px; margin-left: 1em; margin-top: 1em; background-color: #E0E0E0"> There is a now a <a href="https://github.com/gerdstolpmann/omake-fork/tags">pre-release omake-0.10.0-test1</a> that can be bootstrapped! It contains all of the described improvements, plus a number of bugfixes. </div> <p>There is one drawback of this, though: The latency of the pipeline is increased when the commands are run sequentially rather than in parallel. The effect is that OMake takes longer for a j=1 build even if less CPU resources are consumed. A number of further improvements compensate for this: </p><ul> <li>Most importantly, ocamldep-postprocess can now use a builtin function, speeding this part up by switching the implementation language (now OCaml, previously the OMake language). </li><li>Because ocamldep-postprocess mainly accesses the target cache, speeding up this cache also helped (see the next part of this article series). </li><li>Finally, there is now a way how functions like ocamldep-postprocess can propagate updates of the target cache to the main environment. The background is here that functions implementing commands run in a sub environment simulating some isolation from the parent environment. This isolation prevented that updates of the target cache found by one invocation of ocamldep-postprocess could be used by the next invocation. This also speeds up this function. </li></ul> <h2>Windows is not affected</h2> <p>The Windows port of OMake is not affected by the fork problems. For starting commands, an optimized technique similar to posix_spawn() is used anyway. For pipelines and other internal uses of fork() the Windows port uses threads. (Note beside: You may ask why we don't use threads on Linux. There are a couple of reasons: First, the emulation of the process environment with threads is probably not quite as stable as the original using real processes. Second, there are difficult interoperability problems between threads and signals (something that does not exist in Windows). Finally, this would not save us maintaining the code branch using real processes and fork() because OCaml does not support multi-threading for all POSIX systems. Of course, this does not mean we cannot implement it as optional feature, and probably this will be done at some point in the future.) </p><p>The trick of using temporary files for speeding up pipelines is not enabled on Windows. Here, it is more important to get the benefits of parallelization that the real pipeline allows. </p><div style="border: 1px solid black; padding: 10px; margin-left: 1em; margin-bottom: 1em; background-color: #E0E0E0"> The next part will be published on Tuesday, 6/23. </div> <img src="/files/img/blog/omake2_bug.gif" width="1" height="1"/> </div> <div> Gerd Stolpmann works as OCaml consultant. </div> <div> </div> OMake On Steroids (Part 1) http://blog.camlcity.org/blog/omake1.html http://blog.camlcity.org/blog/omake1.html 16 Jun 2015 00:00 GMT <div> <b>Faster builds with omake, part 1: Overview</b><br/>  </div> <div> In the <a href="https://sympa.inria.fr/sympa/arc/caml-list/2014-09/msg00090.html">2014 edition</a> of the "which is the best build system for OCaml" debate the <a href="http://omake.metaprl.org">OMake</a> utility was heavily criticized for being not scalable enough. Some quick tests showed that there was in deed a problem. At <a href="http://lexifi.com">LexiFi</a>, the size of the source tree obviously already exceeded the critical point, and LexiFi was interested in an improvement. LexiFi develops for both Linux and Windows, and OMake is their preferred build system because of its excellent support for Windows. The author of these lines got some funding from LexiFi for analyzing and fixing the problem. </div> <div> <div style="float:right; width:50%; border: 1px solid black; padding: 10px; margin-left: 1em; margin-bottom: 1em; background-color: #E0E0E0"> This text is part 1/3 of a series about the OMake improvements sponsored by <a href="http://lexifi.com">LexiFi</a>: <ul> <li>Part 1: Overview (this page) </li><li>Part 2: Linux (will be released on Friday, 6/19) </li><li>Part 3: Caches (will be released on Tuesday, 6/23) </li></ul> The original publishing is on <a href="http://blog.camlcity.org/blog">camlcity.org</a>. </div> <p> OMake is not only a build system (like e.g. ocamlbuild), but it also includes extensions that are important for controlling and customizing builds. There is an interpreter for a simple dynamically typed functional language. There is a command shell implementing utilities like "rm" or "cp" which is in particular important on non-Unix systems. There are system interfaces for watching files and restarting the build whenever source code is saved in the editor. In short, OMake is very feature-rich, but also, and this is the downside, it is also quite complex: around 130 modules and 80k lines of code. Obviously, it is easy to overlook performance problems when so much code is involved. For me as the developer seeing the sources for the first time the size was also a challenge, namely for identifying possible problems and for finding solutions. </p><h2>Quantifying the performance problem</h2> My very first activity was to develop a synthetic benchmark for OMake (and actually, for any type of OCaml build system). Compared with a real build, a synthetic benchmark has the big advantage that you can simulate builds of any size. The benchmark has these characteristics: The task is to build n^2 libraries with n^2 modules each (for a given small number n), and the dependencies between the modules are created in a way so that we can stress both the dependency analyzer of the build utility and the ability to run commands in parallel. In particular, every library would allow n parallel build flows of the n^2 modules, and you can build n of the n^2 libraries in parallel. (For details see the <a href="https://github.com/gerdstolpmann/omake-fork/blob/perf-test/performance/generate.ml">source code</a>.) <p> This is what I got for omake-0.9.8.6 (note that a different computer was used for Windows, so you cannot compare Linux with Windows): </p><p> </p><table border="1"> <tr> <th>Size n</th> <th>Parallelism j</th> <th>Number of modules (n^4)</th> <th>Runtime Linux</th> <th>Runtime Windows</th> </tr> <tr> <td align="right">n=7</td> <td align="right">j=1</td> <td align="right">2401</td> <td align="right">645</td> <td align="right">353</td> </tr> <tr> <td align="right">n=7</td> <td align="right">j=4</td> <td align="right">2401</td> <td align="right">213</td> <td align="right">179</td> </tr> <tr> <td align="right">n=8</td> <td align="right">j=1</td> <td align="right">4096</td> <td align="right">1906</td> <td align="right">877</td> </tr> <tr> <td align="right">n=8</td> <td align="right">j=4</td> <td align="right">4096</td> <td align="right">607</td> <td align="right">341</td> </tr> </table> <p>This clearly shows that there is something wrong, in particular for Linux as OS: For the n=8 number of 4096 modules, which is around 1.7 times of the 2401 modules for n=7, omake needs around three times longer (for a single-threaded build). For Windows, the numbers are slightly better: the n=8 build takes 2.5 of the time of the n=7 build. Nevertheless, this is quite far away from the optimum. </p><p>Note that this is not good, but it is also not a catastrophe. The latter shows up if you try to use ocamlbuild. I couldn't manage to build the n=7 test case at all: after 30 minutes ocamlbuild slowed down to a crawl, and progressed only with a speed of around one module per second. Apparently, there are much worse problems than with OMake. (Btw, it would be nice to hear how other build systems compete.) </p><h2>After improving OMake</h2> The version from today (2015-05-18) at <a href="https://github.com/gerdstolpmann/omake-fork">Github</a> behaves much better: <p> </p><table border="1"> <tr> <th>Size n</th> <th>Parallelism j</th> <th>Number of modules (n^4)</th> <th>Runtime Linux<br/>(Speedup factor)</th> <th>Runtime Windows<br/>(Speedup factor)</th> </tr> <tr> <td align="right">n=7</td> <td align="right">j=1</td> <td align="right">2401</td> <td align="right">169 (3.8)</td> <td align="right">317 (1.1)</td> </tr> <tr> <td align="right">n=7</td> <td align="right">j=4</td> <td align="right">2401</td> <td align="right">59 (3.6)</td> <td align="right">163 (1.1)</td> </tr> <tr> <td align="right">n=8</td> <td align="right">j=1</td> <td align="right">4096</td> <td align="right">363 (5.3)</td> <td align="right">661 (1.3)</td> </tr> <tr> <td align="right">n=8</td> <td align="right">j=4</td> <td align="right">4096</td> <td align="right">144 (4.2)</td> <td align="right">330 (1.0)</td> </tr> </table> <div style="float:right; width:50%; border: 1px solid black; padding: 10px; margin-left: 1em; margin-top: 1em; background-color: #E0E0E0"> There is a now a <a href="https://github.com/gerdstolpmann/omake-fork/tags">pre-release omake-0.10.0-test1</a> that can be bootstrapped! It contains all of the described improvements, plus a number of bugfixes. </div> <p>As you can see, there is a huge improvement for Linux and a slight one for Windows. It turns out that the Linux version ran into a Unix-specific issue of starting commands from a big process (the OMake main process reaches around 450MB). OMake used the conventional fork/exec combination for doing so, but it is a known problem that this does not work well for big process images. We'll come to the details of this later. The Windows version never suffered from this problem. </p><p>The scalability is now somewhat better, but still not great. For both Windows and Linux, the n=8 runs take now around 2.1 times longer than the n=7 runs. </p><p>Another aspect of the performance impression is how long a typical incremental build takes after changing a single file. At least for OMake, a good measure for this is the zero rebuild time: how long OMake takes to figure out that nothing has changed, i.e. the time for the second omake run in "omake ; omake": </p><table border="1"> <tr> <th>Parameters</th> <th>Runtime Linux omake-0.9.8.6</th> <th>Runtime Linux 2015-05-18<br/>(Speedup Factor)</th> </tr> <tr> <td align="right">n=7, j=1</td> <td align="right">16.8</td> <td align="right">8.4 (2.0)</td> </tr> <tr> <td align="right">n=8, j=1</td> <td align="right">39.2</td> <td align="right">15.6 (2.5)</td> </tr> </table> <p>The time roughly halves. Note that you get a similar effect under Windows as OMake doesn't start any commands for a zero rebuild. Actually, most time is spent for constructing the internal data structures and for computing digests (not only for files but also for commands, which turns out to be the more expensive action). </p><h2>How to tackle the analysis</h2> I started it the old-fashioned way by manually instrumenting interesting functions. This means that counts and (wall-clock) runtimes are measured. Functions that (subjectively) "take too long" are further analyzed by also instrumenting called functions. This way I could quickly find out the interesting parts (while learning how OMake works as you go through the code and instrument it). The helper module I used: <a href="https://github.com/gerdstolpmann/omake-fork/blob/master/src/libmojave/lm_instrument.mli">Lm_instrument</a>. (Note that I did all the actual instrumentation in the "perf-test" branch.) <p>As OCaml supports gprof instrumentation I also tried this but without success. The problem is simply that gprof looks at the wrong metrics, namely only at the runtimes of the two innermost function invocations in the call stack. In OCaml this is usually something like <code>List.map</code> calling <code>String.sub</code>, i.e. at both levels there are general-purpose functions. This is useless information. We need more context for the analysis (i.e. more levels in the call stack), but it depends very much from where the function is called. </p><p>Another problem of gprof was that you do not see kernel time. For analyzing a utility like OMake whose purpose is to start external commands this is crucial information, though. </p><p>For measuring the size of OCaml values I used <a href="http://forge.ocamlcore.org/projects/objsize/">objsize</a>. </p><h2>The main points of the improvement</h2> <p>Summarized, the following improvements were done: </p><ul> <li>For Linux, I switched to posix_spawn instead of fork/exec for starting commands. </li><li>For Linux, it was also important to avoid a self-fork of omake for postprocessing ocamldep output. Now temporary files are used. </li><li>I rewrote the target cache that stores whether a file can be built or not. The new data structure for this cache highly compresses the data, and is better aligned to the main user, namely the function figuring out which implicit rules are needed to build a file. This way I could save processing time in this cache, and the memory footprint also got substantially smaller. </li><li>I also rewrote the file cache that connects file names with file stats and digests. The new cache allows it to skip the computation of digests in more cases. Also, less data is cached (saving memory). </li><li>I tweaked when the file digests are computed. This is no longer done immediately but delayed after the next command has been started, and in parallel to the command. This is in particular advantageous when there are some CPU resources left that could be utilized for this purpose. </li><li>There are also simplified scanner rules in OMake.om, reducing the time needed for computing scanner dependencies. There is a drawback of the new rules, namely that when a file is moved to a new directory OMake does not rescan the file the next time it is run. I guess this is acceptable, because it normally does not matter where a file is stored. Nevertheless, there is an option to get the old behavior back (by setting EXTENDED_DIGESTS). </li><li>Not regarding speed: OMake can now be built with the mingw port of OCaml </li></ul> <h2>One major problem remains</h2> <p> There is still one problem I could not yet address, and this problem is mainly responsible for the long startup time of OMake for large builds. Unlike other build systems, OMake creates a dependency from the rule to the command of the rule, as if every rule looked like: </p><blockquote> <code> target: source1 ... sourceN :value: $(command)<br/>     $(command) </code> </blockquote> i.e. when the command changes the rule "fires" and is executed. This is an automatic addition, and it is very useful: When you start a build after changing parameters (e.g. include paths) OMake automatically detects which commands have changed because of this, and reruns these. <p> However, there is a price to pay. For checking whether a rule is out of date it is required to expand the command and compute the digest. For a full build the time for this is negligible (and you need the commands anyway for starting them), but for a "zero rebuild" the commands are finally not needed, and OMake expands them only for the out-of-date check. As you might guess, this is the main reason why a zero rebuild is so slow. </p><p> It is probably possible to speed up the out-of-date check by doing a static analysis of the command expansions. Most expansions just depend on a small number of variables, and only if these variables change the command can expand to something different. With that knowledge it is possible to compile a quick check whether the expansion is actually needed. As any expression of the OMake language can be used for the commands, developing such a compiler is non-trivial, and it was so far not possible to do in my time budget. </p><div style="border: 1px solid black; padding: 10px; margin-left: 1em; margin-bottom: 1em; background-color: #E0E0E0"> The next part will be published on Friday, 6/19. </div> <img src="/files/img/blog/omake1_bug.gif" width="1" height="1"/> </div> <div> Gerd Stolpmann works as OCaml consultant. </div> <div> </div> Immutable strings in OCaml-4.02 http://blog.camlcity.org/blog/bytes1.html http://blog.camlcity.org/blog/bytes1.html 04 Jul 2014 00:00 GMT <div> <b>Why the concept is not good enough</b><br/>  </div> <div> In the upcoming release 4.02 of the OCaml programming language, the type <code>string</code> can be made immutable by a compiler switch. Although this won't be the default yet, this should be seen as the announcement of a quite disruptive change in the language. Eventually this will be the default in a future version. In this article I explain why I disagree with this particular plan, and which modifications would be better. </div> <div> <p> Of course, the fact that <code>string</code> is mutable doesn't fit well into a functional language. Nevertheless, it has been seen as acceptable for a long time, probably because the developers of OCaml did not pay much attention to strings, and felt that the benefits of a somewhat cleaner concept wouldn't outweigh the practical disadvantages of immutable strings. Apparently, this attitude changed, and we will see a new <code>bytes</code> type in OCaml-4.02. This type is accompanied by a <code>Bytes</code> module with library functions supporting it. The compiler was also extended so that <code>string</code> and <code>bytes</code> can be used interchangably by default. If, however, the <code>-safe-strings</code> switch is set on the command-line, the compiler sees <code>string</code> and <code>bytes</code> as two completely separate types. </p> <p> This is a disruptive change (if enabled): Almost all code bases will need modifications in order to be compatible with the new concept. Although this will often be trivial, there are also harder cases where strings are frequently used as buffers. Before discussing that a bit more in detail, let me point out why such disruptive changes are so problematic. So far there was an implicit guarantee that your code will be compatible to new compiler versions if you stick to the well-established parts of the language and avoid experimental additions. I have in deed code that was developed for OCaml-1.03 (the first version I checked out), and that code still runs. Especially in a commercial context this is a highly appreciated feature, because this protects the investment in the code base. As I'm trying to sell OCaml to companies in my carreer this is a point that bothers me. Giving up this history of excellent backward compatibility is something we shouldn't do easily, and if so, only if we get something highly valuable back. (Of course, if you only look at the open source and academic use of OCaml, you'll put less emphasis on the compatibility point, but it's also not completely unimportant there.) </p> <h2>The problem</h2> <p> I'm fully aware that immutable strings fix some problems (the worst probably: so far even string literals can be mutated, which can be very surprising). However, creating a completely new type <code>bytes</code> comes also with some disadvantages: </p><ul> <li>Lack of generic accessor functions: There is <code>String.get</code> and there is <code>Bytes.get</code>. The shorthand <code>s.[k]</code> is now restricted to strings. This is mostly a stylistic problem. </li><li>The conversion of string to bytes and vice versa requires a copy: <code>Bytes.of_string</code>, and <code>Bytes.to_string</code>. You have to pay a performance penalty. </li><li>In practical programming, there is sometimes no clear conceptual distinction between string data that are read-only and those that require mutation. For example, if you add data to a buffer, the data may come from a string or from another buffer. So how do you type such an <code>add</code> function? </li></ul> This latter point is, in my opinion, the biggest problem. Let's assume we wanted to reimplement the <code>Lexing</code> module of the standard library in pure OCaml without resorting to unsafe coding (currently it's done in C). This module implements the lexing buffer that backs the lexers generated with ocamllex. We now have to use <code>bytes</code> for the core of this buffer. There are three functions in <code>Lexing</code> for creating new buffers: <pre> val from_channel : in_channel -> lexbuf val from_string : string -> lexbuf val from_function : (string -> int -> int) -> lexbuf </pre> The first observation is that we'll better offer two more constructors to the users of this module: <pre> val from_bytes : bytes -> lexbuf val from_bytes_function : (bytes -> int -> int) -> lexbuf </pre> So why do we need the ability to read from <code>bytes</code>, i.e. copy from one buffer to the other? We could just be a bad host and don't offer these functions to the users of the module. However, it's unavoidable anyway for <code>from_channel</code>, because I/O buffers are of course <code>bytes</code>: <pre> let from_channel ch = from_bytes_function (Pervasives.input ch) </pre> So whenever we implement buffers that also include I/O capabilities, it is likely that we need to handle both the <code>bytes</code> and the <code>string</code> case. This is not only a problem for the interface design. Because <code>string</code> and <code>bytes</code> are completely separated, we need two different implementations: <code>from_string</code> and <code>from_bytes</code> cannot share much code. <p> This is the ironical part of the new concept: Although it tries to make the handling of strings more sound and safe, the immediate consequence in reality is that code needs to be duplicated because of missing polymorphisms. Any half-way intelligent programmer will of course fall back to unsafe functions for casting bytes to strings and vice versa (<code>Bytes.unsafe_to_string</code> and <code>Bytes.unsafe_of_string</code>), and this only means that the new <code>-safe-strings</code> option will be a driving force for using unsafe language features. </p> <p> Let's look at three modifications of the concept. Is there some easy fix? </p> <h2>Idea 1: <code>string</code> as a supertype of <code>bytes</code></h2> <p> We just allow that <code>bytes</code> can officially be coerced to <code>string</code>: </p> <pre> let s = (b : bytes :> string) </pre> <p> Of course, this weakens the immutability property: <code>string</code> may now be a read-only interface for a <code>bytes</code> buffer, and this buffer can be mutated, and this mutation can be observed through the <code>string</code> type: </p> <pre> let mutable_string() = let b = Bytes.make 1 'X' in let s = (b :> string) in (s, Bytes.set 0) let (s, set) = mutable_string() (* s is now "X" *) let () = set 'Y' (* s is now "Y" *) </pre> <p> Nevertheless, this concept is not meaningless. In particular, if a function takes a string argument, it is guaranteed that the string isn't modified. Also, string literals are immutable. Only when a function returns a string, we cannot be sure that the string isn't modified by a side effect. </p> <p> This variation of the concept also solves the polymorphism problem we explained at the example of the <code>Lexing</code> module: It is now sufficient when we implement <code>Lexing.from_string</code>, because <code>bytes</code> can always be coerced to <code>string</code>: </p><pre> let from_bytes s = from_string (s :> string) </pre> <h2>Idea 2: Add a read-only type <code>stringlike</code></h2> <p> Some people may feel uncomfortable with the implication of Idea 1 that the immutability of <code>string</code> can be easily circumvented. This can be avoided with a variation: Add a third type <code>stringlike</code> as the common supertype of both <code>string</code> and <code>bytes</code>. So we allow: </p><pre> let sl1 = (s : string :> stringlike) let sl2 = (b : bytes :> stringlike) </pre> Of course, <code>stringlike</code> doesn't implement mutators (like <code>string</code>). It is nevertheless different from <code>string</code>: <ul> <li><code>string</code> is considered as absolutely immutable (there is no way to coerce <code>bytes</code> to <code>string</code>) </li><li><code>stringlike</code> is seen as the read-only API for either <code>string</code> or <code>bytes</code>, and it is allowed to mutate a <code>stringlike</code> behind the back of this API </li></ul> <p> <code>stringlike</code> is especially interesting for interfaces that need to be compatible to both <code>string</code> and <code>bytes</code>. In the <code>Lexing</code> example, we would just define </p><pre> val from_stringlike : stringlike -> lexbuf val from_stringlike_function : (stringlike -> int -> int) -> lexbuf </pre> and then reduce the other constructors to just these two, e.g. <pre> let from_string s = from_stringlike (s :> stringlike) let from_bytes b = from_stringlike (b :> bytes) </pre> These other constructors are now only defined for the convenience of the user. <h2>Idea 3: Base <code>bytes</code> on bigarrays</h2> <p> This idea doesn't fix any of the mentioned problems. Instead, the thinking is: If we already accept the incompatibility between <code>string</code> and <code>bytes</code>, let's at least do in a way so that we get the maximum out of it. Especially for I/O buffers, bigarrays are way better suited than strings: </p><ul> <li>I/O primitives can directly pass the bigarrays to the operating system (no need for an intermediate buffer as it is currently the case for <code>Unix.read</code> and <code>Unix.write</code>) </li><li>Bigarrays support the slicing of buffers (i.e. you can reference subbuffers directly) </li><li>Bigarrays can be aligned to page boundaries (which is accelerated for some operating systems when used for I/O) </li></ul> <p> So let's define: </p><pre> type bytes = (char,Bigarray.int8_unsigned_elt,Bigarray.c_layout) Bigarray.Array1.t </pre> Sure, there is now no way to unsafely cast strings to bytes and vice versa anymore, but arguably we shouldn't prefer a design over the other only for it's unsafety. <p> Regarding <code>stringlike</code>, it is in deed possible to define it, but there is some runtime cost. As <code>string</code> and <code>bytes</code> have now different representations, any accessor function for <code>stringlike</code> would have to check at runtime whether it is backed by a <code>string</code> or by <code>bytes</code>. At least, this check is very cheap. </p> <h2>Conclusion</h2> I hope it has become clear that the current plan is not far reaching enough, as the programmer would have to choose between bad alternatives: either pay a runtime penalty for additional copying and accept that some code needs to be duplicated, or use unsafe coercion between <code>string</code> and <code>bytes</code>. The latter is not desirable, of course, but it is surely the task of the language (designer) to make sound and safe string handling an attractive option. I've presented three ideas that would all improve the concept in some respect. In particular, the combination of the ideas 2 and 3 seems to be very attractive: back <code>bytes</code> by bigarrays, and provide an <code>stringlike</code> supertype for easing the programming of application buffers. <img src="/files/img/blog/bytes1_bug.gif" width="1" height="1"/> </div> <div> Gerd Stolpmann works as O'Caml consultant </div> <div> </div> Welcome IPv6 http://blog.camlcity.org/blog/ipv6.html http://blog.camlcity.org/blog/ipv6.html 21 Jun 2013 00:00 GMT <div> <b>camlcity.org now connected</b><br/>  </div> <div> For two weeks the camlcity.org website is fully connected to IPv6. </div> <div> <p> Actually, the raw connectivity exists already for more than two years, but I haven't found time to put the IP addresses into DNS. This is now done, making the site visible. </p> <p> Around 1% of the traffic is now via IPv6. This is way more than I was expecting. Here in Germany, only a few Internet providers have already rolled out IPv6, but the major players are planning it for 2014. It turns out that at home I already have IPv6, although only via DSLite. (NB. In the default DNS configuration a client connected with DSLite or other 6-in-4 technologies will pick the IPv4 address if both "Internets" are available, so such clients will not show up in my web server logs as IPv6.) </p> <p> The IPv6 world is different: no NAT anymore, and every computer has a globally routable address. This is something you need to get used to - the Internet appears again as a real peer-to-peer network as in the first years, and the distinction between client and datacenter connectivity is gone. Let's hope this drives innovation - like user-controlled social networks, for instance. </p> </div> <div> Gerd Stolpmann works as O'Caml consultant </div> <div> </div> GODI is shutting down http://blog.camlcity.org/blog/godi_shutdown.html http://blog.camlcity.org/blog/godi_shutdown.html 22 Jul 2013 00:00 GMT <div> <b>Sorry!</b><br/>  </div> <div> Unfortunately, it is no longer possible for me to run the GODI distribution. GODI will not upgrade to OCaml 4.01 once it is out, and it will shut down the public service in the course of September 2013. </div> <div> <p>This website, camlcity.org, will remain up, but with reduced content. Existing GODI installations can be continued to be used, but upgrades or bugfixes will not be available when GODI is off. </p><p> Although there are still a lot of GODI users, it is unavoidable to shut GODI down due to lack of supporters, especially package developers. I was more or less alone in the past months, and my time contingent will not allow it to do the upgrade to OCaml 4.01 alone (when it is released). </p><p> Also, there was a lot of noise about a competing packaging system for OCaml in the past weeks: OPAM. Apparently, it got a lot of attention both from individuals and from organizations. As I see it, the OCaml community is too small to support two systems, and so in some sense GODI is displaced by OPAM. </p><p> The sad part is that OPAM is only clearly better in one point, namely in interacting with the community (via Github). In times where social networks are worth billions this is probably the striking point. It doesn't matter that OPAM lacks some features GODI has. So there is some loss of functionality for the community (partly difficult to replace, like GODI's support for Windows). </p><p> If somebody wants to take over GODI, please do so. The <a href="https://godirepo.camlcity.org/svn/godi-bootstrap/">source code</a> is still available as well as the <a href="https://godirepo.camlcity.org/svn/godi-build/">package directories</a>. Maybe it is sufficient to move the repository to a public place and to redesign the package release process to give GODI a restart. </p><p> Hoorn (NL), the 22nd July 2013, </p><p> Gerd Stolpmann </p> </div> <div> Gerd Stolpmann works as O'Caml consultant </div> <div> </div> Plasma Map/Reduce Slightly Faster Than Hadoop http://blog.camlcity.org/blog/plasma6.html http://blog.camlcity.org/blog/plasma6.html 01 Feb 2012 00:00 GMT <div> <b>A performance test</b><br/>  </div> <div> Last week I spent some time running map/reduce jobs on Amazon EC2. In particular, I compared the performance of Plasma, my own map/reduce implementation, with Hadoop. I just wanted to know how much my implementation was behind the most popular map/reduce framework. However, the suprise was that Plasma turned out as slightly faster in this setup. </div> <div> <div style="float:right; width: 50ex; font-size:small; color:grey; border: 1px solid grey; padding: 1ex; margin-left: 2ex"> This article is also available in other languages: <dl> <dt><a href="http://science.webhostinggeeks.com/plasma-map-reduce">[Serbo-Croatian]</a> </dt><dd>translation by Anja Skrba from <a href="http://webhostinggeeks.com/">Webhostinggeeks.com</a> </dd></dl> </div> <p> I would not call this test a "benchmark". Amazon EC2 is not a controlled environment, as you always only get partial machines, and you don't know how much resources are consumed by other users on the same machines. Also, you cannot be sure how far the nodes are off from each other in the network. Finally, there are some special effects coming from the virtualization technology, especially the first write of a disk block is slower (roughly half the normal speed) than following writes. However, EC2 is good enough to get an impression of the speed, and one can hope that all the test runs get the same handicap on average. </p><p> The task was to sort 100G of data, given in 10 files. Each line has 100 bytes, divided into a key of 8 bytes, a TAB character, 90 random bytes as value, and an LF character. The key was randomly chosen from 65536 possible values. This means that there were lots of lines with the same key - a scenario where I think it is more typical of map/reduce than having unique keys. The output is partitioned into 80 sets. </p><p> I allocated 1 larger node (m1-xlarge) with 4 virtual cores and 15G of RAM acting as combined name- and datanode, and 9 smaller nodes (m1-large) with 2 virtual cores and 7.5G of RAM for the other datanodes. Each node had access to two virtual disks that were configured as RAID-0 array. The speed for sequential reading or writing was around 160 MB/s for the array (but only 80 MB/s for the first time blocks were written). Apparently, the nodes had Gigabit network cards (the maximum transfer speed was around 119MB/s). </p><p> During the tests, I monitored the system activity with the sar utility. I observed significant cycle stealing (meaning that a virtual core is blocked because there is no free real core), often reaching values of 25%. This could be interpreted as overdriving the available resources, but another explanation is that the hypervisor needed this time for itself. Anyway, this effect also questions the reliability of this test. </p><h2>The contrahents</h2> <p> Hadoop is the top dog in the map/reduce scene. In this test, the version from Cloudera 0.20.2-cdh3u2 was used, which contains more than 1000 patches against the vanilla 0.20.2 version. Written in Java, it needs a JVM at runtime, which was here IcedTea 1.9.10 distributing OpenJDK 1.6.0_20. I did not do any tuning, hoping that the configuration would be ok for a small job. The HDFS block size was 64M, without replication. </p><p> The contender is Plasma Map/Reduce. I started this project two years ago in my spare time. It is not a clone of the Hadoop architecture, but includes many new ideas. In particular, a lot of work went into the distributed filesystem PlasmaFS which features an almost complete set of file operations, and controls the disk layout directly. The map/reduce algorithm uses a slightly different scheme which tries to delay the partitioning of the data to get larger intermediate files. Plasma is implemented in OCaml, which isn't VM-based but compiles the code directly to assembly language. In this test, the blocksize was 1M (Plasma is designed for smaller-sized blocks). The software version of Plasma is roughly 0.6 (a few svn revisions before the release of 0.6). </p><h2>Results</h2> <p>The runtimes: </p><p> </p><table> <tr> <td><b>Hadoop:</b></td> <td><b>2265 seconds</b> (37 min, 45 s)</td> </tr> <tr> <td><b>Plasma:</b></td> <td><b>1975 seconds</b> (32 min. 55 s)</td> </tr> </table> <p> Given the uncertainty of the environment, this is no big difference. But let's have a closer look at the system activity to get an idea why Plasma is a bit faster. </p><h2>CPU</h2> In the following I took simply one of the datanodes, and created diagrams (with kSar): <p> <img src="/files/img/blog/edited_hadoop_cpu_all.png" width="799" height="472"/> </p><p> <img src="/files/img/blog/edited_plasma_cpu_all.png" width="800" height="471"/> </p><p> Note that kSar does not draw graphs for %iowait and %steal, although these data are recorded by sar. This is the explanation why the sum of user, system and idle is not 100%. </p><p> What we see here is that Hadoop consumes all CPU cycles, whereas Plasma leaves around 1/3 of the CPU capacity unused. Given the fact that this kind of job is normally I/O-bound, it just means that Hadoop is more CPU-hungry, and would have benefit from getting more cores in this test. </p><h2>Network</h2> In this diagram, reads are blue and red, whereas writes are green and black. The first curve shows packets per second, and the second bytes per second: <p> <img src="/files/img/blog/edited_hadoop_eth0.png" width="800" height="333"/> </p><p> <img src="/files/img/blog/edited_plasma_eth0.png" width="800" height="319"/> Summing reads and writes up, Hadoop uses only around 7MB/s on average whereas Plasma transmits around 25MB/s, more than three times as much. There could be two explanations: </p><ul> <li>Because Hadoop is CPU-underpowered, it remains below its possibilities </li><li>The Hadoop scheme is more optimized for keeping the network bandwidth as low as possible </li></ul> The background for the second point is the following: Because Hadoop partitions the data immediately after mapping and sorting, the data has (ideally) only to cross the network once. This is different in Plasma - which generally partitions the data iteratively. In this setup, after mapping and sorting only 4 partitions are created, which are further refined in the following split-and-merge rounds. As we have here 80 partitions in total, there is at least one further step in which data partitioning is refined, meaning that the data has to cross the network roughly twice. This already explains 2/3 of the observed difference. (As a side note, one can configure how many partitions are initially created after mapping and sorting, and it would have been possible to mimick Hadoop's scheme by setting this value to 80.) <h2>Disks</h2> These diagrams depict the disk reads and writes in KB/second: <p> <img src="/files/img/blog/edited_hadoop_md0.png" width="800" height="332"/> </p><p> <img src="/files/img/blog/edited_plasma_md0.png" width="800" height="332"/> The average numbers are (directly taken from sar): </p><p> </p><table> <tr> <td> </td> <th>Hadoop</th> <th>Plasma</th> </tr> <tr> <td>Read/s:</td> <td>17.6 MB/s</td> <td>31.2 MB/s</td> </tr> <tr> <td>Write/s:</td> <td>30.8 MB/s</td> <td>33.9 MB/s</td> </tr> </table> <p> Obviously, Plasma reads data around twice as often from disk than Hadoop, whereas the write speed is about the same. Apart from this, it is interesting that the shape of the curves are quite different: Hadoop has a period of high disk activity at the end of the job (when it is busy merging data), whereas Plasma utilizes the disks better during the first third of the job. </p><h2>Plausibility</h2> <p> Neither of the contenders utilized the I/O resources at all times best. Part of the difficulty of developing a map/reduce scheme is to achieve that the load put onto the disks and onto the network is balanced. It is not good when e,g, the disks are used to 100% at a certain point and the network is underutilized, but during the next period the network is at 100% and the disk not fully used. A balanced distribution of the load reaches higher throughput in total. </p><p> Let's analyze the Plasma scheme a bit more in detail. The data set of 100G (which does not change in volume during the processing) is copied four times in total: once in the map-and-sort phase, and three times in the reduce phase (for this volume Plasma needs three merging rounds). This means we have to transfer 4 * 100G of data in total, or 40G of data per node (remember we have 10 nodes). We ran 22 cores for 1975 seconds, which gives a capacity of 43450 CPU seconds. Plasma tells us in its reports that it used 3822 CPU seconds for in-RAM sorting, which we should subtract for analyzing the I/O throughput. Per core these are 173 seconds. This means each node had 1975-173 = 1802 seconds for handling the 40G of data. This makes around 22 MB per second on each node. </p><p> The Hadoop scheme differs mostly in that the data is only copied twice in the merge phase (because Hadoop by default merges more files in one round than Plasma). However, because of its design there is an extra copy at the end of the reduce phase (from disk to HDFS). This means Hadoop also solves the same job by transferring 4 * 100G of data. There is no counter for measuring the time spent for in-RAM sorting. Let's assume this time is also around 3800 seconds. This means each node had 2265 - 175 = 2090 seconds for handling 40G of data, or 19 MB per second on each node. </p><h2>Conclusion</h2> <p> It looks very much as if both implementations are slowed down by specifics of the EC2 environment. Especially the disk I/O, probably the essential bottleneck here, is far below what one can expect. Plasma probably won because it uses the CPU more efficiently, whereas other aspects like network utilization are better handled by Hadoop. </p><p> For my project this result just means that it is on the right track. Especially, this small setup (only 10 nodes) is easily handled, giving prospect that Plasma is scalable at least to a small multitude of this. The bottleneck would be here the namenode, but there is still a lot of headroom. </p><h2>Where to get Plasma</h2> <p>Plasma Map/Reduce and PlasmaFS are bundled together in one download. Here is the <a href="http://projects.camlcity.org/projects/plasma.html">project page</a>. </p><p> <img src="/files/img/blog/plasma6_bug.gif" width="1" height="1"/> </p> </div> <div> Gerd Stolpmann works as O'Caml consultant </div> <div> </div> After NoSQL there will be NoServer http://blog.camlcity.org/blog/plasma5.html http://blog.camlcity.org/blog/plasma5.html 04 Nov 2011 00:00 GMT <div> <b>An experiment, and a vision</b><br/>  </div> <div> <p> The recent success of NoSQL technologies has not only to do with the fact that it is taken advantage of distribution and replication, but even more with the "middleware effect" that these features became relatively easy to use. Now it is no longer required to be an expert for these cluster techniques in order to profit from them. Let's think a bit ahead: how could a platform look like that makes distributed programming even easier, and that integrates several styles of storing data and managing computations? <cc-field name="maintext"> <p> The starting point for this exploration is a recent experience I made with my own attempt in the NoSQL arena, the <a href="http://plasma.camlcity.org">Plasma project</a>. Two weeks ago, it was "only" a distributed, replicating, and failure-resiliant filesystem PlasmaFS, with its own map/reduce implementation on top of it. Then I had an idea: is it possible to develop a key/value database on top of this filesystem? Which features, and relative advantages/disadvantages would it have? In other words, I was examining whether the existing platform makes it simpler to develop a database with a reasonable feature set. </p><p> When we talk about clusters, I have especially Internet applications in mind that are bombarded by the users with requests, but that have also to do a lot of background processing. </p><h2>The key/value database needed less than 2000 lines of code</h2> <p> Now, PlasmaFS is not following the simple pattern of HDFS, but bases on a transactional core, and it even allows the users to manage the transactions. For example, it is possible to rename a bunch of files atomically by just wrapping the rename operations into a single transaction. The transactional support goes even further: When reading from a file one can activate a special snapshot mode, which just means that the reader's view of the file is isolated from any writes happening at the same time. </p><p> These are clearly advanced features, and the question was whether they helped for writing a key/value database library. And yes, it was extremely helpful - in less than 2000 lines of code this library provides data distribution and replication, a high degree of data safety, almost unlimited scalabilitiy for database reads, and reasonable performance for writes. Of course, most of these features are just "inherited" from PlasmaFS, and the library just had to implement the file format (i.e. a B tree, see <a href="http://projects.camlcity.org/projects/dl/plasma-0.5/doc/html/Plasmakv_intro.html"> this page for details</a>). This is not cheating, but exactly the point: the platform makes it easy to provide features that would otherwise be extremely complicated to provide. </p><h2>NoServer</h2> <p> This key/value database is just a library, and one can use it only on machines where PlasmaFS is deployed. Of course it is possible to access the same database file from several machines - PlasmaFS handles all the networking involved with it. The point is that during the implementation of the library this never had to be taken into account. There is no networking code in this library, and this is why it is the first example of the new NoServer paradigm - not only server. </p><p> The genuine advantage of this paradigm is that it enables developers to write code they never would be able to create without the help of the platform. This is a bit comparable to the current situation for SQL databases: Everybody can store data in them, even over the network, without needing to have any clue how this works in detail. In the NoServer paradigm, we just go one step further, because the provided services by the platform are a lot more low-level, and the developer has a lot more freedom. Instead with a query language the shared resources are accessed with normal file operations, extended by transactional directives. The hope is that this makes a lot of server programming superflous, especially the difficult parts of it (e.g. what to do when a machine crashes). </p><p> A simple key/value database is obviously not difficult to create with these programming means. The interesting question is what else can be done with it in a cluster environment. Obviously, having a common filesystem on all machines of the cluster makes a lot of file copying superflous that a normal cluster would do with rsync and/or ssh. PlasmaFS can even be directly mounted (although the transactional features are unavailable then), so even applications can access PlasmaFS files that have not specially been ported to it. An example would be a read-only Lucene search index residing in PlasmaFS. Replacing the index by an updated one would be done by simply moving the new index into the right directory, and signalling Lucene that it has to re-open the index. </p><p> So far Plasma is implemented, and works well (I just released the release 0.5, which is now beta quality). The vision goes of course beyond that. </p><h2>What the platform also needs</h2> <p> There are a number of further datastructures that can obviously be well represented in files, such as hashtables or queues. Let's explore the latter a bit more in detail: How would a queue manager look like? There are a few data representation options. For example, every queue element could be a file in a directory, or a container format is established where the elements can be appended to. PlasmsFS also allows it to cut arbitrary holes into files, so it is even possible to physically remove elements from the beginning of the queue file by just removing the data blocks storing the elements from the file. As we don't want to run the queue manager as server, but just as library inside any program accessing the queue, the question is how event notifications are handled (which would be obvious in server context). Usually, one has to notify some followup processor when new elements have been added to the queue. Plasma currently does not include a method for doing this, so the platform needs to be extended by a notification framework (which should not be too difficult). </p><p> An important question is also how programs are activated running on different nodes. In my vision there would be a central task execution manager. Of course, this manager is normal client/server middleware. Again, the point here is that the application developer needs no special skills for triggering remote activation, he just uses libraries. I've no absolutely clear picture of this part yet, but it seems to be necessary to have the option of invoking programs in the inetd style as well as directly as if started via ssh. Also, a central directory would be maintained that includes important data such as which program can be run on which node. </p><h2>We won't live totally without servers, only with fewer ones</h2> <p> My vision does not include that servers are completely banned. We will still need them for special features or data access patterns, and of course for interaction with other systems. For example, PlasmaFS is bad at coordinating concurrent write accesses to the same file. Also, PlasmaFS employs a central namenode with a limited capacity only. So, if you are doing OLTP processing, a normal SQL database will still do better. If you need extraordinary write performance, but can pay the price of weakened consistency guarantees, a system like Cassandra will work better. </p><p> Nevertheless, there is the big field of "average deployments" where the number of nodes is not too big and the performance requirements are not too special, but the ACID guarantees PlasmaFS gives are essential. For this field, the NoServer paradigm could be the ideal choice to reduce the development overhead dramatically. </p><h2>Check Plasma out</h2> The <a href="http://plasma.camlcity.org">Plasma homepage</a> provides a lot of documentation, and especially downloads. Also take a look at the <a href="http://plasma.camlcity.org/plasma/perf.html">performance page</a>, describing a few tests I recently ran. <img src="/files/img/blog/plasma5_bug.gif" width="1" height="1"/> </cc-field> </p> </div> <div> </div> <div> Gerd Stolpmann works as O'Caml consultant. <a href="search1.html">Currently looking for new jobs as consultant!</a> </div> <div> </div> PlasmaFS http://blog.camlcity.org/blog/plasma4.html http://blog.camlcity.org/blog/plasma4.html 18 Oct 2011 00:00 GMT <div> <b>A serious distributed filesystem</b><br/>  </div> <div> <p> A few days ago, I released <a href="http://plasma.camlcity.org">Plasma-0.4.1</a>. This article gives an overview over the filesystem subsystem of it, which is actually the more important part. PlasmaFS differs in many points from popular distributed filesystems like HDFS. This starts from the beginning with the requirements analysis. <cc-field name="maintext"> <p> A distributed filesystem (DFS) allows it to store giant amounts of data. A high number of data nodes (computers with hard disks) can be attached to a DFS cluster, and usually a second kind of node, called name node, is used to store metadata, i.e. which files are stored and where. The point is now that the volume of metadata can be very low compared to the payload data (the ratios are somewhere between 1:10,000 to 1:1,000,000), so a single name node can manage a quite large cluster. Also, the clients can contact the data nodes directly to access payload data - the traffic is not routed via the name node like in "normal" network filesystems. This allows enormous bandwidths. </p><p> The motivation for developing another DFS was that existing implementations, and especially the popular HDFS, make (in my opinion) unfortunate compromises to gain speed: </p><ul> <li>The metadata is not well protected. Although the metadata is saved to disk and usually also replicated to another computer, these "safety copies" lag behind. In the case of an outage, data loss is common (HDFS even fails fatally when the disk fills up). Given the amount of data, this is not acceptable. It's like a local filesystem without journaling.<br/>  </li><li>The name node protocol is too simplistic, and because of this, DFS implementations need ultra-high-speed name node implementations (at least several 10000 operations per second) to manage larger clusters. Another consequence is that only large block sizes (several megabytes) promise decent access speeds, because this is the only implemented strategy to reduce the frequency of name node operations.<br/>  </li><li>Unless you can physically separate the cluster from the rest of the network, security is a requirement. It is difficult to provide, however, mainly because the data nodes are independently accessed, and you want to avoid that data nodes have to continuously check for access permissions. So the compromise is to leave this out in the DFS, and rely on complicated and error-prone configurations in network hardware (routers and gateways). </li></ul> <p> I'm not saying that HDFS is a bad implementation. My point is only that there is an alternative where safety and security are taken more seriously, and that there are other ways to get high speed than those that are implemented in HDFS. </p><h2>Using SSDs for transacted metadata stores</h2> PlasmaFS starts at a different point. It uses a data store with full transactional support (right now this is PostgreSQL, just for development simplicity, but other, and more light-weight systems could also fill out this role). This includes: <ul> <li>Data are made persistent in a way so that full ACID support is guaranteed (remember, the ACID properties are atomicity, consistency, isolation, and durability). </li><li>For keeping replicas synchronized, we demand support for two-phase commit, i.e. that transactions can be prepared before the actual commit with the guarantee that the commit is fail-safe after preparation. (Essentially, two-phase commit is a protocol between two database systems keeping them always consistent.) </li></ul> This is, by the way, the established prime-standard way of ensuring data safety for databases. It comes with its own problems, and the most challenging is that commits are relatively slow. The reason for this is the storage hardware - for normal hard disks the maximum frequency of commits is a function of the rotation speed. Fortunately, there is now an alternative: SSDs allow at present several 10000 syncs per second, which is two orders of magnitude more than classic hard disks provide. Good SSDs are still expensive, but luckily moderate disk sizes are already sufficient (with only a 100G database you can already manage a really giant filesystem). <p>Still, writing each modification directly to the SSD limits the speed compared to what systems like HDFS can do (because HDFS keeps the data in RAM, and only writes now and then a copy to disk). We need more techniques to address the potential bottleneck name node: </p><ul> <li>PlasmaFS provides a transactional view to users. This works very much like the transactions in SQL. The performance advantage is here that several write operations can be carried out with only one commit. PlasmaFS takes it that far that unlimited numbers of metadata operations can be put into a transaction, such as creating and deleting files, allocating blocks for the files, and retrieving block lists. It is possible to write terabytes of data to files with <i>only a single commit</i>! Applications accessing large files sequentially (as, e.g., in the map/reduce framework) can especially profit from this scheme.<br/>  </li><li>PlasmaFS addresses blocks linearly: for each data node the blocks are identified by numbers from 0 to n-1. This is safe, because we manage the consistency globally (basically, there is a kind of join between the table managing which blocks are used or free, and the table managing the block lists per file, and our safety measures allow it to keep this join consistent). In contrast, other DFS use GUIDs to identify blocks. The linear scheme, however, allow it to transmit and store block lists in a compressed way (extent-based). For example, if a file uses the blocks 10 to 14 on a data nodes, this is stored as "10-14", and not as "10,11,12,13,14". Also, block allocations are always done for ranges of blocks. This greatly reduces the number of name node operations while only moderately increasing their complexity.<br/>  </li><li>A version number is maintained per file that is increased whenever data or metadata are modified. This allows it to keep external caches up to date with only low overhead: A quick check whether the version number has changed is sufficient to decide whether the cache needs to be refreshed. This is reliable, in contrast to cache consistency schemes that base only on the last modification time. Currently this is used to keep the caches of the NFS bridge synchronized. Especially, applications accessing only a few files randomly profit from such caching. </li></ul> <p> I consider the map/reduce part of Plasma especially as a good test case for PlasmaFS. Of course, this map/reduce implementation is perfectly adapted to PlasmaFS, and uses all possibilities to reduce the frequency of name node operations. It turns out that a typical running map/reduce task contacts the name node only every 3-4 seconds, usually to refill a buffer that got empty, or to flush a full buffer to disk. The point here is that a buffer can be larger than a data block, and that only a single name node transaction is sufficient to handle all blocks in the buffer in one go. The buffers are typically way larger than only a single block, so this reduces the number of name node operations quite dramatically. (Important note: This number (3-4) is only correct for Plasma's map/reduce implementation which uses a modified and more complex algorithm scheme, but it is not applicable to the scheme used by Hadoop.) </p><h2>Speed</h2> <p> I have done some tests with the latest development version of Plasma. The peak number of commits per second seems to be around 500 (here, a "commit" is a transaction writing data that can include several data update operations). This test used a recently bought SSD, and ran on a quad-core server machine. It was not evident that the SSD was the bottleneck (one indication is that the test ran only slightly faster when syncs were turned off), so there is probably still a lot of room for optimization. </p><p> Given that a map/reduce task needs the name node only every ≈0.3 seconds, this "commit speed" would be theoretically sufficient for around 1600 parallely running tasks. It is likely that other limits are hit first (e.g. the switching capacity). Anyway, these are encouraging numbers showing that this young project is not on the wrong track. </p><p> The above techniques are already implemented in PlasmaFS. More advanced options that could be worth an implementation include: </p><ul> <li>As we can maintain exact replicas of the primary name node (via two-phase commit), it becomes possible to also use the replicas for read accesses. For certain types of read operations this is non-trivial, though, because they have an effect on the block allocation map (essentially we would need to synchronize a certain buffer in both the primary and secondary servers that controls delayed block deallocation). nevertheless, this is certainly a viable option. Even writes could be handled by the secondary nodes, but this tends to become very complicated, and is probably not worth it.<br/>  </li><li>An easier option to increase the capacity is to split the file space, so that each name node takes care of a partition only. A user transaction would still need a uniform view on the filesystem, though. If a name node receives a request for an operation it cannot do itself, it automatically extends the scope of the transaction to the name node that is responsible for the right partition. This scheme would also use the two-phase commit protocol for keeping the partitions consistent. I think this option is viable, but only for the price of a complex development effort. </li></ul> <p> Given that these two improvements are very complicated to implement, it is unlikely that it is done soon. There is still a lot of fruit hanging at lower branches of the tree. </p><h2>Delegated access control checks</h2> <p> Let's quickly discuss another problem, namely how to secure accesses to data nodes. It is easy to accept that the name nodes can be secured with classic authentication and authorization schemes in the same style as they are used for other server software, too. For data nodes, however, we face the problem that we need to supervise every access to a data block individually, but want to avoid any extra overhead, especially that each data access needs to be checked with the name node. </p><p> PlasmaFS uses a special cryptographic ticket system to avoid this. Essentially, the name node creates random keys in periodical intervals, and broadcasts these to the data nodes. These keys are secrets shared by the name and data nodes. The accessing clients get only HMAC-based tickets generated from the keys and from the block ID the clients are granted access to. These tickets can be checked by the data nodes because these nodes know the keys. When the client loses the right to access the blocks (i.e. when the client transaction ends), the corresponding key is revoked. </p><p> With some additional tricks it can be achieved that the only communication between the name node and the data node is a periodical maintenance call that hands out the new keys and revokes the expired keys. That's an acceptable overhead. </p><h2>Other quality-assuring features</h2> <p> PlasmaFS implements the POSIX file semantics almost completely. This includes the possibility of modifying data (or better, replacing blocks by newer versions, which is not possible in other DFS implementations), the handling of deleted files, and the exclusive creation of new files. There are a few exceptions, though, namely neither the link count nor the last access time of files are maintained. Also, lockf-style locks are not yet available. </p><p> For supporting map/reduce and other distributed algorithm schemes, PlasmaFS offers locality functions. In particular, one can find out on which nodes a data block is actually stored, and one can also wish that a new data block is stored on a certain node (if possible). </p><p> The PlasmaFS client protocol bases on SunRPC. This protocol has quite good support on the system level, and it supports strong authentication and encryption via the GSS-API extension (which is actually used by PlasmaFS, together with the SCRAM-SHA1 mechanism). I know that younger developers consider it as out-dated, but even the Facebook generation must accept that it can keep up with the requirements of today, and that it includes features that more modern protocols do not provide (like UDP transport and GSS-API). For the quality of the code it is important that modifying the SunRPC layer is easy (e.g. adding or changing a new procedure), and does not imply much coding. Because of this it could be achieved that the PlasmaFS protocol is quite clean on the one hand, but is still adequately expressive on the other hand to support complex transactions. </p><p> PlasmaFS is accessible from many environments. Applications can access it via the mentioned SunRPC protocol (with all features), but also via NFS, and via a command-line client. In the future, WebDAV support will also be provided (which is an extension of HTTP, and which will ensure easy access from many programming environments). </p><h2>Check Plasma out</h2> The <a href="http://plasma.camlcity.org">Plasma homepage</a> provides a lot of documentation, and especially downloads. Also take a look at the <a href="http://plasma.camlcity.org/plasma/perf.html">performance page</a>, describing a few tests I recently ran. <img src="/files/img/blog/plasma4_bug.gif" width="1" height="1"/> </cc-field> </p> </div> <div> </div> <div> Gerd Stolpmann works as O'Caml consultant. <a href="search1.html">Currently looking for new jobs as consultant!</a> </div> <div> </div>