<?xml version="1.0"?>
<rss version="2.0">
  <channel>
    <title>Blog on camlcity.org</title>
    <link>http://blog.camlcity.org</link>
    <language>en</language>
    <description>Articles by Gerd Stolpmann about O'Caml</description>

    
        <item>
          <title>WasiCaml: Translate OCaml Code to WebAssembly</title>
          <guid>http://blog.camlcity.org/blog/wasicaml1.html</guid>
          <link>http://blog.camlcity.org/blog/wasicaml1.html</link>
          <pubDate>15 Jul 2021 00:00 GMT</pubDate>
          <description>

&#60;div&#62;
  &#60;b&#62;The portability story behind WasiCaml&#60;/b&#62;&#60;br/&#62;&#38;#160;
&#60;/div&#62;

&#60;div&#62;
  
  For a recent project we wrote a compiler that translates a
  domain-specific language (DSL) to some runnable form, and we did
  that in OCaml. The DSL is now part of an Electron-based integrated
  development environment (IDE) that will soon be available
  from &#60;a href=&#34;https://remixlabs.com&#34;&#62;Remix Labs&#60;/a&#62;. Electron runs
  on a couple of operating systems, but the DSL compiler orginally did
  not.  How do we accomplish it to run the DSL compiler on as many
  different operating systems?  This was the question we faced when
  starting the development of WasiCaml, a translator from OCaml
  bytecode to WebAssembly.

&#60;/div&#62;

&#60;div&#62;
  
&#60;p&#62;
  Of course, Electron is just an example of a cross-platform
  environment.  You can develop apps for Mac, Windows, and Linux, and
  it is Javascript-based.  We picked Electron for porting the user
  interface of the IDE to the desktop - originally the IDE was written
  for the web, and the DSL compiler was running on a server backing
  the web app. Initially, the Electron version of the IDE started just
  a native binary of the DSL compiler as a server process that ran in
  the background, just like we did it for the web, but this means that
  you run into the cross-build problem again that you actually want to
  avoid by running something in Electron: we would have needed to set
  up several build pipelines, one for each OS, in order to build the
  DSL compiler for the targets we wanted to support.

&#60;/p&#62;&#60;p&#62;
  There are already tools to translate OCaml to Javascript (namely
  Bucklescript and js_of_ocaml), and we could have used these to fiddle
  the DSL compiler into the Javascript code base. However, this does
  not feel right: we would have had to reorganize the OCaml code base
  because you can&#38;#39;t link in C libraries, and driving the DSL compiler
  would have been quite adventurous (it talks via a bidirectional
  pipeline to its clients). At that time we were already exploring
  WebAssembly for other parts of the system, and the idea came up
  to also use WebAssembly for running the DSL compiler. The
  &#60;a href=&#34;https://github.com/remixlabs/wasicaml/&#34;&#62;WasiCaml&#60;/a&#62;
  project was born (and the translation to Javascript only plan B
  should this turn out to be more difficult than expected).

  &#60;/p&#62;&#60;h2&#62;A quick intro to WebAssembly&#60;/h2&#62;

&#60;p&#62;
  As the name suggests, WebAssembly provides a fairly low-level virtual
  machine for running the code. The instructions are comparable to the
  ones you find in a CPU, e.g. load, store, arithmetic. The code is
  structured into functions which take a fixed number of parameters
  and return a single result. The functions can have local variables
  that can be read and written by the code. The parameters and variables
  can have one of four numeric types (i32, i64, f32, and f64).

&#60;/p&#62;&#60;p&#62;
  For example, this is a WebAssembly module with just one function that
  increments a 32 bit number at a memory location by one, and returns
  the value:
&#60;/p&#62;&#60;blockquote&#62;
&#60;code style=&#34;white-space: pre&#34;&#62;
(module
  (import &#38;#34;env&#38;#34; &#38;#34;memory&#38;#34; (memory $memory 1))
  (func $incr (export &#38;#34;incr&#38;#34;) (param $x i32) (result i32)
    (local.get $x)
    (i32.load)
    (i32.const 1)
    (i32.add)
    (return)
  )
)
&#60;/code&#62;
&#60;/blockquote&#62;

&#60;p&#62;
  Here, the code is given in the textual format known as WAT. For running
  it, you first need to convert it to the binary format (WASM), e.g.
  with a tool like &#60;a href=&#34;https://github.com/WebAssembly/wabt&#34;&#62;wat2wasm&#60;/a&#62;.

&#60;/p&#62;&#60;p&#62;
  Also note that there is an operands stack: &#60;code&#62;local.get&#60;/code&#62; pushes
  the result on this stack, and &#60;code&#62;i32.load&#60;/code&#62; loads the number
  from the address found on the stack, and also pushes the result on the
  stack. This stack is mainly meant to express the code in a very compact
  way. The engine running code normally translates the stack operations
  into a more efficient form before starting up.

&#60;/p&#62;&#60;p&#62;
  A WebAssembly VM is equipped with linear memory, i.e. the memory
  addresses go from 0 to a maximum address, without fragmentation, and
  without address ranges supporting special semantics like mapped files. The
  memory is only used for data - the running code is inaccessible
  (i.e. the VM has a Harvard architecture), and this also includes the
  call stack and other parts of the VM (e.g. you cannot iterate over
  the local variables of the functions). In order to also support
  indirect jumps, there is a way to reference functions by numeric
  IDs.

&#60;/p&#62;&#60;p&#62;
  Typically, WebAssembly VMs translate the code to the native
  instruction set of the host running of code before running the code
  (often as JIT compilers, but there are now also engines doing the
  translation statically ahead of time, and producing native binaries),
  and these engines almost reach native speed.  All current browsers support
  WebAssembly now, and it is also present in other Javascript-based
  environments (like node, or the Electron platform).  Although it
  started as a web technology, WebAssembly is not limited to the web.
  For example, &#60;a href=&#34;https://docs.wasmtime.dev/&#34;&#62;wasmtime&#60;/a&#62;
  allows you to embed a WebAssembly engine into almost any environment
  - e.g. you could embed the engine into an application server written
  in Go. In this case, there is no Javascript involved at all.
    
  &#60;/p&#62;&#60;h2&#62;WASI&#60;/h2&#62;

&#60;p&#62;
  While the WebAssembly standard defines how to express the code and
  how to run it, there is still the question how to use it with
  popular languages like C, and
  Rust. The &#60;a href=&#34;https://wasi.dev/&#34;&#62;WASI&#60;/a&#62; standard is an ABI
  that answers a lot of the questions. As an ABI it defines calling
  conventions, but it is not limited to that. In particular, there is
  a version of libc that defines a Unix-like set of base functionality
  the language-specific runtime can use. Also, WASI defines a set of host
  functions that play a role comparable to system calls in the WebAssembly
  world, and that allow access to files, the process environment, and
  the current time. With the help of WASI you can compile many C
  or Rust libraries to WebAssembly, and the porting effort is low.

&#60;/p&#62;&#60;p&#62;
  WASI is multi-lingual environment, and you can in particular link
  code written in different languages into the same executable. This is
  possible because the language-specific runtimes have a common foundation
  (libc), and e.g. memory allocated from one language also counts as
  &#38;#34;taken&#38;#34; within the other language.

&#60;/p&#62;&#60;p&#62;
  WASI is still in an early stage. While developing with it I discovered
  a couple of bugs, but the functionality is already impressive and
  usable for many purposes.
  
  &#60;/p&#62;&#60;h2&#62;WasiCaml&#60;/h2&#62;

&#60;p&#62;
  So now, what is WasiCaml, and how can I use it?

&#60;/p&#62;&#60;p&#62;
  Let&#38;#39;s assume you have a bytecode executable created by something like
&#60;/p&#62;&#60;blockquote&#62;
&#60;code style=&#34;white-space: pre&#34;&#62;
  ocamlc -o myexecutable mycode.ml
&#60;/code&#62;
&#60;/blockquote&#62;

&#60;p&#62;Now, you can further translate the bytecode executable to WebAssembly:

&#60;/p&#62;&#60;blockquote&#62;
&#60;code style=&#34;white-space: pre&#34;&#62;
  wasicaml -o mywasm.wasm myexecutable
&#60;/code&#62;
&#60;/blockquote&#62;

&#60;p&#62;If you want to run this executable, you need a specially configured
WebAssembly engine which can be found in ~/.wasicaml/js after installation:

&#60;/p&#62;&#60;blockquote&#62;
&#60;code style=&#34;white-space: pre&#34;&#62;
  node ~/.wasicaml/js/main.js ./mywasm.wasm ./mywasm.wasm arg ...
&#60;/code&#62;
&#60;/blockquote&#62;

&#60;p&#62;The &#60;code&#62;mywasm.wasm&#60;/code&#62; binary is portable and can be run
  everywhere!

&#60;/p&#62;&#60;p&#62;For simplicity, wasicaml can also generate a wrapper that hides the
&#60;code&#62;node&#60;/code&#62; invocation, and this is triggered by just omitting
the .wasm suffix:

&#60;/p&#62;&#60;blockquote&#62;
&#60;code style=&#34;white-space: pre&#34;&#62;
  wasicaml -o mywasm myexecutable
&#60;/code&#62;
&#60;/blockquote&#62;

&#60;p&#62;Now you can run the program simply with &#60;code&#62;./mywasm&#60;/code&#62; (but note
that the wrapper is not portable).

&#60;/p&#62;&#60;p&#62;Another option is to link in C libraries like e.g.
&#60;/p&#62;&#60;blockquote&#62;
&#60;code style=&#34;white-space: pre&#34;&#62;
  wasicaml -o mywasm.wasm myexecutable -cclib ~/.wasicaml/lib/ocaml/libunix.a
&#60;/code&#62;
&#60;/blockquote&#62;

Of course, the C library must also be WASI-compatible.

&#60;p&#62;Note that WasiCaml-produced code can so far not be run with
  wasmtime or wasmer, in particular because there is no machinery
  for exception handling in these engines. Browsers are fully
  supported, though.

  &#60;/p&#62;&#60;h2&#62;The WasiCaml project&#60;/h2&#62;

&#60;p&#62;WebAssembly is still a very new technology and information about it
  is rare. For example, it took a while until I understood that LLVM
  includes a full-featured assembler for WebAssembly, i.e. you can feed
  it a &#60;code&#62;code.s&#60;/code&#62; file, and you get a &#60;code&#62;code.o&#60;/code&#62;
  file back with partially linked WebAssembly code. This is documented
  nowhere, and I could only figure out some parts of the assembler syntax
  by reading the source code of LLVM.

&#60;/p&#62;&#60;p&#62;What I already knew from an earlier WebAssembly project is that
  there is no exception handling (EH) mechanism yet in the standard
  (although this will likely change soon). This turned out as a
  special problem for WasiCaml, because the OCaml runtime uses long
  jumps in external C code to trigger OCaml exceptions. I remembered
  the way the &#60;a href=&#34;https://emscripten.org&#34;&#62;Emscripten&#60;/a&#62;
  toolchain (which is another wrapper around LLVM) gets around this
  difficulty.  If the host language is Javascript, embedded
  WebAssembly code is compiled to run in the same VM that is also used
  to execute Javascript itself, and this means that Javascript exceptions
  also work perfectly for WebAssembly! Of course, this trick is
  really limited to Javascript hosts, but at least I could remove
  the blocker for one of the possible execution environments.

&#60;/p&#62;&#60;p&#62;The very first task was then to get the OCaml bytecode interpreter
  working in a WASI (plus EH) environment.
  
  &#60;/p&#62;&#60;h2&#62;Milestone: running the bytecode interpreter in the WASI environment&#60;/h2&#62;

&#60;p&#62;Essentially, this means that I wanted to (1) clone the OCaml source
  code, (2) &#60;code&#62;configure&#60;/code&#62; it, and (3) &#60;code&#62;make&#60;/code&#62; the
  bytecode interpreter (and the whole OCaml bytecode toolchain).  The
  C compiler comes from the
  &#60;a href=&#34;https://github.com/WebAssembly/wasi-sdk&#34;&#62;WASI SDK&#60;/a&#62;,
  and it compiles directly to WebAssembly. Now, if you just set the
  &#60;code&#62;CC&#60;/code&#62; variable to this C compiler, &#60;code&#62;configure&#60;/code&#62;
  will consider the target as a cross-compile target. Such targets
  are still very tricky, and - because we actually &#60;em&#62;can&#60;/em&#62; run
  the code somehow - I thought it is better to avoid cross-compilation
  altogether, and to add some tooling so that binaries are
  directly runnable.

&#60;/p&#62;&#60;p&#62;Instead of pointing &#60;code&#62;CC&#60;/code&#62; directly to the C compiler of
  the WASI SDK, there is now a wrapper script &#60;code&#62;wasi_cc&#60;/code&#62;.
  The main purpose of this script is to reshape the WebAssembly
  executables so that they are directly runnable on the host
  system. This is accomplished by prepending a &#60;em&#62;starter&#60;/em&#62; to the
  WebAssembly code. The &#60;em&#62;starter&#60;/em&#62; runs &#60;code&#62;node&#60;/code&#62; with
  the right driver script, and extracts the WebAssembly code from the
  executable file. For example, if you do

&#60;/p&#62;&#60;blockquote&#62;
&#60;code style=&#34;white-space: pre&#34;&#62;
  wasi_cc -o ex code.c
&#60;/code&#62;
&#60;/blockquote&#62;

the resulting file &#60;code&#62;ex&#60;/code&#62; can be directly run with
&#60;code&#62;./ex&#60;/code&#62;.

&#60;p&#62;With this trick, &#60;code&#62;configure&#60;/code&#62; now &#38;#34;thinks&#38;#34; that the
  target is a native target of the operating
  system. &#60;code&#62;configure&#60;/code&#62; could also run the tests on the
  existence of the various libc library functions the OCaml runtime
  needs, and figured out a lot of that stuff correctly. Nevertheless,
  not everything was working, and I had to fork the OCaml sources in
  order to disable functions that are not available
  (see &#60;a href=&#34;https://github.com/gerdstolpmann/ocaml/compare/4.12.0...gerd/wasi-4.12.0&#34;&#62;gerd/wasi-4.12.0&#60;/a&#62;
  for the changes).

&#60;/p&#62;&#60;p&#62;In this branch of OCaml I also changed the main function of the
  bytecode interpreter so that it catches exceptions from Javascript
  (actually, this function was split into two, and the outer function
  catches the exception thrown by the inner function).

&#60;/p&#62;&#60;p&#62;A final difficulty was that function pointers in WebAssembly are typed
  - which is a logical consequence of the fact that functions are typed.
  OCaml generates a file &#60;code&#62;prims.c&#60;/code&#62; that initializes the list
  of FFI functions, and initially LLVM did not like this file, because
  it could not infer the types of the function pointers. The solution
  was &#60;em&#62;not&#60;/em&#62; to generate WebAssembly for this single file but
  to leave it as LLVM IR (&#38;#34;bitcode&#38;#34;). In this format function pointers
  can remain untyped, and the LLVM linker is smart enough to fix up
  the problem at link time, and to convert LLVM IR to WebAssembly when
  the types of the FFI functions are known.

&#60;/p&#62;&#60;p&#62;With this trick, everything worked fine! The speed of the bytecode
  interpreter did not slow much down in WebAssembly, which was very
  encouraging.

  &#60;/p&#62;&#60;h2&#62;Milestone: the direct translator&#60;/h2&#62;

&#60;p&#62;After the bytecode interpreter was running, the second step was to
  directly generate WebAssembly code from OCaml. Actually, there were
  two choices: either to pick up one of the internal formats of OCaml
  (e.g. &#38;#34;Lambda&#38;#34; or &#38;#34;C--&#38;#34;) and to change the OCaml compiler directly,
  or to take the bytecode as the starting point. I preferred the
  latter because WasiCaml is then an add-on processor that can be
  easily added to existing OCaml projects, and because some
  difficulties could be avoided (e.g. incremental compilation, and
  many many fixups through the whole toolchain). Also, I hoped that
  the resulting speed would still be &#38;#34;good enough&#38;#34; (at least for the
  purposes of the DSL compiler we wanted to run with WebAssembly).

&#60;/p&#62;&#60;p&#62;Also, bytecode made it also a lot easier for me to get started.
  There were really a lot of unanswered questions: what does the
  function call mechanism look like? How do we get around the problem
  that OCaml code typically requires tail calls to be working but
  there aren&#38;#39;t tail calls in WebAssembly (yet)? What does the code
  look to allocate a block of memory? How do we emulate exceptions?
  Picking bytecode meant that I could focus on these questions, while
  the bytecode instructions could initially be translated in a naive
  way, e.g. by translating each bytecode instruction separately to a
  fixed block of WebAssembly instructions (like instantiating a
  template). (Note that the current WasiCaml compiler is already
  a lot better than that.)

&#60;/p&#62;&#60;p&#62;Picking bytecode also meant that WasiCaml inherits the bytecode
  stack. This is actually not a bad thing - because of OCaml&#38;#39;s memory
  management the stack must reside in addressable memory, and the
  bytecode stack could serve as what the WebAssembly community calls
  a &#60;em&#62;shadow stack&#60;/em&#62;. (Even for the C language there is a
  shadow stack - and the alternative would have been to also use the
  shadow stack of the C language.) So we got the shadow stack for
  OCaml code practically for free.

&#60;/p&#62;&#60;p&#62;The stack is important because the garbage collector must be
  able to run over all locations where OCaml values are stored.
  As already mentioned, the locations WebAssembly natively supports
  cannot be traversed over (like local and global variables), and
  hence it is crucial to put OCaml values into memory whenever there
  is the chance of a garbage collector run.

&#60;/p&#62;&#60;p&#62;Note that the native OCaml compiler is not much different in this
  respect - only that the native stack of the operating system can be
  used for storing values because it resides in memory. The details
  are different, though. When a value is moved temporarily to the
  stack, this is usually called &#38;#34;register spilling&#38;#34;, and this is done
  because (1) there is only a limited amount of registers, but another
  register is needed, or (2) you don&#38;#39;t know which register remains
  untouched when you call a function, or (3) you call some code that
  may run the garbage collector. Now, in WebAssembly, reason (1) is
  never the case because there can be any number of local variables
  (which take over the role of registers), and the details of (3) are
  very different, because in a native environment the registers are
  global stores, permitting some time-saving tricks that are
  unavailable in WebAssembly.

&#60;/p&#62;&#60;p&#62;So, for developing the WasiCaml code emitter, this meant that it
  had to follow constraints so that OCaml values end up on the stack
  in the right moment. Actually, these constraints mainly shaped the
  layout of the WasiCaml code.
  
  &#60;/p&#62;&#60;h2&#62;32 bit comes back!&#60;/h2&#62;

&#60;p&#62;Once WasiCaml was working, we got back to the DSL compiler we
  originally wanted to make cross-platform. And we actually got it
  running! There was one remaining problem, though: WebAseembly is a
  32 bit environment. As you may know, OCaml suffers from some
  limitations in this case. Most annoyingly, strings can only be 16 MB
  in size at most.

&#60;/p&#62;&#60;p&#62;Fortunately, this problem occurred only here and there, mostly
  in the code emitter. Here, we could switch to
  &#60;a href=&#34;https://github.com/Chris00/ocaml-rope&#34;&#62;ropes&#60;/a&#62;
  as alternate representation - and, lucky as we were, it turned
  out that this change did not eat much performance.

&#60;/p&#62;&#60;p&#62;The DSL compiler is quite big, and the WebAssembly version takes
  around 3 seconds to start up. This is longer than usual, but for
  our application we could hide the startup time, and are now quite
  happy with the product.

  &#60;/p&#62;&#60;hr/&#62;

&#60;p&#62;PS. Interested in WebAssembly and you know OCaml (or another
  functional language like Elm, Scala, Haskell, ...)?
  &#60;a href=&#34;https://www.mixtional.de/recruiting/2021-01/index.html&#34;&#62;We might have
    a job for you (July 2021)&#60;/a&#62;.
&#60;/p&#62;
&#60;/div&#62;

&#60;div&#62;
  Gerd Stolpmann is the CEO of &#60;a href=&#34;https://mixtional.de&#34;&#62;Mixtional Code GmbH&#60;/a&#62;, currently busy with the last development steps of the &#60;a href=&#34;http://remixlabs.com&#34;&#62;Remix Labs&#60;/a&#62; platform

&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;


          </description>
        </item>
      
        <item>
          <title>OMake On Steroids (Part 3)</title>
          <guid>http://blog.camlcity.org/blog/omake3.html</guid>
          <link>http://blog.camlcity.org/blog/omake3.html</link>
          <pubDate>23 Jun 2015 00:00 GMT</pubDate>
          <description>

&#60;div&#62;
  &#60;b&#62;Faster builds with omake, part 3: Caches&#60;/b&#62;&#60;br/&#62;&#38;#160;
&#60;/div&#62;

&#60;div&#62;
  
In this (last) part of the series we have a closer look at how OMake uses
caches, and what could be improved in this field. Remember that we saw
in total double speed for large OMake projects, and that we also could
reduce the time for incremental builds. In particular for the latter, the
effect of caching is important.

&#60;cc-field name=&#34;maintext&#34;&#62;
&#60;div style=&#34;float:right; width:50%; border: 1px solid black; padding: 10px; margin-left: 1em; margin-bottom: 1em; margin-top: 1em; background-color: #E0E0E0&#34;&#62;
This text is part 3/3 of a series about the OMake improvements
sponsored by &#60;a href=&#34;http://lexifi.com&#34;&#62;LexiFi&#60;/a&#62;:
&#60;ul&#62;
  &#60;li&#62;Part 1: &#60;a href=&#34;/blog/omake1.html&#34;&#62;Overview&#60;/a&#62;
  &#60;/li&#62;&#60;li&#62;Part 2: &#60;a href=&#34;/blog/omake2.html&#34;&#62;Linux&#60;/a&#62;
  &#60;/li&#62;&#60;li&#62;Part 3: Caches (this page)
&#60;/li&#62;&#60;/ul&#62;
The original publishing is on &#60;a href=&#34;http://blog.camlcity.org/blog&#34;&#62;camlcity.org&#60;/a&#62;.
&#60;/div&#62;
&#60;p&#62;
Caching more is better, right? Unfortunately, this attitude of many
application programmers does not hold if you look closer at how caches
work. Basically, you trade memory for time, but there are also unwanted
effects. As we learned in the last part, bigger process images may also
cost time. What we examined there at the example of the fork() system
call is also true for any memory that is managed in a fine-grained
way. Look at the garbage collector of the OCaml runtime: If more memory
blocks are allocated, the collector also needs to cycle through more
blocks in order to mark and reclaim memory. Although the runtime includes
some clever logic to alleviate this effect (namely by allowing more waste
for bigger heaps and by adjusting the collection speed to the allocation
speed), the slowdown is still measurable.

&#60;/p&#62;&#60;p&#62;
Another problem for large setups is that if processes consume more
memory the caches maintained by the OS have less memory to work with.
The main competitor on the OS level is the page cache that stores
recently used file blocks. After all, memory is limited, and it is
the question for what we use it. Often enough, the caches on the OS
level are the most effective ones, and user-maintained caches need
to be justified.

&#60;/p&#62;&#60;p&#62;
In the case of OMake there are mainly two important caches:

&#60;/p&#62;&#60;ul&#62;
&#60;li&#62;The target cache answers the question whether a file can be built in
    a given directory. The cache covers both types of build rules: explicit
    and implicit rules. For the latter it is very important to have this
    cache because the applicable implicit rules need to be searched.
    As OMake normally uses the &#38;#34;-modules&#38;#34; switch of ocamldep, it has to
    find out on its own in which directory an OCaml module is built.
&#60;/li&#62;&#60;li&#62;The file cache answers the question whether a file is still up to date,
    or whether it needs to be rebuilt. This is based on three data blobs:
    first, the Unix.stat() properties of the file (and whether the file
    exists at all). Second, the MD5 digest of the file. Third, the digest
    of the command that created the file. If any of these blobs change
    the file is out of date. The details are somewhat complicated, though,
    in particular the computation of the digest costs some time and should
    only be done if it helps avoiding other expensive actions. Parts of the file
    cache survive OMake invocations as these are stored in the &#38;#34;.omakedb&#38;#34;
    file.
&#60;/li&#62;&#60;/ul&#62;

&#60;p&#62;
All in all, I was looking for ways of reducing the size of the caches, and
for a cleverer organization that makes the cache operations cheaper.

&#60;/p&#62;&#60;h2&#62;The target cache&#60;/h2&#62;

The target cache is used for searching the directory where a file can be
built, and also the applicable file extensions (e.g. if a file m.ml
is generated from m.mly there will be entries for both m.ml and m.mly).
As I found it, it was very simple, just a mapping

&#60;blockquote&#62;
filepath &#38;#8614; buildable_flag
&#60;/blockquote&#62;

and if a file f could potentially exist in many directories d there
was a separate entry d/f for every d. For a given OCaml module m,
there were entries for every potential suffix (i.e. for .cmi, .cmo, .cmx
etc.), and also for the casing of m (remember that a module M can be
stored in both m.ml and M.ml). In total, the cache had 2 * D * S * M
entries (when D = number of build directories and S = number of file
suffixes). It&#38;#39;s a high number of entries.

&#60;p&#62;
The problem is not only the size, but also the speed: For every test
we need to walk the mapping data structure.

&#60;/p&#62;&#60;p&#62;
The new layout of the cache compresses the data in the following way:

&#60;/p&#62;&#60;blockquote&#62;
filename &#38;#8614; (directories_buildable, directories_non_buildable)
&#60;/blockquote&#62;

On the left side, only simple filenames without paths are used. So
we need only 1/D entries than before now. On the right side, we have
two sets: the directories where the file can be built, and the directories
where the file cannot be built (and if a directory appears in neither
set, we don&#38;#39;t know yet). As the number of directories is very limited,
these sets can be represented as bitsets.

&#60;p&#62;
Note that if we were to program a lame build system, we could even
simplify this to

&#60;/p&#62;&#60;blockquote&#62;
filename &#38;#8614; directory_buildable option
&#60;/blockquote&#62;

but we want to take into account that files can potentially be built in
several directories, and that it depends on the include paths currently
in scope which directory is finally picked.

&#60;p&#62;
It&#38;#39;s not only that the same information is now stored in a compressed
way. Also, the main user of the target cache picks a single file and
searches the directory where it can be built. Because the data structure
is now aligned with this style of accessing it, only one walk over the
mapping is needed per file (instead of one walk per combination of directory
and file). Inside the loop over the directories we only need to look into
the bitsets, which is very cheap.



&#60;/p&#62;&#60;h2&#62;The file cache&#60;/h2&#62;

Compared to the target cache, the file cache is really complicated. For
every file we have three meta data blobs (stat, file digest, command
digest). Also, there are two versions of the cache: the persistent
version, as stored in the .omakedb file, and the live version.

&#60;p&#62;
Many simpler build systems (like &#38;#34;make&#38;#34;) only use the file stats for
deciding whether a file is out of date. This is somewhat imprecise,
in particular when the filesystem stores the timestamps of the files
with only low granularity (e.g. in units of seconds). Another problem
occurs when the timestamps are not synchronous with the system clock,
as it happens with remote filesystems.

&#60;/p&#62;&#60;div style=&#34;float:right; width:50%; border: 1px solid black; padding: 10px; margin-left: 1em; margin-top: 1em; background-color: #E0E0E0&#34;&#62;
There is a now a &#60;a href=&#34;https://github.com/gerdstolpmann/omake-fork/tags&#34;&#62;pre-release omake-0.10.0-test1&#60;/a&#62; that can be bootstrapped! It contains all
of the described improvements, plus a number of bugfixes.
&#60;/div&#62;

&#60;p&#62;
OMake is programmed so that it only uses the timestamps between
invocations. This means that if OMake is started another time, and the
timestamp of a file changed compared with the previous invocation of
OMake, it is assumed that the file has changed. OMake does not use
timestamps during its runs. Instead it relies on the file cache as the
instance that decides which files need to be created again. For doing
so, it only uses digests (i.e. a rule fires when the digests of the
input files change, or when the digest of the command changes).

&#60;/p&#62;&#60;p&#62;
The role of the .omakedb file is now that a subset of the file cache
is made persistent beween invocations. This file stores the timestamps
of the files and the digests. OMake simply assumes that the saved digest
is still the current one if the timestamp of the file remains the same.
Otherwise it recomputes the digest. This is the only purpose of the
timestamps. Inaccuracies do not play a big role when we can assume that
users typically do not start omake instances so quickly after each other
that clock deviations would matter.

&#60;/p&#62;&#60;p&#62;
The complexity of the file cache is better understood if you look at
key operations:

&#60;/p&#62;&#60;ul&#62;
  &#60;li&#62;Load the .omakedb file and interpret it in the right way
  &#60;/li&#62;&#60;li&#62;Decide whether the cached file digest can be trusted or not
      (and in the latter case the digest is recomputed from the existing
      file)
  &#60;/li&#62;&#60;li&#62;Decide whether a rule is out of date or not. This check needs
      to take the cache contents for the inputs and the outputs of
      the rule into account.
  &#60;/li&#62;&#60;li&#62;Sometimes, we want to avoid expensive checks, and e.g. only know
      whether a digest might be out of date from the available information
      without having to recompute the digest.
&#60;/li&#62;&#60;/ul&#62;

&#60;p&#62;
After finding a couple of imprecise checks in the existing code, I
actually went through the whole Omake_cache module, and went through
all data cases. After that I&#38;#39;m now sure that it is perfect in the sense
that only those digests are recomputed that are really needed for
deciding whether a rule is out of date.

&#60;/p&#62;&#60;p&#62;
There are also some compressions:

&#60;/p&#62;&#60;ul&#62;
  &#60;li&#62;The cache no longer stores the complete Unix.stat records, but only
      the subset of the fields that are really meaningful (timestamps, inode),
      and represent these fields as a single string.
  &#60;/li&#62;&#60;li&#62;There is a separate data structure for the question whether a file
      exists. This is one of the cases where OS level caches already do a
      good job. Now, only for the n most recently accessed files this
      information is available (where n=100). On Linux with its fast system
      calls this cache is probably unnecessary, but on Windows I actually saw some
      speedup.
&#60;/li&#62;&#60;/ul&#62;

&#60;p&#62;
All taken together, this gives another little boost. This is mostly observable
on Windows as this OS does not profit from the improvements described in the
previous article of the series.

&#60;img src=&#34;/files/img/blog/omake3_bug.gif&#34; width=&#34;1&#34; height=&#34;1&#34;/&#62;
&#60;/p&#62;&#60;/cc-field&#62;
&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;

&#60;div&#62;
  Gerd Stolpmann works as OCaml consultant.

&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;


          </description>
        </item>
      
        <item>
          <title>OMake On Steroids (Part 2)</title>
          <guid>http://blog.camlcity.org/blog/omake2.html</guid>
          <link>http://blog.camlcity.org/blog/omake2.html</link>
          <pubDate>19 Jun 2015 12:00 GMT</pubDate>
          <description>

&#60;div&#62;
  &#60;b&#62;Faster builds with omake, part 2: Linux&#60;/b&#62;&#60;br/&#62;&#38;#160;
&#60;/div&#62;

&#60;div&#62;
  
The Linux version of OMake suffered from specific problems, and it is
worth looking at these in detail.

&#60;/div&#62;

&#60;div&#62;
  
&#60;div style=&#34;float:right; width:50%; border: 1px solid black; padding: 10px; margin-left: 1em; margin-bottom: 1em; background-color: #E0E0E0&#34;&#62;
This text is part 2/3 of a series about the OMake improvements
sponsored by &#60;a href=&#34;http://lexifi.com&#34;&#62;LexiFi&#60;/a&#62;:
&#60;ul&#62;
  &#60;li&#62;Part 1: &#60;a href=&#34;/blog/omake1.html&#34;&#62;Overview&#60;/a&#62;
  &#60;/li&#62;&#60;li&#62;Part 2: Linux (this page)
  &#60;/li&#62;&#60;li&#62;Part 3: Caches (will be released on Tuesday, 6/23)
&#60;/li&#62;&#60;/ul&#62;
The original publishing is on &#60;a href=&#34;http://blog.camlcity.org/blog&#34;&#62;camlcity.org&#60;/a&#62;.
&#60;/div&#62;
&#60;p&#62;While analyzing the performance characteristics of OMake, I found
that the features of the OS were used in a non-optimal way. In
particular, the fork() system call can be very expensive, and by
avoiding it the speed of OMake could be dramatically improved. This is
the biggest contribution to the performance optimizations allowing
OMake to run roughly twice as fast on Linux
(see &#60;a href=&#34;/blog/omake1.html&#34;&#62;part 1&#60;/a&#62; for numbers).

&#60;/p&#62;&#60;h2&#62;The fork/exec problem&#60;/h2&#62;
&#60;p&#62;
The traditional way of starting commands is to use the fork/exec
combination: The fork() system call creates an almost identical copy
of the process, and in this copy the exec() call starts the
command. This has a number of logical advantages, namely that you can
run code between fork() and exec() that modifies the environment for
the new command. Often, the file descriptors 0, 1, and 2 are assigned
as it is required for creating pipelines. You can also do other
things, e.g. change the working directory.

&#60;/p&#62;&#60;p&#62;
The whole problem with this is that it is slow. Even for a modern OS
like Linux, fork() includes a number of expensive operations. Although
it can be avoided to actually copy memory, the new address space must
be set up by duplicating the page table. This is the more expensive the
bigger the address space is. Also, memory must be set aside even if it
is not immediately used. The entries for all file mappings must be
duplicated (and every linked-in shared library needs such mappings).
The point is now that all these actions are not really needed because
at exec() time the whole process image is replaced by a different one.

&#60;/p&#62;&#60;p&#62;
In my performance tests I could measure that forking a 450 MB process
image needs around 10 ms. In the n=8 test for compiling each of the
4096 modules two commands are needed (ocamldep.opt and ocamlopt.opt).
The time for this fork alone sums up to 80 seconds. Even worse, this
dramatically limits the benefit of parallelizing the build, because
this time is always spent in the main process.

&#60;/p&#62;&#60;p&#62;
The POSIX standard includes an alternate way of starting commands, the
posix_spawn() call. It was originally developed for small systems
without virtual memory where it is difficult to implement fork()
efficiently. However, because of the mentioned problems of the
fork/exec combinations it was quickly picked up by all current POSIX
systems.  The posix_spawn() call takes a potentially long list of
parameters that describes all the actions needed to be done between
fork() and exec().  This gives the implementer all freedom to exploit
low-level features of the OS for speeding the call up. Some OS, e.g.
Mac OS X, even implement posix_spawn directly as system call.

&#60;/p&#62;&#60;p&#62;
On Linux, posix_spawn is a library function of glibc. By default,
however, it is no real help because it uses fork/exec (being very
conservative).  If you pass the flag POSIX_SPAWN_USEVFORK, though, it
switches to a fast alternate implementation. I was pointed (by T&#38;#246;r&#38;#246;k
Edwin) to a few emails showing that the quality in glibc is not yet
optimal. In particular, there are weaknesses in signal handling and in
thread cancellation. Fortunately, these weaknesses do not matter for
this application (signals are not actively used, and on Linux OMake is
single-threaded).

&#60;/p&#62;&#60;p&#62;
Note that I developed the wrapper for posix_spawn already years ago
for OCamlnet where it is still used. So, if you want to test the speed
advantage out on yourself, just use OCamlnet&#38;#39;s Shell library for
starting commands.

&#60;/p&#62;&#60;h2&#62;Pipelines and fork()&#60;/h2&#62;

&#60;p&#62;It turned that there is another application of fork() in OMake. When
creating pipelines, it is sometimes required that the OMake process
forks itself, namely when one of commands of the pipeline is
implemented in the OMake language. This is somewhat expected, as the
parts of a pipeline need to run concurrently. However, this feature
turned out to be a little bit in the way because the default build
rules used it. In particular, there is the pipeline

&#60;/p&#62;&#60;blockquote&#62;
&#60;code&#62;&#60;small&#62;
$(OCAMLFIND) $(OCAMLDEP) ... -modules $(src_file) | ocamldep-postproc
&#60;/small&#62;&#60;/code&#62;
&#60;/blockquote&#62;

which is started for scanning OCaml modules. While the first command,
$(OCAMLFIND), is a normal external command, the second command,
ocamldep-postprocess, is written in the OMake language.

&#60;p&#62;Forking for creating pipelines is even more expensive than the
fork/exec combination discussed above, because memory needs really to
be copied. I could finally avoid this fork() by some trickery in the
command starter. When used for scanning, and the command is the last one
in the pipeline (as in the above pipeline), a workaround is activated
that writes the data to a temporary file, as if the pipeline would read

&#60;/p&#62;&#60;blockquote&#62;
&#60;code&#62;&#60;small&#62;
$(OCAMLFIND) $(OCAMLDEP) ... -modules $(src_file) &#38;#62;$(tmpfile);&#60;br/&#62;
ocamldep-postproc &#38;#60;$(tmpfile)
&#60;/small&#62;&#60;/code&#62;
&#60;/blockquote&#62;

&#60;p&#62;(NB. You actually can also program this in the OMake language. However,
this does not solve the problem, because for sequences of commands
$(cmd1);$(cmd2) it is also required to fork the process. Hence, I had to
find a solution deeper in the OMake internals.)

&#60;/p&#62;&#60;div style=&#34;float:right; width:50%; border: 1px solid black; padding: 10px; margin-left: 1em; margin-top: 1em; background-color: #E0E0E0&#34;&#62;
There is a now a &#60;a href=&#34;https://github.com/gerdstolpmann/omake-fork/tags&#34;&#62;pre-release omake-0.10.0-test1&#60;/a&#62; that can be bootstrapped! It contains all
of the described improvements, plus a number of bugfixes.
&#60;/div&#62;

&#60;p&#62;There is one drawback of this, though: The latency of the pipeline is
increased when the commands are run sequentially rather than in parallel.
The effect is that OMake takes longer for a j=1 build even if less CPU
resources are consumed. A number of further improvements compensate for
this:

&#60;/p&#62;&#60;ul&#62;
  &#60;li&#62;Most importantly, ocamldep-postprocess can now use a builtin function,
      speeding this part up by switching the implementation language (now
      OCaml, previously the OMake language).
  &#60;/li&#62;&#60;li&#62;Because ocamldep-postprocess mainly accesses the target cache,
      speeding up this cache also helped (see the next part of this
      article series).
  &#60;/li&#62;&#60;li&#62;Finally, there is now a way how functions like ocamldep-postprocess
      can propagate updates of the target cache to the main environment.
      The background is here that functions implementing commands run in
      a sub environment simulating some isolation from the parent
      environment. This isolation prevented that updates of the target
      cache found by one invocation of ocamldep-postprocess could be used
      by the next invocation. This also speeds up this function.
&#60;/li&#62;&#60;/ul&#62;

&#60;h2&#62;Windows is not affected&#60;/h2&#62;

&#60;p&#62;The Windows port of OMake is not affected by the fork problems. For
starting commands, an optimized technique similar to posix_spawn() is
used anyway. For pipelines and other internal uses of fork() the
Windows port uses threads. (Note beside: You may ask why we don&#38;#39;t use
threads on Linux. There are a couple of reasons: First, the emulation
of the process environment with threads is probably not quite as
stable as the original using real processes. Second, there are
difficult interoperability problems between threads and signals
(something that does not exist in Windows).  Finally, this would not
save us maintaining the code branch using real processes and fork()
because OCaml does not support multi-threading for all POSIX systems.
Of course, this does not mean we cannot implement it as optional
feature, and probably this will be done at some point in the future.)

&#60;/p&#62;&#60;p&#62;The trick of using temporary files for speeding up pipelines is not
enabled on Windows. Here, it is more important to get the benefits of
parallelization that the real pipeline allows.

&#60;/p&#62;&#60;div style=&#34;border: 1px solid black; padding: 10px; margin-left: 1em; margin-bottom: 1em; background-color: #E0E0E0&#34;&#62;
The next part will be published on Tuesday, 6/23.
&#60;/div&#62;

&#60;img src=&#34;/files/img/blog/omake2_bug.gif&#34; width=&#34;1&#34; height=&#34;1&#34;/&#62;


&#60;/div&#62;

&#60;div&#62;
  Gerd Stolpmann works as OCaml consultant.

&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;


          </description>
        </item>
      
        <item>
          <title>OMake On Steroids (Part 1)</title>
          <guid>http://blog.camlcity.org/blog/omake1.html</guid>
          <link>http://blog.camlcity.org/blog/omake1.html</link>
          <pubDate>16 Jun 2015 00:00 GMT</pubDate>
          <description>

&#60;div&#62;
  &#60;b&#62;Faster builds with omake, part 1: Overview&#60;/b&#62;&#60;br/&#62;&#38;#160;
&#60;/div&#62;

&#60;div&#62;
  
In
the &#60;a href=&#34;https://sympa.inria.fr/sympa/arc/caml-list/2014-09/msg00090.html&#34;&#62;2014
edition&#60;/a&#62; of the &#38;#34;which is the best build system for OCaml&#38;#34; debate
the &#60;a href=&#34;http://omake.metaprl.org&#34;&#62;OMake&#60;/a&#62; utility was heavily
criticized for being not scalable enough. Some quick tests showed that
there was in deed a problem. At
&#60;a href=&#34;http://lexifi.com&#34;&#62;LexiFi&#60;/a&#62;, the size of the source tree obviously
already exceeded the critical point, and LexiFi was interested in an
improvement. LexiFi develops for both Linux and Windows, and
OMake is their preferred build system because of its excellent support
for Windows. The author of these lines got some funding from LexiFi
for analyzing and fixing the problem.

&#60;/div&#62;

&#60;div&#62;
  
&#60;div style=&#34;float:right; width:50%; border: 1px solid black; padding: 10px; margin-left: 1em; margin-bottom: 1em; background-color: #E0E0E0&#34;&#62;
This text is part 1/3 of a series about the OMake improvements
sponsored by &#60;a href=&#34;http://lexifi.com&#34;&#62;LexiFi&#60;/a&#62;:
&#60;ul&#62;
  &#60;li&#62;Part 1: Overview (this page)
  &#60;/li&#62;&#60;li&#62;Part 2: Linux (will be released on Friday, 6/19)
  &#60;/li&#62;&#60;li&#62;Part 3: Caches (will be released on Tuesday, 6/23)
&#60;/li&#62;&#60;/ul&#62;
The original publishing is on &#60;a href=&#34;http://blog.camlcity.org/blog&#34;&#62;camlcity.org&#60;/a&#62;.
&#60;/div&#62;

&#60;p&#62;
OMake is not only a build system (like e.g. ocamlbuild), but it also
includes extensions that are important for controlling and customizing
builds. There is an interpreter for a simple dynamically typed
functional language. There is a command shell implementing utilities
like &#38;#34;rm&#38;#34; or &#38;#34;cp&#38;#34; which is in particular important on non-Unix
systems. There are system interfaces for watching files and restarting
the build whenever source code is saved in the editor. In short, OMake
is very feature-rich, but also, and this is the downside, it is also
quite complex: around 130 modules and 80k lines of code. Obviously, it
is easy to overlook performance problems when so much code is
involved. For me as the developer seeing the sources for the first
time the size was also a challenge, namely for identifying possible
problems and for finding solutions.

&#60;/p&#62;&#60;h2&#62;Quantifying the performance problem&#60;/h2&#62;

My very first activity was to develop a synthetic benchmark for OMake
(and actually, for any type of OCaml build system). Compared with a
real build, a synthetic benchmark has the big advantage that you can
simulate builds of any size. The benchmark has these characteristics:
The task is to build n^2 libraries with n^2 modules each (for a given
small number n), and the dependencies between the modules are created
in a way so that we can stress both the dependency analyzer of the
build utility and the ability to run commands in parallel. In
particular, every library would allow n parallel build flows of the
n^2 modules, and you can build n of the n^2 libraries in
parallel. (For details see the &#60;a href=&#34;https://github.com/gerdstolpmann/omake-fork/blob/perf-test/performance/generate.ml&#34;&#62;source code&#60;/a&#62;.)

&#60;p&#62;
This is what I got for omake-0.9.8.6 (note that a different computer
was used for Windows, so you cannot compare Linux with Windows):

&#60;/p&#62;&#60;p&#62;
&#60;/p&#62;&#60;table border=&#34;1&#34;&#62;
&#60;tr&#62;
&#60;th&#62;Size n&#60;/th&#62;
&#60;th&#62;Parallelism j&#60;/th&#62;
&#60;th&#62;Number of modules (n^4)&#60;/th&#62;
&#60;th&#62;Runtime Linux&#60;/th&#62;
&#60;th&#62;Runtime Windows&#60;/th&#62;
&#60;/tr&#62;
&#60;tr&#62;
&#60;td align=&#34;right&#34;&#62;n=7&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;j=1&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;2401&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;645&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;353&#60;/td&#62;
&#60;/tr&#62;
&#60;tr&#62;
&#60;td align=&#34;right&#34;&#62;n=7&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;j=4&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;2401&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;213&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;179&#60;/td&#62;
&#60;/tr&#62;
&#60;tr&#62;
&#60;td align=&#34;right&#34;&#62;n=8&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;j=1&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;4096&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;1906&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;877&#60;/td&#62;
&#60;/tr&#62;
&#60;tr&#62;
&#60;td align=&#34;right&#34;&#62;n=8&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;j=4&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;4096&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;607&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;341&#60;/td&#62;
&#60;/tr&#62;
&#60;/table&#62;

&#60;p&#62;This clearly shows that there is something wrong, in particular for
Linux as OS: For the n=8 number of 4096 modules, which is around 1.7
times of the 2401 modules for n=7, omake needs around three times
longer (for a single-threaded build). For Windows, the numbers are
slightly better: the n=8 build takes 2.5 of the time of the n=7
build. Nevertheless, this is quite far away from the optimum.

&#60;/p&#62;&#60;p&#62;Note that this is not good, but it is also not a catastrophe. The
latter shows up if you try to use ocamlbuild. I couldn&#38;#39;t manage to
build the n=7 test case at all: after 30 minutes ocamlbuild slowed
down to a crawl, and progressed only with a speed of around one module
per second. Apparently, there are much worse problems than with
OMake. (Btw, it would be nice to hear how other build systems
compete.)

&#60;/p&#62;&#60;h2&#62;After improving OMake&#60;/h2&#62;

The version from today (2015-05-18)
at &#60;a href=&#34;https://github.com/gerdstolpmann/omake-fork&#34;&#62;Github&#60;/a&#62;
behaves much better:

&#60;p&#62;
&#60;/p&#62;&#60;table border=&#34;1&#34;&#62;
&#60;tr&#62;
&#60;th&#62;Size n&#60;/th&#62;
&#60;th&#62;Parallelism j&#60;/th&#62;
&#60;th&#62;Number of modules (n^4)&#60;/th&#62;
&#60;th&#62;Runtime Linux&#60;br/&#62;(Speedup factor)&#60;/th&#62;
&#60;th&#62;Runtime Windows&#60;br/&#62;(Speedup factor)&#60;/th&#62;
&#60;/tr&#62;
&#60;tr&#62;
&#60;td align=&#34;right&#34;&#62;n=7&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;j=1&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;2401&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;169 (3.8)&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;317 (1.1)&#60;/td&#62;
&#60;/tr&#62;
&#60;tr&#62;
&#60;td align=&#34;right&#34;&#62;n=7&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;j=4&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;2401&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;59 (3.6)&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;163 (1.1)&#60;/td&#62;
&#60;/tr&#62;
&#60;tr&#62;
&#60;td align=&#34;right&#34;&#62;n=8&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;j=1&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;4096&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;363 (5.3)&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;661 (1.3)&#60;/td&#62;
&#60;/tr&#62;
&#60;tr&#62;
&#60;td align=&#34;right&#34;&#62;n=8&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;j=4&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;4096&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;144 (4.2)&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;330 (1.0)&#60;/td&#62;
&#60;/tr&#62;
&#60;/table&#62;

&#60;div style=&#34;float:right; width:50%; border: 1px solid black; padding: 10px; margin-left: 1em; margin-top: 1em; background-color: #E0E0E0&#34;&#62;
There is a now a &#60;a href=&#34;https://github.com/gerdstolpmann/omake-fork/tags&#34;&#62;pre-release omake-0.10.0-test1&#60;/a&#62; that can be bootstrapped! It contains all
of the described improvements, plus a number of bugfixes.
&#60;/div&#62;

&#60;p&#62;As you can see, there is a huge improvement for Linux and a slight
one for Windows. It turns out that the Linux version ran into a
Unix-specific issue of starting commands from a big process (the OMake
main process reaches around 450MB). OMake used the conventional
fork/exec combination for doing so, but it is a known problem that
this does not work well for big process images. We&#38;#39;ll come to the
details of this later. The Windows version never suffered from this
problem.

&#60;/p&#62;&#60;p&#62;The scalability is now somewhat better, but still not great. For both
Windows and Linux, the n=8 runs take now around 2.1 times longer than the
n=7 runs.

&#60;/p&#62;&#60;p&#62;Another aspect of the performance impression is how long a typical
incremental build takes after changing a single file. At least for
OMake, a good measure for this is the zero rebuild time: how long
OMake takes to figure out that nothing has changed, i.e. the time for
the second omake run in &#38;#34;omake ; omake&#38;#34;:

&#60;/p&#62;&#60;table border=&#34;1&#34;&#62;
&#60;tr&#62;
&#60;th&#62;Parameters&#60;/th&#62;
&#60;th&#62;Runtime Linux omake-0.9.8.6&#60;/th&#62;
&#60;th&#62;Runtime Linux 2015-05-18&#60;br/&#62;(Speedup Factor)&#60;/th&#62;
&#60;/tr&#62;
&#60;tr&#62;
&#60;td align=&#34;right&#34;&#62;n=7, j=1&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;16.8&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;8.4 (2.0)&#60;/td&#62;
&#60;/tr&#62;
&#60;tr&#62;
&#60;td align=&#34;right&#34;&#62;n=8, j=1&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;39.2&#60;/td&#62;
&#60;td align=&#34;right&#34;&#62;15.6 (2.5)&#60;/td&#62;
&#60;/tr&#62;
&#60;/table&#62;

&#60;p&#62;The time roughly halves. Note that you get a similar effect under
Windows as OMake doesn&#38;#39;t start any commands for a zero
rebuild. Actually, most time is spent for constructing the internal
data structures and for computing digests (not only for files but also
for commands, which turns out to be the more expensive action).


&#60;/p&#62;&#60;h2&#62;How to tackle the analysis&#60;/h2&#62;

I started it the old-fashioned way by manually instrumenting
interesting functions. This means that counts and (wall-clock)
runtimes are measured. Functions that (subjectively) &#38;#34;take too long&#38;#34;
are further analyzed by also instrumenting called functions. This way
I could quickly find out the interesting parts (while learning how
OMake works as you go through the code and instrument it). The
helper module I used: &#60;a href=&#34;https://github.com/gerdstolpmann/omake-fork/blob/master/src/libmojave/lm_instrument.mli&#34;&#62;Lm_instrument&#60;/a&#62;. (Note that
I did all the actual instrumentation in the &#38;#34;perf-test&#38;#34; branch.)

&#60;p&#62;As OCaml supports gprof instrumentation I also tried this but
without success. The problem is simply that gprof looks at the wrong
metrics, namely only at the runtimes of the two innermost function
invocations in the call stack. In OCaml this is usually something like
&#60;code&#62;List.map&#60;/code&#62; calling &#60;code&#62;String.sub&#60;/code&#62;, i.e. at both
levels there are general-purpose functions. This is useless
information. We need more context for the analysis (i.e. more levels
in the call stack), but it depends very much from where the function
is called.

&#60;/p&#62;&#60;p&#62;Another problem of gprof was that you do not see kernel time. For
analyzing a utility like OMake whose purpose is to start external
commands this is crucial information, though.

&#60;/p&#62;&#60;p&#62;For measuring the size of OCaml values I used &#60;a href=&#34;http://forge.ocamlcore.org/projects/objsize/&#34;&#62;objsize&#60;/a&#62;.


&#60;/p&#62;&#60;h2&#62;The main points of the improvement&#60;/h2&#62;

&#60;p&#62;Summarized, the following improvements were done:

&#60;/p&#62;&#60;ul&#62;
&#60;li&#62;For Linux, I switched to posix_spawn instead of fork/exec for
    starting commands.
&#60;/li&#62;&#60;li&#62;For Linux, it was also important to avoid a self-fork of omake for
    postprocessing ocamldep output. Now temporary files are used.
&#60;/li&#62;&#60;li&#62;I rewrote the target cache that stores whether a file can be built
    or not. The new data structure for this cache highly compresses
    the data, and is better aligned to the main user, namely the
    function figuring out which implicit rules are needed to build
    a file. This way I could save processing time in this cache,
    and the memory footprint also got substantially smaller.
&#60;/li&#62;&#60;li&#62;I also rewrote the file cache that connects file names with file stats
    and digests. The new cache allows it to skip the computation of
    digests in more cases. Also, less data is cached (saving memory).
&#60;/li&#62;&#60;li&#62;I tweaked when the file digests are computed. This is no longer done
    immediately but delayed after the next command has been started,
    and in parallel to the command. This is in particular advantageous
    when there are some CPU resources left that could be utilized for
    this purpose.
&#60;/li&#62;&#60;li&#62;There are also simplified scanner rules in OMake.om, reducing the
    time needed for computing scanner dependencies. There is a drawback
    of the new rules, namely that when a file is moved to a new directory
    OMake does not rescan the file the next time it is run. I guess this is
    acceptable, because it normally does not matter where a file is
    stored. Nevertheless, there is an option to get the old behavior
    back (by setting EXTENDED_DIGESTS).
&#60;/li&#62;&#60;li&#62;Not regarding speed: OMake can now be built with the mingw port of OCaml
&#60;/li&#62;&#60;/ul&#62;


&#60;h2&#62;One major problem remains&#60;/h2&#62;

&#60;p&#62;
There is still one problem I could not yet address, and this problem is
mainly responsible for the long startup time of OMake for large builds.
Unlike other build systems, OMake creates a dependency from the rule
to the command of the rule, as if every rule looked like:

&#60;/p&#62;&#60;blockquote&#62;
&#60;code&#62;
target: source1 ... sourceN :value: $(command)&#60;br/&#62;
&#38;#160;&#38;#160;&#38;#160;&#38;#160;$(command)
&#60;/code&#62;
&#60;/blockquote&#62;

i.e. when the command changes the rule &#38;#34;fires&#38;#34; and is executed. This is
an automatic addition, and it is very useful: When you start a build after
changing parameters (e.g. include paths) OMake automatically
detects which commands have changed because of this, and reruns these.

&#60;p&#62;
However, there is a price to pay. For checking whether a rule is out of date
it is required to expand the command and compute the digest. For a full
build the time for this is negligible (and you need the commands anyway
for starting them), but for a &#38;#34;zero rebuild&#38;#34; the commands are finally
not needed, and OMake expands them only for the out-of-date check. As you
might guess, this is the main reason why a zero rebuild is so slow.

&#60;/p&#62;&#60;p&#62;
It is probably possible to speed up the out-of-date check by doing a
static analysis of the command expansions. Most expansions just depend
on a small number of variables, and only if these variables change the
command can expand to something different. With that knowledge it is 
possible to compile a quick check whether the expansion is actually needed.
As any expression of the OMake language can be used for the commands,
developing such a compiler is non-trivial, and it was so far not possible
to do in my time budget.

&#60;/p&#62;&#60;div style=&#34;border: 1px solid black; padding: 10px; margin-left: 1em; margin-bottom: 1em; background-color: #E0E0E0&#34;&#62;
The next part will be published on Friday, 6/19.
&#60;/div&#62;

&#60;img src=&#34;/files/img/blog/omake1_bug.gif&#34; width=&#34;1&#34; height=&#34;1&#34;/&#62;


&#60;/div&#62;

&#60;div&#62;
  Gerd Stolpmann works as OCaml consultant.

&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;


          </description>
        </item>
      
        <item>
          <title>Immutable strings in OCaml-4.02</title>
          <guid>http://blog.camlcity.org/blog/bytes1.html</guid>
          <link>http://blog.camlcity.org/blog/bytes1.html</link>
          <pubDate>04 Jul 2014 00:00 GMT</pubDate>
          <description>

&#60;div&#62;
  &#60;b&#62;Why the concept is not good enough&#60;/b&#62;&#60;br/&#62;&#38;#160;
&#60;/div&#62;

&#60;div&#62;
  
In the upcoming release 4.02 of the OCaml programming language, the type
&#60;code&#62;string&#60;/code&#62; can be made immutable by a compiler
switch. Although this won&#38;#39;t be the default yet, this should be seen as
the announcement of a quite disruptive change in the
language. Eventually this will be the default in a future version. In
this article I explain why I disagree with this particular plan, and
which modifications would be better.

&#60;/div&#62;

&#60;div&#62;
  
&#60;p&#62;
Of course, the fact that &#60;code&#62;string&#60;/code&#62; is mutable doesn&#38;#39;t fit
well into a functional language. Nevertheless, it has been seen as
acceptable for a long time, probably because the developers of OCaml
did not pay much attention to strings, and felt that the benefits of a
somewhat cleaner concept wouldn&#38;#39;t outweigh the practical disadvantages
of immutable strings. Apparently, this attitude changed, and we will
see a new &#60;code&#62;bytes&#60;/code&#62; type in OCaml-4.02. This type is
accompanied by a &#60;code&#62;Bytes&#60;/code&#62; module with library functions
supporting it. The compiler was also extended so
that &#60;code&#62;string&#60;/code&#62; and &#60;code&#62;bytes&#60;/code&#62; can be used
interchangably by default. If, however, the &#60;code&#62;-safe-strings&#60;/code&#62;
switch is set on the command-line, the compiler
sees &#60;code&#62;string&#60;/code&#62; and &#60;code&#62;bytes&#60;/code&#62; as two completely
separate types.
&#60;/p&#62;

&#60;p&#62;
This is a disruptive change (if enabled): Almost all code bases will
need modifications in order to be compatible with the new
concept. Although this will often be trivial, there are also harder
cases where strings are frequently used as buffers. Before discussing
that a bit more in detail, let me point out why such disruptive
changes are so problematic. So far there was an implicit guarantee
that your code will be compatible to new compiler versions if you
stick to the well-established parts of the language and avoid
experimental additions.  I have in deed code that was developed for
OCaml-1.03 (the first version I checked out), and that code still
runs. Especially in a commercial context this is a highly appreciated
feature, because this protects the investment in the code base. As I&#38;#39;m
trying to sell OCaml to companies in my carreer this is a point that
bothers me. Giving up this history of excellent backward compatibility
is something we shouldn&#38;#39;t do easily, and if so, only if we get something
highly valuable back. (Of course, if you only look at the open source
and academic use of OCaml, you&#38;#39;ll put less emphasis on the compatibility
point, but it&#38;#39;s also not completely unimportant there.)
&#60;/p&#62;


&#60;h2&#62;The problem&#60;/h2&#62;
&#60;p&#62;
I&#38;#39;m fully aware that immutable strings fix some problems (the
worst probably: so far even string literals can be mutated, which can be
very surprising). However, creating a completely new type &#60;code&#62;bytes&#60;/code&#62;
comes also with some disadvantages:

&#60;/p&#62;&#60;ul&#62;
&#60;li&#62;Lack of generic accessor functions: There is &#60;code&#62;String.get&#60;/code&#62; and
there is &#60;code&#62;Bytes.get&#60;/code&#62;. The shorthand &#60;code&#62;s.[k]&#60;/code&#62; is now
restricted to strings. This is mostly a stylistic problem.

&#60;/li&#62;&#60;li&#62;The conversion of string to bytes and vice versa requires a copy:
&#60;code&#62;Bytes.of_string&#60;/code&#62;, and &#60;code&#62;Bytes.to_string&#60;/code&#62;. You have
to pay a performance penalty.

&#60;/li&#62;&#60;li&#62;In practical programming, there is sometimes no clear conceptual 
distinction between string data that are read-only and those that require
mutation. For example, if you add data to a buffer, the data may come from
a string or from another buffer. So how do you type such an &#60;code&#62;add&#60;/code&#62;
function?
&#60;/li&#62;&#60;/ul&#62;

This latter point is, in my opinion, the biggest problem. Let&#38;#39;s assume
we wanted to reimplement the &#60;code&#62;Lexing&#60;/code&#62; module of the
standard library in pure OCaml without resorting to unsafe coding
(currently it&#38;#39;s done in C). This module implements the lexing buffer
that backs the lexers generated with ocamllex. We now have to
use &#60;code&#62;bytes&#60;/code&#62; for the core of this buffer. There are three
functions in &#60;code&#62;Lexing&#60;/code&#62; for creating new buffers:

&#60;pre&#62;
val from_channel : in_channel -&#38;#62; lexbuf
val from_string : string -&#38;#62; lexbuf
val from_function : (string -&#38;#62; int -&#38;#62; int) -&#38;#62; lexbuf
&#60;/pre&#62;

The first observation is that we&#38;#39;ll better offer two more constructors
to the users of this module:

&#60;pre&#62;
val from_bytes : bytes -&#38;#62; lexbuf
val from_bytes_function : (bytes -&#38;#62; int -&#38;#62; int) -&#38;#62; lexbuf
&#60;/pre&#62;

So why do we need the ability to read from &#60;code&#62;bytes&#60;/code&#62;,
i.e. copy from one buffer to the other? We could just be a bad host
and don&#38;#39;t offer these functions to the users of the module. However,
it&#38;#39;s unavoidable anyway for &#60;code&#62;from_channel&#60;/code&#62;, because I/O
buffers are of course &#60;code&#62;bytes&#60;/code&#62;:

&#60;pre&#62;
let from_channel ch =
  from_bytes_function (Pervasives.input ch)
&#60;/pre&#62;

So whenever we implement buffers that also include I/O capabilities,
it is likely that we need to handle both the &#60;code&#62;bytes&#60;/code&#62; and
the &#60;code&#62;string&#60;/code&#62; case. This is not only a problem for the
interface design. Because &#60;code&#62;string&#60;/code&#62; and &#60;code&#62;bytes&#60;/code&#62;
are completely separated, we need two different
implementations: &#60;code&#62;from_string&#60;/code&#62; and
&#60;code&#62;from_bytes&#60;/code&#62; cannot share much code.


&#60;p&#62;
This is the ironical part of the new concept: Although it tries to
make the handling of strings more sound and safe, the immediate
consequence in reality is that code needs to be duplicated because of
missing polymorphisms. Any half-way intelligent programmer will of
course fall back to unsafe functions for casting bytes to strings and
vice versa (&#60;code&#62;Bytes.unsafe_to_string&#60;/code&#62;
and &#60;code&#62;Bytes.unsafe_of_string&#60;/code&#62;), and this only means
that the new &#60;code&#62;-safe-strings&#60;/code&#62; option will be a driving force
for using unsafe language features.
&#60;/p&#62;

&#60;p&#62;
Let&#38;#39;s look at three modifications of the concept. Is there some easy
fix?
&#60;/p&#62;

&#60;h2&#62;Idea 1: &#60;code&#62;string&#60;/code&#62; as a supertype of &#60;code&#62;bytes&#60;/code&#62;&#60;/h2&#62;
&#60;p&#62;
We just allow that &#60;code&#62;bytes&#60;/code&#62; can officially be
coerced to &#60;code&#62;string&#60;/code&#62;:
&#60;/p&#62;

&#60;pre&#62;
let s = (b : bytes :&#38;#62; string)
&#60;/pre&#62;

&#60;p&#62;
Of course, this weakens the immutability property: &#60;code&#62;string&#60;/code&#62;
may now be a read-only interface for a &#60;code&#62;bytes&#60;/code&#62; buffer, and
this buffer can be mutated, and this mutation can be observed through
the &#60;code&#62;string&#60;/code&#62; type:
&#60;/p&#62;

&#60;pre&#62;
let mutable_string() =
  let b = Bytes.make 1 &#38;#39;X&#38;#39; in
  let s = (b :&#38;#62; string) in
  (s, Bytes.set 0)

let (s, set) = mutable_string()
(* s is now &#38;#34;X&#38;#34; *)
let () = set &#38;#39;Y&#38;#39;
(* s is now &#38;#34;Y&#38;#34; *)
&#60;/pre&#62;

&#60;p&#62;
Nevertheless, this concept is not meaningless. In particular, if a
function takes a string argument, it is guaranteed that the string
isn&#38;#39;t modified. Also, string literals are immutable. Only when a
function returns a string, we cannot be sure that the string isn&#38;#39;t
modified by a side effect.
&#60;/p&#62;

&#60;p&#62;
This variation of the concept also solves the polymorphism problem we
explained at the example of the &#60;code&#62;Lexing&#60;/code&#62; module: It is now
sufficient when we implement &#60;code&#62;Lexing.from_string&#60;/code&#62;, because
&#60;code&#62;bytes&#60;/code&#62; can always be coerced to &#60;code&#62;string&#60;/code&#62;:

&#60;/p&#62;&#60;pre&#62;
let from_bytes s =
  from_string (s :&#38;#62; string)
&#60;/pre&#62;


&#60;h2&#62;Idea 2: Add a read-only type &#60;code&#62;stringlike&#60;/code&#62;&#60;/h2&#62;
&#60;p&#62;
Some people may feel uncomfortable with the implication of Idea 1 that
the immutability of &#60;code&#62;string&#60;/code&#62; can be easily circumvented.
This can be avoided with a variation: Add a third type
&#60;code&#62;stringlike&#60;/code&#62; as the common supertype of both
&#60;code&#62;string&#60;/code&#62; and &#60;code&#62;bytes&#60;/code&#62;. So we allow:

&#60;/p&#62;&#60;pre&#62;
let sl1 = (s : string :&#38;#62; stringlike)
let sl2 = (b : bytes :&#38;#62; stringlike)
&#60;/pre&#62;

Of course, &#60;code&#62;stringlike&#60;/code&#62; doesn&#38;#39;t implement mutators (like
&#60;code&#62;string&#60;/code&#62;). It is nevertheless different from &#60;code&#62;string&#60;/code&#62;:

&#60;ul&#62;
&#60;li&#62;&#60;code&#62;string&#60;/code&#62; is considered as absolutely immutable (there is no
way to coerce &#60;code&#62;bytes&#60;/code&#62; to &#60;code&#62;string&#60;/code&#62;)
&#60;/li&#62;&#60;li&#62;&#60;code&#62;stringlike&#60;/code&#62; is seen as the read-only API for either
&#60;code&#62;string&#60;/code&#62; or &#60;code&#62;bytes&#60;/code&#62;, and it is allowed to mutate
a &#60;code&#62;stringlike&#60;/code&#62; behind the back of this API
&#60;/li&#62;&#60;/ul&#62;

&#60;p&#62;
&#60;code&#62;stringlike&#60;/code&#62; is especially interesting for interfaces that
need to be compatible to both &#60;code&#62;string&#60;/code&#62; and &#60;code&#62;bytes&#60;/code&#62;.
In the &#60;code&#62;Lexing&#60;/code&#62; example, we would just define

&#60;/p&#62;&#60;pre&#62;
val from_stringlike : stringlike -&#38;#62; lexbuf
val from_stringlike_function : (stringlike -&#38;#62; int -&#38;#62; int) -&#38;#62; lexbuf
&#60;/pre&#62;

and then reduce the other constructors to just these two, e.g.

&#60;pre&#62;
let from_string s =
  from_stringlike (s :&#38;#62; stringlike)

let from_bytes b =
  from_stringlike (b :&#38;#62; bytes)
&#60;/pre&#62;

These other constructors are now only defined for the convenience
of the user.

&#60;h2&#62;Idea 3: Base &#60;code&#62;bytes&#60;/code&#62; on bigarrays&#60;/h2&#62;

&#60;p&#62;
This idea doesn&#38;#39;t fix any of the mentioned problems. Instead, the
thinking is: If we already accept the incompatibility
between &#60;code&#62;string&#60;/code&#62; and &#60;code&#62;bytes&#60;/code&#62;, let&#38;#39;s at least do
in a way so that we get the maximum out of it. Especially for I/O
buffers, bigarrays are way better suited than strings:

&#60;/p&#62;&#60;ul&#62;
&#60;li&#62;I/O primitives can directly pass the bigarrays to the operating
system (no need for an intermediate buffer as it is currently the case
for &#60;code&#62;Unix.read&#60;/code&#62; and &#60;code&#62;Unix.write&#60;/code&#62;)

&#60;/li&#62;&#60;li&#62;Bigarrays support the slicing of buffers (i.e. you can reference
subbuffers directly)

&#60;/li&#62;&#60;li&#62;Bigarrays can be aligned to page boundaries (which is accelerated
for some operating systems when used for I/O)
&#60;/li&#62;&#60;/ul&#62;

&#60;p&#62;
So let&#38;#39;s define:

&#60;/p&#62;&#60;pre&#62;
type bytes =
  (char,Bigarray.int8_unsigned_elt,Bigarray.c_layout) Bigarray.Array1.t
&#60;/pre&#62;

Sure, there is now no way to unsafely cast strings to bytes and vice
versa anymore, but arguably we shouldn&#38;#39;t prefer a design over the other
only for it&#38;#39;s unsafety.


&#60;p&#62;
Regarding &#60;code&#62;stringlike&#60;/code&#62;, it is in deed possible to define it,
but there is some runtime cost. As &#60;code&#62;string&#60;/code&#62; and &#60;code&#62;bytes&#60;/code&#62;
have now different representations, any accessor function for 
&#60;code&#62;stringlike&#60;/code&#62; would have to check at runtime whether it is
backed by a &#60;code&#62;string&#60;/code&#62; or by &#60;code&#62;bytes&#60;/code&#62;. At least, this
check is very cheap.
&#60;/p&#62;


&#60;h2&#62;Conclusion&#60;/h2&#62;

I hope it has become clear that the current plan is not far reaching
enough, as the programmer would have to choose between bad alternatives:
either pay a runtime penalty for additional copying and accept that
some code needs to be duplicated, or use unsafe coercion
between &#60;code&#62;string&#60;/code&#62; and &#60;code&#62;bytes&#60;/code&#62;. The latter is not
desirable, of course, but it is surely the task of the language
(designer) to make sound and safe string handling an attractive option.
I&#38;#39;ve presented three ideas that would all improve the concept in
some respect. In particular, the combination of the ideas 2 and 3
seems to be very attractive: back &#60;code&#62;bytes&#60;/code&#62; by bigarrays,
and provide an &#60;code&#62;stringlike&#60;/code&#62; supertype for easing the
programming of application buffers.

&#60;img src=&#34;/files/img/blog/bytes1_bug.gif&#34; width=&#34;1&#34; height=&#34;1&#34;/&#62;


&#60;/div&#62;

&#60;div&#62;
  Gerd Stolpmann works as O&#38;#39;Caml consultant

&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;


          </description>
        </item>
      
        <item>
          <title>Welcome IPv6</title>
          <guid>http://blog.camlcity.org/blog/ipv6.html</guid>
          <link>http://blog.camlcity.org/blog/ipv6.html</link>
          <pubDate>21 Jun 2013 00:00 GMT</pubDate>
          <description>

&#60;div&#62;
  &#60;b&#62;camlcity.org now connected&#60;/b&#62;&#60;br/&#62;&#38;#160;
&#60;/div&#62;

&#60;div&#62;
  
For two weeks the camlcity.org website is fully connected to IPv6.

&#60;/div&#62;

&#60;div&#62;
  
&#60;p&#62;
Actually, the raw connectivity exists already for more than two years,
but I haven&#38;#39;t found time to put the IP addresses into DNS. This is now
done, making the site visible.
&#60;/p&#62;

&#60;p&#62;
Around 1% of the traffic is now via IPv6. This is way more than I was
expecting. Here in Germany, only a few Internet providers have already
rolled out IPv6, but the major players are planning it for 2014. It
turns out that at home I already have IPv6, although only via
DSLite. (NB. In the default DNS configuration a client connected with
DSLite or other 6-in-4 technologies will pick the IPv4 address if both
&#38;#34;Internets&#38;#34; are available, so such clients will not show up in my web
server logs as IPv6.)
&#60;/p&#62;

&#60;p&#62;
The IPv6 world is different: no NAT anymore, and every computer
has a globally routable address. This is something you need to get
used to - the Internet appears again as a real peer-to-peer
network as in the first years, and the distinction between client
and datacenter connectivity is gone. Let&#38;#39;s hope this drives
innovation - like user-controlled social networks, for instance.
&#60;/p&#62;


&#60;/div&#62;

&#60;div&#62;
  Gerd Stolpmann works as O&#38;#39;Caml consultant

&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;


          </description>
        </item>
      
        <item>
          <title>GODI is shutting down</title>
          <guid>http://blog.camlcity.org/blog/godi_shutdown.html</guid>
          <link>http://blog.camlcity.org/blog/godi_shutdown.html</link>
          <pubDate>22 Jul 2013 00:00 GMT</pubDate>
          <description>

&#60;div&#62;
  &#60;b&#62;Sorry!&#60;/b&#62;&#60;br/&#62;&#38;#160;
&#60;/div&#62;

&#60;div&#62;
  
Unfortunately, it is no longer possible for me to run the GODI
distribution. GODI will not upgrade to OCaml 4.01 once it is out,
and it will shut down the public service in the course of September 2013.

&#60;/div&#62;

&#60;div&#62;
  
&#60;p&#62;This website, camlcity.org, will remain up, but with reduced
content. Existing GODI installations can be continued to be used,
but upgrades or bugfixes will not be available when GODI is off.

&#60;/p&#62;&#60;p&#62;
Although there are still a lot of GODI users, it is unavoidable
to shut GODI down due to lack of supporters, especially package
developers. I was more or less alone in the past months, and my
time contingent will not allow it to do the upgrade to OCaml 4.01
alone (when it is released).

&#60;/p&#62;&#60;p&#62;
Also, there was a lot of noise about a competing packaging system
for OCaml in the past weeks: OPAM. Apparently, it got a lot of
attention both from individuals and from organizations. As I see
it, the OCaml community is too small to support two systems, and
so in some sense GODI is displaced by OPAM.

&#60;/p&#62;&#60;p&#62;
The sad part is that OPAM is only clearly better in one point,
namely in interacting with the community (via Github). In times
where social networks are worth billions this is probably the
striking point. It doesn&#38;#39;t matter that OPAM lacks
some features GODI has.
So there is some loss of functionality for the community
(partly difficult to replace, like GODI&#38;#39;s support for Windows).

&#60;/p&#62;&#60;p&#62;
If somebody wants to take over GODI, please do so. The 
&#60;a href=&#34;https://godirepo.camlcity.org/svn/godi-bootstrap/&#34;&#62;source code&#60;/a&#62;
is still available as well as the 
&#60;a href=&#34;https://godirepo.camlcity.org/svn/godi-build/&#34;&#62;package directories&#60;/a&#62;.
Maybe it is sufficient to move the repository to a public place and to
redesign the package release process to give GODI a restart.

&#60;/p&#62;&#60;p&#62;
Hoorn (NL), the 22nd July 2013,

&#60;/p&#62;&#60;p&#62;
Gerd Stolpmann
&#60;/p&#62;
&#60;/div&#62;

&#60;div&#62;
  Gerd Stolpmann works as O&#38;#39;Caml consultant

&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;


          </description>
        </item>
      
        <item>
          <title>Plasma Map/Reduce Slightly Faster Than Hadoop</title>
          <guid>http://blog.camlcity.org/blog/plasma6.html</guid>
          <link>http://blog.camlcity.org/blog/plasma6.html</link>
          <pubDate>01 Feb 2012 00:00 GMT</pubDate>
          <description>

&#60;div&#62;
  &#60;b&#62;A performance test&#60;/b&#62;&#60;br/&#62;&#38;#160;
&#60;/div&#62;

&#60;div&#62;
  
Last week I spent some time running map/reduce jobs on Amazon EC2.
In particular, I compared the performance of Plasma, my own map/reduce
implementation, with Hadoop. I just wanted to know how much my implementation
was behind the most popular map/reduce framework. However, the suprise was
that Plasma turned out as slightly faster in this setup.

&#60;/div&#62;

&#60;div&#62;
  
&#60;div style=&#34;float:right; width: 50ex; font-size:small; color:grey; border: 1px solid grey; padding: 1ex; margin-left: 2ex&#34;&#62;
This article is also available in other languages:
&#60;dl&#62;
&#60;dt&#62;&#60;a href=&#34;http://science.webhostinggeeks.com/plasma-map-reduce&#34;&#62;[Serbo-Croatian]&#60;/a&#62;
&#60;/dt&#62;&#60;dd&#62;translation by Anja Skrba from 
&#60;a href=&#34;http://webhostinggeeks.com/&#34;&#62;Webhostinggeeks.com&#60;/a&#62;
&#60;/dd&#62;&#60;/dl&#62;
&#60;/div&#62;
&#60;p&#62;
I would not call this test a &#38;#34;benchmark&#38;#34;. Amazon EC2 is not a
controlled environment, as you always only get partial machines, and
you don&#38;#39;t know how much resources are consumed by other users on the
same machines.  Also, you cannot be sure how far the nodes are off
from each other in the network. Finally, there are some special
effects coming from the virtualization technology, especially the
first write of a disk block is slower (roughly half the normal speed)
than following writes.  However, EC2 is good enough to get an
impression of the speed, and one can hope that all the test runs
get the same handicap on average.

&#60;/p&#62;&#60;p&#62;
The task was to sort 100G of data, given in 10 files. Each line has
100 bytes, divided into a key of 8 bytes, a TAB character, 90 random
bytes as value, and an LF character. The key was randomly chosen from
65536 possible values. This means that there were lots of lines with
the same key - a scenario where I think it is more typical of map/reduce
than having unique keys. The output is partitioned into 80 sets.

&#60;/p&#62;&#60;p&#62;
I allocated 1 larger node (m1-xlarge) with 4 virtual cores and 15G of
RAM acting as combined name- and datanode, and 9 smaller nodes
(m1-large) with 2 virtual cores and 7.5G of RAM for the other
datanodes. Each node had access to two virtual disks that were
configured as RAID-0 array. The speed for sequential reading or
writing was around 160 MB/s for the array (but only 80 MB/s for the
first time blocks were written). Apparently, the nodes had Gigabit
network cards (the maximum transfer speed was around 119MB/s).

&#60;/p&#62;&#60;p&#62;
During the tests, I monitored the system activity with the sar utility.
I observed significant cycle stealing (meaning that a virtual core is
blocked because there is no free real core), often reaching values of
25%. This could be interpreted as overdriving the available resources,
but another explanation is that the hypervisor needed this time for
itself. Anyway, this effect also questions the reliability of this
test.

&#60;/p&#62;&#60;h2&#62;The contrahents&#60;/h2&#62;

&#60;p&#62;
Hadoop is the top dog in the map/reduce scene. In this test, the
version from Cloudera 0.20.2-cdh3u2 was used, which contains more than
1000 patches against the vanilla 0.20.2 version. Written in Java, it
needs a JVM at runtime, which was here IcedTea 1.9.10 distributing
OpenJDK 1.6.0_20. I did not do any tuning, hoping that the configuration
would be ok for a small job. The HDFS block size was 64M, without
replication.

&#60;/p&#62;&#60;p&#62;
The contender is Plasma Map/Reduce. I started this project two years
ago in my spare time. It is not a clone of the Hadoop architecture,
but includes many new ideas. In particular, a lot of work went into
the distributed filesystem PlasmaFS which features an almost complete
set of file operations, and controls the disk layout directly. The
map/reduce algorithm uses a slightly different scheme which tries
to delay the partitioning of the data to get larger intermediate files.
Plasma is implemented in OCaml, which isn&#38;#39;t VM-based but compiles
the code directly to assembly language. In this test, the blocksize
was 1M (Plasma is designed for smaller-sized blocks). The software
version of Plasma is roughly 0.6 (a few svn revisions before the release
of 0.6).

&#60;/p&#62;&#60;h2&#62;Results&#60;/h2&#62;

&#60;p&#62;The runtimes:

&#60;/p&#62;&#60;p&#62;
&#60;/p&#62;&#60;table&#62;
  &#60;tr&#62;
    &#60;td&#62;&#60;b&#62;Hadoop:&#60;/b&#62;&#60;/td&#62;     &#60;td&#62;&#60;b&#62;2265 seconds&#60;/b&#62; (37 min, 45 s)&#60;/td&#62;
  &#60;/tr&#62;
  &#60;tr&#62;
    &#60;td&#62;&#60;b&#62;Plasma:&#60;/b&#62;&#60;/td&#62;     &#60;td&#62;&#60;b&#62;1975 seconds&#60;/b&#62; (32 min. 55 s)&#60;/td&#62;
  &#60;/tr&#62;
&#60;/table&#62;

&#60;p&#62;
Given the uncertainty of the environment, this is no big difference.
But let&#38;#39;s have a closer look at the system activity to get an idea
why Plasma is a bit faster.

&#60;/p&#62;&#60;h2&#62;CPU&#60;/h2&#62;

In the following I took simply one of the datanodes, and created
diagrams (with kSar):

&#60;p&#62;
&#60;img src=&#34;/files/img/blog/edited_hadoop_cpu_all.png&#34; width=&#34;799&#34; height=&#34;472&#34;/&#62;

&#60;/p&#62;&#60;p&#62;
&#60;img src=&#34;/files/img/blog/edited_plasma_cpu_all.png&#34; width=&#34;800&#34; height=&#34;471&#34;/&#62;

&#60;/p&#62;&#60;p&#62;
Note that kSar does not draw graphs for %iowait and %steal, although 
these data are recorded by sar. This is the explanation why the sum of
user, system and idle is not 100%. 

&#60;/p&#62;&#60;p&#62;
What we see here is that Hadoop consumes all CPU cycles, whereas
Plasma leaves around 1/3 of the CPU capacity unused. Given the fact
that this kind of job is normally I/O-bound, it just means that Hadoop
is more CPU-hungry, and would have benefit from getting more cores
in this test.

&#60;/p&#62;&#60;h2&#62;Network&#60;/h2&#62;

In this diagram, reads are blue and red, whereas writes are green and
black. The first curve shows packets per second, and the second bytes
per second:

&#60;p&#62;
&#60;img src=&#34;/files/img/blog/edited_hadoop_eth0.png&#34; width=&#34;800&#34; height=&#34;333&#34;/&#62;

&#60;/p&#62;&#60;p&#62;
&#60;img src=&#34;/files/img/blog/edited_plasma_eth0.png&#34; width=&#34;800&#34; height=&#34;319&#34;/&#62;

Summing reads and writes up, Hadoop uses only around 7MB/s on average
whereas Plasma transmits around 25MB/s, more than three times as
much. There could be two explanations:

&#60;/p&#62;&#60;ul&#62;
  &#60;li&#62;Because Hadoop is CPU-underpowered, it remains below its
      possibilities
  &#60;/li&#62;&#60;li&#62;The Hadoop scheme is more optimized for keeping the network
      bandwidth as low as possible
&#60;/li&#62;&#60;/ul&#62;

The background for the second point is the following: Because Hadoop
partitions the data immediately after mapping and sorting, the data
has (ideally) only to cross the network once.  This is different in
Plasma - which generally partitions the data iteratively. In this
setup, after mapping and sorting only 4 partitions are created, which
are further refined in the following split-and-merge rounds.  As we
have here 80 partitions in total, there is at least one further step
in which data partitioning is refined, meaning that the data has to
cross the network roughly twice. This already explains 2/3 of the
observed difference.  (As a side note, one can configure how many
partitions are initially created after mapping and sorting, and it
would have been possible to mimick Hadoop&#38;#39;s scheme by setting this
value to 80.)

&#60;h2&#62;Disks&#60;/h2&#62;

These diagrams depict the disk reads and writes in KB/second:

&#60;p&#62;
&#60;img src=&#34;/files/img/blog/edited_hadoop_md0.png&#34; width=&#34;800&#34; height=&#34;332&#34;/&#62;

&#60;/p&#62;&#60;p&#62;
&#60;img src=&#34;/files/img/blog/edited_plasma_md0.png&#34; width=&#34;800&#34; height=&#34;332&#34;/&#62;

The average numbers are (directly taken from sar):

&#60;/p&#62;&#60;p&#62;
&#60;/p&#62;&#60;table&#62;
  &#60;tr&#62;
    &#60;td&#62;&#38;#160;&#60;/td&#62;
    &#60;th&#62;Hadoop&#60;/th&#62;
    &#60;th&#62;Plasma&#60;/th&#62;
  &#60;/tr&#62;
  &#60;tr&#62;
    &#60;td&#62;Read/s:&#60;/td&#62;
    &#60;td&#62;17.6 MB/s&#60;/td&#62;
    &#60;td&#62;31.2 MB/s&#60;/td&#62;
  &#60;/tr&#62;
  &#60;tr&#62;
    &#60;td&#62;Write/s:&#60;/td&#62;
    &#60;td&#62;30.8 MB/s&#60;/td&#62;
    &#60;td&#62;33.9 MB/s&#60;/td&#62;
  &#60;/tr&#62;
&#60;/table&#62;

&#60;p&#62;
Obviously, Plasma reads data around twice as often from disk than
Hadoop, whereas the write speed is about the same. Apart from this, it
is interesting that the shape of the curves are quite different:
Hadoop has a period of high disk activity at the end of the job (when
it is busy merging data), whereas Plasma utilizes the disks better
during the first third of the job.

&#60;/p&#62;&#60;h2&#62;Plausibility&#60;/h2&#62;

&#60;p&#62;
Neither of the contenders utilized the I/O resources at all times
best. Part of the difficulty of developing a map/reduce scheme is to
achieve that the load put onto the disks and onto the network is
balanced. It is not good when e,g, the disks are used to 100% at a
certain point and the network is underutilized, but during the next
period the network is at 100% and the disk not fully used. A balanced
distribution of the load reaches higher throughput in total.

&#60;/p&#62;&#60;p&#62;
Let&#38;#39;s analyze the Plasma scheme a bit more in detail. The data set of
100G (which does not change in volume during the processing) is copied
four times in total: once in the map-and-sort phase, and three times
in the reduce phase (for this volume Plasma needs three merging
rounds). This means we have to transfer 4 * 100G of data in total, or
40G of data per node (remember we have 10 nodes). We ran 22 cores for
1975 seconds, which gives a capacity of 43450 CPU seconds. Plasma
tells us in its reports that it used 3822 CPU seconds for in-RAM
sorting, which we should subtract for analyzing the I/O
throughput. Per core these are 173 seconds. This means each node had
1975-173 = 1802 seconds for handling the 40G of data. This makes
around 22 MB per second on each node.

&#60;/p&#62;&#60;p&#62;
The Hadoop scheme differs mostly in that the data is only copied twice
in the merge phase (because Hadoop by default merges more files in
one round than Plasma). However, because of its design there is an
extra copy at the end of the reduce phase (from disk to HDFS).  This
means Hadoop also solves the same job by transferring 4 * 100G of data.
There is no counter for measuring the time spent for in-RAM sorting.
Let&#38;#39;s assume this time is also around 3800 seconds. This means each
node had 2265 - 175 = 2090 seconds for handling 40G of data, or
19 MB per second on each node.

&#60;/p&#62;&#60;h2&#62;Conclusion&#60;/h2&#62;

&#60;p&#62;
It looks very much as if both implementations are slowed down by
specifics of the EC2 environment. Especially the disk I/O, probably
the essential bottleneck here, is far below what one can expect.
Plasma probably won because it uses the CPU more efficiently, whereas
other aspects like network utilization are better handled by Hadoop.

&#60;/p&#62;&#60;p&#62;
For my project this result just means that it is on the right track.
Especially, this small setup (only 10 nodes) is easily handled, giving
prospect that Plasma is scalable at least to a small multitude of
this. The bottleneck would be here the namenode, but there is still a
lot of headroom.

&#60;/p&#62;&#60;h2&#62;Where to get Plasma&#60;/h2&#62;

&#60;p&#62;Plasma Map/Reduce and PlasmaFS are bundled together in one download. Here is the
&#60;a href=&#34;http://projects.camlcity.org/projects/plasma.html&#34;&#62;project page&#60;/a&#62;.

&#60;/p&#62;&#60;p&#62;

&#60;img src=&#34;/files/img/blog/plasma6_bug.gif&#34; width=&#34;1&#34; height=&#34;1&#34;/&#62;

&#60;/p&#62;
&#60;/div&#62;

&#60;div&#62;
  Gerd Stolpmann works as O&#38;#39;Caml consultant

&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;


          </description>
        </item>
      
        <item>
          <title>After NoSQL there will be NoServer</title>
          <guid>http://blog.camlcity.org/blog/plasma5.html</guid>
          <link>http://blog.camlcity.org/blog/plasma5.html</link>
          <pubDate>04 Nov 2011 00:00 GMT</pubDate>
          <description>

&#60;div&#62;
  &#60;b&#62;An experiment, and a vision&#60;/b&#62;&#60;br/&#62;&#38;#160;
&#60;/div&#62;

&#60;div&#62;
  
&#60;p&#62;
The recent success of NoSQL technologies has not only to do with the
fact that it is taken advantage of distribution and replication, but
even more with the &#38;#34;middleware effect&#38;#34; that these features became
relatively easy to use.  Now it is no longer required to be an expert
for these cluster techniques in order to profit from them. Let&#38;#39;s think
a bit ahead: how could a platform look like that makes distributed
programming even easier, and that integrates several styles of storing
data and managing computations?

&#60;cc-field name=&#34;maintext&#34;&#62;
&#60;p&#62;
The starting point for this exploration is a recent experience I made
with my own attempt in the NoSQL arena,
the &#60;a href=&#34;http://plasma.camlcity.org&#34;&#62;Plasma project&#60;/a&#62;. Two weeks
ago, it was &#38;#34;only&#38;#34; a distributed, replicating, and failure-resiliant
filesystem PlasmaFS, with its own map/reduce implementation on top of
it. Then I had an idea: is it possible to develop a key/value database
on top of this filesystem? Which features, and relative
advantages/disadvantages would it have? In other words, I was
examining whether the existing platform makes it simpler to develop
a database with a reasonable feature set.

&#60;/p&#62;&#60;p&#62;
When we talk about clusters, I have especially Internet applications
in mind that are bombarded by the users with requests, but that have
also to do a lot of background processing.


&#60;/p&#62;&#60;h2&#62;The key/value database needed less than 2000 lines of code&#60;/h2&#62;

&#60;p&#62;
Now, PlasmaFS is not following the simple pattern of HDFS, but bases
on a transactional core, and it even allows the users to manage the
transactions. For example, it is possible to rename a bunch of files
atomically by just wrapping the rename operations into a single
transaction.  The transactional support goes even further: When
reading from a file one can activate a special snapshot mode, which
just means that the reader&#38;#39;s view of the file is isolated from any
writes happening at the same time.

&#60;/p&#62;&#60;p&#62;
These are clearly advanced features, and the question was whether they
helped for writing a key/value database library. And yes, it was
extremely helpful - in less than 2000 lines of code this library
provides data distribution and replication, a high degree of data
safety, almost unlimited scalabilitiy for database reads, and
reasonable performance for writes. Of course, most of these features
are just &#38;#34;inherited&#38;#34; from PlasmaFS, and the library just had to
implement the file format (i.e. a B tree,
see &#60;a href=&#34;http://projects.camlcity.org/projects/dl/plasma-0.5/doc/html/Plasmakv_intro.html&#34;&#62;
this page for details&#60;/a&#62;). This is not cheating, but exactly the
point: the platform makes it easy to provide features that would
otherwise be extremely complicated to provide.

&#60;/p&#62;&#60;h2&#62;NoServer&#60;/h2&#62;

&#60;p&#62;
This key/value database is just a library, and one can use it only
on machines where PlasmaFS is deployed. Of course it is possible to
access the same database file from several machines - PlasmaFS handles
all the networking involved with it. The point is that during the
implementation of the library this never had to be taken into account.
There is no networking code in this library, and this is why it is
the first example of the new NoServer paradigm - not only server.

&#60;/p&#62;&#60;p&#62;
The genuine advantage of this paradigm is that it enables developers
to write code they never would be able to create without the help of
the platform. This is a bit comparable to the current situation for
SQL databases: Everybody can store data in them, even over the
network, without needing to have any clue how this works in detail.
In the NoServer paradigm, we just go one step further, because the
provided services by the platform are a lot more low-level, and the
developer has a lot more freedom. Instead with a query language
the shared resources are accessed with normal file operations,
extended by transactional directives. The hope is that this makes
a lot of server programming superflous, especially the difficult
parts of it (e.g. what to do when a machine crashes).

&#60;/p&#62;&#60;p&#62;
A simple key/value database is obviously not difficult to create with
these programming means. The interesting question is what else can be
done with it in a cluster environment. Obviously, having a common
filesystem on all machines of the cluster makes a lot of file copying
superflous that a normal cluster would do with rsync and/or
ssh. PlasmaFS can even be directly mounted (although the transactional
features are unavailable then), so even applications can access
PlasmaFS files that have not specially been ported to it.  An example
would be a read-only Lucene search index residing in PlasmaFS.
Replacing the index by an updated one would be done by simply moving
the new index into the right directory, and signalling Lucene that it
has to re-open the index.

&#60;/p&#62;&#60;p&#62;
So far Plasma is implemented, and works well (I just released the
release 0.5, which is now beta quality). The vision goes of course
beyond that.

&#60;/p&#62;&#60;h2&#62;What the platform also needs&#60;/h2&#62;

&#60;p&#62;
There are a number of further datastructures that can obviously be
well represented in files, such as hashtables or queues. Let&#38;#39;s explore
the latter a bit more in detail: How would a queue manager look like?
There are a few data representation options. For example, every queue
element could be a file in a directory, or a container format is
established where the elements can be appended to. PlasmsFS also
allows it to cut arbitrary holes into files, so it is even possible to
physically remove elements from the beginning of the queue file by
just removing the data blocks storing the elements from the file.  As
we don&#38;#39;t want to run the queue manager as server, but just as library
inside any program accessing the queue, the question is how event
notifications are handled (which would be obvious in server context).
Usually, one has to notify some followup processor when new elements
have been added to the queue. Plasma currently does not include a
method for doing this, so the platform needs to be extended by a
notification framework (which should not be too difficult).

&#60;/p&#62;&#60;p&#62;
An important question is also how programs are activated running on
different nodes. In my vision there would be a central task execution
manager. Of course, this manager is normal client/server middleware.
Again, the point here is that the application developer needs no 
special skills for triggering remote activation, he just uses
libraries. I&#38;#39;ve no absolutely clear picture of this part yet, but
it seems to be necessary to have the option of invoking programs
in the inetd style as well as directly as if started via ssh.
Also, a central directory would be maintained that includes
important data such as which program can be run on which node.

&#60;/p&#62;&#60;h2&#62;We won&#38;#39;t live totally without servers, only with fewer ones&#60;/h2&#62;

&#60;p&#62;
My vision does not include that servers are completely banned. We will
still need them for special features or data access patterns, and of
course for interaction with other systems.  For example, PlasmaFS is
bad at coordinating concurrent write accesses to the same file. Also,
PlasmaFS employs a central namenode with a limited capacity only. So,
if you are doing OLTP processing, a normal SQL database will still do
better. If you need extraordinary write performance, but can pay the
price of weakened consistency guarantees, a system like Cassandra will
work better.

&#60;/p&#62;&#60;p&#62;
Nevertheless, there is the big field of &#38;#34;average deployments&#38;#34; where
the number of nodes is not too big and the performance requirements
are not too special, but the ACID guarantees PlasmaFS gives are
essential. For this field, the NoServer paradigm could be the ideal
choice to reduce the development overhead dramatically.

&#60;/p&#62;&#60;h2&#62;Check Plasma out&#60;/h2&#62;

The &#60;a href=&#34;http://plasma.camlcity.org&#34;&#62;Plasma homepage&#60;/a&#62; provides
a lot of documentation, and especially downloads. Also take a look at
the &#60;a href=&#34;http://plasma.camlcity.org/plasma/perf.html&#34;&#62;performance
page&#60;/a&#62;, describing a few tests I recently ran.

&#60;img src=&#34;/files/img/blog/plasma5_bug.gif&#34; width=&#34;1&#34; height=&#34;1&#34;/&#62;



&#60;/cc-field&#62;
&#60;/p&#62;
&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;

&#60;div&#62;
  Gerd Stolpmann works as O&#38;#39;Caml consultant.
&#60;a href=&#34;search1.html&#34;&#62;Currently looking for new jobs as consultant!&#60;/a&#62;

&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;


          </description>
        </item>
      
        <item>
          <title>PlasmaFS</title>
          <guid>http://blog.camlcity.org/blog/plasma4.html</guid>
          <link>http://blog.camlcity.org/blog/plasma4.html</link>
          <pubDate>18 Oct 2011 00:00 GMT</pubDate>
          <description>

&#60;div&#62;
  &#60;b&#62;A serious distributed filesystem&#60;/b&#62;&#60;br/&#62;&#38;#160;
&#60;/div&#62;

&#60;div&#62;
  
&#60;p&#62;
A few days ago, I
released &#60;a href=&#34;http://plasma.camlcity.org&#34;&#62;Plasma-0.4.1&#60;/a&#62;.  This
article gives an overview over the filesystem subsystem of it, which
is actually the more important part. PlasmaFS differs in many points
from popular distributed filesystems like HDFS. This starts from the
beginning with the requirements analysis.

&#60;cc-field name=&#34;maintext&#34;&#62;
&#60;p&#62;
A distributed filesystem (DFS) allows it to store giant amounts of
data.  A high number of data nodes (computers with hard disks) can be
attached to a DFS cluster, and usually a second kind of node, called
name node, is used to store metadata, i.e. which files are stored and
where. The point is now that the volume of metadata can be very low
compared to the payload data (the ratios are somewhere between
1:10,000 to 1:1,000,000), so a single name node can manage a quite
large cluster. Also, the clients can contact the data nodes
directly to access payload data - the traffic is not routed via
the name node like in &#38;#34;normal&#38;#34; network filesystems. This allows
enormous bandwidths.

&#60;/p&#62;&#60;p&#62;
The motivation for developing another DFS was that existing
implementations, and especially the popular HDFS, make (in my opinion)
unfortunate compromises to gain speed:

&#60;/p&#62;&#60;ul&#62;
  &#60;li&#62;The metadata is not well protected. Although the metadata is
   saved to disk and usually also replicated to another computer, these 
   &#38;#34;safety copies&#38;#34; lag behind. In the case of an outage, data loss
   is common (HDFS even fails fatally when the disk fills up).
   Given the amount of data, this is not acceptable. It&#38;#39;s like a
   local filesystem without journaling.&#60;br/&#62;&#38;#160;
  &#60;/li&#62;&#60;li&#62;The name node protocol is too simplistic, and because of this,
   DFS implementations need ultra-high-speed name node implementations
   (at least several 10000 operations per second) to manage larger clusters.
   Another consequence is that only large block sizes (several megabytes)
   promise decent access speeds, because this is the only implemented
   strategy to reduce the frequency of name node operations.&#60;br/&#62;&#38;#160;
  &#60;/li&#62;&#60;li&#62;Unless you can physically separate the cluster from the rest
    of the network, security is a requirement. It is difficult to provide,
    however, mainly because the data nodes are independently accessed, and you
    want to avoid that data nodes have to continuously check for
    access permissions. So the compromise is to leave this out in the
    DFS, and rely on complicated and error-prone configurations in
    network hardware (routers and gateways).
&#60;/li&#62;&#60;/ul&#62;

&#60;p&#62;
I&#38;#39;m not saying that HDFS is a bad implementation. My point is only that
there is an alternative where safety and security are taken more
seriously, and that there are other ways to get high speed than those
that are implemented in HDFS.

&#60;/p&#62;&#60;h2&#62;Using SSDs for transacted metadata stores&#60;/h2&#62;

PlasmaFS starts at a different point. It uses a data store with full
transactional support (right now this is PostgreSQL, just for
development simplicity, but other, and more light-weight systems could
also fill out this role). This includes:

&#60;ul&#62;
  &#60;li&#62;Data are made persistent in a way so that full ACID support
    is guaranteed (remember, the ACID properties are atomicity,
    consistency, isolation, and durability).
  &#60;/li&#62;&#60;li&#62;For keeping replicas synchronized, we demand support for
    two-phase commit, i.e. that transactions can be prepared before
    the actual commit with the guarantee that the commit is fail-safe
    after preparation. (Essentially, two-phase commit is a protocol
    between two database systems keeping them always consistent.)
&#60;/li&#62;&#60;/ul&#62;

This is, by the way, the established prime-standard way of ensuring
data safety for databases.  It comes with its own problems, and the
most challenging is that commits are relatively slow. The reason for this
is the storage hardware - for normal hard disks the maximum frequency
of commits is a function of the rotation speed. Fortunately, there is
now an alternative: SSDs allow at present several 10000 syncs per
second, which is two orders of magnitude more than classic hard disks
provide. Good SSDs are still expensive, but luckily moderate disk
sizes are already sufficient (with only a 100G database you can
already manage a really giant filesystem).

&#60;p&#62;Still, writing each modification directly to the SSD limits the
speed compared to what systems like HDFS can do (because HDFS keeps
the data in RAM, and only writes now and then a copy to disk).  We need
more techniques to address the potential bottleneck name node:

&#60;/p&#62;&#60;ul&#62;
  &#60;li&#62;PlasmaFS provides a transactional view to users. This works
    very much like the transactions in SQL. The performance advantage is here that
    several write operations can be carried out with only one commit.
    PlasmaFS takes it that far that unlimited numbers of metadata
    operations can be put into a transaction, such as creating and
    deleting files, allocating blocks for the files, and retrieving
    block lists. It is possible to write terabytes of data to files with
    &#60;i&#62;only a single commit&#60;/i&#62;! Applications accessing large files
    sequentially (as, e.g., in the map/reduce framework) can especially
    profit from this scheme.&#60;br/&#62;&#38;#160;
  &#60;/li&#62;&#60;li&#62;PlasmaFS addresses blocks linearly: for each data node the blocks
    are identified by numbers from 0 to n-1. This is safe, because we
    manage the consistency globally (basically, there is a kind of
    join between the table managing which blocks are used or free, and
    the table managing the block lists per file, and our safety
    measures allow it to keep this join consistent). In contrast,
    other DFS use GUIDs to identify blocks. The linear scheme,
    however, allow it to transmit and store block lists in a
    compressed way (extent-based). For example, if a file uses the
    blocks 10 to 14 on a data nodes, this is stored as &#38;#34;10-14&#38;#34;, and not
    as &#38;#34;10,11,12,13,14&#38;#34;. Also, block allocations are always done
    for ranges of blocks. This greatly reduces the number
    of name node operations while only moderately increasing their
    complexity.&#60;br/&#62;&#38;#160;
  &#60;/li&#62;&#60;li&#62;A version number is maintained per file that is
    increased whenever data or metadata are modified. This allows it
    to keep external caches up to date with only low overhead: A quick
    check whether the version number has changed is sufficient to
    decide whether the cache needs to be refreshed. This is reliable,
    in contrast to cache consistency schemes that base only on the
    last modification time. Currently this is used to keep the
    caches of the NFS bridge synchronized. Especially, applications accessing
    only a few files randomly profit from such caching.
&#60;/li&#62;&#60;/ul&#62;

&#60;p&#62;
I consider the map/reduce part of Plasma especially as a good test
case for PlasmaFS. Of course, this map/reduce implementation is
perfectly adapted to PlasmaFS, and uses all possibilities to reduce
the frequency of name node operations. It turns out that a typical
running map/reduce task contacts the name node only every 3-4 seconds,
usually to refill a buffer that got empty, or to flush a full buffer
to disk. The point here is that a buffer can be larger than a data
block, and that only a single name node transaction is sufficient to
handle all blocks in the buffer in one go. The buffers are typically
way larger than only a single block, so this reduces the number of
name node operations quite dramatically.  (Important note: This number
(3-4) is only correct for Plasma&#38;#39;s map/reduce implementation which
uses a modified and more complex algorithm scheme, but it is not
applicable to the scheme used by Hadoop.)

&#60;/p&#62;&#60;h2&#62;Speed&#60;/h2&#62;

&#60;p&#62;
I have done some tests with the latest development version of
Plasma. The peak number of commits per second seems to be around 500
(here, a &#38;#34;commit&#38;#34; is a transaction writing data that can include
several data update operations). This test used a recently bought SSD,
and ran on a quad-core server machine. It was not evident that the SSD
was the bottleneck (one indication is that the test ran only slightly
faster when syncs were turned off), so there is probably still a lot
of room for optimization.

&#60;/p&#62;&#60;p&#62;
Given that a map/reduce task needs the name node only every &#38;#8776;0.3 seconds,
this &#38;#34;commit speed&#38;#34; would be theoretically sufficient for around
1600 parallely running tasks. It is likely that other limits are
hit first (e.g. the switching capacity). Anyway, these are encouraging
numbers showing that this young project is not on the wrong track.

&#60;/p&#62;&#60;p&#62;
The above techniques are already implemented in PlasmaFS. More advanced
options that could be worth an implementation include:

&#60;/p&#62;&#60;ul&#62;
  &#60;li&#62;As we can maintain exact replicas of the primary name node (via
    two-phase commit), it becomes possible to also use the replicas
    for read accesses. For certain types of read operations this is
    non-trivial, though, because they have an effect on the block
    allocation map (essentially we would need to synchronize a certain
    buffer in both the primary and secondary servers that controls
    delayed block deallocation). nevertheless, this is certainly a viable option.
    Even writes could be handled by
    the secondary nodes, but this tends to become very complicated,
    and is probably not worth it.&#60;br/&#62;&#38;#160;
  &#60;/li&#62;&#60;li&#62;An easier option to increase the capacity is to split the file
    space, so that each name node takes care of a partition only. A
    user transaction would still need a uniform view on the filesystem,
    though. If a name node receives a request for an operation it
    cannot do itself, it automatically extends the scope of the
    transaction to the name node that is responsible for the right
    partition. This scheme would also use the two-phase commit protocol
    for keeping the partitions consistent. I think this option is viable,
    but only for the price of a complex development effort.
&#60;/li&#62;&#60;/ul&#62;

&#60;p&#62;
Given that these two improvements are very complicated to implement,
it is unlikely that it is done soon. There is still a lot of fruit
hanging at lower branches of the tree.


&#60;/p&#62;&#60;h2&#62;Delegated access control checks&#60;/h2&#62;

&#60;p&#62;
Let&#38;#39;s quickly discuss another problem, namely how to secure accesses
to data nodes. It is easy to accept that the name nodes can be secured
with classic authentication and authorization schemes in the same
style as they are used for other server software, too. For data nodes,
however, we face the problem that we need to supervise every access to a
data block individually, but want to avoid any extra overhead, especially
that each data access needs to be checked with the name node.

&#60;/p&#62;&#60;p&#62;
PlasmaFS uses a special cryptographic ticket system to avoid
this. Essentially, the name node creates random keys in periodical
intervals, and broadcasts these to the data nodes. These keys are
secrets shared by the name and data nodes. The accessing clients get
only HMAC-based tickets generated from the keys and from the block ID
the clients are granted access to.  These tickets can be checked by
the data nodes because these nodes know the keys. When the client
loses the right to access the blocks (i.e. when the client transaction
ends), the corresponding key is revoked.

&#60;/p&#62;&#60;p&#62;
With some additional tricks it can be achieved that the only
communication between the name node and the data node is a periodical
maintenance call that hands out the new keys and revokes the expired
keys. That&#38;#39;s an acceptable overhead.


&#60;/p&#62;&#60;h2&#62;Other quality-assuring features&#60;/h2&#62;

&#60;p&#62;
PlasmaFS implements the POSIX file semantics almost completely. This
includes the possibility of modifying data (or better, replacing
blocks by newer versions, which is not possible in other DFS
implementations), the handling of deleted files, and the exclusive
creation of new files. There are a few exceptions, though, namely
neither the link count nor the last access time of files are maintained.
Also, lockf-style locks are not yet available.

&#60;/p&#62;&#60;p&#62;
For supporting map/reduce and other distributed algorithm schemes,
PlasmaFS offers locality functions. In particular, one can find out
on which nodes a data block is actually stored, and one can also
wish that a new data block is stored on a certain node (if possible).

&#60;/p&#62;&#60;p&#62;
The PlasmaFS client protocol bases on SunRPC. This protocol has quite
good support on the system level, and it supports strong
authentication and encryption via the GSS-API extension (which is
actually used by PlasmaFS, together with the SCRAM-SHA1 mechanism). I
know that younger developers consider it as out-dated, but even the
Facebook generation must accept that it can keep up with the
requirements of today, and that it includes features that more modern
protocols do not provide (like UDP transport and GSS-API). For the
quality of the code it is important that modifying the SunRPC layer is
easy (e.g. adding or changing a new procedure), and does not imply
much coding. Because of this it could be achieved that the PlasmaFS
protocol is quite clean on the one hand, but is still adequately
expressive on the other hand to support complex transactions.

&#60;/p&#62;&#60;p&#62;
PlasmaFS is accessible from many environments. Applications can access
it via the mentioned SunRPC protocol (with all features), but also
via NFS, and via a command-line client. In the future, WebDAV support
will also be provided (which is an extension of HTTP, and which will
ensure easy access from many programming environments).

&#60;/p&#62;&#60;h2&#62;Check Plasma out&#60;/h2&#62;

The &#60;a href=&#34;http://plasma.camlcity.org&#34;&#62;Plasma homepage&#60;/a&#62; provides
a lot of documentation, and especially downloads. Also take a look at
the &#60;a href=&#34;http://plasma.camlcity.org/plasma/perf.html&#34;&#62;performance
page&#60;/a&#62;, describing a few tests I recently ran.

&#60;img src=&#34;/files/img/blog/plasma4_bug.gif&#34; width=&#34;1&#34; height=&#34;1&#34;/&#62;



&#60;/cc-field&#62;
&#60;/p&#62;
&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;

&#60;div&#62;
  Gerd Stolpmann works as O&#38;#39;Caml consultant.
&#60;a href=&#34;search1.html&#34;&#62;Currently looking for new jobs as consultant!&#60;/a&#62;

&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;


          </description>
        </item>
      
  </channel>
</rss>
