BLOG ON CAMLCITY.ORG: OMake
Faster builds with omake, part 2: Linux - by Gerd Stolpmann, 2015-06-19
While analyzing the performance characteristics of OMake, I found that the features of the OS were used in a non-optimal way. In particular, the fork() system call can be very expensive, and by avoiding it the speed of OMake could be dramatically improved. This is the biggest contribution to the performance optimizations allowing OMake to run roughly twice as fast on Linux (see part 1 for numbers).
The traditional way of starting commands is to use the fork/exec combination: The fork() system call creates an almost identical copy of the process, and in this copy the exec() call starts the command. This has a number of logical advantages, namely that you can run code between fork() and exec() that modifies the environment for the new command. Often, the file descriptors 0, 1, and 2 are assigned as it is required for creating pipelines. You can also do other things, e.g. change the working directory.
The whole problem with this is that it is slow. Even for a modern OS like Linux, fork() includes a number of expensive operations. Although it can be avoided to actually copy memory, the new address space must be set up by duplicating the page table. This is the more expensive the bigger the address space is. Also, memory must be set aside even if it is not immediately used. The entries for all file mappings must be duplicated (and every linked-in shared library needs such mappings). The point is now that all these actions are not really needed because at exec() time the whole process image is replaced by a different one.
In my performance tests I could measure that forking a 450 MB process image needs around 10 ms. In the n=8 test for compiling each of the 4096 modules two commands are needed (ocamldep.opt and ocamlopt.opt). The time for this fork alone sums up to 80 seconds. Even worse, this dramatically limits the benefit of parallelizing the build, because this time is always spent in the main process.
The POSIX standard includes an alternate way of starting commands, the posix_spawn() call. It was originally developed for small systems without virtual memory where it is difficult to implement fork() efficiently. However, because of the mentioned problems of the fork/exec combinations it was quickly picked up by all current POSIX systems. The posix_spawn() call takes a potentially long list of parameters that describes all the actions needed to be done between fork() and exec(). This gives the implementer all freedom to exploit low-level features of the OS for speeding the call up. Some OS, e.g. Mac OS X, even implement posix_spawn directly as system call.
On Linux, posix_spawn is a library function of glibc. By default, however, it is no real help because it uses fork/exec (being very conservative). If you pass the flag POSIX_SPAWN_USEVFORK, though, it switches to a fast alternate implementation. I was pointed (by Török Edwin) to a few emails showing that the quality in glibc is not yet optimal. In particular, there are weaknesses in signal handling and in thread cancellation. Fortunately, these weaknesses do not matter for this application (signals are not actively used, and on Linux OMake is single-threaded).
Note that I developed the wrapper for posix_spawn already years ago for OCamlnet where it is still used. So, if you want to test the speed advantage out on yourself, just use OCamlnet's Shell library for starting commands.
It turned that there is another application of fork() in OMake. When creating pipelines, it is sometimes required that the OMake process forks itself, namely when one of commands of the pipeline is implemented in the OMake language. This is somewhat expected, as the parts of a pipeline need to run concurrently. However, this feature turned out to be a little bit in the way because the default build rules used it. In particular, there is the pipeline
$(OCAMLFIND) $(OCAMLDEP) ... -modules $(src_file) | ocamldep-postproc
which is started for scanning OCaml modules. While the first command,
$(OCAMLFIND), is a normal external command, the second command,
ocamldep-postprocess, is written in the OMake language.
Forking for creating pipelines is even more expensive than the fork/exec combination discussed above, because memory needs really to be copied. I could finally avoid this fork() by some trickery in the command starter. When used for scanning, and the command is the last one in the pipeline (as in the above pipeline), a workaround is activated that writes the data to a temporary file, as if the pipeline would read
$(OCAMLFIND) $(OCAMLDEP) ... -modules $(src_file) >$(tmpfile);
ocamldep-postproc <$(tmpfile)
(NB. You actually can also program this in the OMake language. However, this does not solve the problem, because for sequences of commands $(cmd1);$(cmd2) it is also required to fork the process. Hence, I had to find a solution deeper in the OMake internals.)
There is one drawback of this, though: The latency of the pipeline is increased when the commands are run sequentially rather than in parallel. The effect is that OMake takes longer for a j=1 build even if less CPU resources are consumed. A number of further improvements compensate for this:
The Windows port of OMake is not affected by the fork problems. For starting commands, an optimized technique similar to posix_spawn() is used anyway. For pipelines and other internal uses of fork() the Windows port uses threads. (Note beside: You may ask why we don't use threads on Linux. There are a couple of reasons: First, the emulation of the process environment with threads is probably not quite as stable as the original using real processes. Second, there are difficult interoperability problems between threads and signals (something that does not exist in Windows). Finally, this would not save us maintaining the code branch using real processes and fork() because OCaml does not support multi-threading for all POSIX systems. Of course, this does not mean we cannot implement it as optional feature, and probably this will be done at some point in the future.)
The trick of using temporary files for speeding up pipelines is not enabled on Windows. Here, it is more important to get the benefits of parallelization that the real pipeline allows.