<?xml version="1.0"?>
<rss version="2.0">
  <channel>
    <title>Blog on camlcity.org</title>
    <link>http://blog.camlcity.org</link>
    <language>en</language>
    <description>Articles by Gerd Stolpmann about O'Caml</description>

    
        <item>
          <title>Plasma Map/Reduce Slightly Faster Than Hadoop</title>
          <guid>http://blog.camlcity.org/blog/plasma6.html</guid>
          <link>http://blog.camlcity.org/blog/plasma6.html</link>
          <description>

&#60;div&#62;
  &#60;b&#62;A performance test&#60;/b&#62;&#60;br/&#62;&#38;#160;
&#60;/div&#62;

&#60;div&#62;
  
Last week I spent some time running map/reduce jobs on Amazon EC2.
In particular, I compared the performance of Plasma, my own map/reduce
implementation, with Hadoop. I just wanted to know how much my implementation
was behind the most popular map/reduce framework. However, the suprise was
that Plasma turned out as slightly faster in this setup.

&#60;/div&#62;

&#60;div&#62;
  
&#60;div style=&#34;float:right; width: 50ex; font-size:small; color:grey; border: 1px solid grey; padding: 1ex; margin-left: 2ex&#34;&#62;
This article is also available in other languages:
&#60;dl&#62;
&#60;dt&#62;&#60;a href=&#34;http://science.webhostinggeeks.com/plasma-map-reduce&#34;&#62;[Serbo-Croatian]&#60;/a&#62;
&#60;/dt&#62;&#60;dd&#62;translation by Anja Skrba from 
&#60;a href=&#34;http://webhostinggeeks.com/&#34;&#62;Webhostinggeeks.com&#60;/a&#62;
&#60;/dd&#62;&#60;/dl&#62;
&#60;/div&#62;
&#60;p&#62;
I would not call this test a &#38;#34;benchmark&#38;#34;. Amazon EC2 is not a
controlled environment, as you always only get partial machines, and
you don&#38;#39;t know how much resources are consumed by other users on the
same machines.  Also, you cannot be sure how far the nodes are off
from each other in the network. Finally, there are some special
effects coming from the virtualization technology, especially the
first write of a disk block is slower (roughly half the normal speed)
than following writes.  However, EC2 is good enough to get an
impression of the speed, and one can hope that all the test runs
get the same handicap on average.

&#60;/p&#62;&#60;p&#62;
The task was to sort 100G of data, given in 10 files. Each line has
100 bytes, divided into a key of 8 bytes, a TAB character, 90 random
bytes as value, and an LF character. The key was randomly chosen from
65536 possible values. This means that there were lots of lines with
the same key - a scenario where I think it is more typical of map/reduce
than having unique keys. The output is partitioned into 80 sets.

&#60;/p&#62;&#60;p&#62;
I allocated 1 larger node (m1-xlarge) with 4 virtual cores and 15G of
RAM acting as combined name- and datanode, and 9 smaller nodes
(m1-large) with 2 virtual cores and 7.5G of RAM for the other
datanodes. Each node had access to two virtual disks that were
configured as RAID-0 array. The speed for sequential reading or
writing was around 160 MB/s for the array (but only 80 MB/s for the
first time blocks were written). Apparently, the nodes had Gigabit
network cards (the maximum transfer speed was around 119MB/s).

&#60;/p&#62;&#60;p&#62;
During the tests, I monitored the system activity with the sar utility.
I observed significant cycle stealing (meaning that a virtual core is
blocked because there is no free real core), often reaching values of
25%. This could be interpreted as overdriving the available resources,
but another explanation is that the hypervisor needed this time for
itself. Anyway, this effect also questions the reliability of this
test.

&#60;/p&#62;&#60;h2&#62;The contrahents&#60;/h2&#62;

&#60;p&#62;
Hadoop is the top dog in the map/reduce scene. In this test, the
version from Cloudera 0.20.2-cdh3u2 was used, which contains more than
1000 patches against the vanilla 0.20.2 version. Written in Java, it
needs a JVM at runtime, which was here IcedTea 1.9.10 distributing
OpenJDK 1.6.0_20. I did not do any tuning, hoping that the configuration
would be ok for a small job. The HDFS block size was 64M, without
replication.

&#60;/p&#62;&#60;p&#62;
The contender is Plasma Map/Reduce. I started this project two years
ago in my spare time. It is not a clone of the Hadoop architecture,
but includes many new ideas. In particular, a lot of work went into
the distributed filesystem PlasmaFS which features an almost complete
set of file operations, and controls the disk layout directly. The
map/reduce algorithm uses a slightly different scheme which tries
to delay the partitioning of the data to get larger intermediate files.
Plasma is implemented in OCaml, which isn&#38;#39;t VM-based but compiles
the code directly to assembly language. In this test, the blocksize
was 1M (Plasma is designed for smaller-sized blocks). The software
version of Plasma is roughly 0.6 (a few svn revisions before the release
of 0.6).

&#60;/p&#62;&#60;h2&#62;Results&#60;/h2&#62;

&#60;p&#62;The runtimes:

&#60;/p&#62;&#60;p&#62;
&#60;/p&#62;&#60;table&#62;
  &#60;tr&#62;
    &#60;td&#62;&#60;b&#62;Hadoop:&#60;/b&#62;&#60;/td&#62;     &#60;td&#62;&#60;b&#62;2265 seconds&#60;/b&#62; (37 min, 45 s)&#60;/td&#62;
  &#60;/tr&#62;
  &#60;tr&#62;
    &#60;td&#62;&#60;b&#62;Plasma:&#60;/b&#62;&#60;/td&#62;     &#60;td&#62;&#60;b&#62;1975 seconds&#60;/b&#62; (32 min. 55 s)&#60;/td&#62;
  &#60;/tr&#62;
&#60;/table&#62;

&#60;p&#62;
Given the uncertainty of the environment, this is no big difference.
But let&#38;#39;s have a closer look at the system activity to get an idea
why Plasma is a bit faster.

&#60;/p&#62;&#60;h2&#62;CPU&#60;/h2&#62;

In the following I took simply one of the datanodes, and created
diagrams (with kSar):

&#60;p&#62;
&#60;img src=&#34;/files/img/blog/edited_hadoop_cpu_all.png&#34; width=&#34;799&#34; height=&#34;472&#34;/&#62;

&#60;/p&#62;&#60;p&#62;
&#60;img src=&#34;/files/img/blog/edited_plasma_cpu_all.png&#34; width=&#34;800&#34; height=&#34;471&#34;/&#62;

&#60;/p&#62;&#60;p&#62;
Note that kSar does not draw graphs for %iowait and %steal, although 
these data are recorded by sar. This is the explanation why the sum of
user, system and idle is not 100%. 

&#60;/p&#62;&#60;p&#62;
What we see here is that Hadoop consumes all CPU cycles, whereas
Plasma leaves around 1/3 of the CPU capacity unused. Given the fact
that this kind of job is normally I/O-bound, it just means that Hadoop
is more CPU-hungry, and would have benefit from getting more cores
in this test.

&#60;/p&#62;&#60;h2&#62;Network&#60;/h2&#62;

In this diagram, reads are blue and red, whereas writes are green and
black. The first curve shows packets per second, and the second bytes
per second:

&#60;p&#62;
&#60;img src=&#34;/files/img/blog/edited_hadoop_eth0.png&#34; width=&#34;800&#34; height=&#34;333&#34;/&#62;

&#60;/p&#62;&#60;p&#62;
&#60;img src=&#34;/files/img/blog/edited_plasma_eth0.png&#34; width=&#34;800&#34; height=&#34;319&#34;/&#62;

Summing reads and writes up, Hadoop uses only around 7MB/s on average
whereas Plasma transmits around 25MB/s, more than three times as
much. There could be two explanations:

&#60;/p&#62;&#60;ul&#62;
  &#60;li&#62;Because Hadoop is CPU-underpowered, it remains below its
      possibilities
  &#60;/li&#62;&#60;li&#62;The Hadoop scheme is more optimized for keeping the network
      bandwidth as low as possible
&#60;/li&#62;&#60;/ul&#62;

The background for the second point is the following: Because Hadoop
partitions the data immediately after mapping and sorting, the data
has (ideally) only to cross the network once.  This is different in
Plasma - which generally partitions the data iteratively. In this
setup, after mapping and sorting only 4 partitions are created, which
are further refined in the following split-and-merge rounds.  As we
have here 80 partitions in total, there is at least one further step
in which data partitioning is refined, meaning that the data has to
cross the network roughly twice. This already explains 2/3 of the
observed difference.  (As a side note, one can configure how many
partitions are initially created after mapping and sorting, and it
would have been possible to mimick Hadoop&#38;#39;s scheme by setting this
value to 80.)

&#60;h2&#62;Disks&#60;/h2&#62;

These diagrams depict the disk reads and writes in KB/second:

&#60;p&#62;
&#60;img src=&#34;/files/img/blog/edited_hadoop_md0.png&#34; width=&#34;800&#34; height=&#34;332&#34;/&#62;

&#60;/p&#62;&#60;p&#62;
&#60;img src=&#34;/files/img/blog/edited_plasma_md0.png&#34; width=&#34;800&#34; height=&#34;332&#34;/&#62;

The average numbers are (directly taken from sar):

&#60;/p&#62;&#60;p&#62;
&#60;/p&#62;&#60;table&#62;
  &#60;tr&#62;
    &#60;td&#62;&#38;#160;&#60;/td&#62;
    &#60;th&#62;Hadoop&#60;/th&#62;
    &#60;th&#62;Plasma&#60;/th&#62;
  &#60;/tr&#62;
  &#60;tr&#62;
    &#60;td&#62;Read/s:&#60;/td&#62;
    &#60;td&#62;17.6 MB/s&#60;/td&#62;
    &#60;td&#62;31.2 MB/s&#60;/td&#62;
  &#60;/tr&#62;
  &#60;tr&#62;
    &#60;td&#62;Write/s:&#60;/td&#62;
    &#60;td&#62;30.8 MB/s&#60;/td&#62;
    &#60;td&#62;33.9 MB/s&#60;/td&#62;
  &#60;/tr&#62;
&#60;/table&#62;

&#60;p&#62;
Obviously, Plasma reads data around twice as often from disk than
Hadoop, whereas the write speed is about the same. Apart from this, it
is interesting that the shape of the curves are quite different:
Hadoop has a period of high disk activity at the end of the job (when
it is busy merging data), whereas Plasma utilizes the disks better
during the first third of the job.

&#60;/p&#62;&#60;h2&#62;Plausibility&#60;/h2&#62;

&#60;p&#62;
Neither of the contenders utilized the I/O resources at all times
best. Part of the difficulty of developing a map/reduce scheme is to
achieve that the load put onto the disks and onto the network is
balanced. It is not good when e,g, the disks are used to 100% at a
certain point and the network is underutilized, but during the next
period the network is at 100% and the disk not fully used. A balanced
distribution of the load reaches higher throughput in total.

&#60;/p&#62;&#60;p&#62;
Let&#38;#39;s analyze the Plasma scheme a bit more in detail. The data set of
100G (which does not change in volume during the processing) is copied
four times in total: once in the map-and-sort phase, and three times
in the reduce phase (for this volume Plasma needs three merging
rounds). This means we have to transfer 4 * 100G of data in total, or
40G of data per node (remember we have 10 nodes). We ran 22 cores for
1975 seconds, which gives a capacity of 43450 CPU seconds. Plasma
tells us in its reports that it used 3822 CPU seconds for in-RAM
sorting, which we should subtract for analyzing the I/O
throughput. Per core these are 173 seconds. This means each node had
1975-173 = 1802 seconds for handling the 40G of data. This makes
around 22 MB per second on each node.

&#60;/p&#62;&#60;p&#62;
The Hadoop scheme differs mostly in that the data is only copied twice
in the merge phase (because Hadoop by default merges more files in
one round than Plasma). However, because of its design there is an
extra copy at the end of the reduce phase (from disk to HDFS).  This
means Hadoop also solves the same job by transferring 4 * 100G of data.
There is no counter for measuring the time spent for in-RAM sorting.
Let&#38;#39;s assume this time is also around 3800 seconds. This means each
node had 2265 - 175 = 2090 seconds for handling 40G of data, or
19 MB per second on each node.

&#60;/p&#62;&#60;h2&#62;Conclusion&#60;/h2&#62;

&#60;p&#62;
It looks very much as if both implementations are slowed down by
specifics of the EC2 environment. Especially the disk I/O, probably
the essential bottleneck here, is far below what one can expect.
Plasma probably won because it uses the CPU more efficiently, whereas
other aspects like network utilization are better handled by Hadoop.

&#60;/p&#62;&#60;p&#62;
For my project this result just means that it is on the right track.
Especially, this small setup (only 10 nodes) is easily handled, giving
prospect that Plasma is scalable at least to a small multitude of
this. The bottleneck would be here the namenode, but there is still a
lot of headroom.

&#60;/p&#62;&#60;h2&#62;Where to get Plasma&#60;/h2&#62;

&#60;p&#62;Plasma Map/Reduce and PlasmaFS are bundled together in one download. Here is the
&#60;a href=&#34;http://projects.camlcity.org/projects/plasma.html&#34;&#62;project page&#60;/a&#62;.

&#60;/p&#62;&#60;p&#62;

&#60;img src=&#34;/files/img/blog/plasma6_bug.gif&#34; width=&#34;1&#34; height=&#34;1&#34;/&#62;

&#60;/p&#62;
&#60;/div&#62;

&#60;div&#62;
  Gerd Stolpmann works as O&#38;#39;Caml consultant

&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;


          </description>
        </item>
      
        <item>
          <title>After NoSQL there will be NoServer</title>
          <guid>http://blog.camlcity.org/blog/plasma5.html</guid>
          <link>http://blog.camlcity.org/blog/plasma5.html</link>
          <description>

&#60;div&#62;
  &#60;b&#62;An experiment, and a vision&#60;/b&#62;&#60;br/&#62;&#38;#160;
&#60;/div&#62;

&#60;div&#62;
  
&#60;p&#62;
The recent success of NoSQL technologies has not only to do with the
fact that it is taken advantage of distribution and replication, but
even more with the &#38;#34;middleware effect&#38;#34; that these features became
relatively easy to use.  Now it is no longer required to be an expert
for these cluster techniques in order to profit from them. Let&#38;#39;s think
a bit ahead: how could a platform look like that makes distributed
programming even easier, and that integrates several styles of storing
data and managing computations?

&#60;cc-field name=&#34;maintext&#34;&#62;
&#60;p&#62;
The starting point for this exploration is a recent experience I made
with my own attempt in the NoSQL arena,
the &#60;a href=&#34;http://plasma.camlcity.org&#34;&#62;Plasma project&#60;/a&#62;. Two weeks
ago, it was &#38;#34;only&#38;#34; a distributed, replicating, and failure-resiliant
filesystem PlasmaFS, with its own map/reduce implementation on top of
it. Then I had an idea: is it possible to develop a key/value database
on top of this filesystem? Which features, and relative
advantages/disadvantages would it have? In other words, I was
examining whether the existing platform makes it simpler to develop
a database with a reasonable feature set.

&#60;/p&#62;&#60;p&#62;
When we talk about clusters, I have especially Internet applications
in mind that are bombarded by the users with requests, but that have
also to do a lot of background processing.


&#60;/p&#62;&#60;h2&#62;The key/value database needed less than 2000 lines of code&#60;/h2&#62;

&#60;p&#62;
Now, PlasmaFS is not following the simple pattern of HDFS, but bases
on a transactional core, and it even allows the users to manage the
transactions. For example, it is possible to rename a bunch of files
atomically by just wrapping the rename operations into a single
transaction.  The transactional support goes even further: When
reading from a file one can activate a special snapshot mode, which
just means that the reader&#38;#39;s view of the file is isolated from any
writes happening at the same time.

&#60;/p&#62;&#60;p&#62;
These are clearly advanced features, and the question was whether they
helped for writing a key/value database library. And yes, it was
extremely helpful - in less than 2000 lines of code this library
provides data distribution and replication, a high degree of data
safety, almost unlimited scalabilitiy for database reads, and
reasonable performance for writes. Of course, most of these features
are just &#38;#34;inherited&#38;#34; from PlasmaFS, and the library just had to
implement the file format (i.e. a B tree,
see &#60;a href=&#34;http://projects.camlcity.org/projects/dl/plasma-0.5/doc/html/Plasmakv_intro.html&#34;&#62;
this page for details&#60;/a&#62;). This is not cheating, but exactly the
point: the platform makes it easy to provide features that would
otherwise be extremely complicated to provide.

&#60;/p&#62;&#60;h2&#62;NoServer&#60;/h2&#62;

&#60;p&#62;
This key/value database is just a library, and one can use it only
on machines where PlasmaFS is deployed. Of course it is possible to
access the same database file from several machines - PlasmaFS handles
all the networking involved with it. The point is that during the
implementation of the library this never had to be taken into account.
There is no networking code in this library, and this is why it is
the first example of the new NoServer paradigm - not only server.

&#60;/p&#62;&#60;p&#62;
The genuine advantage of this paradigm is that it enables developers
to write code they never would be able to create without the help of
the platform. This is a bit comparable to the current situation for
SQL databases: Everybody can store data in them, even over the
network, without needing to have any clue how this works in detail.
In the NoServer paradigm, we just go one step further, because the
provided services by the platform are a lot more low-level, and the
developer has a lot more freedom. Instead with a query language
the shared resources are accessed with normal file operations,
extended by transactional directives. The hope is that this makes
a lot of server programming superflous, especially the difficult
parts of it (e.g. what to do when a machine crashes).

&#60;/p&#62;&#60;p&#62;
A simple key/value database is obviously not difficult to create with
these programming means. The interesting question is what else can be
done with it in a cluster environment. Obviously, having a common
filesystem on all machines of the cluster makes a lot of file copying
superflous that a normal cluster would do with rsync and/or
ssh. PlasmaFS can even be directly mounted (although the transactional
features are unavailable then), so even applications can access
PlasmaFS files that have not specially been ported to it.  An example
would be a read-only Lucene search index residing in PlasmaFS.
Replacing the index by an updated one would be done by simply moving
the new index into the right directory, and signalling Lucene that it
has to re-open the index.

&#60;/p&#62;&#60;p&#62;
So far Plasma is implemented, and works well (I just released the
release 0.5, which is now beta quality). The vision goes of course
beyond that.

&#60;/p&#62;&#60;h2&#62;What the platform also needs&#60;/h2&#62;

&#60;p&#62;
There are a number of further datastructures that can obviously be
well represented in files, such as hashtables or queues. Let&#38;#39;s explore
the latter a bit more in detail: How would a queue manager look like?
There are a few data representation options. For example, every queue
element could be a file in a directory, or a container format is
established where the elements can be appended to. PlasmsFS also
allows it to cut arbitrary holes into files, so it is even possible to
physically remove elements from the beginning of the queue file by
just removing the data blocks storing the elements from the file.  As
we don&#38;#39;t want to run the queue manager as server, but just as library
inside any program accessing the queue, the question is how event
notifications are handled (which would be obvious in server context).
Usually, one has to notify some followup processor when new elements
have been added to the queue. Plasma currently does not include a
method for doing this, so the platform needs to be extended by a
notification framework (which should not be too difficult).

&#60;/p&#62;&#60;p&#62;
An important question is also how programs are activated running on
different nodes. In my vision there would be a central task execution
manager. Of course, this manager is normal client/server middleware.
Again, the point here is that the application developer needs no 
special skills for triggering remote activation, he just uses
libraries. I&#38;#39;ve no absolutely clear picture of this part yet, but
it seems to be necessary to have the option of invoking programs
in the inetd style as well as directly as if started via ssh.
Also, a central directory would be maintained that includes
important data such as which program can be run on which node.

&#60;/p&#62;&#60;h2&#62;We won&#38;#39;t live totally without servers, only with fewer ones&#60;/h2&#62;

&#60;p&#62;
My vision does not include that servers are completely banned. We will
still need them for special features or data access patterns, and of
course for interaction with other systems.  For example, PlasmaFS is
bad at coordinating concurrent write accesses to the same file. Also,
PlasmaFS employs a central namenode with a limited capacity only. So,
if you are doing OLTP processing, a normal SQL database will still do
better. If you need extraordinary write performance, but can pay the
price of weakened consistency guarantees, a system like Cassandra will
work better.

&#60;/p&#62;&#60;p&#62;
Nevertheless, there is the big field of &#38;#34;average deployments&#38;#34; where
the number of nodes is not too big and the performance requirements
are not too special, but the ACID guarantees PlasmaFS gives are
essential. For this field, the NoServer paradigm could be the ideal
choice to reduce the development overhead dramatically.

&#60;/p&#62;&#60;h2&#62;Check Plasma out&#60;/h2&#62;

The &#60;a href=&#34;http://plasma.camlcity.org&#34;&#62;Plasma homepage&#60;/a&#62; provides
a lot of documentation, and especially downloads. Also take a look at
the &#60;a href=&#34;http://plasma.camlcity.org/plasma/perf.html&#34;&#62;performance
page&#60;/a&#62;, describing a few tests I recently ran.

&#60;img src=&#34;/files/img/blog/plasma5_bug.gif&#34; width=&#34;1&#34; height=&#34;1&#34;/&#62;



&#60;/cc-field&#62;
&#60;/p&#62;
&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;

&#60;div&#62;
  Gerd Stolpmann works as O&#38;#39;Caml consultant.
&#60;a href=&#34;search1.html&#34;&#62;Currently looking for new jobs as consultant!&#60;/a&#62;

&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;


          </description>
        </item>
      
        <item>
          <title>PlasmaFS</title>
          <guid>http://blog.camlcity.org/blog/plasma4.html</guid>
          <link>http://blog.camlcity.org/blog/plasma4.html</link>
          <description>

&#60;div&#62;
  &#60;b&#62;A serious distributed filesystem&#60;/b&#62;&#60;br/&#62;&#38;#160;
&#60;/div&#62;

&#60;div&#62;
  
&#60;p&#62;
A few days ago, I
released &#60;a href=&#34;http://plasma.camlcity.org&#34;&#62;Plasma-0.4.1&#60;/a&#62;.  This
article gives an overview over the filesystem subsystem of it, which
is actually the more important part. PlasmaFS differs in many points
from popular distributed filesystems like HDFS. This starts from the
beginning with the requirements analysis.

&#60;cc-field name=&#34;maintext&#34;&#62;
&#60;p&#62;
A distributed filesystem (DFS) allows it to store giant amounts of
data.  A high number of data nodes (computers with hard disks) can be
attached to a DFS cluster, and usually a second kind of node, called
name node, is used to store metadata, i.e. which files are stored and
where. The point is now that the volume of metadata can be very low
compared to the payload data (the ratios are somewhere between
1:10,000 to 1:1,000,000), so a single name node can manage a quite
large cluster. Also, the clients can contact the data nodes
directly to access payload data - the traffic is not routed via
the name node like in &#38;#34;normal&#38;#34; network filesystems. This allows
enormous bandwidths.

&#60;/p&#62;&#60;p&#62;
The motivation for developing another DFS was that existing
implementations, and especially the popular HDFS, make (in my opinion)
unfortunate compromises to gain speed:

&#60;/p&#62;&#60;ul&#62;
  &#60;li&#62;The metadata is not well protected. Although the metadata is
   saved to disk and usually also replicated to another computer, these 
   &#38;#34;safety copies&#38;#34; lag behind. In the case of an outage, data loss
   is common (HDFS even fails fatally when the disk fills up).
   Given the amount of data, this is not acceptable. It&#38;#39;s like a
   local filesystem without journaling.&#60;br/&#62;&#38;#160;
  &#60;/li&#62;&#60;li&#62;The name node protocol is too simplistic, and because of this,
   DFS implementations need ultra-high-speed name node implementations
   (at least several 10000 operations per second) to manage larger clusters.
   Another consequence is that only large block sizes (several megabytes)
   promise decent access speeds, because this is the only implemented
   strategy to reduce the frequency of name node operations.&#60;br/&#62;&#38;#160;
  &#60;/li&#62;&#60;li&#62;Unless you can physically separate the cluster from the rest
    of the network, security is a requirement. It is difficult to provide,
    however, mainly because the data nodes are independently accessed, and you
    want to avoid that data nodes have to continuously check for
    access permissions. So the compromise is to leave this out in the
    DFS, and rely on complicated and error-prone configurations in
    network hardware (routers and gateways).
&#60;/li&#62;&#60;/ul&#62;

&#60;p&#62;
I&#38;#39;m not saying that HDFS is a bad implementation. My point is only that
there is an alternative where safety and security are taken more
seriously, and that there are other ways to get high speed than those
that are implemented in HDFS.

&#60;/p&#62;&#60;h2&#62;Using SSDs for transacted metadata stores&#60;/h2&#62;

PlasmaFS starts at a different point. It uses a data store with full
transactional support (right now this is PostgreSQL, just for
development simplicity, but other, and more light-weight systems could
also fill out this role). This includes:

&#60;ul&#62;
  &#60;li&#62;Data are made persistent in a way so that full ACID support
    is guaranteed (remember, the ACID properties are atomicity,
    consistency, isolation, and durability).
  &#60;/li&#62;&#60;li&#62;For keeping replicas synchronized, we demand support for
    two-phase commit, i.e. that transactions can be prepared before
    the actual commit with the guarantee that the commit is fail-safe
    after preparation. (Essentially, two-phase commit is a protocol
    between two database systems keeping them always consistent.)
&#60;/li&#62;&#60;/ul&#62;

This is, by the way, the established prime-standard way of ensuring
data safety for databases.  It comes with its own problems, and the
most challenging is that commits are relatively slow. The reason for this
is the storage hardware - for normal hard disks the maximum frequency
of commits is a function of the rotation speed. Fortunately, there is
now an alternative: SSDs allow at present several 10000 syncs per
second, which is two orders of magnitude more than classic hard disks
provide. Good SSDs are still expensive, but luckily moderate disk
sizes are already sufficient (with only a 100G database you can
already manage a really giant filesystem).

&#60;p&#62;Still, writing each modification directly to the SSD limits the
speed compared to what systems like HDFS can do (because HDFS keeps
the data in RAM, and only writes now and then a copy to disk).  We need
more techniques to address the potential bottleneck name node:

&#60;/p&#62;&#60;ul&#62;
  &#60;li&#62;PlasmaFS provides a transactional view to users. This works
    very much like the transactions in SQL. The performance advantage is here that
    several write operations can be carried out with only one commit.
    PlasmaFS takes it that far that unlimited numbers of metadata
    operations can be put into a transaction, such as creating and
    deleting files, allocating blocks for the files, and retrieving
    block lists. It is possible to write terabytes of data to files with
    &#60;i&#62;only a single commit&#60;/i&#62;! Applications accessing large files
    sequentially (as, e.g., in the map/reduce framework) can especially
    profit from this scheme.&#60;br/&#62;&#38;#160;
  &#60;/li&#62;&#60;li&#62;PlasmaFS addresses blocks linearly: for each data node the blocks
    are identified by numbers from 0 to n-1. This is safe, because we
    manage the consistency globally (basically, there is a kind of
    join between the table managing which blocks are used or free, and
    the table managing the block lists per file, and our safety
    measures allow it to keep this join consistent). In contrast,
    other DFS use GUIDs to identify blocks. The linear scheme,
    however, allow it to transmit and store block lists in a
    compressed way (extent-based). For example, if a file uses the
    blocks 10 to 14 on a data nodes, this is stored as &#38;#34;10-14&#38;#34;, and not
    as &#38;#34;10,11,12,13,14&#38;#34;. Also, block allocations are always done
    for ranges of blocks. This greatly reduces the number
    of name node operations while only moderately increasing their
    complexity.&#60;br/&#62;&#38;#160;
  &#60;/li&#62;&#60;li&#62;A version number is maintained per file that is
    increased whenever data or metadata are modified. This allows it
    to keep external caches up to date with only low overhead: A quick
    check whether the version number has changed is sufficient to
    decide whether the cache needs to be refreshed. This is reliable,
    in contrast to cache consistency schemes that base only on the
    last modification time. Currently this is used to keep the
    caches of the NFS bridge synchronized. Especially, applications accessing
    only a few files randomly profit from such caching.
&#60;/li&#62;&#60;/ul&#62;

&#60;p&#62;
I consider the map/reduce part of Plasma especially as a good test
case for PlasmaFS. Of course, this map/reduce implementation is
perfectly adapted to PlasmaFS, and uses all possibilities to reduce
the frequency of name node operations. It turns out that a typical
running map/reduce task contacts the name node only every 3-4 seconds,
usually to refill a buffer that got empty, or to flush a full buffer
to disk. The point here is that a buffer can be larger than a data
block, and that only a single name node transaction is sufficient to
handle all blocks in the buffer in one go. The buffers are typically
way larger than only a single block, so this reduces the number of
name node operations quite dramatically.  (Important note: This number
(3-4) is only correct for Plasma&#38;#39;s map/reduce implementation which
uses a modified and more complex algorithm scheme, but it is not
applicable to the scheme used by Hadoop.)

&#60;/p&#62;&#60;h2&#62;Speed&#60;/h2&#62;

&#60;p&#62;
I have done some tests with the latest development version of
Plasma. The peak number of commits per second seems to be around 500
(here, a &#38;#34;commit&#38;#34; is a transaction writing data that can include
several data update operations). This test used a recently bought SSD,
and ran on a quad-core server machine. It was not evident that the SSD
was the bottleneck (one indication is that the test ran only slightly
faster when syncs were turned off), so there is probably still a lot
of room for optimization.

&#60;/p&#62;&#60;p&#62;
Given that a map/reduce task needs the name node only every &#38;#8776;0.3 seconds,
this &#38;#34;commit speed&#38;#34; would be theoretically sufficient for around
1600 parallely running tasks. It is likely that other limits are
hit first (e.g. the switching capacity). Anyway, these are encouraging
numbers showing that this young project is not on the wrong track.

&#60;/p&#62;&#60;p&#62;
The above techniques are already implemented in PlasmaFS. More advanced
options that could be worth an implementation include:

&#60;/p&#62;&#60;ul&#62;
  &#60;li&#62;As we can maintain exact replicas of the primary name node (via
    two-phase commit), it becomes possible to also use the replicas
    for read accesses. For certain types of read operations this is
    non-trivial, though, because they have an effect on the block
    allocation map (essentially we would need to synchronize a certain
    buffer in both the primary and secondary servers that controls
    delayed block deallocation). nevertheless, this is certainly a viable option.
    Even writes could be handled by
    the secondary nodes, but this tends to become very complicated,
    and is probably not worth it.&#60;br/&#62;&#38;#160;
  &#60;/li&#62;&#60;li&#62;An easier option to increase the capacity is to split the file
    space, so that each name node takes care of a partition only. A
    user transaction would still need a uniform view on the filesystem,
    though. If a name node receives a request for an operation it
    cannot do itself, it automatically extends the scope of the
    transaction to the name node that is responsible for the right
    partition. This scheme would also use the two-phase commit protocol
    for keeping the partitions consistent. I think this option is viable,
    but only for the price of a complex development effort.
&#60;/li&#62;&#60;/ul&#62;

&#60;p&#62;
Given that these two improvements are very complicated to implement,
it is unlikely that it is done soon. There is still a lot of fruit
hanging at lower branches of the tree.


&#60;/p&#62;&#60;h2&#62;Delegated access control checks&#60;/h2&#62;

&#60;p&#62;
Let&#38;#39;s quickly discuss another problem, namely how to secure accesses
to data nodes. It is easy to accept that the name nodes can be secured
with classic authentication and authorization schemes in the same
style as they are used for other server software, too. For data nodes,
however, we face the problem that we need to supervise every access to a
data block individually, but want to avoid any extra overhead, especially
that each data access needs to be checked with the name node.

&#60;/p&#62;&#60;p&#62;
PlasmaFS uses a special cryptographic ticket system to avoid
this. Essentially, the name node creates random keys in periodical
intervals, and broadcasts these to the data nodes. These keys are
secrets shared by the name and data nodes. The accessing clients get
only HMAC-based tickets generated from the keys and from the block ID
the clients are granted access to.  These tickets can be checked by
the data nodes because these nodes know the keys. When the client
loses the right to access the blocks (i.e. when the client transaction
ends), the corresponding key is revoked.

&#60;/p&#62;&#60;p&#62;
With some additional tricks it can be achieved that the only
communication between the name node and the data node is a periodical
maintenance call that hands out the new keys and revokes the expired
keys. That&#38;#39;s an acceptable overhead.


&#60;/p&#62;&#60;h2&#62;Other quality-assuring features&#60;/h2&#62;

&#60;p&#62;
PlasmaFS implements the POSIX file semantics almost completely. This
includes the possibility of modifying data (or better, replacing
blocks by newer versions, which is not possible in other DFS
implementations), the handling of deleted files, and the exclusive
creation of new files. There are a few exceptions, though, namely
neither the link count nor the last access time of files are maintained.
Also, lockf-style locks are not yet available.

&#60;/p&#62;&#60;p&#62;
For supporting map/reduce and other distributed algorithm schemes,
PlasmaFS offers locality functions. In particular, one can find out
on which nodes a data block is actually stored, and one can also
wish that a new data block is stored on a certain node (if possible).

&#60;/p&#62;&#60;p&#62;
The PlasmaFS client protocol bases on SunRPC. This protocol has quite
good support on the system level, and it supports strong
authentication and encryption via the GSS-API extension (which is
actually used by PlasmaFS, together with the SCRAM-SHA1 mechanism). I
know that younger developers consider it as out-dated, but even the
Facebook generation must accept that it can keep up with the
requirements of today, and that it includes features that more modern
protocols do not provide (like UDP transport and GSS-API). For the
quality of the code it is important that modifying the SunRPC layer is
easy (e.g. adding or changing a new procedure), and does not imply
much coding. Because of this it could be achieved that the PlasmaFS
protocol is quite clean on the one hand, but is still adequately
expressive on the other hand to support complex transactions.

&#60;/p&#62;&#60;p&#62;
PlasmaFS is accessible from many environments. Applications can access
it via the mentioned SunRPC protocol (with all features), but also
via NFS, and via a command-line client. In the future, WebDAV support
will also be provided (which is an extension of HTTP, and which will
ensure easy access from many programming environments).

&#60;/p&#62;&#60;h2&#62;Check Plasma out&#60;/h2&#62;

The &#60;a href=&#34;http://plasma.camlcity.org&#34;&#62;Plasma homepage&#60;/a&#62; provides
a lot of documentation, and especially downloads. Also take a look at
the &#60;a href=&#34;http://plasma.camlcity.org/plasma/perf.html&#34;&#62;performance
page&#60;/a&#62;, describing a few tests I recently ran.

&#60;img src=&#34;/files/img/blog/plasma4_bug.gif&#34; width=&#34;1&#34; height=&#34;1&#34;/&#62;



&#60;/cc-field&#62;
&#60;/p&#62;
&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;

&#60;div&#62;
  Gerd Stolpmann works as O&#38;#39;Caml consultant.
&#60;a href=&#34;search1.html&#34;&#62;Currently looking for new jobs as consultant!&#60;/a&#62;

&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;


          </description>
        </item>
      
        <item>
          <title>Opportunity</title>
          <guid>http://blog.camlcity.org/blog/search1.html</guid>
          <link>http://blog.camlcity.org/blog/search1.html</link>
          <description>

&#60;div&#62;
  &#60;b&#62;Gerd Stolpmann is looking for new Ocaml projects&#60;/b&#62;&#60;br/&#62;&#38;#160;
&#60;/div&#62;

&#60;div&#62;
  
Finally my job at Mylife ended. After all, it was a great success, and
it used Ocaml as implementation language. The question is now: What
is the next challenge?

&#60;/div&#62;

&#60;div&#62;
  
&#60;p&#62;
Of course, this is a message to possible employers. There is now the
opportunity to hire me, an Ocaml enthusiast of the first hour, creator
of GODI, camlcity.org, and a number of Ocaml libraries. Also, I think
I&#38;#39;m a quite good system programmer ;-)
&#60;/p&#62;

&#60;p&#62;
Find more information in my &#60;a href=&#34;http://www.gerd-stolpmann.de/buero/Company-profile.pdf&#34;&#62;Company Profile&#60;/a&#62;. I&#38;#39;m searching as a contractor only.
Both small and big-sized jobs are now possible. Don&#38;#39;t miss this
opportunity, and talk &#60;b&#62;now&#60;/b&#62; to me (gerd@gerd-stolpmann.de).
&#60;/p&#62;

&#60;/div&#62;

&#60;div&#62;
  Gerd Stolpmann works as O&#38;#39;Caml consultant

&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;


          </description>
        </item>
      
        <item>
          <title>GODI upgrades to Ocaml-3.12.1</title>
          <guid>http://blog.camlcity.org/blog/godi_3_12_1.html</guid>
          <link>http://blog.camlcity.org/blog/godi_3_12_1.html</link>
          <description>

&#60;div&#62;
  &#60;b&#62;Web site is also redesigned&#60;/b&#62;&#60;br/&#62;&#38;#160;
&#60;/div&#62;

&#60;div&#62;
  
The GODI project just updated the release line for Ocaml-3.12 which bases
now on version 3.12.1. We have done extensive tests, and found only
some minor incompatibilities for ocamlbuild (which is a bit pickier
now and chokes sometimes when it sees extra files it does not know
about). These could be resolved. At the same time, the 3.12 release
loses its beta status (which had it for quite a long time), and is now
the recommended Ocaml version.

&#60;/div&#62;

&#60;div&#62;
  
&#60;p&#62;
GODI also gets a newly designed web site: http://godi.camlcity.org is
now much more illustrative and informative. As a special bonus, a news
feed is included with the newest package releases (right now only as
HTML, RSS will follow later). Also the list of packages has been
reworked and is now fully dynamic (instead of generated).

&#60;/p&#62;&#60;p&#62;
The new bootstrap for GODI is now available as
&#60;a href=&#34;http://www.camlcity.org:81/download/godi-rocketboost-20110717.tar.gz&#34;&#62;
godi-rocketboost-20110717.tar.gz&#60;/a&#62;.  It also got a bit of
developers&#38;#39; attention, and is now more intuitive. (For example, the
stage 2 of the bootstrap is now automatically started. The bootstrap
is now interactive by default.)  If you do not want to run another
bootstrap, you can also upgrade your existing GODI installation:

&#60;/p&#62;&#60;ul&#62;
&#60;li&#62;If you already have a 3.12 installation, just follow the normal
upgrade path for packages in godi_console. Ocaml 3.12.1 is a normal
package upgrade here.
&#60;/li&#62;&#60;li&#62;If you still use a 3.11 installation, just edit godi.conf, and
replace the line setting GODI_SECTION, and set this variable to
3.12. Then perform a package upgrade using godi_console (as above).
&#60;/li&#62;&#60;/ul&#62;

&#60;p&#62;
There are a few remaining issues with the new version:

&#60;/p&#62;&#60;ul&#62;
&#60;li&#62;Batteries: It is reported that there is an incompatibility with the
  new Hashtbl signature. It is being worked on this. Currently the
  package is broken. It is recommended that users wait until the
  problems have been resolved.
&#60;/li&#62;&#60;li&#62;Ocamlduce is still unavailable for 3.12.1. This also affects
  dependencies like godi-tyxml.
&#60;/li&#62;&#60;li&#62;After installing GODI on a newly set up system, I found problems
  for godi-mlgmp, godi-fftw, and godi-lablgtk(1), because they
  are out of sync with current C libraries, or the C libraries have
  become unavailable (like gtk1).
&#60;/li&#62;&#60;li&#62;There are a few applications which are still broken:
  apps-felix, apps-nurpawiki, apps-pkglab, apps-regstab. The
  package maintainers are notified.
&#60;/li&#62;&#60;/ul&#62;

&#60;p&#62;The documentation
archive &#60;a href=&#34;http://docs.camlcity.org&#34;&#62;docs.camlcity.org&#60;/a&#62; is
still switching to 3.12 as default source for documentation. This
means not all package docs are available yet (but an impressive subset
is already). I hope this will be fixed until tomorrow.

&#60;/p&#62;&#60;p&#62;
A final word about GODI and OASIS. Sylvain is very interested in
providing the OASIS packages in a format that GODI understands. There
is now the plan that GODI extracts the required information from its
package db, and OASIS uses this to wrap its packages so these can be
included as package source into godi_console. We are now working on
this.
&#60;/p&#62;
&#60;/div&#62;

&#60;div&#62;
  Gerd Stolpmann works as O&#38;#39;Caml consultant

&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;


          </description>
        </item>
      
        <item>
          <title>Camlcity.org gets a shared cache</title>
          <guid>http://blog.camlcity.org/blog/multicore4.html</guid>
          <link>http://blog.camlcity.org/blog/multicore4.html</link>
          <description>

&#60;div&#62;
  &#60;b&#62;Ocaml and multicore programming&#60;/b&#62;&#60;br/&#62;&#38;#160;
&#60;/div&#62;

&#60;div&#62;
  
The website camlcity.org (where this is blog is published) is running
a special server software written in Ocaml. Recently I made it faster
by introducing a cache that is directly shared by several worker
processes. The cache module consists only of a few lines of code,
and makes use of the new &#60;a href=&#34;http://blog.camlcity.org/blog/multicore2.html&#34;&#62;Netmulticore&#60;/a&#62; library.

&#60;/div&#62;

&#60;div&#62;
  
&#60;p&#62;
Netmulticore is a part of Ocamlnet, and because the camlcity.org software
is also developed with Ocamlnet, it was quite natural and simple to use
this shared memory interface.

&#60;/p&#62;&#60;p&#62;
But let&#38;#39;s step back first and look at the special problem that was solved.
Camlcity.org is not a simple web server - it is a cascade of a front-end
server and several back-end servers. The front-end is, more or less,
mixing the data coming from the back-ends, and transforms the data to
a presentable form using a template engine. The back-ends are also
HTTP servers.  This is shown in this picture:


&#60;img src=&#34;/files/img/blog/multicore4_fig1.png&#34; width=&#34;700&#34; height=&#34;247&#34;/&#62;

&#60;/p&#62;&#60;p&#62;
The nice aspect about this architecture is that the back-ends can be
individually deployed, and can run on different machines than the
front-end.

&#60;/p&#62;&#60;p&#62;
So, for example, if you view this blog post on camlcity.org, the text
of the blog article comes from a back-end server, and the front-end
creates the frame with the navigation elements. If you view this
blog with an RSS reader, the front-end just wraps the back-end text
differently so an RSS file is generated instead of a web page.

&#60;/p&#62;&#60;p&#62;
This architecture creates a little performance problem, though: For
processing a user request quite a number of accesses to the back-end
servers are required. Not only the article text needs to be fetched,
but also the required templates, and for generating the navigation
elements, also some neighbor texts (parents and siblings) need to be
requested from the back-ends. This can add up to a dozen or more 
requests, and was the reason why using camlcity.org often felt a bit
sluggish.

&#60;/p&#62;&#60;p&#62;
Note that only use multi-processing is used: There is a master
process starting as many worker processes as needed (the workers
can be of different type, here front-ends or back-ends). The workers
can run in parallel, and even take advantage of several processor
cores. Also, the workers are fully separated from each other, so that
a malfunction (including crash) of one worker does not affect other
workers. A good feature for a software running 24/7. 

&#60;/p&#62;&#60;p&#62;
However, multi-processing makes it difficult to share data between
workers. The workers have only their own process-local memory, and
cannot normally not make data available to others. Well, Netmulticore
changes the game at this point.

&#60;/p&#62;&#60;p&#62;
The improved architecture introduces a cache on the front-end
side. This cache stores all back-end responses where it is suspected
they could be requested soon again:


&#60;img src=&#34;/files/img/blog/multicore4_fig2.png&#34; width=&#34;716&#34; height=&#34;327&#34;/&#62;

&#60;/p&#62;&#60;p&#62;
This cache resides in explicitly allocated shared memory (using the
POSIX interface &#60;code&#62;shm_open&#60;/code&#62;). The Netmulticore library is
used to manage this block of shared memory. Besides other data
structures there is also &#60;code&#62;Netmcore_hashtbl&#60;/code&#62;, an adaption of
the well-known &#60;code&#62;Hashtbl&#60;/code&#62; module of the standard library for
use in shared memory.

&#60;/p&#62;&#60;p&#62;
As this code is really short and nice, I just show it here:

&#60;/p&#62;&#60;pre style=&#34;font-size:smaller&#34;&#62;
type cache_obj =
    [ `Fields of Json_type.t * string option 
    | `Refs of Json_type.t
    | `Html of Nethtml.document list
    ]

type cache_hdr =
    { mutable lock : Netmcore_mutex.mutex;
      mutable next_gc : float
    }

type cache =
  (string, float * string * string * cache_obj, cache_hdr) Netmcore_hashtbl.t)
&#60;/pre&#62;

The values &#60;code&#62;cache_obj&#60;/code&#62; are stored in the cache (payload data).
As you see, we cannot only store strings, but structured Ocaml values
(with limitations, though). The shared hashtable features a so-called
header which exists once per hashtable. Here, &#60;code&#62;cache_hdr&#60;/code&#62;
includes a mutex (to ensure that only one process can write at a time),
and the field &#60;code&#62;next_gc&#60;/code&#62; which is the point in time when the
hashtable will be checked next for elements exceeding their lifetime.
The &#60;code&#62;cache&#60;/code&#62;, finally, maps URLs (given as strings) to tuples
&#60;code&#62;(timeout, path1, path2, url,
element)&#60;/code&#62;. Here, &#60;code&#62;timeout&#60;/code&#62; is the point in time when
the elements needs to be evicted from the cache, and the paths and URL
are further metadata.

&#60;p&#62;
The cache lookup is as easy as:

&#60;/p&#62;&#60;pre style=&#34;font-size:smaller&#34;&#62;
let cache_lookup path =
  let cache = get_cache() in
  let (t_out, real_path, real_url, obj) = Netmcore_hashtbl.find_c cache path in
  if Unix.time() &#38;#62;= t_out then raise Not_found;
  (real_path, real_url, obj)
&#60;/pre&#62;

Note that we leave out here &#60;code&#62;get_cache&#60;/code&#62; because it involves a
bit of application-specific management code.

&#60;p&#62;
The function &#60;code&#62;Netmcore_hashtbl.find_c&#60;/code&#62; creates a copy of the
values found in the shared cache. This is required because we cannot allow
that pointers to shared values escape the scope of this module - such
pointers need special treatment (there are some programming rules to be
followed). The copy is put into normal process-local memory, so these
rules no longer apply then.

&#60;/p&#62;&#60;p&#62;
For storing value in the cache we have:

&#60;/p&#62;&#60;pre style=&#34;font-size:smaller&#34;&#62;
let cache_store path real_path real_url obj =
  try
    let cache = get_cache() in
    let now = Unix.time() in
    let t_out = now +. float !cache_default_timeout  in
    let hdr = Netmcore_hashtbl.header cache in
    Netmcore_mutex.lock hdr.lock;
    ( try
        if now &#38;#62;= hdr.next_gc then (
          let l = ref [] in
          Netmcore_hashtbl.iter
            (fun p (t,_,_,_) -&#38;#62;
               (* Warning: p, t are in shared mem *)
               if now &#38;#62;= t then l := p :: !l
            )
            cache;
          List.iter
            (fun p -&#38;#62;
               (* Warning: p is in shared mem *)
               Netmcore_hashtbl.remove cache p
            )
            !l;
          (* Floats are boxed! *)
          Netmcore_heap.modify
            (Netmcore_hashtbl.heap cache)
            (fun mut -&#38;#62;
               hdr.next_gc &#38;#60;- Netmcore_heap.add mut t_out
            );
        );
        if Netmcore_hashtbl.length cache &#38;#60; !cache_limit then
          Netmcore_hashtbl.replace
            cache path (t_out, real_path, real_url, obj);
        Netmcore_mutex.unlock hdr.lock;
      with
        | error -&#38;#62; 
            Netmcore_mutex.unlock hdr.lock;
            raise error
    )
  with
    | Netmcore_mempool.Out_of_pool_memory -&#38;#62;
        Netlog.logf `Warning &#38;#34;Shared cache: Out of pool memory&#38;#34;
&#60;/pre&#62;

&#60;p&#62;
We use a lock to ensure that only one process can write at a time.
This lock is managed with Netmulticore&#38;#39;s &#60;code&#62;Netmcore_mutex&#60;/code&#62;
module. Essentially, the lock guarantees that all modifications done
at write time are done atomically, and thus consistency is preserved.

&#60;/p&#62;&#60;p&#62;
As you see we now and then throw out all elements exceeding their
lifetime. This is done (for simplicity) by iterating over the whole
hashtable, and checking each element. The keys of the found elements
are gathered up in &#60;code&#62;l&#60;/code&#62;, and are removed in a second step.
Note that the iteration gives us direct pointers to shared memory, e.g.
&#60;code&#62;p&#60;/code&#62; is a string residing in shared memory. One has to be
very careful with such values, because Netmulticore provides less
guarantees how long such values exist than Ocaml programmers are used
to. For example, once a key &#60;code&#62;p&#60;/code&#62; is removed from the table,
the string counts as no longer referenced, and can be deleted by
Netmulticore&#38;#39;s internal memory manager - even if we still have
the &#60;code&#62;p&#60;/code&#62; variable (because Netmulticore cannot cooperate
with Ocaml&#38;#39;s memory manager for this purpose).

&#60;/p&#62;&#60;p&#62;
Another strange thing is the &#60;code&#62;Netmcore_heap.modify&#60;/code&#62;
function.  It is required for modifying shared values in-place,
here &#60;code&#62;next_gc&#60;/code&#62;. The value &#60;code&#62;t_out&#60;/code&#62; is a float
stored in normal process-local memory. Assigning it directly to
&#60;code&#62;next_gc&#60;/code&#62; would create an illegal pointer from shared
memory to local memory (resulting in a crash). By using the
&#38;#34;write protocol&#38;#34; as shown here, the float is copied to shared
memory before doing the assignment.

&#60;/p&#62;&#60;p&#62;
The solution is surprisingly short. It was never so simple to profit
from shared memory in Ocaml programs. The reason is that we need not
to deal with serialization formats to translate values to strings.  We
just store values directly! It should also be noted that there are
additional dangers resulting from shared memory. The worker processes
are no longer completely isolated from each other - we made an
exception by sharing memory for the cache. If a worker fails to comply
to all programming rules required for accessing shared memory, not
only this worker will crash, but all workers.  Another risk are the
shared locks. Imagine what happens when a worker is terminated in the
middle of the &#60;code&#62;cache_store&#60;/code&#62; function (e.g. by sending
a signal from outside). The lock will never be released again, and
the other workers will wait forever for the lock.

&#60;/p&#62;&#60;p&#62;
Anyway, these risks are manageable, and are roughly equivalent in
severity to what multi-threaded programming is also exposed to.  In
summary, Netmulticore solves some of the problems arising from using
multi-processing, and is definitely worth considering it.

&#60;img src=&#34;/files/img/blog/multicore4_bug.gif&#34; width=&#34;1&#34; height=&#34;1&#34;/&#62;

&#60;/p&#62;
&#60;/div&#62;

&#60;div&#62;
  Gerd Stolpmann works as O&#38;#39;Caml consultant

&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;


          </description>
        </item>
      
        <item>
          <title>Netmulticore and the n-queens puzzle</title>
          <guid>http://blog.camlcity.org/blog/multicore3.html</guid>
          <link>http://blog.camlcity.org/blog/multicore3.html</link>
          <description>

&#60;div&#62;
  &#60;b&#62;Ocaml and multicore programming&#60;/b&#62;&#60;br/&#62;&#38;#160;
&#60;/div&#62;

&#60;div&#62;
  
The &#60;a href=&#34;http://en.wikipedia.org/wiki/Nqueens&#34;&#62;n-queens puzzle&#60;/a&#62;
is a well-known toy problem in computer science that can serve as a
benchmark for a class of problems where possible solutions are
systematically enumerated and then tested whether they fulfill all
required properties. Here we show how to speed the algorithm up for
multi-core CPU&#38;#39;s using the brand-new
&#60;a href=&#34;http://blog.camlcity.org/blog/multicore2.html&#34;&#62;Netmulticore&#60;/a&#62;
library for Ocaml.

&#60;/div&#62;

&#60;div&#62;
  
&#60;p&#62;
The exact task is to find all &#60;i&#62;fundamental&#60;/i&#62; solutions for a
n&#38;#215;n chessboard, then to print the solutions to stdout, and
finally to print the number of such solutions. Because of the symmetry
of the board we do not count solutions as new when they can be derived
from existing solutions by rotation and/or reflection. There are usually
eight such equivalent variants, and it does not matter which of the
variants representing such a fundamental solution is output by the program.

&#60;/p&#62;&#60;p&#62;
As we&#38;#39;ll see, the restriction to the fundamental solutions makes the
algorithm harder to parallelize although the search space becomes
smaller. In particular, the algorithm is no longer &#60;i&#62;embarrassingly
parallel&#60;/i&#62;. The provided solutions would even count as examples of
fine-grained parallelism.

&#60;/p&#62;&#60;p&#62;
Our general design is to first emit all solutions using a brute-force
algorithm &#60;i&#62;solve&#60;/i&#62; and then to filter out all duplicates that are
symmetric to an already found solution. The goal is to look into ways
of speeding this design up on multi-core computers, but it is not
intended to optimize the core &#60;i&#62;solve&#60;/i&#62; algorithm directly. We get
then numbers comparing the performance of a parallel algorithm with
the basic sequential algorithm, and hopefully an idea how well the
parallelization approach worked.

&#60;/p&#62;&#60;p&#62;
The complete source code is available as example in the newest
Ocamlnet test release, which is also available in the Subversion
repository:
&#60;a href=&#34;https://godirepo.camlcity.org/svn/lib-ocamlnet2/trunk/code/examples/multicore/nqueens.ml&#34;&#62;nqueens.ml&#60;/a&#62;.


&#60;/p&#62;&#60;h2&#62;Basic algorithms&#60;/h2&#62;

&#60;p&#62;
The basic data structure is

&#60;/p&#62;&#60;pre&#62;
type board = int array
&#60;/pre&#62;

For such a &#60;code&#62;board&#60;/code&#62; array the number &#60;code&#62;board.(col)&#60;/code&#62;
is the row where the queen is placed in the column &#60;code&#62;col&#60;/code&#62;.
Rows and columns are numbered from 0 to &#60;code&#62;N-1&#60;/code&#62;. For this 
representation the functions

&#60;pre&#62;
val x_mirror : board -&#38;#62; board
val rot_90 : board -&#38;#62; board
&#60;/pre&#62;

for reflection at the &#60;code&#62;x&#60;/code&#62; (columns) axis and for rotation
by 90 degrees are simple array operations (similar to
transposition). With their help it is possible to write a function

&#60;pre&#62;
val transformations : board -&#38;#62; board list
&#60;/pre&#62;

that returns all variants of a solution that can be determined by
rotation and reflection (we allow that a variant is returned twice).

&#60;p&#62;
Also, we assume we have the core algorithm as

&#60;/p&#62;&#60;pre&#62;
val solve : int -&#38;#62; int -&#38;#62; (board -&#38;#62; unit) -&#38;#62; unit
&#60;/pre&#62;

so that &#60;code&#62;solve q0 N emit&#60;/code&#62; generates all solutions where
the first queen (in column 0) is put into row &#60;code&#62;q0&#60;/code&#62;. For
each solution &#60;code&#62;b&#60;/code&#62; the function &#60;code&#62;emit b&#60;/code&#62; is
called to further process it. As already mentioned, this algorithm
is expected to find all solutions no matter whether a symmetrical
solution is already known or not.

&#60;p&#62;
Finally, we assume we can print a board with

&#60;/p&#62;&#60;pre&#62;
val print : board -&#38;#62; unit
&#60;/pre&#62;


&#60;h2&#62;SEQ: The sequential algorithm&#60;/h2&#62;

With the given functions, the sequential solver is simply

&#60;pre&#62;
  let run n =
    let t0 = Unix.gettimeofday() in
    let ht = Hashtbl.create 91 in
    for k = 0 to n-1 do
      solve k n
        (fun b -&#38;#62;
           if not (Hashtbl.mem ht b) then (
             let b = Array.copy b in
             List.iter
               (fun b&#38;#39; -&#38;#62;
                  Hashtbl.add ht b&#38;#39; ()
               )
               (transformations b);
             print b
           )
        )
    done;
    let t1 = Unix.gettimeofday() in
    printf &#38;#34;Number solutions: %n\n%!&#38;#34; (Hashtbl.length ht / 8);
    printf &#38;#34;Time: %.3f\n%!&#38;#34; (t1-.t0)
&#60;/pre&#62;

We use here a hash table &#60;code&#62;ht&#60;/code&#62; to collect all solutions. The
hash table is filled with the symmetrical variants, too, so that we
can recognize a new fundamental solution by just checking whether it
is already member of the table or not.

&#60;p&#62;
Runtimes are measured on a single-CPU quad-core Opteron machine
(Barcelona) with 8 GB RAM:

&#60;/p&#62;&#60;p&#62;
&#60;/p&#62;&#60;table&#62;
&#60;tr&#62;&#60;td&#62;Size&#38;#160;&#38;#160;&#38;#160;&#38;#160;&#38;#160;&#60;/td&#62; &#60;td&#62;Runtime&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=8:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.001 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=9:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.003 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=10:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.009 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=11:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.039 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=12:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.283 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=13:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;1.730 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=14:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;10.037 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=15:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;66.420 s&#60;/td&#62;&#60;/tr&#62;
&#60;/table&#62;


&#60;h2&#62;The parallel algorithms&#60;/h2&#62;

There are four parallel versions of this algorithm:

&#60;ul&#62;
  &#60;li&#62;&#60;code&#62;SHT&#60;/code&#62; starts several solvers so that their search spaces
do not overlap. The hash table &#60;code&#62;ht&#60;/code&#62; is put into shared memory
using Netmulticore&#38;#39;s &#60;code&#62;Netmcore_hashtbl&#60;/code&#62; module. (&#60;code&#62;SHT&#60;/code&#62;
= shared hash table.)
  &#60;/li&#62;&#60;li&#62;&#60;code&#62;SHT2&#60;/code&#62; is an enhanced version trying to address the
issue that in &#60;code&#62;SHT&#60;/code&#62; several workers struggle for the same
lock. In &#60;code&#62;SHT2&#60;/code&#62; we partition the filter space into two distinct
sets so that there can be a separate hash table for each set.
  &#60;/li&#62;&#60;li&#62;&#60;code&#62;MP&#60;/code&#62; does not put the hash table into shared memory but
sends all data to a special process that filters duplicates out. This
process uses a normal &#60;code&#62;Hashtbl&#60;/code&#62; from Ocaml&#38;#39;s standard library.
(&#60;code&#62;MP&#60;/code&#62; = message passing.)
  &#60;/li&#62;&#60;li&#62;&#60;code&#62;MP2&#60;/code&#62; is an improved version also partitioning the
filter space so that two independent processes perform the filtering.
&#60;/li&#62;&#60;/ul&#62;

&#60;p&#62;
The performance of the parallel solutions and the sequential original
&#60;code&#62;SEQ&#60;/code&#62; is depicted in the following diagram. The X axis shows
the problem size (N = number of rows/columns of the board), whereas the
Y axis is the runtime in seconds (with logarithmic scale).

&#60;/p&#62;&#60;p&#62;
&#60;img width=&#34;442&#34; height=&#34;435&#34; src=&#34;/files/img/blog/multicore3_diagram.jpg&#34;/&#62;

&#60;/p&#62;&#60;p&#62;
As you can already see, the message passing variants are faster than
the versions using a shared hash table. Note
that &#60;code&#62;Netmcore_hashtbl&#60;/code&#62; is an adaption
of &#60;code&#62;Hashtbl&#60;/code&#62; using the same hashing technique with even the
same hash function. Of course, the shared version has to spend some
additional runtime for copying the data to the shared memory block.
However, the main problem seems to be that the worker processes lock
each other out, because they need exclusive access for adding new
elements to the table. The message passing algorithms avoid this
problem - Netmulticore&#38;#39;s &#60;code&#62;Netmcore_camlbox&#60;/code&#62; implementation
of message passing allows that several writers send in parallel.



&#60;/p&#62;&#60;h2&#62;Netmulticore&#60;/h2&#62;

&#60;p&#62;
Ocamlnet has recently been extended by a library for managing multiple
processes which can keep/exchange data via shared memory. The library
is very new and for sure neither fully optimized nor even bug-free.

&#60;/p&#62;&#60;p&#62;
The basic design of a Netmulticore program is that several independent
Ocaml processes are started so that each process has local memory with
its own Ocaml heap, and the usual garbage collector provided by the
Ocaml runtime. The processes run at the same speed as in the
&#38;#34;uni-core&#38;#34; case, and cannot by random effects step on each others&#38;#39;
feet. In addition to this, a pool of shared memory is allocated so
that all processes map this memory at the same address. This pool is
managed so that additional
&#60;i&#62;shared heaps&#60;/i&#62; can be kept there. Normal Ocaml data can be moved
to a shared heap and accessed like process-local data. Shared heaps
must be self-contained so that no pointer references memory outside
the same shared heap. Due to these constraints, special coding
rules must be followed when shared data is altered. For each shared
heap there is a separate garbage collector.

&#60;/p&#62;&#60;p&#62;
The advantage of this design is there is no &#38;#34;single point of
congestion&#38;#34; where several processes would necessarily compete for the
same resource (like a single shared heap) and would run into the
danger of locking each other out. Although there is a lock for each
shared heap granting exclusive write access there is no limit in the
number of such heaps.  The disadvantage is the self-containment
restriction - before data can be accessed by several processes it must
be explicitly copied to a shared heap.

&#60;/p&#62;&#60;h2&#62;SHT: Shared hash tables&#60;/h2&#62;

&#60;p&#62;
The &#60;code&#62;solve&#60;/code&#62; function allows it to place the first queen
explicitly. We use this feature to split the search space into N
partitions: Every parallel worker k puts the first queen into a
different row, and the remaining queens are systematically tried out.

&#60;/p&#62;&#60;p&#62;
There is a trick to reduce the number of write accesses to the hash
table by a factor of eight. Instead of adding all solutions to the
table only a representative of each fundamental solution is
entered. The representative is simply the smallest board according to
Ocaml&#38;#39;s generic comparison function.

&#60;/p&#62;&#60;p&#62;
This leads to this worker implementation:

&#60;/p&#62;&#60;pre&#62;
  let worker (pool, ht_descr, first_queen, n) =
    let ht = Netmcore_hashtbl.hashtbl_of_descr pool ht_descr in
    solve first_queen n
      (fun b -&#38;#62;
	 let b = Array.copy b in
	 let b_list = transformations b in
	 let b_min =
	   List.fold_left
	     (fun acc b1 -&#38;#62; min acc b1)
	     (List.hd b_list)
	     (List.tl b_list) in
	 let header = Netmcore_hashtbl.header ht in
	 Netmcore_mutex.lock header.lock;
	 try
	   if not (Netmcore_hashtbl.mem ht b_min) then (
	     Netmcore_hashtbl.add ht b_min ();
	     print b_min
	   );
	   Netmcore_mutex.unlock header.lock;
	 with
	   | error -&#38;#62;
	       Netmcore_mutex.unlock header.lock;
	       raise error
      )
&#60;/pre&#62;

&#60;p&#62;Note that we have to use a lock to make the whole read/write access
atomic (the &#60;code&#62;Netmcore_hashtbl&#60;/code&#62; module already uses locks to
protect each access operation individually, but this is not sufficient
here). The call of &#60;code&#62;Netmcore_hashtbl.add&#60;/code&#62; differs from
&#60;code&#62;Hashtbl.add&#60;/code&#62; in so far the keys and values to add are first
copied to the shared heap.

&#60;/p&#62;&#60;p&#62;
The full implementation also has a controller process which starts the
workers and waits until all workers are finished. Finally, the number
of solutions is the length of the shared hash table.

&#60;/p&#62;&#60;p&#62;
The problem of this solution is that only one of the worker processes
can have the lock at a time. The runtime required for adding an
element to the shared hash table is substantially higher than in the
sequential case because of the additional copy done
by &#60;code&#62;Netmcore_hashtbl.add&#60;/code&#62;, making the locking issue even
more problematic. However, in the end &#60;code&#62;SHT&#60;/code&#62; is faster
than &#60;code&#62;SEQ&#60;/code&#62; for N &#38;#62;= 13. (For smaller N the time for setting
up the processes and the shared memory pool - needing a few RPC calls
in the current Netmulticore implementation - lets the sequential
version win.)

&#60;/p&#62;&#60;p&#62;
&#60;/p&#62;&#60;table&#62;
&#60;tr&#62;&#60;td&#62;Size&#38;#160;&#38;#160;&#38;#160;&#38;#160;&#38;#160;&#60;/td&#62; &#60;td&#62;Runtime&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=8:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.083 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=9:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.090 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=10:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.098 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=11:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.129 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=12:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.312 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=13:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;1.387 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=14:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;6.890 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=15:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;43.739 s&#60;/td&#62;&#60;/tr&#62;
&#60;/table&#62;




&#60;h2&#62;SHT2: Two shared hash tables&#60;/h2&#62;

&#60;p&#62;
As we have a representative for each fundamental solution, it is
easily possible to provide more than one hash table, and to use
each table for a subset of the solutions. Here, we check this idea
with two hash tables. 
&#60;/p&#62;&#60;p&#62;
The worker now looks as follows:

&#60;/p&#62;&#60;pre&#62;
  let worker (pool, ht1_descr, ht2_descr, first_queen, n) =
    let ht1 = Netmcore_hashtbl.hashtbl_of_descr pool ht1_descr in
    let ht2 = Netmcore_hashtbl.hashtbl_of_descr pool ht2_descr in
    solve first_queen n
      (fun b -&#38;#62;
	 (* Because this is a read-modify-update operation we have to lock
	    the hash table
	  *)
	 let b = Array.copy b in
	 let b_list = transformations b in
	 let b_min =
	   List.fold_left
	     (fun acc b1 -&#38;#62; min acc b1)
	     (List.hd b_list)
	     (List.tl b_list) in
	 let ht = if b_min.(0) &#38;#60; n/2 then ht1 else ht2 in
	 let header = Netmcore_hashtbl.header ht in
	 Netmcore_mutex.lock header.lock;
	 try
	   if not (Netmcore_hashtbl.mem ht b_min) then (
	     Netmcore_hashtbl.add ht b_min ()
	   );
	   Netmcore_mutex.unlock header.lock;
	 with
	   | error -&#38;#62;
	       Netmcore_mutex.unlock header.lock;
	       raise error
      )
&#60;/pre&#62;

&#60;p&#62;
Note that we exploit here the fact that each hash table lives in a
shared heap of its own. Because of this, there is no hidden common
lock where the two table implementations could run into a congestion
issue.

&#60;/p&#62;&#60;p&#62;The results show that this idea works. The &#60;code&#62;SHT2&#60;/code&#62;
version is quite a bit faster although not twice as fast:

&#60;/p&#62;&#60;p&#62;
&#60;/p&#62;&#60;table&#62;
&#60;tr&#62;&#60;td&#62;Size&#38;#160;&#38;#160;&#38;#160;&#38;#160;&#38;#160;&#60;/td&#62; &#60;td&#62;Runtime&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=8:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.145 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=9:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.162 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=10:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.157 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=11:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.169 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=12:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.286 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=13:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.999 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=14:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;4.503 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=15:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;28.027 s&#60;/td&#62;&#60;/tr&#62;
&#60;/table&#62;

&#60;p&#62;
I&#38;#39;ve not examined whether the idea can be generalized to more than
two hash tables. This is probably the case.


&#60;/p&#62;&#60;h2&#62;MP: Workers passing messages to a filter process&#60;/h2&#62;

The shared hash tables have the disadvantage that the accesses to them
are practically serialized. Even worse, the CPU cores have to wait on
each other, requiring an expensive form of synchronization.  By
changing the design, we try to address this problem differently.

&#60;p&#62;
In this version of the algorithm the worker processes do not filter
out the duplicates, but just send all result candidates to a special
collector process. The collector uses then a normal process-local
hash table to do the filtering.

&#60;/p&#62;&#60;p&#62;
This program uses &#60;code&#62;Netmcore_camlbox&#60;/code&#62;
and &#60;code&#62;Netcamlbox&#60;/code&#62; for message passing. The (single) receiver
of the messages has to create the box, and the (multiple) senders can
put messages into the slots of the box. Because there are several slots
for messages, the senders can normally avoid to compete for the same
locks.

&#60;/p&#62;&#60;p&#62;
Of course, the workers put only the representatives of the fundamental
solution into the box, leading to this piece of program:

&#60;/p&#62;&#60;pre&#62;
  let worker (camlbox_id, first_queen, n) =
    let cbox = 
      (Netmcore_camlbox.lookup_camlbox_sender camlbox_id : camlbox_sender) in
    let current = ref [] in
    let count = ref 0 in

    let send() =
      Netcamlbox.camlbox_send cbox (ref (Boards !current));
      current := [];
      count := 0 in
    
    solve first_queen n
      (fun b -&#38;#62;
	 let b = Array.copy b in
	 let b_list = transformations b in
	 let b_min =
	   List.fold_left
	     (fun acc b1 -&#38;#62; min acc b1)
	     (List.hd b_list)
	     (List.tl b_list) in
	 current := b_min :: !current;
	 incr count;
	 if !count = n_max then send()
      );
    if !count &#38;#62; 0 then send();
    Netcamlbox.camlbox_send cbox (ref End)
&#60;/pre&#62;

&#60;p&#62;
There is the further optimization that we consider a list of boards
as a message, and not a single board, reducing the overhead for 
message passing even more. The last message is &#60;code&#62;End&#60;/code&#62;,
signalling to the collector that all solutions have been sent.

&#60;/p&#62;&#60;p&#62;
The collector looks now as follows (some parts omitted for brevity):

&#60;/p&#62;&#60;pre&#62;
  let collector n =
    let msg_max_size =
      ((n+1) * 3 * n_max + 500) * Sys.word_size / 8 in

    let ((cbox : camlbox), camlbox_id) =
      Netmcore_camlbox.create_camlbox &#38;#34;nqueens&#38;#34; (4*n) msg_max_size in

    ... (* start workers *)

    let ht = Hashtbl.create 91 in
    let w = ref n in
    while !w &#38;#62; 0 do
      let slots = Netcamlbox.camlbox_wait cbox in
      List.iter
	(fun slot -&#38;#62;
	   ( match !(Netcamlbox.camlbox_get cbox slot) with
	       | Boards b_list -&#38;#62;
		   List.iter
		     (fun b -&#38;#62;
			if not (Hashtbl.mem ht b) then (
			  let b = Array.copy b in
			  Hashtbl.add ht b ();
			  print b
			)
		     )
		     b_list
	       | End -&#38;#62;
		   decr w
	   );
	   Netcamlbox.camlbox_delete cbox slot
	)
	slots
    done;

    ... (* wait for termination of workers *)

    printf &#38;#34;Number solutions: %n\n%!&#38;#34; (Hashtbl.length ht)
&#60;/pre&#62;

&#60;p&#62;
The &#60;code&#62;MP&#60;/code&#62; version of the program still has a bottleneck,
namely the single collector process. Nevertheless, it already performs
better than the versions using shared hash tables:

&#60;/p&#62;&#60;p&#62;
&#60;/p&#62;&#60;table&#62;
&#60;tr&#62;&#60;td&#62;Size&#38;#160;&#38;#160;&#38;#160;&#38;#160;&#38;#160;&#60;/td&#62; &#60;td&#62;Runtime&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=8:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.030 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=9:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.039 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=10:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.072 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=11:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.072 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=12:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.166 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=13:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.501 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=14:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;3.298 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=15:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;18.806 s&#60;/td&#62;&#60;/tr&#62;
&#60;/table&#62;


&#60;h2&#62;MP2: Two filter processes&#60;/h2&#62;

&#60;p&#62;
Of course, the same idea as in &#60;code&#62;SHT2&#60;/code&#62; can also be applied here.
This leads to &#60;code&#62;MP2&#60;/code&#62;: Two filter processes are started, and
each process is in charge for filtering one half of the search space.

&#60;/p&#62;&#60;p&#62;
There is a difficulty, though: For getting the exact count of the
results we have to merge the result sets of both collector
processes. We do this here (in a bit inefficient way) by sending all
results to a master collector where they can be counted. This design
leads to the next difficulty: The message boxes for the two collectors
cannot be created before starting the workers.  Because of this, the
workers have to wait until the message boxes are created. We use
condition variables for doing so.

&#60;/p&#62;&#60;p&#62;
Although the algorithm gets complicated by this design, there is nothing
really new, and we omit it here in the article. The results are
impressive, we can even observe a super linear speedup for N &#38;#62;= 14:

&#60;/p&#62;&#60;p&#62;
&#60;/p&#62;&#60;table&#62;
&#60;tr&#62;&#60;td&#62;Size&#38;#160;&#38;#160;&#38;#160;&#38;#160;&#38;#160;&#60;/td&#62; &#60;td&#62;Runtime&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=8:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.033 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=9:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.040 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=10:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.043 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=11:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.067 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=12:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.169 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=13:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;0.568 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=14:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;2.455 s&#60;/td&#62;&#60;/tr&#62;
&#60;tr&#62;&#60;td&#62;N=15:&#60;/td&#62; &#60;td align=&#34;right&#34;&#62;15.215 s&#60;/td&#62;&#60;/tr&#62;
&#60;/table&#62;

&#60;p&#62;
The highly interesting point is that the design of the message passing
algorithm wins although the data is even sent twice between processes
(and copied twice).  It is not the runtime of the individual operation
that counts but whether the parallely executed operations lock each
other out or not.  The message passing algorithm avoids lock
situations between CPU cores.  Also, it is a bit more coarse-grained
because several solutions can be bundled into a single message.

&#60;/p&#62;&#60;p&#62;
Note that this does not mean that message passing is always better.
This is just an example where it leads to a smoother way of execution,
mainly because we can organize the data stream without feedback loops.

&#60;/p&#62;&#60;p&#62;
Of course, this article is also meant to demonstrate that programming
with Netmulticore is not that complicated and relatively high-level.
The programmer needs not to deal with representations of data, for
instance (i.e. no need for a data marshalling format). Also, the
shared data structures like &#60;code&#62;Netmcore_hashtbl&#60;/code&#62; work well
enough so that we see a speedup even for a fine-grained
parallelization design like &#60;code&#62;SHT&#60;/code&#62;. This may also work for
other problems, especially scientific ones, where (easier)
coarse-grained designs are often not possible.


&#60;img src=&#34;/files/img/blog/multicore3_bug.gif&#34; width=&#34;1&#34; height=&#34;1&#34;/&#62;

&#60;/p&#62;
&#60;/div&#62;

&#60;div&#62;
  Gerd Stolpmann works as O&#38;#39;Caml consultant

&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;


          </description>
        </item>
      
        <item>
          <title>Test release of Netmulticore</title>
          <guid>http://blog.camlcity.org/blog/multicore2.html</guid>
          <link>http://blog.camlcity.org/blog/multicore2.html</link>
          <description>

&#60;div&#62;
  &#60;b&#62;Ocaml and multicore programming&#60;/b&#62;&#60;br/&#62;&#38;#160;
&#60;/div&#62;

&#60;div&#62;
  
Who wants to check out Netmulticore can do this now: There is a test
release of Ocamlnet containing Netmulticore. There is also a new
extensive tutorial explaining all in detail.

&#60;/div&#62;

&#60;div&#62;
  
&#60;p&#62;The Ocamlnet release (3.3.0test1) can be downloaded from the
&#60;a href=&#34;http://projects.camlcity.org/projects/ocamlnet.html&#34;&#62;project page&#60;/a&#62;.
The reference manual contains the &#60;a href=&#34;http://projects.camlcity.org/projects/dl/ocamlnet-3.3.0test1/doc/html-main/Netmcore_tut.html&#34;&#62;tutorial&#60;/a&#62;.

&#60;img src=&#34;/files/img/blog/multicore2_bug.gif&#34; width=&#34;1&#34; height=&#34;1&#34;/&#62;

&#60;/p&#62;
&#60;/div&#62;

&#60;div&#62;
  Gerd Stolpmann works as O&#38;#39;Caml consultant

&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;


          </description>
        </item>
      
        <item>
          <title>Netmulticore works!</title>
          <guid>http://blog.camlcity.org/blog/multicore1.html</guid>
          <link>http://blog.camlcity.org/blog/multicore1.html</link>
          <description>

&#60;div&#62;
  &#60;b&#62;Ocaml and multicore programming&#60;/b&#62;&#60;br/&#62;&#38;#160;
&#60;/div&#62;

&#60;div&#62;
  
Netmulticore is my attempt at solving the multicore puzzle for
Ocaml. It has reached now a development stage so that I can run test
programs, and I see real speedups. Although not everything is perfect
yet, the API has stabilized a bit, and it is close to being ready for
broader testing. Expect a test release in the next days - I hope to
finish it before the Ocaml Meeting at Friday, so you guys have
something to talk about. (Unfortunately, I cannot visit.)

&#60;/div&#62;

&#60;div&#62;
  
&#60;p&#62;The approach of Netmulticore is unusual, but AFAIK the only one that
works without modifying the Ocaml runtime and/or compiler.
Instead of using kernel threads and implicitly sharing memory,
Netmulticore forks full-fledged processes as thread replacements, and
allocates explicit shared memory that is accessible by all worker
processes. By doing this allocation directly at program startup, it is
ensured that all processes see the shared memory block at the same
address (which would not be the case if the processes mapped the block
individually). The shared block is now treated in a very special way
- shared memory has to cope with some difficulties that normally do
not exist. All in all, this creates a setup where each process has its
normal Ocaml heap (which is not shared with other processes) plus
access to the shared block. This is not a bad situation provided we
can make the access to the shared block as convenient as possible, and
this is what Netmulticore is really about. If we can achieve a nice
programming API for dealing with shared data, the multicore issue is
solved, and probably even in a more scalable way than in other
runtimes where all threads have to share a single heap, and constantly
step on each others&#38;#39; feet.

&#60;/p&#62;&#60;h2&#62;Self-contained shared heaps&#60;/h2&#62;

&#60;p&#62;So what we want to have is that we can store normal Ocaml values
into the shared block - including all regular data representations
such as tuples, variants, records. Also, we want to allow
modifications of values - immutable shared memory is not worth the
fun. Netmulticore solves both issues - it provides direct read access
to Ocaml values stored in the shared block, and it allows
modifications (but the programmer has to follow a special API).

&#60;/p&#62;&#60;p&#62;The initially allocated shared block is broken down in smaller
units called &#60;i&#62;shared heaps&#60;/i&#62;. The heaps do not have a fixed size
but can grow and shrink (just like the regular Ocaml heap).  The heaps
are now the containers for the Ocaml values: When creating a heap, one
can copy an initial Ocaml value into it, and by following the special
rules for mutation it is possible to put more values into it (or
remove values). The shared heaps are structured similarly to the
normal Ocaml heap, and contain the value blocks densely packed one
after the other. Free areas are managed with a free list. When the
shared heap fills up, a specially implemented garbage collector tries
to find unreachable values and reclaims the space (using the
mark-and-sweep design). Shared heaps have a lock which synchronizes
accesses to it, and, unfortunately, limits the degree of
concurrency. Especially, only one process can write to a heap at a
time. The programmer can, if necessary, work around this limitation by
using several heaps - each heap has its own lock, so tricky
application designs can avoid lock contention.

&#60;/p&#62;&#60;p&#62;This sounds like a nice idea, but there are some traps. In the next
two paragraphs I&#38;#39;ll try to give an impression of the difficulties. The
essence is, after all, that managing shared heaps requires a
disciplined programmer, or the memory gets corrupted. This is the
downside of the Netmulticore approach - it is quite easy to crash the
program by not following one of the programming rules. (This is,
however, also true for &#38;#34;normal&#38;#34; multi-threaded programming.)

&#60;/p&#62;&#60;p&#62;The first difficulty is that this design requires that the shared
heaps are self-contained. This means that no pointer must ever exist
that points from a shared heap to the normal process-local Ocaml heap,
or from a shared heap to a different shared heap. The first kind of
pointer would cause invalid memory accesses if a second process
dereferenced such a pointer. The second kind of pointer confuses the
garbage collector. What is still allowed, of course, are pointers from
process-local memory to the shared heaps. (The garbage collector built
into the Ocaml runtime fortunately does not follow such pointers.)
However, one should be careful: the garbage collector cleaning the
shared heap from time to time will not see that such &#38;#34;external&#38;#34;
pointers exist, and will not keep the referenced data alive. It is
left to the programmer to do something about it.

&#60;/p&#62;&#60;p&#62;The second difficulty is how to actually do mutation in shared
heaps. If you have every written a memory manager, you&#38;#39;ll probably
know the problem: Each allocation can cause a GC run, and this can
invalidate what you&#38;#39;ve just put into the heap but is not yet
considered as reachable by the GC. The solution is to keep a set of
further roots, i.e. pointers the GC must also consider although they
are not in the memory region the GC manages. I omit here the details -
the point is that mutation requires a special procedure so that such
additional roots can be managed. This is a bit like declaring
arguments of wrapper functions with the CAMLparam macros. A similar
convention exists for Netmulticore, only that it has to be done on the
Ocaml level.

&#60;/p&#62;&#60;h2&#62;Higher-level data structures&#60;/h2&#62;

Fortunately, the programmer does not need to remember all this
low-level stuff most of the time, because there are a number of
ready-to-use data containers. These are already developed on top of
the raw shared heap structure, and are a lot safer to use:

&#60;ul&#62;
  &#60;li&#62;Netmcore_array: Keeps data in an array, and provides synchronization
      for accessing array elements
  &#60;/li&#62;&#60;li&#62;Netmcore_matrix: a two-dimensional array
  &#60;/li&#62;&#60;li&#62;Netmcore_buffer: A shared string buffer where one can add strings
      at the end and remove data from the beginning
  &#60;/li&#62;&#60;li&#62;Netmcore_queue: A shared queue very much like the Queue module of
      the standard library, but again with additional synchronization
  &#60;/li&#62;&#60;li&#62;Netmcore_hashtbl: A shared hash table very much like the Hashtbl module of
      the standard library, but again with additional synchronization
  &#60;/li&#62;&#60;li&#62;Netmcore_ref: A single shared variable (like a shared &#38;#34;ref&#38;#34; variable)
&#60;/li&#62;&#60;/ul&#62;

&#60;p&#62;
In some sense, these modules are &#38;#34;ports&#38;#34; of the corresponding data
structures that are provided by the standard library. During porting I
had especially to change the way the data is mutated so the mentioned
programming rules are followed. (This is the main reason why these
ports exist - you cannot put a normal Hashtbl into shared memory,
because the normal mutation breaks the rules for shared heaps.
There is no such problem when read-only data structures are copied
to shared heaps.)

&#60;/p&#62;&#60;h2&#62;Synchronization primitives&#60;/h2&#62;

The mentioned data types have - to some degree - built-in protection
against uncontrolled parallel access. Sometimes, however, it is useful
to have additional ways of managing synchronization:

&#60;ul&#62;
  &#60;li&#62;Netmcore_sem: Semaphores
  &#60;/li&#62;&#60;li&#62;Netmcore_mutex: Mutexes (including normal and recursive ones)
  &#60;/li&#62;&#60;li&#62;Netmcore_condition: Condition variables
&#60;/li&#62;&#60;/ul&#62;

The condition variables have a bit unpleasent API, because it is the
task of the caller to allocate a special block of memory for each
process that can be suspended. In system-level implementations of
condition variables this block can be hidden from the user (it can be
put into the thread control block). As we don&#38;#39;t have access to
something like a process-local but nevertheless shared place the only
solution I&#38;#39;m seeing right now is to delegate this obligation to the
caller. But anyway, the important message is that Netmulticore provides
condition variables, and that it is thus easy to signal (or &#38;#34;broadcast&#38;#34;)
suspended processes.

&#60;h2&#62;Message passing&#60;/h2&#62;

Netmulticore also integrates nicely with Camlbox, the message passing
API that exists since Ocamlnet-3. Camlboxes allow it to send Ocaml
values from a number of sender processes to a single receiver process.
The implementation of Camlboxes also uses shared memory (and not
sockets), and is very fast.

&#60;h2&#62;The code, please!&#60;/h2&#62;

What I&#38;#39;ve described in this article already works so far. Netmulticore
is distributed as part of Ocamlnet, and is right now only available
in the svn repository:

&#60;ul&#62;
  &#60;li&#62;&#60;a href=&#34;https://godirepo.camlcity.org/svn/lib-ocamlnet2/trunk/code/src/netmulticore/&#34;&#62;Netmulticore source directory&#60;/a&#62;
  &#60;/li&#62;&#60;li&#62;&#60;a href=&#34;https://godirepo.camlcity.org/svn/lib-ocamlnet2/trunk/code/examples/multicore/&#34;&#62;Examples for Netmulticore&#60;/a&#62;
&#60;/li&#62;&#60;/ul&#62;

&#60;p&#62;You&#38;#39;ll probably have to check out the whole tree to build it.

&#60;/p&#62;&#60;p&#62;The skeptical reader is very encouraged to look into the examples,
just to see how nice the code using Netmulticore looks. It is the first
time you can use shared memory from Ocaml without having to deal with
data marshalling issues (because we don&#38;#39;t use this technique). Actually,
the code looks very much like multi-threaded programming, only that
here and there different primitives need to be used.

&#60;/p&#62;&#60;h2&#62;What next?&#60;/h2&#62;

&#60;p&#62;
It is very important to create sample programs that use a library like
Netmulticore, because problems only show up in practice, and are hard
to predict. There are already three non-trivial examples, and I&#38;#39;ve
plans to write a few more. Expect also blog articles about this,
and how the performance is (and the numbers I got so far are promising).

&#60;/p&#62;&#60;p&#62;
The Netmulticore implementation has reached some &#38;#34;beta quality&#38;#34;, at
least. We need a few improvements here and there, but generally the
code exists and works.

&#60;/p&#62;&#60;p&#62;
It is generally a good idea to watch out how to make Netmulticore
programming safer. As pointed out before, it is right now easy to
crash the program (with a segfault) when missing one of the special
programming rules. The OCaml compiler could perhaps help here and
prevent some mistakes. What I&#38;#39;ve especially in mind here is a typing
annotation whether a value is in a shared heap. This annotation would
normally be invisible to the user (like the polarity annotation), but
jump in at compile time when a reference to a non-shared value is
stored into a shared value. However, this annotation would probably
require a modification of the compiler.

&#60;/p&#62;&#60;p&#62;
If anybody is interested in testing Netmulticore, please write me.
Optimizing programs for multicore is tricky, and getting more
experience here would allow us to make a good step forward.

&#60;img src=&#34;/files/img/blog/multicore1_bug.gif&#34; width=&#34;1&#34; height=&#34;1&#34;/&#62;

&#60;/p&#62;
&#60;/div&#62;

&#60;div&#62;
  Gerd Stolpmann works as O&#38;#39;Caml consultant

&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;


          </description>
        </item>
      
        <item>
          <title>Why Map/Reduce matters</title>
          <guid>http://blog.camlcity.org/blog/plasma3.html</guid>
          <link>http://blog.camlcity.org/blog/plasma3.html</link>
          <description>

&#60;div&#62;
  &#60;b&#62;It is time for functional programming&#60;/b&#62;&#60;br/&#62;&#38;#160;
&#60;/div&#62;

&#60;div&#62;
  
&#60;p&#62;
Recently, Map/Reduce got a lot of attention in the media. Of course,
this has nothing to do with the fact that it is functional
programming, but more with the company that has invented it
(Google). However, there is hope that some of the curiosity of the
public can be diverted to functional programming as such. The author
explains this by the way memory is accessed, and why an implementation
of Map/Reduce in an FP language is important.

&#60;cc-field name=&#34;maintext&#34;&#62;
&#60;p&#62;
Functional programming (FP) is still a niche paradigm. Neither in the
academic world nor in the industry it is widely used, although it is
gaining acceptance. I think this is in some way anachronistic - not
only because FP adoption could lead to more correct programs, but also
because the hardware is not used optimally when it is ignored. I hope
the reader stumbles here a bit - isn&#38;#39;t FP more memory-intensive than
imperative programming? In some naive way this is true, but one has
also take into account that we live now in a world of cheap memory.
Nowadays it does not matter much for the efficiency of a program
whether data copies are avoided as much as possible. Other criterions
also play a role, for instance whether a program can easily be
rewritten in a distributed way so it can run on a compute cluster.  FP
shines here, because functional blocks can more easily be refactored
in a distributed way.

&#60;/p&#62;&#60;h2&#62;The world of cheap memory&#60;/h2&#62;

The cost for RAM and disks drop constantly, and the available
bandwidth for accessing memory increases. There is no sign that this
development stops or slows down, Moore&#38;#39;s law is still applicable. This
has, over time, lead to a situation where 1G of RAM does not cost more
than an hour of developer time. A disk of 1T is the equivalence of 1-2
work days. Nevertheless, most programmers prefer coding styles that
seem to ignore reality. Especially, it is still considered good
practice to avoid data copying at all costs, e.g. by using dangerous
techniques like passing around pointers to data structures - which
becomes no better when one encapsulates these structures into objects.
My theory is that the current generation of developers is still used
to these bad habits because this group grew up in a different world
where RAM was tight and the disk always full. In the old times it was
better to save memory, even at the cost of increased instability - the
24/7 requirement was not yet born. Nowadays, however, doing so is a
bad strategy, and causes that a lot of time has to go into bug fixing,
and that the maintenance of such software is extremely costly.  The
old habits fit no more.

&#60;h2&#62;FP and RAM&#60;/h2&#62;

The FP scene mostly considers FP as a calculus, as a mathematically
founded way of programming. This is cathedral thinking, and it
overlooks a lot of real-world phenomenons, and maybe we should start
explaining FP in a different way that works better for the bazaar. FP
also means that the way changes how memory is accessed. We have here
two examples: First, Ocaml&#38;#39;s functional record updates, and second
Map/Reduce.

&#60;p&#62;Ocaml&#38;#39;s record type is special because it allows one to mix
imperative and functional styles. Let&#38;#39;s look quickly at the latter.
After defining a record type, as in

&#60;/p&#62;&#60;pre&#62;
type order =
  { id : int;
    customer : customer;
    items : item list;
    price : float
  }
&#60;/pre&#62;

one can create new &#60;code&#62;order&#60;/code&#62; values with the syntax

&#60;pre&#62;
let my_order =
  { id = 5;
    customer = my_customer;
    items = my_items;
    price = compute_price my_items
  }
&#60;/pre&#62;

However, there is by default no way of changing the fields of the
record - the fields are immutable! Although one can add the keyword
&#60;code&#62;mutable&#60;/code&#62; to the type declaration and change that, the
intention is to encourage the programmer to use records
differently. Instead of overwriting fields, the programmer can also
create a flat copy of the record where fields are changed:

&#60;pre&#62;
let my_updated_order =
  { my_order with
      items = my_updated_items;
      price = compute_price my_updated_items
  }
&#60;/pre&#62;

This is called a functional update. The original order is still
unmodified, and the updated order shares the &#60;code&#62;customer&#60;/code&#62;
field with the original one (i.e. it points to the same memory). This
means that we create a copy, but it is only limited to one level, and
it is quite cheap in terms of memory and CPU consumption (compared to,
say, a C++ copy constructor). If the programmer declares
the &#60;code&#62;customer&#60;/code&#62; type also as immutable, there is no problem
with the fact that the &#60;code&#62;customer&#60;/code&#62; field is shared. It
cannot be changed anyway, and so a modification in one record instance
cannot cause unwanted effects for the second instance.  A good way of
imagining this way of managing data is to see such a record as a
version of a value. Changing means to create a second instance as a
second version. Both versions continue to live independently from each
other.

&#60;p&#62;
In imperative code, the record fields are usually directly
changed.  Of course, this is still a bit more memory-saving than the
sharing trick Ocaml implements. However, it is also more dangerous in
two respects: First, the programmer might lose control where in the
program the records are stored (and this is easy in development
teams), and the overwritten field would also be visible in parts of
the program where it is unexpected. Second, it is more difficult to
ensure consistency. For example, in the &#60;code&#62;order&#60;/code&#62; example
the &#60;code&#62;price&#60;/code&#62; field is a dependent field - it is computed
from the &#60;code&#62;items&#60;/code&#62;. When the fields are set in two
consecutive assignment statements, it is possible that this goes wrong
(e.g. when an exception is raised between the assignments).

&#60;/p&#62;&#60;p&#62;
Thinking data modifications as a sequence of versions is a typical
design pattern of functional programming. We have seen that it
consumes slightly more RAM, but also that it gives the programmer a
lot more control about the coordination of data changes and that it
helps to ensure consistency.

&#60;/p&#62;&#60;h2&#62;FP and disks&#60;/h2&#62;

Since Google has published the paper about Map/Reduce, this framework
is considered as an attractive alternative for data warehousing, i.e.
for managing giant amounts of data. Interestingly, it hasn&#38;#39;t found
much attention that Map/Reduce follows a functional paradigm -
although &#38;#34;map&#38;#34; and &#38;#34;reduce&#38;#34; are FP lingo, and one must be blind not
to see it. It is more viewed as an alternative to relational databases
(&#38;#34;No-SQL&#38;#34; is the keyword here).

&#60;p&#62;
Map/Reduce processes data files strictly sequentially, and creates a
sequence of versions of a set of input files, and finally stores the
last generation of versions into the output files. Especially,
Map/Reduce never overwrites existing files, neither as a whole nor in
parts. Of course, this is clearly a functional way of doing data
manipulation, in the same way as the functional record updates I&#38;#39;ve
explained above. The only difference is that the data units are bigger
and stored on disk. Interestingly, the reason why this is advantageous
has also to do with the characteristics of the underlying memory:
Disks allow higher data transfer rates when files are sequentially
read or written, whereas random accesses are quite expensive. By
sticking to sequential accesses only, Map/Reduce is by design scalable
into regions where the speed of even big relational databases is
limited by unavoidable random disk seeks.  Another advantage is that
Map/Reduce can be easily distributed - every part function of the
scheme can be run on its own machine. As these functions do not have
side effects, the part functions can run independently of each other,
provided the machines have access to the shared file system.

&#60;/p&#62;&#60;h2&#62;Explaining FP differently&#60;/h2&#62;

It is time for FP because we finally have the amount of memory so that
FP can run well, and so that the advantages clearly outweigh the
costs.  FP means accessing memory differently: data is no longer
overwritten, but sequences of versions are created to represent the
data modification. This works for both RAM-based and disk-based
data. We have seen a lot of advantages:

&#60;ul&#62;
&#60;li&#62;Better control about the scope of data modifications
&#60;/li&#62;&#60;li&#62;Higher guarantees about data consistency
&#60;/li&#62;&#60;li&#62;For disks: Higher data access rates when big chunks of data are
transformed rather than modified in place
&#60;/li&#62;&#60;li&#62;For clusters: Functional decomposition schemes are easier to
distribute in compute clusters
&#60;/li&#62;&#60;/ul&#62;

In case of RAM, the costs are moderate, and can be neglected for
current computers with multiple gigabytes of RAM. For disks, a strict
functional data scheme may even reduce random seeks, but of course
this is only viable for &#38;#34;offline&#38;#34; data preparation jobs (no ad-hoc
data modifications are required).

&#60;h2&#62;Plasma&#60;/h2&#62;

As Map/Reduce is currently the hot topic, it is important to get the
ball back into the field of FP. Right now, the widely used open source
implementation of Map/Reduce is written in Java, a language that only
verbosely manages in-place memory modifications instead of going one
step further, namely offering true FP constructs to the
programmers. However, an FP scheme like Map/Reduce becomes even more
useful when the framework is also written in an FP language.

&#60;p&#62;&#60;a href=&#34;http://plasma.camlcity.org&#34;&#62;Plasma&#60;/a&#62;
is my project that will offer an alternative. Plasma is written in
Ocaml, and the developer can seamlessly use Map/Reduce to run
functional programs in distributed ways. Although it is still in an
early development stage, it is already usable and can demonstrate why
an FP implementation is better here. Without going to much into detail
(what I happily postpone to future articles), one can imagine that it
is easier to introduce Map/Reduce to programs that are globally
structured by functional decomposition anyway.

&#60;/p&#62;&#60;p&#62;With some luck FP thinking will finally get the attention instead
of a single technique like Map/Reduce. It finally deserves it.
&#60;/p&#62;&#60;/cc-field&#62;
&#60;/p&#62;
&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;

&#60;div&#62;
  Gerd Stolpmann works as O&#38;#39;Caml consultant

&#60;/div&#62;

&#60;div&#62;
  
&#60;/div&#62;


          </description>
        </item>
      
  </channel>
</rss>
