BLOG ON CAMLCITY.ORG: Plasma

Why Map/Reduce matters

It is time for functional programming - by Gerd Stolpmann, 2010-11-08

Recently, Map/Reduce got a lot of attention in the media. Of course, this has nothing to do with the fact that it is functional programming, but more with the company that has invented it (Google). However, there is hope that some of the curiosity of the public can be diverted to functional programming as such. The author explains this by the way memory is accessed, and why an implementation of Map/Reduce in an FP language is important.

Functional programming (FP) is still a niche paradigm. Neither in the academic world nor in the industry it is widely used, although it is gaining acceptance. I think this is in some way anachronistic - not only because FP adoption could lead to more correct programs, but also because the hardware is not used optimally when it is ignored. I hope the reader stumbles here a bit - isn't FP more memory-intensive than imperative programming? In some naive way this is true, but one has also take into account that we live now in a world of cheap memory. Nowadays it does not matter much for the efficiency of a program whether data copies are avoided as much as possible. Other criterions also play a role, for instance whether a program can easily be rewritten in a distributed way so it can run on a compute cluster. FP shines here, because functional blocks can more easily be refactored in a distributed way.

The world of cheap memory

The cost for RAM and disks drop constantly, and the available bandwidth for accessing memory increases. There is no sign that this development stops or slows down, Moore's law is still applicable. This has, over time, lead to a situation where 1G of RAM does not cost more than an hour of developer time. A disk of 1T is the equivalence of 1-2 work days. Nevertheless, most programmers prefer coding styles that seem to ignore reality. Especially, it is still considered good practice to avoid data copying at all costs, e.g. by using dangerous techniques like passing around pointers to data structures - which becomes no better when one encapsulates these structures into objects. My theory is that the current generation of developers is still used to these bad habits because this group grew up in a different world where RAM was tight and the disk always full. In the old times it was better to save memory, even at the cost of increased instability - the 24/7 requirement was not yet born. Nowadays, however, doing so is a bad strategy, and causes that a lot of time has to go into bug fixing, and that the maintenance of such software is extremely costly. The old habits fit no more.

FP and RAM

The FP scene mostly considers FP as a calculus, as a mathematically founded way of programming. This is cathedral thinking, and it overlooks a lot of real-world phenomenons, and maybe we should start explaining FP in a different way that works better for the bazaar. FP also means that the way changes how memory is accessed. We have here two examples: First, Ocaml's functional record updates, and second Map/Reduce.

Ocaml's record type is special because it allows one to mix imperative and functional styles. Let's look quickly at the latter. After defining a record type, as in

type order =
  { id : int;
    customer : customer;
    items : item list;
    price : float
  }

one can create new order values with the syntax

let my_order =
  { id = 5;
    customer = my_customer;
    items = my_items;
    price = compute_price my_items
  }

However, there is by default no way of changing the fields of the record - the fields are immutable! Although one can add the keyword mutable to the type declaration and change that, the intention is to encourage the programmer to use records differently. Instead of overwriting fields, the programmer can also create a flat copy of the record where fields are changed:

let my_updated_order =
  { my_order with
      items = my_updated_items;
      price = compute_price my_updated_items
  }

This is called a functional update. The original order is still unmodified, and the updated order shares the customer field with the original one (i.e. it points to the same memory). This means that we create a copy, but it is only limited to one level, and it is quite cheap in terms of memory and CPU consumption (compared to, say, a C++ copy constructor). If the programmer declares the customer type also as immutable, there is no problem with the fact that the customer field is shared. It cannot be changed anyway, and so a modification in one record instance cannot cause unwanted effects for the second instance. A good way of imagining this way of managing data is to see such a record as a version of a value. Changing means to create a second instance as a second version. Both versions continue to live independently from each other.

In imperative code, the record fields are usually directly changed. Of course, this is still a bit more memory-saving than the sharing trick Ocaml implements. However, it is also more dangerous in two respects: First, the programmer might lose control where in the program the records are stored (and this is easy in development teams), and the overwritten field would also be visible in parts of the program where it is unexpected. Second, it is more difficult to ensure consistency. For example, in the order example the price field is a dependent field - it is computed from the items. When the fields are set in two consecutive assignment statements, it is possible that this goes wrong (e.g. when an exception is raised between the assignments).

Thinking data modifications as a sequence of versions is a typical design pattern of functional programming. We have seen that it consumes slightly more RAM, but also that it gives the programmer a lot more control about the coordination of data changes and that it helps to ensure consistency.

FP and disks

Since Google has published the paper about Map/Reduce, this framework is considered as an attractive alternative for data warehousing, i.e. for managing giant amounts of data. Interestingly, it hasn't found much attention that Map/Reduce follows a functional paradigm - although "map" and "reduce" are FP lingo, and one must be blind not to see it. It is more viewed as an alternative to relational databases ("No-SQL" is the keyword here).

Map/Reduce processes data files strictly sequentially, and creates a sequence of versions of a set of input files, and finally stores the last generation of versions into the output files. Especially, Map/Reduce never overwrites existing files, neither as a whole nor in parts. Of course, this is clearly a functional way of doing data manipulation, in the same way as the functional record updates I've explained above. The only difference is that the data units are bigger and stored on disk. Interestingly, the reason why this is advantageous has also to do with the characteristics of the underlying memory: Disks allow higher data transfer rates when files are sequentially read or written, whereas random accesses are quite expensive. By sticking to sequential accesses only, Map/Reduce is by design scalable into regions where the speed of even big relational databases is limited by unavoidable random disk seeks. Another advantage is that Map/Reduce can be easily distributed - every part function of the scheme can be run on its own machine. As these functions do not have side effects, the part functions can run independently of each other, provided the machines have access to the shared file system.

Explaining FP differently

It is time for FP because we finally have the amount of memory so that FP can run well, and so that the advantages clearly outweigh the costs. FP means accessing memory differently: data is no longer overwritten, but sequences of versions are created to represent the data modification. This works for both RAM-based and disk-based data. We have seen a lot of advantages:

Better control about the scope of data modifications
Higher guarantees about data consistency
For disks: Higher data access rates when big chunks of data are transformed rather than modified in place
For clusters: Functional decomposition schemes are easier to distribute in compute clusters

In case of RAM, the costs are moderate, and can be neglected for current computers with multiple gigabytes of RAM. For disks, a strict functional data scheme may even reduce random seeks, but of course this is only viable for "offline" data preparation jobs (no ad-hoc data modifications are required).

Plasma

As Map/Reduce is currently the hot topic, it is important to get the ball back into the field of FP. Right now, the widely used open source implementation of Map/Reduce is written in Java, a language that only verbosely manages in-place memory modifications instead of going one step further, namely offering true FP constructs to the programmers. However, an FP scheme like Map/Reduce becomes even more useful when the framework is also written in an FP language.

Plasma is my project that will offer an alternative. Plasma is written in Ocaml, and the developer can seamlessly use Map/Reduce to run functional programs in distributed ways. Although it is still in an early development stage, it is already usable and can demonstrate why an FP implementation is better here. Without going to much into detail (what I happily postpone to future articles), one can imagine that it is easier to introduce Map/Reduce to programs that are globally structured by functional decomposition anyway.

With some luck FP thinking will finally get the attention instead of a single technique like Map/Reduce. It finally deserves it.

Gerd Stolpmann works as O'Caml consultant

This web site is published by Informatikbüro Gerd Stolpmann

Plasma	GitLab	Archive
Projects	Blog	Knowledge