Recently, Map/Reduce got a lot of attention in the media. Of course,
this has nothing to do with the fact that it is functional
programming, but more with the company that has invented it
(Google). However, there is hope that some of the curiosity of the
public can be diverted to functional programming as such. The author
explains this by the way memory is accessed, and why an implementation
of Map/Reduce in an FP language is important.
Functional programming (FP) is still a niche paradigm. Neither in the
academic world nor in the industry it is widely used, although it is
gaining acceptance. I think this is in some way anachronistic - not
only because FP adoption could lead to more correct programs, but also
because the hardware is not used optimally when it is ignored. I hope
the reader stumbles here a bit - isn't FP more memory-intensive than
imperative programming? In some naive way this is true, but one has
also take into account that we live now in a world of cheap memory.
Nowadays it does not matter much for the efficiency of a program
whether data copies are avoided as much as possible. Other criterions
also play a role, for instance whether a program can easily be
rewritten in a distributed way so it can run on a compute cluster. FP
shines here, because functional blocks can more easily be refactored
in a distributed way.
Ocaml's record type is special because it allows one to mix
imperative and functional styles. Let's look quickly at the latter.
After defining a record type, as in
In imperative code, the record fields are usually directly
changed. Of course, this is still a bit more memory-saving than the
sharing trick Ocaml implements. However, it is also more dangerous in
two respects: First, the programmer might lose control where in the
program the records are stored (and this is easy in development
teams), and the overwritten field would also be visible in parts of
the program where it is unexpected. Second, it is more difficult to
ensure consistency. For example, in the
Thinking data modifications as a sequence of versions is a typical
design pattern of functional programming. We have seen that it
consumes slightly more RAM, but also that it gives the programmer a
lot more control about the coordination of data changes and that it
helps to ensure consistency.
Map/Reduce processes data files strictly sequentially, and creates a
sequence of versions of a set of input files, and finally stores the
last generation of versions into the output files. Especially,
Map/Reduce never overwrites existing files, neither as a whole nor in
parts. Of course, this is clearly a functional way of doing data
manipulation, in the same way as the functional record updates I've
explained above. The only difference is that the data units are bigger
and stored on disk. Interestingly, the reason why this is advantageous
has also to do with the characteristics of the underlying memory:
Disks allow higher data transfer rates when files are sequentially
read or written, whereas random accesses are quite expensive. By
sticking to sequential accesses only, Map/Reduce is by design scalable
into regions where the speed of even big relational databases is
limited by unavoidable random disk seeks. Another advantage is that
Map/Reduce can be easily distributed - every part function of the
scheme can be run on its own machine. As these functions do not have
side effects, the part functions can run independently of each other,
provided the machines have access to the shared file system.
Plasma
is my project that will offer an alternative. Plasma is written in
Ocaml, and the developer can seamlessly use Map/Reduce to run
functional programs in distributed ways. Although it is still in an
early development stage, it is already usable and can demonstrate why
an FP implementation is better here. Without going to much into detail
(what I happily postpone to future articles), one can imagine that it
is easier to introduce Map/Reduce to programs that are globally
structured by functional decomposition anyway.
With some luck FP thinking will finally get the attention instead
of a single technique like Map/Reduce. It finally deserves it.
The world of cheap memory
The cost for RAM and disks drop constantly, and the available
bandwidth for accessing memory increases. There is no sign that this
development stops or slows down, Moore's law is still applicable. This
has, over time, lead to a situation where 1G of RAM does not cost more
than an hour of developer time. A disk of 1T is the equivalence of 1-2
work days. Nevertheless, most programmers prefer coding styles that
seem to ignore reality. Especially, it is still considered good
practice to avoid data copying at all costs, e.g. by using dangerous
techniques like passing around pointers to data structures - which
becomes no better when one encapsulates these structures into objects.
My theory is that the current generation of developers is still used
to these bad habits because this group grew up in a different world
where RAM was tight and the disk always full. In the old times it was
better to save memory, even at the cost of increased instability - the
24/7 requirement was not yet born. Nowadays, however, doing so is a
bad strategy, and causes that a lot of time has to go into bug fixing,
and that the maintenance of such software is extremely costly. The
old habits fit no more.
FP and RAM
The FP scene mostly considers FP as a calculus, as a mathematically
founded way of programming. This is cathedral thinking, and it
overlooks a lot of real-world phenomenons, and maybe we should start
explaining FP in a different way that works better for the bazaar. FP
also means that the way changes how memory is accessed. We have here
two examples: First, Ocaml's functional record updates, and second
Map/Reduce.
type order =
{ id : int;
customer : customer;
items : item list;
price : float
}
one can create new order
values with the syntax
let my_order =
{ id = 5;
customer = my_customer;
items = my_items;
price = compute_price my_items
}
However, there is by default no way of changing the fields of the
record - the fields are immutable! Although one can add the keyword
mutable
to the type declaration and change that, the
intention is to encourage the programmer to use records
differently. Instead of overwriting fields, the programmer can also
create a flat copy of the record where fields are changed:
let my_updated_order =
{ my_order with
items = my_updated_items;
price = compute_price my_updated_items
}
This is called a functional update. The original order is still
unmodified, and the updated order shares the customer
field with the original one (i.e. it points to the same memory). This
means that we create a copy, but it is only limited to one level, and
it is quite cheap in terms of memory and CPU consumption (compared to,
say, a C++ copy constructor). If the programmer declares
the customer
type also as immutable, there is no problem
with the fact that the customer
field is shared. It
cannot be changed anyway, and so a modification in one record instance
cannot cause unwanted effects for the second instance. A good way of
imagining this way of managing data is to see such a record as a
version of a value. Changing means to create a second instance as a
second version. Both versions continue to live independently from each
other.
order
example
the price
field is a dependent field - it is computed
from the items
. When the fields are set in two
consecutive assignment statements, it is possible that this goes wrong
(e.g. when an exception is raised between the assignments).
FP and disks
Since Google has published the paper about Map/Reduce, this framework
is considered as an attractive alternative for data warehousing, i.e.
for managing giant amounts of data. Interestingly, it hasn't found
much attention that Map/Reduce follows a functional paradigm -
although "map" and "reduce" are FP lingo, and one must be blind not
to see it. It is more viewed as an alternative to relational databases
("No-SQL" is the keyword here).
Explaining FP differently
It is time for FP because we finally have the amount of memory so that
FP can run well, and so that the advantages clearly outweigh the
costs. FP means accessing memory differently: data is no longer
overwritten, but sequences of versions are created to represent the
data modification. This works for both RAM-based and disk-based
data. We have seen a lot of advantages:
In case of RAM, the costs are moderate, and can be neglected for
current computers with multiple gigabytes of RAM. For disks, a strict
functional data scheme may even reduce random seeks, but of course
this is only viable for "offline" data preparation jobs (no ad-hoc
data modifications are required).
Plasma
As Map/Reduce is currently the hot topic, it is important to get the
ball back into the field of FP. Right now, the widely used open source
implementation of Map/Reduce is written in Java, a language that only
verbosely manages in-place memory modifications instead of going one
step further, namely offering true FP constructs to the
programmers. However, an FP scheme like Map/Reduce becomes even more
useful when the framework is also written in an FP language.