A few days ago, I
released Plasma-0.4.1. This
article gives an overview over the filesystem subsystem of it, which
is actually the more important part. PlasmaFS differs in many points
from popular distributed filesystems like HDFS. This starts from the
beginning with the requirements analysis.
A distributed filesystem (DFS) allows it to store giant amounts of
data. A high number of data nodes (computers with hard disks) can be
attached to a DFS cluster, and usually a second kind of node, called
name node, is used to store metadata, i.e. which files are stored and
where. The point is now that the volume of metadata can be very low
compared to the payload data (the ratios are somewhere between
1:10,000 to 1:1,000,000), so a single name node can manage a quite
large cluster. Also, the clients can contact the data nodes
directly to access payload data - the traffic is not routed via
the name node like in "normal" network filesystems. This allows
enormous bandwidths.
The motivation for developing another DFS was that existing
implementations, and especially the popular HDFS, make (in my opinion)
unfortunate compromises to gain speed:
I'm not saying that HDFS is a bad implementation. My point is only that
there is an alternative where safety and security are taken more
seriously, and that there are other ways to get high speed than those
that are implemented in HDFS.
Still, writing each modification directly to the SSD limits the
speed compared to what systems like HDFS can do (because HDFS keeps
the data in RAM, and only writes now and then a copy to disk). We need
more techniques to address the potential bottleneck name node:
I consider the map/reduce part of Plasma especially as a good test
case for PlasmaFS. Of course, this map/reduce implementation is
perfectly adapted to PlasmaFS, and uses all possibilities to reduce
the frequency of name node operations. It turns out that a typical
running map/reduce task contacts the name node only every 3-4 seconds,
usually to refill a buffer that got empty, or to flush a full buffer
to disk. The point here is that a buffer can be larger than a data
block, and that only a single name node transaction is sufficient to
handle all blocks in the buffer in one go. The buffers are typically
way larger than only a single block, so this reduces the number of
name node operations quite dramatically. (Important note: This number
(3-4) is only correct for Plasma's map/reduce implementation which
uses a modified and more complex algorithm scheme, but it is not
applicable to the scheme used by Hadoop.)
I have done some tests with the latest development version of
Plasma. The peak number of commits per second seems to be around 500
(here, a "commit" is a transaction writing data that can include
several data update operations). This test used a recently bought SSD,
and ran on a quad-core server machine. It was not evident that the SSD
was the bottleneck (one indication is that the test ran only slightly
faster when syncs were turned off), so there is probably still a lot
of room for optimization.
Given that a map/reduce task needs the name node only every ≈0.3 seconds,
this "commit speed" would be theoretically sufficient for around
1600 parallely running tasks. It is likely that other limits are
hit first (e.g. the switching capacity). Anyway, these are encouraging
numbers showing that this young project is not on the wrong track.
The above techniques are already implemented in PlasmaFS. More advanced
options that could be worth an implementation include:
Given that these two improvements are very complicated to implement,
it is unlikely that it is done soon. There is still a lot of fruit
hanging at lower branches of the tree.
Let's quickly discuss another problem, namely how to secure accesses
to data nodes. It is easy to accept that the name nodes can be secured
with classic authentication and authorization schemes in the same
style as they are used for other server software, too. For data nodes,
however, we face the problem that we need to supervise every access to a
data block individually, but want to avoid any extra overhead, especially
that each data access needs to be checked with the name node.
PlasmaFS uses a special cryptographic ticket system to avoid
this. Essentially, the name node creates random keys in periodical
intervals, and broadcasts these to the data nodes. These keys are
secrets shared by the name and data nodes. The accessing clients get
only HMAC-based tickets generated from the keys and from the block ID
the clients are granted access to. These tickets can be checked by
the data nodes because these nodes know the keys. When the client
loses the right to access the blocks (i.e. when the client transaction
ends), the corresponding key is revoked.
With some additional tricks it can be achieved that the only
communication between the name node and the data node is a periodical
maintenance call that hands out the new keys and revokes the expired
keys. That's an acceptable overhead.
PlasmaFS implements the POSIX file semantics almost completely. This
includes the possibility of modifying data (or better, replacing
blocks by newer versions, which is not possible in other DFS
implementations), the handling of deleted files, and the exclusive
creation of new files. There are a few exceptions, though, namely
neither the link count nor the last access time of files are maintained.
Also, lockf-style locks are not yet available.
For supporting map/reduce and other distributed algorithm schemes,
PlasmaFS offers locality functions. In particular, one can find out
on which nodes a data block is actually stored, and one can also
wish that a new data block is stored on a certain node (if possible).
The PlasmaFS client protocol bases on SunRPC. This protocol has quite
good support on the system level, and it supports strong
authentication and encryption via the GSS-API extension (which is
actually used by PlasmaFS, together with the SCRAM-SHA1 mechanism). I
know that younger developers consider it as out-dated, but even the
Facebook generation must accept that it can keep up with the
requirements of today, and that it includes features that more modern
protocols do not provide (like UDP transport and GSS-API). For the
quality of the code it is important that modifying the SunRPC layer is
easy (e.g. adding or changing a new procedure), and does not imply
much coding. Because of this it could be achieved that the PlasmaFS
protocol is quite clean on the one hand, but is still adequately
expressive on the other hand to support complex transactions.
PlasmaFS is accessible from many environments. Applications can access
it via the mentioned SunRPC protocol (with all features), but also
via NFS, and via a command-line client. In the future, WebDAV support
will also be provided (which is an extension of HTTP, and which will
ensure easy access from many programming environments).
Using SSDs for transacted metadata stores
PlasmaFS starts at a different point. It uses a data store with full
transactional support (right now this is PostgreSQL, just for
development simplicity, but other, and more light-weight systems could
also fill out this role). This includes:
This is, by the way, the established prime-standard way of ensuring
data safety for databases. It comes with its own problems, and the
most challenging is that commits are relatively slow. The reason for this
is the storage hardware - for normal hard disks the maximum frequency
of commits is a function of the rotation speed. Fortunately, there is
now an alternative: SSDs allow at present several 10000 syncs per
second, which is two orders of magnitude more than classic hard disks
provide. Good SSDs are still expensive, but luckily moderate disk
sizes are already sufficient (with only a 100G database you can
already manage a really giant filesystem).
Speed
Delegated access control checks
Other quality-assuring features
Check Plasma out
The Plasma homepage provides
a lot of documentation, and especially downloads. Also take a look at
the performance
page, describing a few tests I recently ran.