<?xml version="1.0"?>
<rss version="2.0">
  <channel>
    <title>Blog on camlcity.org</title>
    <link>http://blog.camlcity.org</link>
    <language>en</language>
    <description>Articles by Gerd Stolpmann about O'Caml</description>

    
        <item>
          <title>About LambdaRank</title>
          <guid>http://blog.camlcity.org/blog/lambdarank.html</guid>
          <link>http://blog.camlcity.org/blog/lambdarank.html</link>
          <description>

&#60;div&#62;
  &#60;b&#62;The ranking algorithm behind GODI Search&#60;/b&#62;&#60;br&#62;&#38;#160;
&#60;/div&#62;

&#60;div&#62;
  
The question has always been: Are the results better if a search
engine really understands the text it indexes? You can view my latest
project, &#60;a href=&#34;http://docs.camlcity.org&#34;&#62;GODI Search&#60;/a&#62;, as an
attempt to answer this question for a very limited set of documents,
namely the code and its documentation of &#60;a href=&#34;http://godi.camlcity.org&#34;&#62;GODI&#60;/a&#62;, the source code O&#38;#39;Caml
distribution. Actually, understanding text allows it to rank the
results better, so that the more important occurrences of the query
words are shown more at the beginning of the result list. Because
ranking really bases on text interpretation, and the text is here a
variant of lambda code, I call this method LambdaRank.  
&#60;/div&#62;

&#60;div&#62;
  
&#60;p&#62;As every search engine, GODI Search consists roughly of two main
components, namely the indexer and the searcher. The indexer iterates
over the set of documents, and populates a database with the words it
extracts. The searcher is the simpler part of the game, as it finds
everything well-prepared in this database, and when a query comes in,
it only has to pull the matching documents out of the database, sort
it according to the ranking score, and show it to the user. So the
tricky part is the indexing.

&#60;/p&#62;&#60;p&#62;If you don&#38;#39;t understand the text at all, you cannot do much about
ranking but count the occurrences of a word in a document. The idea
is that a text that speaks about a certain subject also mentions the
subject more often than other texts, and thus the number of occurrences
is a good measure for ranking.
In a document set where the texts are connected with hyperlinks one
can furthermore look at the relationships between the documents.
Google&#38;#39;s PageRank is based on this approach (but, as rumors say,
is now heavily modified from its original design).

&#60;/p&#62;&#60;p&#62;Fortunately, we know a lot about the document set GODI Search 
analyzes. Most documents are like this:

&#60;/p&#62;&#60;ul&#62;
  &#60;li&#62;&#60;a href=&#34;http://docs.camlcity.org/docs/godipkg/3.10/godi-ocaml/lib/ocaml/std-lib/list.mli&#34;&#62;O&#38;#39;Caml module interfaces&#60;/a&#62;
  &#60;/li&#62;&#60;li&#62;&#60;a href=&#34;http://docs.camlcity.org/docs/godipkg/3.10/godi-ocaml/lib/ocaml/std-lib/list.ml&#34;&#62;O&#38;#39;Caml module implementations&#60;/a&#62;
  &#60;/li&#62;&#60;li&#62;&#60;a href=&#34;http://docs.camlcity.org/docs/godipkg/3.10/godi-omake/doc/godi-omake/html/omake-doc.html&#34;&#62;Technical manuals&#60;/a&#62;
  &#60;/li&#62;&#60;li&#62;&#60;a href=&#34;http://docs.camlcity.org/docs/godipkg/3.10/apps-coq/share/emacs/site-lisp/coq.el&#34;&#62;Code in programming languages other than O&#38;#39;Caml&#60;/a&#62;
  &#60;/li&#62;&#60;li&#62;&#60;a href=&#34;http://docs.camlcity.org/docs/godipkg/3.10/apps-coq/lib/coq/ide/utf8.v&#34;&#62;Code in exoctic languages&#60;/a&#62;
&#60;/li&#62;&#60;/ul&#62;

O&#38;#39;Caml code files and closely corresponding manuals dominate the corpus.
GODI Search tries to make the best of this by analyzing O&#38;#39;Caml code in
detail.

&#60;p&#62;After looking at many examples I had some ideas which occurrences must
be ranked higher than others.

&#60;/p&#62;&#60;p&#62;&#60;b&#62;Idea 1. Definitions of identifiers are more important than uses.&#60;/b&#62;
This sounds natural, but actually not every definition is important. GODI
Search also looks at the scope of the definition: A local definition is
restricted to a surrounding function, and scores the least. Then follow
definitions on module level, and in top-level modules. The highest score
is given to exported identifiers that occur in top-level module interfaces.

&#60;/p&#62;&#60;p&#62;Only let-bound identifiers are ranked this way. Function arguments,
variables in pattern matchings, and fun-bound identifiers are ignored.
This is a bit arbitrary, but my feeling is that these identifiers are
usually not important in the coding styles I&#38;#39;m aware of.

&#60;/p&#62;&#60;p&#62;For types, exceptions, and module names similar scoring techniques
exist.

&#60;/p&#62;&#60;p&#62;&#60;b&#62;Idea 2. Values and types are rated separately.&#60;/b&#62;
The namespace of all identifiers can be roughly divided into two big
zones: Values and types. Of course, there are more kinds of
identifiers (modules, classes, labels, file names,...), but my
impression is that the typical programmer has a mind set that is
dominated by only these two classes of symbols. In this sense, a
&#38;#34;value&#38;#34; names an executable thing, and a &#38;#34;type&#38;#34; names meta data. Both
have little to do with each other, and thus a document that contains
many words &#38;#34;list&#38;#34; as type has little importance in a search for &#38;#34;list&#38;#34;
as value.

&#60;/p&#62;&#60;p&#62;&#60;b&#62;Idea 3. Keywords are stopwords.&#60;/b&#62; Keywords occur in
practically every code file in big number, and thus say nothing about
it. GODI Search simply ignores keywords.

&#60;/p&#62;&#60;p&#62;&#60;b&#62;Idea 4. Qualified identifiers are hyperlinks.&#60;/b&#62; If you
search for &#38;#34;mem&#38;#34; then you will get a list of top-level definitions of
this function. The question is which occurrence is shown first.  GODI
Search implements the PageRank idea of scoring hyperlinks pointing to
a document by looking at qualified identifiers. So if there are more
&#38;#34;Hashtbl.find&#38;#34; than &#38;#34;List.find&#38;#34; in code anywhere in the corpus, the
module &#38;#34;Hashtbl&#38;#34; scores higher than &#38;#34;List&#38;#34;. (Actually, it is the other
way round if you also take other references into account.)

&#60;/p&#62;&#60;p&#62;&#60;b&#62;Idea 5. Code and non-code are rated separately.&#60;/b&#62; 
Of course, the above applies only to text sections that are O&#38;#39;Caml code.
Other languages and non-code cannot be rated this way. For this reason,
GODI Search puts a lot of effort into separating both types of text.
Currently, this is done on a per-paragraph basis, i.e. every paragraph
is first analyzed in order to know whether it is O&#38;#39;Caml code or not.
Also, comments and string literals in code files are considered as
non-code.

&#60;/p&#62;&#60;p&#62;So far about the ideas behind LambdaRank. The results of the
implementation look promising: If the user types in &#38;#34;fold_left&#38;#34; he or
she will be taken to really relevant occurrences. And user experience
is what counts, finally.

&#60;/p&#62;
&#60;/div&#62;

&#60;div&#62;
  Gerd Stolpmann works as O&#38;#39;Caml consultant

&#60;/div&#62;

&#60;div&#62;
  
      &#60;p&#62;&#60;b&#62;Links:&#60;/b&#62;
      &#60;ul&#62;
	
	    &#60;li&#62;&#60;a href=&#34;http://docs.camlcity.org&#34;&#62;GODI Search&#60;/a&#62;:
	    Try out GODI Search here
	  &#60;/li&#62;
	    &#60;li&#62;&#60;a href=&#34;http://godi.camlcity.org&#34;&#62;GODI Homepage&#60;/a&#62;:
	    About GODI in general
	  &#60;/li&#62;
      &#60;/ul&#62;
    &#60;/p&#62;
&#60;/div&#62;


          </description>
        </item>
      
        <item>
          <title>Cross-Language Cluster Computing</title>
          <guid>http://blog.camlcity.org/blog/hydrostory.html</guid>
          <link>http://blog.camlcity.org/blog/hydrostory.html</link>
          <description>

&#60;div&#62;
  &#60;b&#62;The Story Behind Hydro&#60;/b&#62;&#60;br&#62;&#38;#160;
&#60;/div&#62;

&#60;div&#62;
  
On the surface a search engine looks like a very simple web site, but
actually most things happen in the backend, and are hidden from the
user. A search engine consists of 10 to 20 different types of servers,
and many of them are instantiated several times in a cluster
configuration.  No single programming language is well suited for the
entire implementation. In addition, some language environments may
have compelling libraries that are lacking in other languages.  It is,
however, still difficult to let servers written in different languages
communicate with each other. At Wink, we decided to go with ICE, and
to develop the missing O&#38;#39;Caml implementation ourselves.  
&#60;/div&#62;

&#60;div&#62;
  
&#60;p&#62;For two years now I have done consulting work for &#60;a href=&#34;http://wink.com&#34;&#62;Wink Technologies&#60;/a&#62; who started out as a
user-powered search engine, but switched later to people
search. Actually, people search began as an experiment, and was
originally developed in O&#38;#39;Caml with some Java standard components. It
turned out that O&#38;#39;Caml was very well suited for crawling and parsing,
and that our solution was convincing enough so we could go on with
O&#38;#39;Caml as implementation language. It was also a big plus that it was
already possible to develop server clusters in O&#38;#39;Caml with Ocamlnet&#38;#39;s
SunRPC implementation - we only had to add a highly available
directory and configuration service. However, the problem of SunRPC is
that there is no good C++ implementation (interestingly, there is an
acceptable one for Java, RemoteTea). Too bad, since SunRPC is simple
and robust, and its type system matches the one of O&#38;#39;Caml quite well
(there are records, arrays, variants, and option types).

&#60;/p&#62;&#60;h2&#62;ICE: An RPC Middleware&#60;/h2&#62;

&#60;p&#62;Looking for an alternative, somebody came up with ICE (&#38;#34;Internet
Communications Engine&#38;#34;). This is a commercial product by &#60;a href=&#34;http://zeroc.com&#34;&#62;ZeroC&#60;/a&#62; which is dual-licensed under the GPL
(like e.g. MySQL). There are implementations for a number of
languages, including C++, Java, Python, and PHP, but unfortunately not
for O&#38;#39;Caml. Well, this is no surprise, but at least the other company
languages are covered. So we looked closer at ICE. Does it match our
needs?

&#60;/p&#62;&#60;p&#62;ICE follows the object-oriented paradigm. For an RPC middleware
this means that a remote call is seen as sending a message to a remote
object. Of course, it is possible to have several such objects of the
same type, and creating instances is made possible by a class
construct. Such a design is very acceptable, but unfortunately object
orientation often meant in the past that the rest of the type system
was crippled. To some degree, this also happened to ICE - especially,
there are no variants and no option types. Well, not optimal, but
there are at least clean &#38;#34;design patterns&#38;#34; how to emulate these 
missing features with classes, and for pure OO languages like Java
the ICE approach simplifies the language mapping.

&#60;/p&#62;&#60;p&#62;Unlike CORBA, there is a fixed protocol the components have to use
to talk to each other. That means a client in language X can directly
contact a server in language Y, and there is no need for an
intermediate instance to translate the protocol. Basically, this means
you can use ICE without any infrastructure - no central server you
are dependent upon. For developing massively parallel cluster services
this is an essential requirement, because such central servers don&#38;#39;t 
scale well enough, and are single points of failure.

&#60;/p&#62;&#60;p&#62;For using ICE in a cluster context, there is the IceGrid add-on.
Basically, this is a highly available directory and configuration
service, and serves for a similar purpose like the service we had
developed for SunRPC before. Clients ask IceGrid where to find their
servers in the network, and IceGrid replies with a suggestion of TCP
ports. This can be used for load-balancing and for high availability.

&#60;/p&#62;&#60;h2&#62;Hydro: Implementing ICE for O&#38;#39;Caml&#60;/h2&#62;

&#60;p&#62;After ICE was found to be good enough, we needed an implementation
for O&#38;#39;Caml. Well, this was my field - I already developed the SunRPC
support in Ocamlnet years ago, and this made me an expert for this
type of work. It took only about 3 weeks until it was possible to
generate client code, and about another week until server support
was ready. However, it was still challenging work, because the ICE
type system needed to be mapped to O&#38;#39;Caml&#38;#39;s type system. Furthermore,
the ICE reference manual was full of errors, and everything had to
be checked against ZeroC&#38;#39;s implementation.

&#60;/p&#62;&#60;p&#62;The difficulty of the type mapping is that ICE demands that objects
and exceptions can be downcasted. O&#38;#39;Caml, however, does not support
this operation, because there is no efficient implementation of
downcasting for a type system like O&#38;#39;Caml&#38;#39;s that includes structural
subtyping. Nevertheless, downcasting is a
reasonable operation in the context of RPC, and it is hardly possible
to get around it.

&#60;/p&#62;&#60;p&#62;Maybe an example demonstrates this best. In Slice, the IDL for
ICE, one can easily define a hierarchy of classes (the syntax resembles
Java&#38;#39;s):

&#60;/p&#62;&#60;blockquote lang=&#34;x-slice&#34;&#62;
&#60;code&#62;&#60;pre&#62;
class SearchResult {
    string url;
    string title;
}

class PeopleSearchResult extends SearchResult {
    string name;
}

class BandSearchResult extends SearchResult {
    string bandName;
    stringSeq bandMembers;
}
&#60;/pre&#62;&#60;/code&#62;&#60;/blockquote&#62;

&#60;p&#62;When the search engine returns a &#60;code&#62;SearchResult&#60;/code&#62; item, it
can also be one of the descendants of this class. Of course, a client
of the search engine that simply wants to display the result, needs to
know all details, and thus downcasts &#60;code&#62;SearchResult&#60;/code&#62; to the
real subclass.

&#60;/p&#62;&#60;p&#62;In a normal OO program one can get rid of this downcast by adding
an operation for displaying the result:

&#60;/p&#62;&#60;blockquote lang=&#34;x-slice&#34;&#62;
&#60;code&#62;&#60;pre&#62;
class SearchResult {
    string url;
    string title;
    string display();
}
&#60;/pre&#62;&#60;/code&#62;&#60;/blockquote&#62;

In an RPC context such an addition might be difficult, however, or may
break some other principle of the RPC design. Basically, RPC is about
marshalling data, and that means getting data out of the context of
one server and forcing them into the context of another server. The
&#38;#34;unity of data and operations&#38;#34;, one of the OO principles, is
intentionally given up.

&#60;p&#62;Note that ICE allows to define operations for classes, and
operations are always executed in the context of the data. In this
example, it would be in deed possible to define &#60;code&#62;display&#60;/code&#62;
in a reasonable way, and to avoid the downcast. However,
&#60;code&#62;display&#60;/code&#62; then becomes part of the protocol, although it is
rather a detail of the client. Anyway, one quickly faces the situation
where downcasting is unavoidable.

&#60;/p&#62;&#60;p&#62;In the O&#38;#39;Caml mapping generated by Hydro, these three classes would
appear like

&#60;/p&#62;&#60;blockquote lang=&#34;x-ocaml&#34;&#62;
&#60;code&#62;&#60;pre&#62;
class type o_SearchResult = 
  object
    inherit o_Ice_Object
    method url : string ref
    method title : string ref
  end

class type o_PeopleSearchResult =
  object
    inherit o_SearchResult
    method name : string ref
  end

class type o_BandSearchResult =
  object
    inherit o_SearchResult
    method bandName : string ref
    method bandMembers : string array ref
  end

val as_SearchResult : 
      #Hydro_lm.object_base -&#38;#62; o_SearchResult

val as_PeopleSearchResult : 
      #Hydro_lm.object_base -&#38;#62; o_PeopleSearchResult

val as_BandSearchResult : 
      #Hydro_lm.object_base -&#38;#62; o_BandSearchResult
&#60;/pre&#62;&#60;/code&#62;&#60;/blockquote&#62;

This is a bit simplified, but shows the idea. The ICE classes are
mapped to O&#38;#39;Caml classes with some hidden machinery. The data members
appear as O&#38;#39;Caml methods returning references - the most direct 
translation of this concept. The class hierarchy corresponds to the
hierarchy in ICE, so the O&#38;#39;Caml operator for upcasting, :&#38;#62;, can
be directly used. The hidden machinery comes into play by inheriting
from &#60;code&#62;o_Ice_Object&#60;/code&#62;, the root of the ICE hierarchy, and
by using &#60;code&#62;object_base&#60;/code&#62;, an even smaller antecedent that
defines the marshalling core.

&#60;p&#62;The downcast operation is emulated by defining conversion functions
for every class type: &#60;code&#62;as_PeopleSearchResult&#60;/code&#62; checks whether
the argument is a &#60;code&#62;PeopleSearchResult&#60;/code&#62; in reality, and if so,
casts it to this class type. If not, an exception is raised.

&#60;/p&#62;&#60;p&#62;Of course, this emulation is a bit inconvenient, but this is mostly
a problem of generating good code. From a user&#38;#39;s perspecitve, there is
not much difference between calling a generated conversion function,
or using a built-in language operation. It makes, however, the whole
generated code a lot more difficult to understand.

&#60;/p&#62;&#60;h2&#62;The Story Continues&#60;/h2&#62;

Missing support for RPC middleware is one of biggest concerns when
using a new language in enterprises. In a startup company like Wink it
is possible to address these concerns, because such companies are open
for unconventional solutions (and ICE is unconventional - the industry
standards are CORBA and DCOM in the LAN, and HTTP-based protocols like
SOAP in the Internet). Finally, by integrating several languages into
the system it was possible to deliver some components quicker and with
better quality because we could choose the best language for every
component.

&#60;p&#62;In the people searcher context we use now both SunRPC and ICE. 
The former is arguably better when only O&#38;#39;Caml components have to
talk with each other, and the latter is for crossing the language
boundaries.
&#60;/p&#62;
&#60;/div&#62;

&#60;div&#62;
  Gerd Stolpmann works as O&#38;#39;Caml consultant

&#60;/div&#62;

&#60;div&#62;
  
      &#60;p&#62;&#60;b&#62;Links:&#60;/b&#62;
      &#60;ul&#62;
	
	    &#60;li&#62;&#60;a href=&#34;http://wink.com&#34;&#62;Wink Technologies&#60;/a&#62;:
	    Company homepage
	  &#60;/li&#62;
	    &#60;li&#62;&#60;a href=&#34;http://www.zeroc.com&#34;&#62;ZeroC&#60;/a&#62;:
	    Company homepage, ICE documentation, ICE implementation for a number of languages
	  &#60;/li&#62;
	    &#60;li&#62;&#60;a href=&#34;http://remotetea.sourceforge.net&#34;&#62;RemoteTea&#60;/a&#62;:
	    SunRPC for Java
	  &#60;/li&#62;
	    &#60;li&#62;&#60;a href=&#34;http://oss.wink.com/hydro&#34;&#62;Hydro&#60;/a&#62;:
	    ICE for O&#38;#39;Caml
	  &#60;/li&#62;
	    &#60;li&#62;&#60;a href=&#34;http://projects.camlcity.org/projects/ocamlnet.html&#34;&#62;Ocamlnet&#60;/a&#62;:
	    Internet library for O&#38;#39;Caml with SunRPC support
	  &#60;/li&#62;
      &#60;/ul&#62;
    &#60;/p&#62;
&#60;/div&#62;


          </description>
        </item>
      
        <item>
          <title>Mixing Apples And Pears</title>
          <guid>http://blog.camlcity.org/blog/polyvariants.html</guid>
          <link>http://blog.camlcity.org/blog/polyvariants.html</link>
          <description>

&#60;div&#62;
  &#60;b&#62;Using Polymorphic Variants&#60;/b&#62;&#60;br&#62;&#38;#160;
&#60;/div&#62;

&#60;div&#62;
  
It is one of the coolest language constructs, but its conception leads
sometimes to confusion. O&#38;#39;Caml allows it to form ad-hoc unions of
tagged values, the so-called polymorphic variants. They are the free-style
counterpart of the &#38;#34;normal&#38;#34; variant types. We want to shed some light
on this construction in this article, and encourage programmers to
try it out.

&#60;/div&#62;

&#60;div&#62;
  
&#60;p&#62;
The most baffling property of the polyvariants is that one can mix
tags that come from different pieces of code. We&#38;#39;ll give an example of
that later in the text, but first let&#38;#39;s explain some foundations.
Syntactically, the tags are written with a leading reverse apostrophe,
so for example &#60;code&#62;`Apple&#60;/code&#62; is a (value-less) tag. Like the
normal variants the tags can have attached values, so for instance
&#60;code&#62;`Pear &#38;#34;it&#38;#39;s sweet man&#38;#34;&#60;/code&#62; is a tag with a string value.  It
is not required to declare polyvariant types, one can simply start
creating such tagged values in the code.

&#60;/p&#62;&#60;h2&#62;Data Analysis Step-By-Step&#60;/h2&#62;

&#60;p&#62;Imagine we want to analyze a string. In a first step, we would like
to classify every character of the string, and determine whether it is
a letter, a digit, or something else. Using polyvariants, this function
does the job:

&#60;/p&#62;&#60;blockquote lang=&#34;x-ocaml&#34;&#62;
&#60;code&#62;&#60;pre&#62;
let classify_chars s =
  let rec classify_chars_at p =
    if p &#38;#60; String.length s then
      let c = s.[p] in
      let cls =
	match c with
	 | &#38;#39;0&#38;#39;..&#38;#39;9&#38;#39; -&#38;#62; `Digit c
	 | &#38;#39;A&#38;#39;..&#38;#39;Z&#38;#39; | &#38;#39;a&#38;#39;..&#38;#39;z&#38;#39; -&#38;#62; `Letter c
         | _ -&#38;#62; `Other c in
      cls :: classify_chars_at (p+1)
    else
      []
  in
    classify_chars_at 0
&#60;/pre&#62;&#60;/code&#62;
&#60;/blockquote&#62;

&#60;p&#62;So this function would return this list for the input string
&#38;#34;a56*&#38;#34;:

&#60;/p&#62;&#60;blockquote lang=&#34;x-ocaml&#34;&#62;
&#60;code&#62;&#60;pre&#62;[ `Letter &#38;#39;a&#38;#39;; `Digit &#38;#39;5&#38;#39;; `Digit &#38;#39;6&#38;#39;; `Other &#38;#39;*&#38;#39; ]
&#60;/pre&#62;&#60;/code&#62;
&#60;/blockquote&#62;

&#60;p&#62;Note that there is no type declaration! If you enter this function into
the O&#38;#39;Caml toploop, you see that its type is inferred like this:

&#60;/p&#62;&#60;blockquote lang=&#34;x-ocaml&#34;&#62;
&#60;code&#62;&#60;pre&#62;
val classify_chars :
  string -&#38;#62; 
    [&#38;#62; `Digit of char | `Letter of char | `Other of char ] list =
  &#38;#60;fun&#38;#62;
&#60;/pre&#62;&#60;/code&#62;
&#60;/blockquote&#62;

&#60;p&#62;Read this as: The function returns tagged values with tags
&#60;code&#62;`Digit&#60;/code&#62;, &#60;code&#62;`Letter&#60;/code&#62;, or &#60;code&#62;`Other&#60;/code&#62;,
and every tag has an attached character. Note the &#38;#34;greater than&#38;#34;
sign at the beginning of the tag list. It means that this tag list
is compatible with being mixed with completely unrelated tags.
There are also polyvariant types where this sign is reversed
(like in &#60;code&#62;[&#38;#60;...]&#60;/code&#62;), or completely missing. We&#38;#39;ll
come back to that later.

&#60;/p&#62;&#60;p&#62;Back to our string analysis example. We are now interested in
recognizing integer numbers in the list of classified characters.
We assume our input is a list that contains &#60;code&#62;`Digit&#60;/code&#62;
tags, but also other tags. In the output list, sequences of
&#60;code&#62;`Digit&#60;/code&#62; are replaced by &#60;code&#62;`Number&#60;/code&#62;:

&#60;/p&#62;&#60;blockquote lang=&#34;x-ocaml&#34;&#62;
&#60;code&#62;&#60;pre&#62;
let recognize_numbers l =
  let rec recognize_at m acc =
    match m with
      | `Digit d :: m&#38;#39; -&#38;#62;
          let d_v = Char.code d - Char.code &#38;#39;0&#38;#39; in
          let acc&#38;#39; =
            match acc with
              | Some v -&#38;#62; Some(10*v + d_v)
              | None -&#38;#62; Some d_v in
          recognize_at m&#38;#39; acc&#38;#39;
      | x :: m&#38;#39; -&#38;#62;
          ( match acc with
              | None -&#38;#62; x :: recognize_at m&#38;#39; None
              | Some v -&#38;#62; (`Number v) :: x :: recognize_at m&#38;#39; None
          )
      | [] -&#38;#62;
          ( match acc with
              | None -&#38;#62; []
              | Some v -&#38;#62; (`Number v) :: []
          )
  in
  recognize_at l None
&#60;/pre&#62;&#60;/code&#62;
&#60;/blockquote&#62;

&#60;p&#62;The inferred type of this function is now really strange:

&#60;/p&#62;&#60;blockquote lang=&#34;x-ocaml&#34;&#62;
&#60;code&#62;&#60;pre&#62;
val recognize_numbers :
  ([&#38;#62; `Digit of char | `Number of int ] as &#38;#39;a) list -&#38;#62; &#38;#39;a list 
  = &#60;fun&#62;
&#60;/fun&#62;&#60;/pre&#62;&#60;/code&#62;
&#60;/blockquote&#62;

&#60;p&#62;Basically, it says that there is a tagged list as input, and
that the output list has the same tags. Furthermore, the &#38;#34;&#38;#62;&#38;#34;
sign again signals extensibility, so we can not only use the tags
mentioned in the function, but any other tag as well. Especially,
we are free to pass &#60;code&#62;`Letter&#60;/code&#62; and &#60;code&#62;`Other&#60;/code&#62;
tags in:

&#60;/p&#62;&#60;blockquote lang=&#34;x-ocaml&#34;&#62;
&#60;code&#62;&#60;pre&#62;
recognize_numbers
   [ `Digit &#38;#39;1&#38;#39; ; `Digit &#38;#39;3&#38;#39;; `Letter &#38;#39;a&#38;#39;; `Digit &#38;#39;2&#38;#39;; `Other &#38;#39;*&#38;#39; ]
&#60;/pre&#62;&#60;/code&#62;
yields
&#60;code&#62;&#60;pre&#62;
[ `Number 13; `Letter &#38;#39;a&#38;#39;; `Number 2; `Other &#38;#39;*&#38;#39;]
&#60;/pre&#62;&#60;/code&#62;
&#60;/blockquote&#62;

&#60;p&#62;Note that the type of the &#60;code&#62;recognize_numbers&#60;/code&#62; function
does not reflect all what we could know about the function. We can be
sure that the function will never return a &#60;code&#62;`Digit&#60;/code&#62; tag,
but this is not expressed in the function type. We have run into one
of the cases where the O&#38;#39;Caml type system is not powerful enough to
find this out, or even to write this knowledge down. In practice, this
is no real limitation - the types are usually a bit weaker than
necessary, but it is unlikely that weaker types cause problems.

&#60;/p&#62;&#60;p&#62;The really great thing about the polymorphic variants is that 
it is possible to mix tags that come from different contexts. So
in our example we can combine the two functions, and apply one
after the other:

&#60;/p&#62;&#60;blockquote lang=&#34;x-ocaml&#34;&#62;
&#60;code&#62;&#60;pre&#62;
let analyze s =
  recognize_numbers (classify_chars s)
&#60;/pre&#62;&#60;/code&#62;
&#60;/blockquote&#62;

&#60;p&#62;It is no problem that &#60;code&#62;classify_chars&#60;/code&#62; emits tags that
are completely unknown to &#60;code&#62;recognize_numbers&#60;/code&#62;. And both
functions can use the same tag, &#60;code&#62;`Digit&#60;/code&#62;, without having
to declare in some way that they are meaning the same. It is sufficient
that the tag is the same, and that the attached value has the same
type.

&#60;/p&#62;&#60;p&#62;This may have big advantages for structuring programs. Of course,
our example of string analysis already benefits from the loose type
correspondence the polyvariants make possible. The problem can now be
divided into several steps, and every step needs only to know the tags
it operates on. There is no global type all steps have to agree upon,
rather every step sees only the fraction of type information that is 
needed for the task.


&#60;/p&#62;&#60;h2&#62;Limiting Tags&#60;/h2&#62;

&#60;p&#62;Compare these two functions:

&#60;/p&#62;&#60;blockquote lang=&#34;x-ocaml&#34;&#62;
&#60;code&#62;&#60;pre&#62;
let number_value1 t =
  match t with
   | `Number n -&#38;#62; n
   | `Digit d -&#38;#62; Char.code d - Char.code &#38;#39;0&#38;#39;

let number_value2 t =
  match t with
   | `Number n -&#38;#62; n
   | `Digit d -&#38;#62; Char.code d - Char.code &#38;#39;0&#38;#39;
   | _ -&#38;#62; failwith &#38;#34;This is not a number&#38;#34;
&#60;/pre&#62;&#60;/code&#62;
&#60;/blockquote&#62;

&#60;p&#62;The difference is that the second version explicitly catches the case
of &#38;#34;any other tag&#38;#34; whereas the first version leaves this unspecified.
This leads to different typings:

&#60;/p&#62;&#60;blockquote lang=&#34;x-ocaml&#34;&#62;
&#60;code&#62;&#60;pre&#62;
val number_value1 : [&#38;#60; `Digit of char | `Number of int ] -&#38;#62; int
val number_value2 : [&#38;#62; `Digit of char | `Number of int ] -&#38;#62; int
&#60;/pre&#62;&#60;/code&#62;
&#60;/blockquote&#62;

&#60;p&#62;So in the first version we have a &#38;#34;&#38;#60;&#38;#34; sign! This sign usually only
appears for input arguments, and means that the function can only
process these tags (or less tags), but no other tags. The type checker
prevents that any other tag can be passed in:

&#60;/p&#62;&#60;blockquote lang=&#34;x-ocaml&#34;&#62;
&#60;code&#62;&#60;pre&#62;
# number_value1 (`Letter &#38;#39;a&#38;#39;);;
This expression has type [&#38;#62; `Letter of char ] but is here used with type
  [&#38;#60; `Digit of char | `Number of int ]
&#60;/pre&#62;&#60;/code&#62;
&#60;/blockquote&#62;

&#60;p&#62;In the second version of the function, this case is handled at
runtime.  From a typing perspective, the &#38;#34;&#38;#62;&#38;#34; sign signals that the
function can also accept other tags than the mentioned ones. However,
this function actually raises an exception.

&#60;/p&#62;&#60;blockquote lang=&#34;x-ocaml&#34;&#62;
&#60;code&#62;&#60;pre&#62;
# number_value2 (`Letter &#38;#39;a&#38;#39;);;
Exception: Failure &#38;#34;This is not a number&#38;#34;.
&#60;/pre&#62;&#60;/code&#62;
&#60;/blockquote&#62;

&#60;p&#62;So the &#38;#34;&#38;#60;&#38;#34; sign is a way to limit the number of tags a function
can process. The programmer gets it by not adding a &#38;#34;catch all&#38;#34; case
to pattern matchings. This kind of polyvariant is useful to enforce
some strictness in programming, and the type checker catches the cases
that would otherwise only be handed at runtime.


&#60;/p&#62;&#60;h2&#62;Giving Polyvariants Names&#60;/h2&#62;

&#60;p&#62;
Although the mantra of this article is that we don&#38;#39;t need declarations,
it is of course possible to define named polymorphic variants. For
example, we could introduce these named types:

&#60;/p&#62;&#60;blockquote lang=&#34;x-ocaml&#34;&#62;
&#60;code&#62;&#60;pre&#62;
type classified_char =
  [ `Digit of char | `Letter of char | `Other of char ]

type number_token =
  [ `Digit of char | `Number of int ]
&#60;/pre&#62;&#60;/code&#62;
&#60;/blockquote&#62;

&#60;p&#62;Note that there is no &#38;#34;&#38;#62;&#38;#34; or &#38;#34;&#38;#60;&#38;#34; sign in such definitions - it
would not make sense to say something about whether more or less tags
are possible than given, because the context is missing.

&#60;/p&#62;&#60;p&#62;Using these names, one could simplify the typings of our functions:

&#60;/p&#62;&#60;blockquote lang=&#34;x-ocaml&#34;&#62;
&#60;code&#62;&#60;pre&#62;
val classify_chars : string -&#38;#62; [&#38;#62; classified_char ]
val recognize_numbers : ( [&#38;#62; number_token ] as &#38;#39;a) list -&#38;#62; &#38;#39;a list 
val number_value1 : [&#38;#60; number_token ] -&#38;#62; int
val number_value2 : [&#38;#62; number_token ] -&#38;#62; int
&#60;/pre&#62;&#60;/code&#62;
&#60;/blockquote&#62;

&#60;p&#62;As you see, the &#38;#34;&#38;#62;&#38;#34; or &#38;#34;&#38;#60;&#38;#34; sign can still appear in function
types using these names. Actually, there is a special syntax behind
this notation. If you just say &#60;code&#62;number_token&#60;/code&#62; in a
type expression, exactly the definition applies (without sign).
But you can also construct new polyvariants from existing ones.
For example, 

&#60;/p&#62;&#60;blockquote lang=&#34;x-ocaml&#34;&#62;
&#60;code&#62;&#60;pre&#62;[ classified_char | number_token ]
&#60;/pre&#62;&#60;/code&#62;
&#60;/blockquote&#62;

&#60;p&#62;would mean a type that combines the tags of both types, i.e. this is 
the same as if all four tags were enumerated. The syntax
&#60;code&#62;[&#38;#60;number_token ]&#60;/code&#62; is only a special case of this
type constructor, where the new type also gets a sign.


&#60;/p&#62;&#60;h2&#62;Matching Variants&#60;/h2&#62;

&#60;p&#62;Try to compile this piece of code that ought to sum up all 
&#60;code&#62;`Digit&#60;/code&#62; and &#60;code&#62;`Number&#60;/code&#62; tags of a list of
any tags:

&#60;/p&#62;&#60;blockquote lang=&#34;x-ocaml&#34;&#62;
&#60;code&#62;&#60;pre&#62;
let rec sum l =
  match l with
    | x :: l&#38;#39; -&#38;#62;
      ( match x with
         | `Digit _ | `Number _ -&#38;#62;
              number_value1 x + sum l&#38;#39;
         | _ -&#38;#62;
              sum l&#38;#39;
      )
    | [] -&#38;#62; 
      0
&#60;/pre&#62;&#60;/code&#62;
&#60;/blockquote&#62;

&#60;p&#62;It compiles, but there is a little surprise. The compiler emits 
a warning that the &#60;code&#62;_ -&#38;#62; sum l&#38;#39;&#60;/code&#62; case of the matching
is unused, and the inferred type is just

&#60;/p&#62;&#60;blockquote lang=&#34;x-ocaml&#34;&#62;
&#60;code&#62;&#60;pre&#62;
val sum : [ `Digit of char | `Number of int ] list -&#38;#62; int 
&#60;/pre&#62;&#60;/code&#62;
&#60;/blockquote&#62;

i.e. the &#38;#34;&#38;#62;&#38;#34; sign is missing that would allow us to pass any tags
in. What&#38;#39;s wrong? 

&#60;p&#62;This is a pitfall one quickly runs into when using polyvariants.
The type checker assumes that the &#60;code&#62;x&#60;/code&#62; in 
&#60;code&#62;number_value1 x&#60;/code&#62; has the same type as the &#60;code&#62;x&#60;/code&#62;
that is matched again. It is not sufficient to use a matched variable
in the expression of the matched case to restrict its type.
Fortunately, there is a special syntax for that (look at the bold &#38;#34;as&#38;#34;
clause):

&#60;/p&#62;&#60;blockquote lang=&#34;x-ocaml&#34;&#62;
&#60;code&#62;&#60;pre&#62;
let rec sum l =
  match l with
    | x :: l&#38;#39; -&#38;#62;
      ( match x with
         | (`Digit _ | `Number _) &#60;b&#62;as y&#60;/b&#62; -&#38;#62;
              number_value1 &#60;b&#62;y&#60;/b&#62; + sum l&#38;#39;
         | _ -&#38;#62;
              sum l&#38;#39;
      )
    | [] -&#38;#62; 
      0
&#60;/pre&#62;&#60;/code&#62;
&#60;/blockquote&#62;

&#60;p&#62;Here, &#60;code&#62;y&#60;/code&#62; is the value &#60;code&#62;x&#60;/code&#62; for the case that
the match applies, so &#60;code&#62;y&#60;/code&#62; can have a stricter type than
&#60;code&#62;x&#60;/code&#62;.

&#60;/p&#62;&#60;p&#62;Alternatively, the match condition could also have been written as:

&#60;/p&#62;&#60;blockquote lang=&#34;x-ocaml&#34;&#62;
&#60;code&#62;&#60;pre&#62;
match x with
  | #number_token as y -&#38;#62; ...
&#60;/pre&#62;&#60;/code&#62;
&#60;/blockquote&#62;

i.e. one can use named polyvariants in matchings. This is just a 
convenience notation for the former.


&#60;h2&#62;Behind The Scene&#60;/h2&#62;

&#60;p&#62;The polyvariants might be cool, but many programmers suspect that
the performance of their programs suffer when they use them. Well,
although there are some runtime cost, these are very small, and not
noticeable for many programs.

&#60;/p&#62;&#60;p&#62;Internally, the tags are represented by hash values of the names
of the tags. So the tags are simply reduced to integers at runtime.
Compared with the normal variant types, there is some additional
overhead for tags with values. In particular, for storing
&#60;code&#62;`X&#38;#160;value&#60;/code&#62; one extra word is needed in comparison with
the normal variant &#60;code&#62;X&#38;#160;value&#60;/code&#62;.

&#60;/p&#62;&#60;p&#62;It is possible that the hash values of variants collide, e.g.

&#60;/p&#62;&#60;blockquote lang=&#34;x-ocaml&#34;&#62;
&#60;code&#62;&#60;pre&#62;
# type x = [ `jagJhn | `oZshTt ];;
Variant tags `jagJhn and `oZshTt have same hash value. Change one of them.
&#60;/pre&#62;&#60;/code&#62;
&#60;/blockquote&#62;

As you see, the compiler checks for this rare case. I&#38;#39;ve never seen
it in practice.

&#60;p&#62;Despite rumours, there is nothing special done at link time. The
tags are already cut down to integers at this point of compiling.


&#60;/p&#62;&#60;h2&#62;Conclusion&#60;/h2&#62;

&#60;p&#62;I hope I&#38;#39;ve shown how elegant code looks that uses polyvariants to
represent data cases. But there is some more to say, especially if you
look at other programming languages.

&#60;/p&#62;&#60;p&#62;The author of this article thinks that polymorphic variants are one
of the features that makes O&#38;#39;Caml so different in comparison with
mainstream languages like Java. In particular, there are competing
approaches for representing data cases, and one of the radical ideas
of object orientation has always been that combining data and program
cases into a single class construct is the best way to deal with the
problem. However, I think something has been overlooked - data and
algorithms do not always walk hand in hand, and classes are inflexible
if only a loose correlation between both is needed. In contrast, 
polyvariants give the programmer the maximum of freedom in this respect.
Thus it is believed that polyvariants are a serious alternative for
representing data cases.

&#60;/p&#62;
&#60;/div&#62;

&#60;div&#62;
  Gerd Stolpmann works as O&#38;#39;Caml consultant

&#60;/div&#62;

&#60;div&#62;
  
      &#60;p&#62;&#60;b&#62;Links:&#60;/b&#62;
      &#60;ul&#62;
	
	    &#60;li&#62;&#60;a href=&#34;http://caml.inria.fr/pub/docs/manual-ocaml/manual006.html#htoc41&#34;&#62;Polymorphic variants&#60;/a&#62;:
	    Chapter in the O&#38;#39;Caml manual
	  &#60;/li&#62;
      &#60;/ul&#62;
    &#60;/p&#62;
&#60;/div&#62;


          </description>
        </item>
      
  </channel>
</rss>
