BLOG ON CAMLCITY.ORG: GODI

About LambdaRank

The ranking algorithm behind GODI Search - by Gerd Stolpmann, 2008-04-22

The question has always been: Are the results better if a search engine really understands the text it indexes? You can view my latest project, GODI Search, as an attempt to answer this question for a very limited set of documents, namely the code and its documentation of GODI, the source code O'Caml distribution. Actually, understanding text allows it to rank the results better, so that the more important occurrences of the query words are shown more at the beginning of the result list. Because ranking really bases on text interpretation, and the text is here a variant of lambda code, I call this method LambdaRank.

As every search engine, GODI Search consists roughly of two main components, namely the indexer and the searcher. The indexer iterates over the set of documents, and populates a database with the words it extracts. The searcher is the simpler part of the game, as it finds everything well-prepared in this database, and when a query comes in, it only has to pull the matching documents out of the database, sort it according to the ranking score, and show it to the user. So the tricky part is the indexing.

If you don't understand the text at all, you cannot do much about ranking but count the occurrences of a word in a document. The idea is that a text that speaks about a certain subject also mentions the subject more often than other texts, and thus the number of occurrences is a good measure for ranking. In a document set where the texts are connected with hyperlinks one can furthermore look at the relationships between the documents. Google's PageRank is based on this approach (but, as rumors say, is now heavily modified from its original design).

Fortunately, we know a lot about the document set GODI Search analyzes. Most documents are like this:

O'Caml code files and closely corresponding manuals dominate the corpus. GODI Search tries to make the best of this by analyzing O'Caml code in detail.

After looking at many examples I had some ideas which occurrences must be ranked higher than others.

Idea 1. Definitions of identifiers are more important than uses. This sounds natural, but actually not every definition is important. GODI Search also looks at the scope of the definition: A local definition is restricted to a surrounding function, and scores the least. Then follow definitions on module level, and in top-level modules. The highest score is given to exported identifiers that occur in top-level module interfaces.

Only let-bound identifiers are ranked this way. Function arguments, variables in pattern matchings, and fun-bound identifiers are ignored. This is a bit arbitrary, but my feeling is that these identifiers are usually not important in the coding styles I'm aware of.

For types, exceptions, and module names similar scoring techniques exist.

Idea 2. Values and types are rated separately. The namespace of all identifiers can be roughly divided into two big zones: Values and types. Of course, there are more kinds of identifiers (modules, classes, labels, file names,...), but my impression is that the typical programmer has a mind set that is dominated by only these two classes of symbols. In this sense, a "value" names an executable thing, and a "type" names meta data. Both have little to do with each other, and thus a document that contains many words "list" as type has little importance in a search for "list" as value.

Idea 3. Keywords are stopwords. Keywords occur in practically every code file in big number, and thus say nothing about it. GODI Search simply ignores keywords.

Idea 4. Qualified identifiers are hyperlinks. If you search for "mem" then you will get a list of top-level definitions of this function. The question is which occurrence is shown first. GODI Search implements the PageRank idea of scoring hyperlinks pointing to a document by looking at qualified identifiers. So if there are more "Hashtbl.find" than "List.find" in code anywhere in the corpus, the module "Hashtbl" scores higher than "List". (Actually, it is the other way round if you also take other references into account.)

Idea 5. Code and non-code are rated separately. Of course, the above applies only to text sections that are O'Caml code. Other languages and non-code cannot be rated this way. For this reason, GODI Search puts a lot of effort into separating both types of text. Currently, this is done on a per-paragraph basis, i.e. every paragraph is first analyzed in order to know whether it is O'Caml code or not. Also, comments and string literals in code files are considered as non-code.

So far about the ideas behind LambdaRank. The results of the implementation look promising: If the user types in "fold_left" he or she will be taken to really relevant occurrences. And user experience is what counts, finally.

Gerd Stolpmann works as O'Caml consultant

Links:

GODI Search: Try out GODI Search here
GODI Homepage: About GODI in general

This web site is published by Informatikbüro Gerd Stolpmann

Plasma	GitLab	Archive
Projects	Blog	Knowledge