BLOG ON CAMLCITY.ORG: GODI
The ranking algorithm behind GODI Search - by Gerd Stolpmann, 2008-04-22
As every search engine, GODI Search consists roughly of two main components, namely the indexer and the searcher. The indexer iterates over the set of documents, and populates a database with the words it extracts. The searcher is the simpler part of the game, as it finds everything well-prepared in this database, and when a query comes in, it only has to pull the matching documents out of the database, sort it according to the ranking score, and show it to the user. So the tricky part is the indexing.
If you don't understand the text at all, you cannot do much about ranking but count the occurrences of a word in a document. The idea is that a text that speaks about a certain subject also mentions the subject more often than other texts, and thus the number of occurrences is a good measure for ranking. In a document set where the texts are connected with hyperlinks one can furthermore look at the relationships between the documents. Google's PageRank is based on this approach (but, as rumors say, is now heavily modified from its original design).
Fortunately, we know a lot about the document set GODI Search analyzes. Most documents are like this:
After looking at many examples I had some ideas which occurrences must be ranked higher than others.
Idea 1. Definitions of identifiers are more important than uses. This sounds natural, but actually not every definition is important. GODI Search also looks at the scope of the definition: A local definition is restricted to a surrounding function, and scores the least. Then follow definitions on module level, and in top-level modules. The highest score is given to exported identifiers that occur in top-level module interfaces.
Only let-bound identifiers are ranked this way. Function arguments, variables in pattern matchings, and fun-bound identifiers are ignored. This is a bit arbitrary, but my feeling is that these identifiers are usually not important in the coding styles I'm aware of.
For types, exceptions, and module names similar scoring techniques exist.
Idea 2. Values and types are rated separately. The namespace of all identifiers can be roughly divided into two big zones: Values and types. Of course, there are more kinds of identifiers (modules, classes, labels, file names,...), but my impression is that the typical programmer has a mind set that is dominated by only these two classes of symbols. In this sense, a "value" names an executable thing, and a "type" names meta data. Both have little to do with each other, and thus a document that contains many words "list" as type has little importance in a search for "list" as value.
Idea 3. Keywords are stopwords. Keywords occur in practically every code file in big number, and thus say nothing about it. GODI Search simply ignores keywords.
Idea 4. Qualified identifiers are hyperlinks. If you search for "mem" then you will get a list of top-level definitions of this function. The question is which occurrence is shown first. GODI Search implements the PageRank idea of scoring hyperlinks pointing to a document by looking at qualified identifiers. So if there are more "Hashtbl.find" than "List.find" in code anywhere in the corpus, the module "Hashtbl" scores higher than "List". (Actually, it is the other way round if you also take other references into account.)
Idea 5. Code and non-code are rated separately. Of course, the above applies only to text sections that are O'Caml code. Other languages and non-code cannot be rated this way. For this reason, GODI Search puts a lot of effort into separating both types of text. Currently, this is done on a per-paragraph basis, i.e. every paragraph is first analyzed in order to know whether it is O'Caml code or not. Also, comments and string literals in code files are considered as non-code.
So far about the ideas behind LambdaRank. The results of the implementation look promising: If the user types in "fold_left" he or she will be taken to really relevant occurrences. And user experience is what counts, finally.