BLOG ON CAMLCITY.ORG: GODI

Documentation fun

PXP-1.2.1 with a new reference manual - by Gerd Stolpmann, 2009-02-03

The last release of a stable PXP version happened 5 years ago. That's a long time. Actually, a lot of devlopment took place since then, only that it was difficult to bring PXP into a releasable shape. Now the last missing piece has been added, namely extensive documentation. So I can proudly announce the best XML parser that has ever been available for programming in O'Caml: Not only fast and feature-rich, but also easy to understand.

Writing documentation is something programmers do not like very much, and also in the case of PXP the code was far ahead of any description about it. There was a "User's Guide", but it took an oldish approach of explaining things I don't like anymore. Also, it was very incomplete. Last year, I got some funding from a company to improve the PXP documentation, so I faced the problem to reorganize it completely, and to add anything missing.

If you would like to take a look at the result, here it is: The PXP Reference.

Switching from docbook to ocamldoc

The old "User's Guide" was written as docbook document. This is a good general-purpose text format that allows one to structure a large text into chapters, sections, etc., and to generate viewable and printable output from it (especially one can convert it into a bunch of HTML pages, and into PDF). However, there is one difficulty: It does not integrate well with ocamldoc-style interface references.

The "User's Guide" predates ocamldoc - when I wrote the first version of PXP documentation I had no other chance than to use some third-party tool to process it. Now what to do? Stick with docbook, and include ocamldoc somehow into the processing chain? The docbook format has clearly more features for formatting text, e.g. one can easily include pictures. However, ocamldoc cannot output in a format that would be convertible to docbook with only little effort, and this made this way unfeasible.

I decided to switch completely to ocamldoc. Not only the module interfaces should be documented with it, but also the various introductory chapters explaining concepts spanning several modules. Since O'Caml 3.09, ocamldoc understands the file suffix *.txt and takes these input files as pure documentation. One can still use all formatting directives like {2 headings} or {!Hyperlinks} pointing to code elements. However, there was still the difficulty of missing features.

So I looked at developing a custom HTML generator (I am mostly interested in outputting HTML). It is possible to load an add-on into ocamldoc that modifies its behaviour. One just has to write a class that inherits from Odoc_html.html, and overrides its methods:

class chtml =
  object(self)
    inherit Odoc_html.html as super

    method private html_of_<foo> ... = ...
  end

let chtml = new chtml
let _ = 
  Odoc_args.set_doc_generator (Some chtml :> Odoc_args.doc_generator option)

Of course, it was still the question whether my features could be added this way (without rewriting half of the generator class). Yes, they can, and it only needed about 160 lines of code. I must admit it took quite a long time to develop this code, since I had to dig into internals of ocamldoc to understand it better. But anyway, ocamldoc turns out to be a customizable utility.

What I added in particular:

A {picture} tag for including pictures
The possibility to change the output for include Module in interfaces so that the included interface is directly shown instead of only the include statement as such. For clarity, the included interface is indented, and has grey background. This change can be turned on and off with a {directinclude} tag.
The include change requires another feature to be really looking good. All references (hyperlinks and plain occurrences) pointing to the included module should be rewritten so that they point to the including module instead. That means if module N uses include M we want that all references M.x are changed into N.x. The intention is that M is no longer referenced, and that the duplication of definitions in two modules cannot confuse readers (especially those that are unfamiliar with the module system). I added that feature for my specific case, and the ocamldoc tag {fixpxpcoretypes} enables that rewriting. (It changes Pxp_core_types.[S|I] into Pxp_types.)
A last change has also to do with the include feature. With {knowntype} and {knownclass} one can add identifiers to the lists of known types and classes, so that the generator will emit hyperlinks to them, although there is no such definition in reality. It turned out that many identifiers were already pointing to the including module, but because there is no definition in the mli file, ocamldoc does not make these identifiers clickable. With {knowntype} and {knownclass} one can change that on a case by case basis.

The full source code of the custom generator class can be studied here: chtml.ml. The module with the mentioned include directive is Pxp_types. Look here how nice the generated page is.

The need for conceptual introductions

XML is a cute and simple text format, right? Many people think like that, but given the fact that many XML parsers are either feature-rich and slow, or poor and fast, there must be some complexity in the XML definition. Recently, I read the article "XML fever" (by Erik Wilde and Robert J. Glushko, Communications of the ACM, issue 7, 2008), where the authors point out a number of deficiencies in the definition of XML that can lead to delusion about XML, and finally into "fever". After years of maintaining this XML parser, I can only second the authors. Clearly, there are problems even in the fundamental XML specification.

I do not want to complain about this - XML is widely used, and many of the standards are practically unfixable without breaking large numbers of programs. For me the problem arose how to explain all that. For example, there is the question what is to be considered as the root node of an XML tree. This is a conceptual question, and the explanation should not be hidden in an interface description of a PXP module. For that reason, I had to add a number of chapters to the manual that explain concepts and generally introduce into the PXP world. All the Intro_* chapters are like this.

The nice thing is now that I can add direct links from introductory chapters to interface references and vice versa, since all documentation is now processed with the same utility, ocamldoc. When some complicated issue arises in some function description, it is now possible to point to the section in the introduction where this issue is explained in detail, and conversely, I can point to the definition in the interface when a function or type is used in an intro chapter.

What next?

I must admit that my interest in XML has not gained in the last years, to say it politely. XML is most often used as a base technology for HTML, or as a data exchange format. Many of the advanced XML standards like XSLT or XQuery have not found the way into the daily life of us programmers. The hype is over.

Nevertheless, I promise that I will still maintain PXP, and now and then add another feature. For example, there is a nice XPath evaluator in the development pipeline - again, I do not find time to finish it, but hey, there are still many years for doing it. (By the way, if you want to accelerate that and have some money, we will find a way to quickly finish XPath.)

In August 2009, PXP becomes 10 years old (counted from the first mentioning in the O'Caml mailing list). This is already a long time for a software library and an open source project. I am quite confident it will now also reach its 20th birthday!

Gerd Stolpmann works as O'Caml consultant

Links:

PXP homepage: Find here links to downloads and documentation

This web site is published by Informatikbüro Gerd Stolpmann

Plasma	GitLab	Archive
Projects	Blog	Knowledge