more blog than sense

Graham decides that he should start a blog for no good reason at all.

Friday, February 19, 2010

Making a null language with Ocaml

I wanted to learn how to use Ocaml to write a little language for a hobby project. Ocaml has an powerful set of language-development tools — from lexer and parser generators to native-code compilers — but the documentation is not always geared toward the novice. I've found some very thorough examples, but most of them are overly worked, making it hard to understand how the various parts fit together.

Today, I'm going to implement a null parser: that is, a parser that completely ignores its input. Feed it whatever you want: a Java program, your grocery list, whatever — regardless, it will always emit the same, static code. Since there's no actual parsing going on, what remains ought to be the minimum scaffolding required to host a real parser.

The input to the parser (which we're going to ignore) is a Stream of characters. The output is an Ocaml abstract syntax tree, which can be pretty-printed for human consumption or fed directly into the Ocaml compiler. The compiler takes the AST and generates an executable.

(Sidebar: The murky lineage of camlp5 and camlp4)

Skip this paragraph if you're easily confused. Ocaml comes bundled with a tool called the "Caml preprocessor and pretty printer", or camlp4 for short. The primary task of camlp4 is to parse Ocaml source code and emit Ocaml AST into the compiler. To make a long story short, camlp4 underwent a significant design change a while back; but some people preferred the earlier version, so they forked the old camlp4 and packaged it up as an alternative package called camlp5. So, camlp5 is older tech than camlp4.

I'm using camlp5 here, for a number of reasons (including the superior documentation available for it: most camlp4 documents online refer to the old camlp4, not the new one, and so are actually compatible with camlp5. I know, you couldn't make this stuff up.).

The null parser

Let's jump into the implementation. Here's the parser, pa_null.ml:

(* define the parser *)
let null_parser stream =
  let loc = Ploc.dummy 
  and directive_found = false 
  in
    ( [ (<:str_item< print_endline "|your code          |"; >>, loc);
        (<:str_item< print_endline "|has been replaced  |"; >>, loc);
        (<:str_item< print_endline "|by the null parser.|"; >>, loc);
      ], directive_found)

(* install it *)
let () = Pcaml.parse_implem := null_parser

We're providing an alternate implementation for Pcaml.parse_implem, which is camlp5's toplevel parsing rule: it's the maestro that reads the source code, and calls upon a suite of smaller parsers to break the code down into an AST. Pcaml.parse_implem's output is a pair of values: a list of structure items (toplevel expressions, of type str_item) and their corresponding locations in the source code; and a boolean flag (directive_found) which we're totally going to ignore.

Since we're not actually lexing any code, there aren't any source-code locations to speak of; so we use Ploc.dummy to create a dummy-location, just to keep camlp5 happy. (In a real parser, these locations could be used to indicate the exact locations of errors in the source code.)

The toplevel expressions (structure items) are expressed using a quotation syntax which expands the quoted expression into an AST node. (Strictly, these quotations are optional: our null-parser could have emitted a program with no instructions in it. But it's bad enough that our parser reads nothing: it ought to at least spit something out.) So,

<:str_item< print_endline "|by the null parser.|"; >>

…is expanded into an AST node for a toplevel expression, which in this case is an instruction to print a message to the screen. (In a Lisp language, this would be the 'backtick' in a macro definition. In Ocaml's case, the syntax expressions are statically typed, hence the :str_item declaration.)

Note that while the stream argument is passed into our parser, we never refer to it. The original Pcaml.parse_implem rule (which we replaced) would have lexed the stream into Ocaml-syntax tokens, and then parsed those tokens into an AST.

Building and using the parser

We'll use a Makefile to manage our building. The first rule builds the parser, and the second rule uses it to compile a hello-world program.

pa_null.cmo: pa_null.ml
    ocamlc -I +camlp5 -pp 'camlp5o q_ast.cmo' -c pa_null.ml

hello: hello.ml pa_null.cmo
    ocamlc -pp 'camlp5o ./pa_null.cmo' hello.ml -o hello

The -pp statements are where we invoke camlp5 as a preprocessor: first, using the q_ast (quotations) module to enable the quotations-syntax we used in our parser; and second, using our own pa_null parser to "parse" our hello-world program.

Put whatever you want in hello.ml, and run make hello. The resulting executable, hello, should produce the following output:

$ ./hello
|your code          |
|has been replaced  |
|by the null parser.|

That's that.

While this isn't an introduction to Ocaml, if you want to follow along at home, both Ocaml and the camlp5 processor are available in Debian/Ubuntu's package systems (and are probably available in Fink or MacPorts for you Mac users).

Labels:

0 Comments:

Post a Comment

Links to this post:

Create a Link

<< Home