Making a null language with Ocaml
I wanted to learn how to use Ocaml to write a little language for a hobby project. Ocaml has an powerful set of language-development tools — from lexer and parser generators to native-code compilers — but the documentation is not always geared toward the novice. I've found some very thorough examples, but most of them are overly worked, making it hard to understand how the various parts fit together.
Today, I'm going to implement a null parser: that is, a parser that completely ignores its input. Feed it whatever you want: a Java program, your grocery list, whatever — regardless, it will always emit the same, static code. Since there's no actual parsing going on, what remains ought to be the minimum scaffolding required to host a real parser.
The input to the parser (which we're going to ignore) is a Stream of
characters. The output is an Ocaml abstract syntax tree, which can
be pretty-printed for human consumption or fed directly into the Ocaml
compiler. The compiler takes the AST and generates an executable.
(Sidebar: The murky lineage of camlp5 and camlp4)
Skip this paragraph if you're easily confused. Ocaml comes bundled
with a tool called the "Caml preprocessor and pretty printer", or
camlp4 for short. The primary task of camlp4 is to parse Ocaml
source code and emit Ocaml AST into the compiler. To make a long story
short, camlp4 underwent a significant design change a while
back; but some people preferred the earlier version, so they forked the
old camlp4 and packaged it up as an alternative package called
camlp5. So, camlp5 is older tech than camlp4.
I'm using camlp5
here, for a number of reasons (including the superior documentation
available for it: most camlp4 documents online refer to the old
camlp4, not the new one, and so are actually compatible with
camlp5. I know, you couldn't make this stuff up.).
The null parser
Let's jump into the implementation. Here's the parser, pa_null.ml:
(* define the parser *) let null_parser stream = let loc = Ploc.dummy and directive_found = false in ( [ (<:str_item< print_endline "|your code |"; >>, loc); (<:str_item< print_endline "|has been replaced |"; >>, loc); (<:str_item< print_endline "|by the null parser.|"; >>, loc); ], directive_found) (* install it *) let () = Pcaml.parse_implem := null_parser
We're providing an alternate implementation for Pcaml.parse_implem,
which is camlp5's toplevel parsing rule: it's the maestro that reads
the source code, and calls upon a suite of smaller parsers to break
the code down into an AST. Pcaml.parse_implem's output is a pair of
values: a list of structure items (toplevel expressions, of type
str_item) and their corresponding locations in the source code; and
a boolean flag (directive_found) which we're totally going to
ignore.
Since we're not actually lexing any code, there aren't any source-code
locations to speak of; so we use Ploc.dummy to create a
dummy-location, just to keep camlp5 happy. (In a real parser, these
locations could be used to indicate the exact locations of errors in
the source code.)
The toplevel expressions (structure items) are expressed using a quotation syntax which expands the quoted expression into an AST node. (Strictly, these quotations are optional: our null-parser could have emitted a program with no instructions in it. But it's bad enough that our parser reads nothing: it ought to at least spit something out.) So,
<:str_item< print_endline "|by the null parser.|"; >>
…is expanded into an AST node for a toplevel expression, which in
this case is an instruction to print a message to the screen. (In a
Lisp language, this would be the 'backtick' in a macro definition. In
Ocaml's case, the syntax expressions are statically typed, hence the
:str_item declaration.)
Note that while the stream argument is passed into our parser, we
never refer to it. The original Pcaml.parse_implem rule (which we
replaced) would have lexed the stream into Ocaml-syntax tokens, and
then parsed those tokens into an AST.
Building and using the parser
We'll use a Makefile to manage our building. The first rule builds the parser, and the second rule uses it to compile a hello-world program.
pa_null.cmo: pa_null.ml ocamlc -I +camlp5 -pp 'camlp5o q_ast.cmo' -c pa_null.ml hello: hello.ml pa_null.cmo ocamlc -pp 'camlp5o ./pa_null.cmo' hello.ml -o hello
The -pp statements are where we invoke camlp5 as a preprocessor:
first, using the q_ast (quotations) module to enable the quotations-syntax we used in our parser; and second, using our own pa_null
parser to "parse" our hello-world program.
Put whatever you want in hello.ml, and run make hello. The
resulting executable, hello, should produce the following output:
$ ./hello
|your code |
|has been replaced |
|by the null parser.|
That's that.
While this isn't an introduction to Ocaml, if you want to follow along
at home, both Ocaml and the camlp5 processor are available in
Debian/Ubuntu's package systems (and are probably available in Fink or
MacPorts for you Mac users).
Labels: ocaml
0 Comments:
Post a Comment
Links to this post:
Create a Link
<< Home