One-upping the null parser: the amazing Echo language
In my previous article, I offered what I think is the simplest
parser you can write for Ocaml with camlp5 that will actually
generate compilable code. It was simple because it ignored its input:
every source program got compiled into the same static executable.
This time we're taking a big leap upward, into the world where parsers
actually consume their inputs. In this little language (let's call it
"echo"), each word in the source is translated into an Ocaml "print
statement" that outputs that word on its own line. (For clarity, I'm
also going to stick the word EMIT: in front of every one of the
printed lines.) So, the input program hello.src:
My fancy
hello-world program.
…will be compiled to an executable that outputs:
EMIT: My
EMIT: fancy
EMIT: hello-world
EMIT: program.
The "echo" language is more complex than the null language, and we have to organize things a bit differently in order to get a good result. Our basic strategy here is to:
- define a simple lexer, using the Ocamllex tool
-
replace
camlp5's Ocaml-oriented lexer with our own -
replace
camlp5's toplevel parser with our echo-parser - use our echo-parser to compile echo-programs into echo-executables.
I'm not sure that I need to introduce all of the extra machinery that you'll find below. But it seems clear that if I don't, I'm left with the burden of writing more scaffolding of my own. I'd rather not do that, so I'm willing to digest a bit of boilerplate in order to avoid make-work.
The lexer
We don't want to use Ocaml's own lexer, since Ocaml's notion of a valid token is very different from our simple language's. In "echo", a token is any sequence of non-whitespace characters. We don't need classify our tokens (e.g. into identifiers, numbers, punctuation).
Ocaml comes with a tool very similar to the classic lex for
generating lexers. It's called Ocamllex, and camlp5 has support
for integrating with an Ocammlex lexer. (You don't have to use
Ocamllex; camlp5 also has a roll-your-own-lexer alternative, but I
don't see the benefit.) Here's the definition of our simple lexer,
simplelex.mll:
{ open Lexing } let nonspace = [^ ' ' '\t' '\n']+ let otherwise = _ rule token = parse (* non-space sequences are tokens. *) nonspace { ("TOK", lexeme lexbuf) } (* 'eof' is special: always include a rule for it. *) | eof { ("EOF", lexeme lexbuf) } (* everything else: in this case, whitespace. *) | otherwise { token lexbuf } (* "ignore" *)
(This is an .mll file: Ocamllex will generate the related .ml file
for us.) We have to be careful to include the eof rule, which is
specified by Ocamllex itself and expected by our parser as and
end-of-input marker. The ordering of the rules is important: since
we're using an "anything-else" clause to represent whitespace, it must
appear last.
Integrating the lexer
In the previous article, our parser was a function that accepted a
stream of characters as input, which we ignored; the parser was
expected to translate the stream into some top-level code. I could
have written "echo" that way, and avoided the need for a formal lexer.
But there's a drawback to that approach. If you read the stream
yourself, directly, you're also responsible for reporting the
locations of your tokens into the parser. In the null-language, we
used a "dummy location" to sidestep this requirement; but that's not a
great strategy for a larger language. Those locations will turn out to
be very helpful when your language grows an actual syntax, and you
want to start pinpointing errors in your source code. Today, let's pay
the boilerplate tax and have camlp5 do the housekeeping for us.
We begin our parser, pa_echo.ml, with the lexer-integration
boilerplate:
open Pcaml let replace_lexer lxrule = let wrapped_lexer = (* for details, see http://is.gd/960EQ *) { Plexing.tok_func = (Token.lexer_func_of_ocamllex lxrule); Plexing.tok_using = (fun _ -> ()); tok_removing = (fun _ -> ()); tok_match = Plexing.default_match; tok_comm = None; tok_text = Plexing.lexer_text } in Grammar.Unsafe.gram_reinit Pcaml.gram wrapped_lexer let () = replace_lexer Simplelex.token
The last line is the money shot. Note that Simplelex is the
module-name associated with our lexer, defined in simplelex.ml, and
token is the name of the lexing rule we defined there.
EXTEND-ing the Ocaml grammar
The easiest way that I could find to get camlp5 to both recognize a
custom lexer and let you define your own toplevel parser was to use
a special syntax called "EXTEND", which provides a pattern-matching
language for overloading the camlp5 parsers. There's a camlp5
plugin called pa_extend which defines the EXTEND syntax: see the
Makefile, below.
Let's continue writing pa_echo.ml, and define our parser:
let echo token loc = (<:str_item< print_endline ("EMIT: " ^ $str:token$) >>, loc);; EXTEND implem: [ [ st = LIST0 [ s = TOK -> echo s loc ]; EOF -> (st, false) ] ]; END
The echo function takes a token and its location from the lexer, and
returns a pair of values: a structure item (toplevel expression), and
the location. If you take a look back at the last article, you might
recognize a similar data structure in the null-parser's definition.
The <:str_item< . >> thing is a quotation that builds an AST
node for our toplevel expression; the $str:token$ bit inside tells
the quotation-expander to treat the token as a literal string in the
generated code.
In the previous article, we learned that implem is camlp5's
top-level rule for parsing an Ocaml program. Here we overload implem
using the EXTEND syntax. The EXTEND section reads: implem
takes a list of zero or more tokens, which are turned into toplevel
expressions via the echo function, until the end-of-input is
encountered. Then, the list of toplevel expressions (and their
locations in the source) are returned to camlp5 for further
processing (along with the false flag which we'll avoid discussing
today).
(Being an astute reader, you probably noticed that loc is used in
the definition of implem, but it's not clear where loc's value
is coming from. Recall that EXTEND is a syntax-extension: the code
into which the EXTEND statement is expanded will bind a value to the
loc identifier.)
Build and test
This article isn't about best-practices for building Ocaml code: you
have several options there. To make the steps clear, I'm using a
Makefile with three rules: one to generate our lexer with Ocamllex;
one to build our echo parser; and one to use "echo" to compile the
hello.src program at the top of the article.
simplelex.ml: simplelex.mll ocamllex simplelex.mll pa_echo.cma: pa_echo.ml simplelex.ml ocamlc -I +camlp5 -pp 'camlp5o pa_extend.cmo q_MLast.cmo' \ -a -o pa_echo.cma \ simplelex.ml pa_echo.ml hello: hello.src pa_echo.cma ocamlopt -pp 'camlp5 ./pa_echo.cma pr_dump.cmo -impl' \ -impl hello.src -o hello
Run make hello and then ./hello and you should get the expected result.
Seeing the generated code
If you want, you can ask camlp5 to show you the Ocaml code it
generates through our "echo" parser. Try this:
camlp5 ./pa_echo.cma pr_r.cmo -impl hello.src | sed "s/; /;\n/g"
Note the similarities and differences with the hello rule in the
Makefile. The file pr_r.cmo is the "pretty printer" that outputs the
Ocaml source code. The sed pipe is optional, but will add some
newlines for easier reading. Running this on my hello.src, I get:
print_endline ("EMIT: " ^ "This");
print_endline ("EMIT: " ^ "is");
print_endline ("EMIT: " ^ "my");
print_endline ("EMIT: " ^ "hello-world");
print_endline ("EMIT: " ^ "program.");
…where ^ is Ocaml's string-concatenation operator. If this were a
real language, I'd probably put that command into a shell-script,
e.g., echodump, and rework the Makefile's hello rule into a
compiler-script, e.g. echoc.
In the generated code, I see that we're concatenating the strings at runtime. It would be possible to concatenate them at compile-time (making the program run a billionth of a second faster) by moving the concatenation to a different spot in the quotation expression. Can you figure out how?
Labels: ocaml
