more blog than sense

Graham decides that he should start a blog for no good reason at all.

Wednesday, February 24, 2010

One-upping the null parser: the amazing Echo language

In my previous article, I offered what I think is the simplest parser you can write for Ocaml with camlp5 that will actually generate compilable code. It was simple because it ignored its input: every source program got compiled into the same static executable. This time we're taking a big leap upward, into the world where parsers actually consume their inputs. In this little language (let's call it "echo"), each word in the source is translated into an Ocaml "print statement" that outputs that word on its own line. (For clarity, I'm also going to stick the word EMIT: in front of every one of the printed lines.) So, the input program hello.src:

My fancy
hello-world program.

…will be compiled to an executable that outputs:

EMIT: My
EMIT: fancy
EMIT: hello-world
EMIT: program.

The "echo" language is more complex than the null language, and we have to organize things a bit differently in order to get a good result. Our basic strategy here is to:

  • define a simple lexer, using the Ocamllex tool
  • replace camlp5's Ocaml-oriented lexer with our own
  • replace camlp5's toplevel parser with our echo-parser
  • use our echo-parser to compile echo-programs into echo-executables.

I'm not sure that I need to introduce all of the extra machinery that you'll find below. But it seems clear that if I don't, I'm left with the burden of writing more scaffolding of my own. I'd rather not do that, so I'm willing to digest a bit of boilerplate in order to avoid make-work.

The lexer

We don't want to use Ocaml's own lexer, since Ocaml's notion of a valid token is very different from our simple language's. In "echo", a token is any sequence of non-whitespace characters. We don't need classify our tokens (e.g. into identifiers, numbers, punctuation).

Ocaml comes with a tool very similar to the classic lex for generating lexers. It's called Ocamllex, and camlp5 has support for integrating with an Ocammlex lexer. (You don't have to use Ocamllex; camlp5 also has a roll-your-own-lexer alternative, but I don't see the benefit.) Here's the definition of our simple lexer, simplelex.mll:

{ open Lexing }

let nonspace = [^ ' ' '\t' '\n']+
let otherwise = _

rule token = parse
              (* non-space sequences are tokens. *)
    nonspace  { ("TOK", lexeme lexbuf) }
              (* 'eof' is special: always include a rule for it. *)
  | eof       { ("EOF", lexeme lexbuf) } 
              (* everything else: in this case, whitespace. *)
  | otherwise { token lexbuf }      (* "ignore" *)

(This is an .mll file: Ocamllex will generate the related .ml file for us.) We have to be careful to include the eof rule, which is specified by Ocamllex itself and expected by our parser as and end-of-input marker. The ordering of the rules is important: since we're using an "anything-else" clause to represent whitespace, it must appear last.

Integrating the lexer

In the previous article, our parser was a function that accepted a stream of characters as input, which we ignored; the parser was expected to translate the stream into some top-level code. I could have written "echo" that way, and avoided the need for a formal lexer. But there's a drawback to that approach. If you read the stream yourself, directly, you're also responsible for reporting the locations of your tokens into the parser. In the null-language, we used a "dummy location" to sidestep this requirement; but that's not a great strategy for a larger language. Those locations will turn out to be very helpful when your language grows an actual syntax, and you want to start pinpointing errors in your source code. Today, let's pay the boilerplate tax and have camlp5 do the housekeeping for us.

We begin our parser, pa_echo.ml, with the lexer-integration boilerplate:

open Pcaml

let replace_lexer lxrule = 
  let wrapped_lexer =     (* for details, see http://is.gd/960EQ *)
    { Plexing.tok_func  = (Token.lexer_func_of_ocamllex lxrule);
      Plexing.tok_using = (fun _ -> ());
      tok_removing      = (fun _ -> ());
      tok_match         = Plexing.default_match;
      tok_comm          = None;
      tok_text          = Plexing.lexer_text }
  in 
    Grammar.Unsafe.gram_reinit Pcaml.gram wrapped_lexer

let () = replace_lexer Simplelex.token

The last line is the money shot. Note that Simplelex is the module-name associated with our lexer, defined in simplelex.ml, and token is the name of the lexing rule we defined there.

EXTEND-ing the Ocaml grammar

The easiest way that I could find to get camlp5 to both recognize a custom lexer and let you define your own toplevel parser was to use a special syntax called "EXTEND", which provides a pattern-matching language for overloading the camlp5 parsers. There's a camlp5 plugin called pa_extend which defines the EXTEND syntax: see the Makefile, below.

Let's continue writing pa_echo.ml, and define our parser:

let echo token loc =
  (<:str_item< print_endline ("EMIT: " ^ $str:token$) >>, loc);;

EXTEND
implem:
[ [ st = LIST0 [ s = TOK -> echo s loc ]; EOF -> (st, false) ] ];
END

The echo function takes a token and its location from the lexer, and returns a pair of values: a structure item (toplevel expression), and the location. If you take a look back at the last article, you might recognize a similar data structure in the null-parser's definition. The <:str_item< . >> thing is a quotation that builds an AST node for our toplevel expression; the $str:token$ bit inside tells the quotation-expander to treat the token as a literal string in the generated code.

In the previous article, we learned that implem is camlp5's top-level rule for parsing an Ocaml program. Here we overload implem using the EXTEND syntax. The EXTEND section reads: implem takes a list of zero or more tokens, which are turned into toplevel expressions via the echo function, until the end-of-input is encountered. Then, the list of toplevel expressions (and their locations in the source) are returned to camlp5 for further processing (along with the false flag which we'll avoid discussing today).

(Being an astute reader, you probably noticed that loc is used in the definition of implem, but it's not clear where loc's value is coming from. Recall that EXTEND is a syntax-extension: the code into which the EXTEND statement is expanded will bind a value to the loc identifier.)

Build and test

This article isn't about best-practices for building Ocaml code: you have several options there. To make the steps clear, I'm using a Makefile with three rules: one to generate our lexer with Ocamllex; one to build our echo parser; and one to use "echo" to compile the hello.src program at the top of the article.

simplelex.ml: simplelex.mll
        ocamllex simplelex.mll

pa_echo.cma: pa_echo.ml simplelex.ml
        ocamlc -I +camlp5 -pp 'camlp5o pa_extend.cmo q_MLast.cmo' \
                -a -o pa_echo.cma \
                simplelex.ml pa_echo.ml

hello: hello.src pa_echo.cma
        ocamlopt -pp 'camlp5 ./pa_echo.cma pr_dump.cmo -impl' \
        -impl hello.src -o hello

Run make hello and then ./hello and you should get the expected result.

Seeing the generated code

If you want, you can ask camlp5 to show you the Ocaml code it generates through our "echo" parser. Try this:

camlp5 ./pa_echo.cma pr_r.cmo -impl hello.src | sed "s/; /;\n/g"

Note the similarities and differences with the hello rule in the Makefile. The file pr_r.cmo is the "pretty printer" that outputs the Ocaml source code. The sed pipe is optional, but will add some newlines for easier reading. Running this on my hello.src, I get:

print_endline ("EMIT: " ^ "This");
print_endline ("EMIT: " ^ "is");
print_endline ("EMIT: " ^ "my");
print_endline ("EMIT: " ^ "hello-world");
print_endline ("EMIT: " ^ "program.");

…where ^ is Ocaml's string-concatenation operator. If this were a real language, I'd probably put that command into a shell-script, e.g., echodump, and rework the Makefile's hello rule into a compiler-script, e.g. echoc.

In the generated code, I see that we're concatenating the strings at runtime. It would be possible to concatenate them at compile-time (making the program run a billionth of a second faster) by moving the concatenation to a different spot in the quotation expression. Can you figure out how?

Labels:

Friday, February 19, 2010

Making a null language with Ocaml

I wanted to learn how to use Ocaml to write a little language for a hobby project. Ocaml has an powerful set of language-development tools — from lexer and parser generators to native-code compilers — but the documentation is not always geared toward the novice. I've found some very thorough examples, but most of them are overly worked, making it hard to understand how the various parts fit together.

Today, I'm going to implement a null parser: that is, a parser that completely ignores its input. Feed it whatever you want: a Java program, your grocery list, whatever — regardless, it will always emit the same, static code. Since there's no actual parsing going on, what remains ought to be the minimum scaffolding required to host a real parser.

The input to the parser (which we're going to ignore) is a Stream of characters. The output is an Ocaml abstract syntax tree, which can be pretty-printed for human consumption or fed directly into the Ocaml compiler. The compiler takes the AST and generates an executable.

(Sidebar: The murky lineage of camlp5 and camlp4)

Skip this paragraph if you're easily confused. Ocaml comes bundled with a tool called the "Caml preprocessor and pretty printer", or camlp4 for short. The primary task of camlp4 is to parse Ocaml source code and emit Ocaml AST into the compiler. To make a long story short, camlp4 underwent a significant design change a while back; but some people preferred the earlier version, so they forked the old camlp4 and packaged it up as an alternative package called camlp5. So, camlp5 is older tech than camlp4.

I'm using camlp5 here, for a number of reasons (including the superior documentation available for it: most camlp4 documents online refer to the old camlp4, not the new one, and so are actually compatible with camlp5. I know, you couldn't make this stuff up.).

The null parser

Let's jump into the implementation. Here's the parser, pa_null.ml:

(* define the parser *)
let null_parser stream =
  let loc = Ploc.dummy 
  and directive_found = false 
  in
    ( [ (<:str_item< print_endline "|your code          |"; >>, loc);
        (<:str_item< print_endline "|has been replaced  |"; >>, loc);
        (<:str_item< print_endline "|by the null parser.|"; >>, loc);
      ], directive_found)

(* install it *)
let () = Pcaml.parse_implem := null_parser

We're providing an alternate implementation for Pcaml.parse_implem, which is camlp5's toplevel parsing rule: it's the maestro that reads the source code, and calls upon a suite of smaller parsers to break the code down into an AST. Pcaml.parse_implem's output is a pair of values: a list of structure items (toplevel expressions, of type str_item) and their corresponding locations in the source code; and a boolean flag (directive_found) which we're totally going to ignore.

Since we're not actually lexing any code, there aren't any source-code locations to speak of; so we use Ploc.dummy to create a dummy-location, just to keep camlp5 happy. (In a real parser, these locations could be used to indicate the exact locations of errors in the source code.)

The toplevel expressions (structure items) are expressed using a quotation syntax which expands the quoted expression into an AST node. (Strictly, these quotations are optional: our null-parser could have emitted a program with no instructions in it. But it's bad enough that our parser reads nothing: it ought to at least spit something out.) So,

<:str_item< print_endline "|by the null parser.|"; >>

…is expanded into an AST node for a toplevel expression, which in this case is an instruction to print a message to the screen. (In a Lisp language, this would be the 'backtick' in a macro definition. In Ocaml's case, the syntax expressions are statically typed, hence the :str_item declaration.)

Note that while the stream argument is passed into our parser, we never refer to it. The original Pcaml.parse_implem rule (which we replaced) would have lexed the stream into Ocaml-syntax tokens, and then parsed those tokens into an AST.

Building and using the parser

We'll use a Makefile to manage our building. The first rule builds the parser, and the second rule uses it to compile a hello-world program.

pa_null.cmo: pa_null.ml
    ocamlc -I +camlp5 -pp 'camlp5o q_ast.cmo' -c pa_null.ml

hello: hello.ml pa_null.cmo
    ocamlc -pp 'camlp5o ./pa_null.cmo' hello.ml -o hello

The -pp statements are where we invoke camlp5 as a preprocessor: first, using the q_ast (quotations) module to enable the quotations-syntax we used in our parser; and second, using our own pa_null parser to "parse" our hello-world program.

Put whatever you want in hello.ml, and run make hello. The resulting executable, hello, should produce the following output:

$ ./hello
|your code          |
|has been replaced  |
|by the null parser.|

That's that.

While this isn't an introduction to Ocaml, if you want to follow along at home, both Ocaml and the camlp5 processor are available in Debian/Ubuntu's package systems (and are probably available in Fink or MacPorts for you Mac users).

Labels:

Friday, February 15, 2008

PostgreSQL Love: deleting from views

I know that MySQL gets most of the attention in the relational-database field, as an easy-to-install, general-purpose relational database. But I've used PostgreSQL for years, and sometimes I have a moment like I did today that reminds me why.

In a quizzing application I run, I have an 'attempts' table with a lot of columns, and many of them are very wide. So, I created a view of the 'attempts' table that contains the key columns, and most of the time this view is what I use when I'm looking something up.

The problem is, you cannot delete records from a view (and rightly so). So, I often end up with a scenario like this:

# select * from attempts_brief where quiz=1479;
id    | quiz | person | score | out_of
-------+------+--------+-------+----
14981 | 1479 | admin  |    20 |     20
14982 | 1479 | admin  |       |  
(2 rows)
I want to delete those two dummy rows, so I alter the query, changing SELECT * to DELETE, as I would on any other cautious deletion attempt. But I get this:
# delete from attempts_brief where quiz=1479;
ERROR:  cannot delete from a view
HINT:  You need an unconditional ON DELETE DO INSTEAD rule.
Of course, the deletion fails, and I'm reminded that I'm not querying a table directly: silly programmer. But that hint, what a beaut! Using PostgreSQL's Rules system, I can tell the database what I really mean to do:
# create rule attempts_brief_del as on delete to attempts_brief
# do instead delete from attempts where id=OLD.id;
CREATE RULE

Now that the rule is in place, I can delete from the view, and the database will do the Right Thing:

# delete from attempts_brief where quiz=1479;
DELETE 2

Perfect!

It's not a huge thing, I know. But it just reminds me how much time the Postgres designers spent, not only creating a system that is fast, consistent and powerful, but that is helpful, flexible and user-friendly to boot. I love ya, Postgres.

Thursday, August 16, 2007

Roman Numerals in Haskell

Here's a toy Haskell program, for converting integers to Roman numerals. It's inspired directly by Doug Coleman's Factor version; I'm using the same algorithm he describes.

(Haskell's an interesting language for writing little things like this. I haven't used a statically-typed language in a long time; it's kind of like taking a hike in someone else's boots. The monad thing hasn't been as hard to wrap my head around as I initially thought it might be; I particularly like the way that exception handling can be generalized using monads.)

-- inspired by
-- http://code-factor.blogspot.com/2007/06/roman-numeral-conversion-in-factor.html

module RomanNumerals (toRoman) where

-- constants

romanValues = [1000, 900, 500, 400, 100, 90, 50, 40, 10, 9, 5, 4, 1]
romanDigits = words "m cm d cd c xc l xl x ix v iv i"
theRomans   = zip romanValues romanDigits


-- internals

-- To do the actual conversion to Roman numerals, we left-fold over
-- theRomans, starting with a seed value of (accumulator,
-- remainder). The initial remainder is the value we want to
-- convert. Using theRomans as a resource, romFold breaks down the
-- remainder by modular arithmetic, adding Roman digits to the
-- accumulator as it goes. Finally we take the accumulator, and we're
-- done.

toRoman' :: Int -> String
romFold  :: ([a], Int) -> (Int, [a]) -> ([a], Int)

toRoman' n = fst $ foldl romFold ("", n) theRomans

romFold (acc, n) (romValue, romDigit) =
let (result, remain) = divMod n romValue
    newDigits = concat $ replicate result romDigit
in (acc ++ newDigits, remain)


romanRangeCheck :: (Monad m) => Int -> m Int
romanRangeCheck n
| 0 < n && n < 4000 = return n
| otherwise = fail "out of bounds [0 < n < 4000]"


-- The public, range-checking version.

toRoman :: Monad m => Int -> m String
toRoman n = romanRangeCheck n >>= (return . toRoman')

Wednesday, May 02, 2007

The kindness of dictators, and an octet-full of eggs

chicken-shirt I had a real treat today; I received a t-shirt from Felix Winkelmann, the benevolent dictator of the Chicken Scheme project. My prize was for submitting the 256th “egg” (extension module) to the language, a simple library for accessing Oracle databases. (Of course, it took 255 contributions from others to make mine shirt-worthy: the giants, and their shoulders, did the real work.)

I've been reflecting upon Scheme recently, not only because of the shirt (which, by the way, pleases me more than an article of clothing ought to), but also because I've been using Scheme in earnest these past few months, building a small-but-serious set of Web projects at work. Before working with Scheme, I would have chosen Python for these projects, without a second thought. A dynamic language is the only sensible tool for the bob-and-weave, roll-with-the-punches development style that gets projects like these off of the napkin and into production in short time.

Scheme fills the same dynamic niche, but makes different tradeoffs than Python, and its cousin Ruby, do. It really does feel more like a language-construction kit than a language in itself. You can use Scheme without defining any new syntax, but I think that syntactic freedom is where one of the sweet-spots is. I've done my share of metaprogramming in Python, trying to push the syntax-envelope in order to write concise and representative solutions. In Scheme, there really isn't an envelope to push: you can write anything you want, any way you want it, and sure enough, there's a way to treat it as valid Scheme code. Problems have a way of growing more complicated over time; the lack of a syntax barrier carries the promise (the siren song?) of solutions that can evolve in step with their problems.

Scheme communities are much smaller than those of mainstream languages, so you end up writing a lot of stuff that you take for granted in other languages. That's a strength, of sorts: the devil you wrote is sometimes easier to understand, and often much simpler, than the general library that someone else did. (On the other hand, a programmer who uses his own libraries may have a fool for a client.) The fragmentation of communities doesn't help in the library department; although Scheme-the-language is standardized, there are several dialects, each not fully compatible with the others. (My Oracle library, for example, is a problem that is already (partially) solved in PLT Scheme, but in a PLT-specific manner.) Fragmentation aside, those communities are smart as hell, and if you ask nicely they can help a great deal.

Anyway, back to those Web projects of mine. I can't say that they any shorter than they would have been in Python, and probably not much more readable. At least half of what I've written has been reusable library code, which is good; but I still write Scheme like a greenhorn, and I can see many areas I'd like to fix. So, it's comparable in complexity to the Python I would have written — though in different ways — but I'm optimistic that it will mature more naturally. I also like the performance; Web apps don't always need a lot of computational power, but it's hard to ignore just how well these apps perform: the hotspots just aren't that hot.

In short, I earned a shirt while conspiring with giants, in a wild-west, open-source frontier (or backwoods?), using a shockingly-powerful language that my grandpa would have recognized. Man, I love this stuff.


Friday, February 16, 2007

Rethinking OpenURL

Dan Chudnov says a lot of great things in Rethinking OpenURL, but this caught my eye:
"This, my friends, is dynamic service linking, in a nutshell. (And, yes, I do mean 'in a nutshell' as in 'ooh, ooh, help, I'm trapped in a nutshell, get me out of here.')"
And now, I'm off to read the rest before the weekend starts. :-)

Hello planet code4lib!

Hello to the readers of planet code4lib! I just found out my blog is being syndicated [t]here. In future, I'll try to keep the noise down, and finish my snacks before entering the blog...

Thursday, February 15, 2007

group-by: like itertools.group in Scheme

Just a little post. I wanted to share a Scheme procedure that worked something like Python's itertools.groupby function. Emphasis on "something like": for one thing, it works on lists only, rather than any "iterable" type, which is part of the magic and appeal of itertools.
(use srfi-1 srfi-26)

(define (group-by keyproc lst #!optional (is-equal equal?))
  ;; examples:
  ;; (group-by identity '(1 2 3 3 4 4 4)) --> ((1) (2) (3 3) (4 4 4))                                         
  ;; (group-by car '((a 1) (a 2) (b 1))) --> '(((a 1) (a 2)) ((b 1)))                                                   
  (let loop ((lst lst) (acc '()))
   (if (null? lst)
       (reverse acc)
       (let ((key (keyproc (car lst))))
         (receive (grouped rest)
             (span (lambda (item) (is-equal key (keyproc item))) lst)
           (loop rest (cons grouped acc)))))))
Also unlike the Python version, there's no default "key" function (Python uses an identity function as the default). To get the Pythonic behaviour, use identity (or values if your Scheme doesn't have an identity procedure). The optional is-equal? function lets you specify a different key-comparison function; by default, the comparison function is Scheme's equal? which is pretty liberal: strings are compared by value, for example, whereas Scheme's eq? compares by identity only. You might want to use a custom comparison function anyway --- for example, to group by strings, but case-insensitively:
(group-by first '(("fred" 25) ("Fred" 33) ("mary" 22)))
--> ((("fred" 25)) (("Fred" 33)) (("mary" 22)))
(group-by first '(("fred" 25) ("Fred" 33) ("mary" 22)) string-ci=?)
--> ((("fred" 25) ("Fred" 33)) (("mary" 22)))
That's all.