Recognising a language with Antlr and Python

A while ago I started writing an IDL for D-Bus, which was envisioned as afairly complex language. A large number of libraries and programs wereavailable to help but I settled on Antlr. I had tried flex / bison beforeand never had a good experience. Besides ruling out the ‘C’ version, thisalso discouraged me from using language implementations such as ply.

Antlr is a parser generator, similar to flex / bison, and written in Java.It has the advantage of having many ‘target’ languages. This means it ispossible to create parsers in many languages including Python, C, and Java.As a parser generator Antlr has its own language for describing languages.This is supposed to make the grammar of the language clear, and act as aspecification for the language as well as generating an implementation of aparser for it.

In Antlr the lexer and parser are described in the same file. The differencebetween a lexer and parser is technical but Antlr makes this distinction,parsing a language in two separate stages. The lexer is simpler and servesto split a sequence of characters into a sequence of words or tokens. Theparser can be more complicated and is used to ‘recognise’ a sequence ofsymbols that conform to the grammar of the language.

Descriptions of languages are made up of a sequence of ‘rules’ each of theserules what is permissible within the language. Lexer rules start with anupper-case character, and parser rules start with a lower-case character.Below is a grammar file that parses a very simple css-like language.

grammar simplecss;options {    language = Python;}/* Parser rules */cssdocument : cssrule+ ;cssrule : IDENTIFIER '{' declarationlist '}' ;declarationlist : declaration+ ;declaration : IDENTIFIER ':' IDENTIFIER ';' ;/* Lexer rules */fragment LETTER : 'A'..'Z'                | 'a'..'z'                | '_'                ;fragment DIGIT : '0'..'9' ;IDENTIFIER : LETTER (LETTER | DIGIT)* ;COMMENT : '/*' .* '*/' {$channel = HIDDEN} ;WHITESPACE : ( 't' | ' ' | 'r' | 'n'| 'u000C' )+             {$channel = HIDDEN} ;

Starting with the big-picture the first rule in this file is simplecss.This is the top-level rule and matches a sequence of one or more css rules.Antlr syntax is somewhat consistent with regexp and feels moderatelynatural. The symbol indicating one-or-more is +.

cssrule : IDENTIFIER '{' declarationlist '}' ;

The css rule itself is an IDENTIFIER token followed by a declarationlistinside braces. The '{' is a literal token within a parser rule.

In the parser rules it is acceptable to use sub-rules, such as whendeclaring a declarationlist as one or more of the rule declaration. Thisis not true for lexer rules. The lexer is used to process a stream ofcharacters in to a stream of tokens. Each lexer rule becomes a token. To usea lexer rule within others you must declare that it as a fragment. Thismeans that it will not become a token and is used just for composing otherlexer rules.

fragment DIGIT : '0'..'9' ;IDENTIFIER : LETTER (LETTER | DIGIT)* ;

From the above rule you it can be seen that an IDENTIFIER token is verysimple, just a word composed of letters and digits. As the IDENTIFER tokenis used in place of the selector, this language only allows very basicselectors.

COMMENT : '/*' .* '*/' {$channel = HIDDEN} ;WHITESPACE : ( 't' | ' ' | 'r' | 'n'| 'u000C' )+             {$channel = HIDDEN} ;

The above rules are taken from a page describing Antlr grammars. They serve to removecomments and extraneous whitespace from the token stream. This is veryuseful as comments and whitespace can appear almost anywhere within thelanguage. If these were processed in to tokens the parser rules would becomeexcessively complicated.

The documentation for Antlr grammars is excellent, as well as more involvedtutorials available on the Antlr website there isalso a bookavailable that does a decent job in this area. The list of operators used tocompose parser and lexer rules is found on the Antlr cheat sheet.

Generating Python code

Once the grammar of the language has been declared, Antlr is used togenerate programs that recognise this language. To indicate that a programshould be created in the Python language language = Python; is placed inthe grammar options. The default language is Java.

Antlr version 3 is available in the Ubuntu repositories in the antlr3package. I did not use this as it is a substantially older version. Thelatest can be downloaded from the project website. Some Python dependencies will also berequired: the Antlr runtime module and the Stringtemplate module. The Antlrruntime library needs to be exactly the same version as the Antlr tool. Thismay mean that it is not possible to use the very latest version of Antlrwith Python. I used version 3.1.2 of the Antlr tool and the Python runtimelibrary. Iused version 3.2.1 of the Stringtemplate module.

Once you have downloaded the Antlr tool as a jar you can run it directly.

> java -classpath antlr-3.1.2.jar org.antlr.Tool simplecss.g

This will generate two python files Make sure that the python dependencies are in thePYTHONPATH. It is now possible to write a script that checks if a fileconforms to the language grammar.

import sysimport antlr3from simplecssLexer import simplecssLexerfrom simplecssParser import simplecssParserdef main (argv):    filename = argv[1]    input = antlr3.FileStream (filename)    lexer = simplecssLexer (input)    tokens = antlr3.CommonTokenStream (lexer)    parser = simplecssParser (tokens)    res = parser.cssdocument ()if __name__ == "__main__":    sys.exit (main (sys.argv))

This tool is not very useful. All it can do is find out if a document iscorrect, and report errors if it is not. Antlr also has support forattaching actions to parser rules and translating a sequence of tokens in toan abstract syntax tree. A data structure that is easy to manipulate withinPython. In a future post I will show how to use actions and abstract syntaxtrees to create something a little more exciting.