cdiggins.com

June 20, 2007

Parsing Expression Grammars are Good … but Packrat Parsing is not Neccessary

Filed under: Everything — cdiggins @ 5:12 pm

Parsing expression grammars (PEG) are becoming increasingly popular as a useful tool for implementing flexible and efficient parsers. A new language Katahdin was recently announced with runtime customizable syntax that leverages PEG’s and packrat parsing techniques to achieve this.

Packrat parsing uses memoization to guarantee parsing PEG languages with O(1) complexity. I am not convinced that using memoization is really neccessary. As far as I can see it just adds complexity. I have implemented several PEG based parsers (C++, C#, and even in Heron) which all appear to be O(1) but I haven’t provent it. Perhaps someone will look into this in more detail an with some rigour in the future.

6 Comments »

  1. Are you writing your grammars so that that parser doesn’t have to backtrack in the first place? For such grammars memoizing doesn’t increase performance. With Katahdin I wanted to allow grammar writers to just express the syntax as the use it, and not to worry about eliminating backtracking and making their grammar fast - the packrat algorithm then restores reasonable runtime.

    For example, for C grammars people often define function decleration and definitions in a single rule with an alternative:

    TYPE NAME “(” … “)” (”;” | STATEMENT)

    That doesn’t require any backtracking to decide between decleration and definition, but it isn’t as nice as it could be. Declerations and definitions are to different language constructs so why are they expressed with a single rule? With memoizing you can express the two separate constructs in two separate rules, which I think is nicer for the grammar writer:

    TYPE NAME “(” … “)” “;”
    TYPE NAME “(” … “)” STATEMENT

    With packrat parsing, this exampe example can be parsed as fast as the previous.

    Comment by chrisseaton — June 20, 2007 @ 5:44 pm

  2. Thanks for the clarification Chris. I didn’t realize how memoization allows for more clarity in the grammars definition.

    The only counter-argument I have (albeit a weak one) is that grammar production is generally quite simple given a good PEG parsing library. The Cat grammar for example in expressed in onely a few lines ( http://cat-language.googlecode.com/svn/trunk/cpp/cat_to_cpp/cat_grammar.hpp ). However, the grammar clearly lacks some elegance.

    I think re-implementing my YARD parsing framework to use memoization would possibly require a lot of work simply because it is in C++, but I could be wrong about that.

    So to answer your first question, some back-tracking is done, but the grammar construction is such that back-tracking is minimized, and as far as I can tell doesn’t affect the overall big O complexity.

    Thanks for posting to my blog, and great job with your language!

    Comment by cdiggins — June 20, 2007 @ 6:38 pm

  3. I haven’t had a chance to look at the paper but this reminded me I had bookmarked it a while ago with the intention of doing so…might be of interest if you haven’t seen it yet:

    http://home.swipnet.se/redz/roman/papers/Parsing.pdf

    Comment by abeyer — June 21, 2007 @ 1:42 am

  4. Thanks for the link to the paper. Here is the abstract for those interested:

    Parsing Expression Grammar as a Primitive
    Recursive-Descent Parser with Backtracking

    Roman R. Redziejowski
    February 2, 2007

    Abstract

    Two recent developments in the field of formal languages are Parsing Expression Grammar (PEG) and packrat parsing. The PEG formalism is similar to BNF, but defines syntax in terms of recognizing strings, rather than constructing them. It is, in fact, precise specification of a backtracking recursive-descent parser. Packrat parsing is a general method to handle backtracking in recursive-descent parsers. It ensures linear working time, at a huge memory cost. This paper reports an experiment that consisted of defining the syntax of Java 1.5 in PEG formalism, and literally transcribing the PEG definitions into parsing procedures (accidentally, also in Java). The resulting primitive parser shows an acceptable behavior, indicating that packrat parsing might be an overkill for practical languages.

    The exercise with defining the Java syntax suggests that more work is needed on PEG as a language specification tool.

    Comment by cdiggins — June 21, 2007 @ 3:02 pm

  5. Hi,

    before looking at the performance of static or dynamic parsers, what are the motivations for making the syntax of a language changeable at runtime? I have thought about something similar for Mog, but fear that the readability, and therefore the usability, of code, would suffer significantly.

    Then, one very simple extension for Yard that I wanted to add is a counter that tells me exactly how often every input character was looked at; this should be a reasonable performance metric for the scanner/parser.

    Ciao, Colin

    Comment by Colin — June 22, 2007 @ 12:50 pm

  6. Being able to modify the language you are using is only one advantage of a mutable grammar. It’s even better if you can do that at runtime because it then fits the same model as functions. You define functions as a normal part of your program, so you should also be able to define new language constructs. I can see that a user modifable grammar has problems with language corruption and readability, but I’m looking at the technology, not the human aspects of this.

    Another advantage of a runtime mutable grammar is composition. If you want to use Python and FORTRAN in the same program, you need to compose those grammars and make yourself a new compiler. What if you know also want to use Java? You have to modify the original compiler, recompile it and install it again. With a runtime mutable grammar you can compose whatever languages you want at runtime. A single binary can run a FORTRAN/Python program, and a Java/C one.

    Comment by chrisseaton — June 22, 2007 @ 4:06 pm

RSS feed for comments on this post. TrackBack URL

Leave a comment

You must be logged in to post a comment.

Powered by WordPress