Parsing list comprehensions is hard

I have a toy problem that I like to test on anyone who thinks they've “solved” parsing: Haskell list comprehensions. These are given by the grammar:

LISTCOMP ::= "[" EXPR "|" CLAUSE["," CLAUSE]* "]"
CLAUSE   ::= EXPR | PAT "<-" EXPR

The problem is that when parsing a CLAUSE, until you see a “<-”, you don't know whether you've been parsing a pattern PAT or an expression EXPR. This is extra hard because patterns and expressions overlap: (x, Left 2) could be either, but (x, Left (2+3)) is definitely an expression. You can get arbitrarily deep into parsing a pattern before you realize it's actually an expression!

Neither LL nor LR1 parsers can handle this. In fact, GHC's parser uses a hack: it parses patterns as expressions, and only later checks that the expression it parsed was a valid pattern! This works so long as patterns are a subset of expressions. But if some patterns aren't valid expressions (e.g. OCaml's or-patterns p|q), then you need to get clever.2,3

Naïve recursive descent parsers—which are basically LL(1)—can't handle this, but they can resort to a classic trick: backtracking! First, try parsing EXPR; if that fails, try parsing PAT "<-" EXPR. Parsec permits this via the try combinator, and PEGs do it by default. One worry here is that backtracking can lead to exponential explosion. I think this isn't a problem for list comprehensions, because expressions can't nest inside patterns.4 (PEGs duck the exponential blowup by memoising, anyway.)

Still, it's tricky to reason about the behavior of backtracking. Backtracking in the wrong place can over-eagerly commit to a wrong parse. And because it “forgets” branches it backtracks out of, it can give sub-par error messages.5 Still, it's probably the best solution if you're hand-rolling a parser.

Python also has list comprehensions, but neatly sidesteps this issue:

LISTCOMP ::= "[" EXPR STMT* "]"
STMT     ::= "if" EXPR | "for" PAT "in" EXPR

The unique prefixes if and for prevent ambiguity; this is LL(1) and easily parsed with recursive descent. This is both pragmatic and ergonomic, at least for list comprehensions; I'd find writing monadic do-notation in this style a bit tedious.

Of course, the original Haskell-style grammar can be parsed by all-purpose parsing algorithms like GLR, GLL, parsing with derivatives, Earley, and CYK. I'm not sure how efficiently they do it, but Jeffrey Kegler claims that his Marpa parser (a modernized Earley variant) handles practically any unambiguous grammar in linear time. Sounds great! Maybe in the future we won't need fiddly hacks just to parse Haskell-style list comprehensions.

Footnotes

  1. I haven't proven that this grammar isn't LR(k) for any k, but I strongly suspect it. I do know that Menhir, an LR(1) parser-generator, can't handle it.

  2. In one of my Datafun prototypes, I solved this by taking the recursive union of my pattern and expression grammars, throwing in all production rules that apply to either. The associated parse actions produce a (Maybe Pat, Maybe Expr) pair of the pattern and expression parsed; the pattern or expression half might be None if we apply a rule that's only valid for expressions or patterns respectively.

    This leads to some odd situations. For example, for me an underscore _ is not a valid expression, so (_, 2+3) is neither pattern nor expression. But it is in the recursive union of their grammars, so it parses, yielding (None, None). At this point I bail out, manually throwing a parse error.

  3. GHC is even cleverer than I am. GHC's view patterns have the form EXPR "->" PAT, which isn't a valid expression. Despite this, GHC includes view-patterns in its datatype and grammar for expressions, with a post-processing check to ensure you can't use a view pattern where an expression is expected.

  4. As Alexis King pointed out to me, view patterns also let expressions nest inside of patterns (inside of expressions inside of patterns inside of...). So perhaps backtracking could blow up with those enabled. I'm not really sure.

  5. I'm being a bit misleading here—not all backtracking parsers are created equal. I'm mostly considering what I call “shallow” backtracking, where if the first branch succeeds, you don't bother with the second branch; indeed, you forget that you ever backtracked at all. This is easy to implement in a hand-written recursive descent parser. Parsec's try combinator does this.

    But the original and famous formulation of monadic parsers as “a function from strings to lists of things and strings”, i.e. String -> [(a, String)], is a more elephantine beast: it never forgets! If you demand the list of all possible parses, every backtracking branch will eventually be visited. Naturally, this is even more prone to combinatorial explosion. But it's easier to reason about it: you'll never "cut off" a branch from consideration. I call this “deep” backtracking. Prolog uses this search strategy (so long as you avoid the “cut” operator, !).

    Finally, you could explore both branches “in parallel”. I've not thought very hard about either the semantics or performance of this option. It seems harder to implement, but not that hard: miniKanren's default search strategy does this by interleaving exploration of each branch. I'm told that uu-parsinglib runs branches in parallel, too.