28 June 2018

LR parsers, what are they really

I recently went back to school on parsers. I always had an involuntary mental resistance to the whole field, because the theory, while well-established, is inelegant. Oh, it starts out pretty enough, with the regular languages, but from there it gets gradually worse and worse until you’re reading things like “and if there are any slots with multiple entries in table X, then the grammar isn’t LR(1).” The algorithm as definition.

Plus, I was totally unable to map the workings of LR parsers to anything else in my experience. So, apropos of that, here’s a journal entry from my wanderings in the wilderness:

The stack in an LR parser is a lot like the C stack.

When you reach the end of a production, you return, popping the stack—much like how returning from a C function pops its stack frame. Of course, in C, the return address is on the stack; whereas in an LR parser, you instead use the corresponding stack value to look up the next state in a goto table... rather like returning to a switch statement.

Like the C stack, the LR stack is heterogenous. A typical stack data structure contains elements that are all the same type. While you can certainly write an LR parser where the stack is just a stack of grammar symbols, you can also assign a different type to every terminal and nonterminal in your grammar, even types of different sizes, and the whole scheme still works. In fact, real parsers do use the LR stack to store partially built parse trees, although I don’t know that they support nonterminals of many different types. They probably should.

I think of each state in an LR parser as a quantum superposition of possible parses of the input so far. The next token might rule out some of these possibilities, fully or partially collapsing the waveform. What makes it LR is that at any given point, all the possible parses happen to assign exactly the same types to everything on the stack, by construction. I imagine a plain old C program (maybe a recursive descent parser) throwing up its hands and saying, “well, it's not clear which branch to take, we’d better just take them all” and splitting off half a dozen universes. The situation remains miraculously executable on normal hardware, because of this curious stack discipline: all the possible universes end up doing exactly the same thing with the same data at each step.

LR parsers, and automata generally, feel like debuggers to me. It’s hard to put my finger on it: something about the notion of “stepping”, and considering all the program code and state as mere data. The table generation algorithm is abstract interpretation of… something; but again, I’m not sure what. An Earley parser, maybe? The grammar itself, somehow considered as a program?