Top-down and Bottom-up Parsing
Two strategies, one tree
Parsing is the inverse of derivation: given a string, find a parse tree (or
prove none exists). There are two main strategies.
Top-down parsing (Definition 7.3.11, notes p. 142): start from the start
symbol and work downwards to terminals. Try each rule that has on the
left; for each, recursively try to derive the next nonterminal. Predictive:
you guess what each nonterminal should expand to based on the input.
Bottom-up parsing (Definition 7.3.12, notes p. 143): start from the input
string and work upwards to . Find substrings matching the right-hand side of
some rule, and replace them with the left-hand nonterminal. This is called
shift-reduce parsing.
A subtle distinction (notes p. 172):
- LL parsers are top-down (Left-to-right scan, Leftmost derivation).
- LL(1): one token of lookahead.
- LL(k): tokens of lookahead.
- LR parsers are bottom-up (Left-to-right scan, Rightmost derivation in
reverse).
- LR(0), SLR, LALR, LR(1): increasing power, increasing table size.
In practice:
- Top-down is conceptually simpler, fits recursive-descent parsers well.
- Bottom-up is more powerful (handles a wider class of grammars) and tends
to be more efficient. Yacc / Bison are bottom-up LR parser generators.
Theorem 7.3.14 (notes p. 173, Lee 2002): context-free parsing is equivalent
to Boolean matrix multiplication. This is why efficient CFG parsers run in
— and why we are unlikely to find much faster practical algorithms.
- 0.0sPredictive parsing: LL(1)
- 1.2sGrammar E → T + E | T appears at the top
- 3.0sParse table builds — rows for non-terminals, columns for terminals
- 5.5sEach cell holds the production to apply for that lookahead
- 6.5sInput string + n is queued; cursor lands on +
- 7.5sCursor + is looked up in row E, column + → production E → T + E
- 9.0sCursor moves to n; lookup in row T, column n → T → n
- 11.5sLookahead + table entry + production = next derivation step