User:Quale/Recursive-descent parsing

In computer science, recursive descent is a top-down parsing technique that employs mutually-recursive procedures to analyze sentences in a context-free grammar. Each procedure processes one grammatical category, and has a clause for each production rule in the category (nonterminal). The procedures call each other recursively to process the nonterminals in the grammar so the structure of the resulting recursive-descent parser mirrors the organization of the grammar. Recursive descent is a LL parsing technique, since it processes top-down, left-to-right, producing the leftmost derivation.

A predictive parser is a recursive descent parser that does not require backtracking. Predictive parsing is possible only for the class of LL(k) grammars, which are the context-free grammars for which there exists some positive integer k that allows a recursive descent parser to decide which production to use by examining only the next k tokens of input. The LL(k) grammars exclude all ambiguous grammars, as well as all grammars that contain left recursion. Any context-free grammar can be transformed into an equivalent grammar that has no left recursion, but removal of left recursion does not always yield an LL(k) grammar. A predictive parser runs in linear time.

Recursive descent with backtracking is a more general technique that determines which production to use by trying each production in turn. Recursive descent with backtracking can parse some ambiguous grammars, but left recursion can still cause the parser to enter an infinite loop and fail to terminate. Although this is a larger class than the LL(k) grammars, backtracking recursive descent parsers are not guaranteed to terminate unless the grammar is LL(k). Backtracking may require exponential time and can introduce a great deal of overhead, as semantic actions such as adding entries to the symbol table may need to be undone. Backtracking also makes meaningful error reporting difficult, as there is little information to indicate where the error occurred. As a consequence, recursive descent with backtracking is usually considered impractical for use in programming language compilers, and the term "recursive descent" is often used to mean a predictive parser.

Recursive descent is a simple parsing technique, but it is sufficient to parse most common computer programming languages and given a suitable grammar or syntax diagrams the parsers are easy to write by hand. Although it is possible to use a tool to generate recursive-descent parsers from a grammar description, most automatically-generated parsers use more powerful techniques. Despite this, recursive-descent parsers are popular because they allow convenient expression of semantic routines that derive meaning from the input and permit issuing high quality error diagnostics for syntactically invalid input.

History and applications

 * early history, Irons, Foster
 * Algol 60
 * most programming languages in the early 1970s used recursive descent or operator precedence or both
 * Pascal and subsequent Wirth 1-pass compilers
 * BCPL
 * Cfront used yacc but Stroustrup thinks recursive descent would have been better
 * lcc, gcc, clang

Technique

 * Wirth, Algorithms + Data Structures = Programs - from EBNF or syntax diagrams
 * Burge, Recursive Programming Techniques - backtracking parser written in a functional language
 * I have at least one more source detailing an explicit algorithm to write an r-d parser from an LL(1) grammar
 * possible to use explicit stack handling if implementing in a language without recursion such as FORTRAN before Fortran 90, usually not required today

Theory and practice

 * LL(k), LL(1)
 * factoring to remove left recursion, may change associativity (EBNF usually gives an easy fix)
 * substitution
 * Fraser and Hanson technique to parse expressions that reduces the number of procedures required for languages with many precedence levels such as C

Left recursion
Left recursion can be removed from any grammar, although the transformation may be involved and can introduce ε-productions. Most cases of left recursion encountered in programming languages are immediate (have the form A → Aα).

A left recursive grammar for simple expressions:
 * E → E + T | T
 * T → T * F | F
 * F → ( E ) | number

Replace the E and T productions as follows
 * E → T E′
 * E′ → + T | ε
 * T → F T′
 * T′ → * F | ε
 * F → ( E ) | number

This changes the associativity of the grammar, naturally producing right-associative parse trees. If the grammar is written in EBNF it is easy to maintain left associativity and to write the parse functions using loops rather than recursion


 * E → T { + E }*
 * T → F { * F }*

{α}* is zero or more repetitions of α, defined as
 * astar → astar α | ε

Left factoring

 * A → αβ | αγ

If the input begins with a non-empty string derived from α there is no way to choose whether to expand αβ or αγ. Factor the common prefixes
 * A → αA′
 * A′ → β | γ

Programming language example. A common grammar for the conditional statement is
 * if_statement → if condition then statement else statement
 * | if condition then statement

Upon encountering the token if a top-down parser cannot predict which expansion to use. Factor the common prefix
 * if_statement → if condition then statement if_tail
 * if_tail → else statement | ε

where ε is the empty string. (Recursive-descent parsers choose an epsilon production only if the input does not match any non-empty productions.)

Error recovery
most methods delete symbols to reach a synchronization point
 * Wirth
 * Turner
 * lcc, Fraser and Hanson
 * good discussion of error recovery in Crafting a Compiler, Fischer and LeBlanc, if only I could find my copy

Pros

 * simple, can be written by hand
 * fast if no backtracking needed
 * good error messages
 * easy to attach semantic routines
 * can handle ambiguity w/ ad hoc code

Cons

 * some grammars are most naturally expressed using left recursion and must be rewritten to be LL
 * many context-free grammars are not LL(k) for any finite k, LL(k) grammars are a subset of LR and LR is strictly more powerful
 * if using a parser generator, LR table-driven techniques can parse more languages

Although predictive parsers are widely used, and are frequently chosen if writing a parser by hand, programmers often prefer to use a table-based parser produced by a parser generator, either for an LL(k) language or using an alternative parser, such as LALR or LR. This is particularly the case if a grammar is not in LL(k) form, as transforming the grammar to LL to make it suitable for predictive parsing is involved. Predictive parsers can also be automatically generated, using tools like ANTLR.

Predictive parsers can be depicted using transition diagrams for each non-terminal symbol where the edges between the initial and the final states are labelled by the symbols (terminals and non-terminals) of the right side of the production rule.