User:Ilgeco1995/sandbox

The Packrat parser is a type of parser that shares similarities with the recursive descent parser in its construction. However, it differs because it takes parsing expression grammar as input rather than LL grammar.

In 1970, A. Birman laid the groundwork for packrat parsing by introducing the TMG recognition schema (TS). His work was later refined by Aho and Ullman and renamed the Generalized Top-Down Parsing Language (GTDPL). This algorithm was the first of its kind to employ deterministic top-down parsing with backtracking.

Bryan Ford developed PEGs as an expansion of GTDPL and TS. Unlike CFGs, PEGs are unambiguous and can match well with machine-oriented languages. PEGs, similar to GTDPL and TS, can also express all LL(k) and LR(k). Bryan also introduced Packract as a parser that uses memoization techniques on top of a simple PEG parser. This was done because PEGs has an unlimited lookahead capability resulting in a parser with exponential time performance in the worst case.

Packract keeps track of the intermediate results for all mutually recursive parsing functions. Each parsing function is only called once at a specific input position. In some instances of packrat implementation, if there is insufficient memory, certain parsing functions may need to be called multiple times at the same input position, causing the parser to take longer than linear time.

Syntax
Packract takes in input the same syntax as a PEGs:

A simple PEG is compose by terminal and nonterminal possibly interleaved with operators that composed one or several derivation rules

Symbols:

 * nonterminal are indicated with capital letter ex. $$\{S, E, F, D\}$$
 * Terminal symbols are indicated with lower case ex. $$\{a,b,z,e,g \}$$
 * Expression are indicated with lower case Greek letter $$\{\alpha,\beta,\gamma,\omega,\tau\}$$
 * Expression can be a mix of terminal, nonterminal and operator

Rules:
A derivation rule is composed by a nonterminal and an expression $$S \rightarrow \alpha$$

A special expression $$\alpha_s$$ is the starting point of the grammar. in case no $$\alpha_s$$ is specified the first expression of the first rule is used.

An input string is considered accepted by the parser if the $$ \alpha_s $$ is recognized. As a side-effect a string $$ x $$ can be recognized by the parser even if it was not fully consumed.

An extreme case of this rule is that the grammar $$ S \rightarrow x* $$ match any string

This can be avoided rewriting the grammar as $$ S -> x*!. $$

Example:
$$\begin{cases} S \rightarrow A/B \\ A \rightarrow \textbf{'a'}\ A \ \textbf{'a'} \ /\ \textbf{'b'}\  B \ \textbf{'b'} \ /\textbf{'a} \ D \ \textbf{a'}  \\ B \rightarrow \textbf{'b'}\ B \ \textbf{'b'} \ /\ \textbf{'a'}\  A \ \textbf{'a'} \ /\ \textbf{'b} \ D \ \textbf{b'}  \\ D \rightarrow (\textbf{'0'}-\textbf{'9'}) \end{cases}$$

This grammar recognize the palindrome string over the alphabet $$ \{ a,b \} $$ with in the middle any digit

A Possible derivation is


 * Parsing of a palindrome string with packrat.jpg

Left recursion:
Left recursion happens when a grammar production refers to itself as its left-most element, either directly or indirectly. Since Packrat is a recursive descent parser, it cannot handle left recursion directly. Since Packrat is a recursive descent parser, it cannot handle left recursion directly. During the early stages of development, it was found that a production that is left-recursive can be transformed into a right-recursive production. This modification significantly simplifies the task of a packrat parser. Nonetheless, if there is an indirect left recursion involved, the process of rewriting can be quite complex and challenging. If the time complexity requirements are loosened from linear to superlinear, it is possible to modify the memoization table of a packrat parser to permit left recursion, without altering the input grammar.

Iterative combinator:
The iterative combinator $$\alpha +$$, $$\alpha *$$, needs special attention when a translated into a packrat parser. In fact the use of iterative combinators introduces a "secret" recursion that doesn't record intermediate results in the outcome matrix. This can lead to the parser operating in a superlinear. This Problem can be resolved apply the following transformation : With these transformation the intermediate results can be properly memoizated.

Memoization technique
Memoization is an optimization technique in computing that aims to speed up programs by storing the results of expensive function calls. This technique essentially works by caching the results so that when the same inputs occur again, the cached result is simply returned, thus avoiding the time-consuming process of re-computing. When using packrat parsing and memoization, it's noteworthy that the parsing function for each nonterminal is solely based on the input string. It does not depend on any information gathered during the parsing process. Essentially, memo table entries do not affect or rely on the parser's specific state at any given time. Packrat parsing stores results in a matrix or similar data structure that allows for quick look-ups and insertions. When a production is encountered, the matrix is checked to see if it has already occurred. If it has, the result is retrieved from the matrix. If not, the production is evaluated, the result is inserted into the matrix, and then returned. When evaluating the entire $$m*n$$ matrix in a tabular approach, it would require $$\Theta(mn)$$ space. Here, $$m$$ represents the number of nonterminals, and $$n$$ represents the input string size.

In a naïve implementation the full table can be derived from the input string starting from the end of the string.

The packrat parser can be improved to update only the necessary cells in the matrix through a deep first visit of each subexpression. Consequently, using a matrix with dimensions of $$m*n$$ is often wasteful, as most entries will remain empty. These cells are linked to the input string, not the nonterminals of the grammar. This means that increasing the input string size will always increase memory consumption, while the number of parsing rules changes only the worst space complexity.

Cut operator
Another operator called "cut" has been introduced to Packract to reduce its average space complexity even further. This operator utilizes the formal structures of many programming languages to eliminate impossible derivations. For instance, control statements parsing in a standard programming language is mutually exclusive from the first recognized token ex $$\{if, do, while, switch\} $$. When a packrat parser uses cut operators, it effectively clears its backtracking stack. This is because a cut operator reduces the number of possible alternatives in an ordered choice. By adding cut operators in the right places in a grammar's definition, the resulting packrat parser will only need a nearly constant amount of space for memoization.

The algorithm
Sketch of a implementation of a packract algorithm in a LUA like pseudocode.

Example
Given the following context free grammar that recognize simple arithmetic expression composed by single digit interleaved by sum, multiplication and parenthesis.

$$\begin{cases} S \rightarrow A \\ A \rightarrow M\ \textbf{'+'}\  A \ / \  M \\ M \rightarrow P\ \textbf{'*'}\  M \ / \ P \\ P \rightarrow \textbf{'('}\ A\ \textbf{')'}\ / \ D \\ D \rightarrow (\textbf{'0'}-\textbf{'9'}) \end{cases}$$

Denoted with $$\dashv$$ the line terminator we can apply the packrat algorithm

{| class="wikitable" !Syntax tree !Action !Packrat Table No update because no terminal was recognized Update:
 * +Derivation of $$2*(3+4)\dashv$$
 * Derivation of a context free grammar with packrat.svg
 * Derivation of a context free grammar with packrat.svg
 * Second_step_in_Parsing_a_CFG_with_packrat.svg
 * Second_step_in_Parsing_a_CFG_with_packrat.svg

D(1) = 1;

P(1) = 1;
 * Third_step_of_recognizing_CFG_with_packrat.svg
 * Third_step_of_recognizing_CFG_with_packrat.svg

No update because no nonterminal was fully recognized No update because no terminal was recognized Update:
 * Fourth_step_in_recognizing_CFG_grammar_with_Packrat.svg
 * Fourth_step_in_recognizing_CFG_grammar_with_Packrat.svg
 * 5th step of recognizing CFG with packrat.svg
 * 5th step of recognizing CFG with packrat.svg

D(4) = 1;

P(4) = 1; Hit on P(4)
 * Sixth step of recognizing CFG with packrat.svg
 * Sixth step of recognizing CFG with packrat.svg

Update M(4) = 1 as M was recognized
 * Seventh step of recognizing CFG with packrat.svg
 * Seventh step of recognizing CFG with packrat.svg

No update because no terminal was recognized Update:
 * Eighth step of recognizing CFG with packrat.svg
 * Eighth step of recognizing CFG with packrat.svg

D(6) = 1;

P(6) = 1; Hit on P(6)
 * Ninth step of recognizing CFG with packrat.svg
 * Ninth step of recognizing CFG with packrat.svg

Update M(6) = 1 as M was recognized Hit on M(6)
 * Tenth step of recognizing CFG with packrat.svg
 * Tenth step of recognizing CFG with packrat.svg

Update A(4) = 3 as A was recognized

Update P(3)=5 as P was recognized
 * Eleventh step of recognizing CFG with packrat.svg
 * Eleventh step of recognizing CFG with packrat.svg

No update because no terminal was recognized Hit on P(3)
 * Twelfth step of recognizing CFG with packrat.svg
 * Twelfth step of recognizing CFG with packrat.svg

Update M(1)=7 as M was recognized
 * 13th step of recognizing CFG with packrat.svg
 * 13th step of recognizing CFG with packrat.svg

No update because no terminal was recognized Hit on M(1)
 * 14th step of recognizing CFG with packrat.svg
 * 14th step of recognizing CFG with packrat.svg

Update A(1)=7 as A was recognized

Update S(1)=7 as S was recognized
 * }