1.1. Parsing generative grammars
Since the earliest beginnings of computer science, researchers have sought to identify robust algorithms for parsing formal languages (Aho and Ullman, 1972; Aho et al., 2006). Careful work to establish the theory of computation provided further understanding into the expressive power of different classes of languages, the relationships between different classes of languages, and the relative computational power of algorithms that can parse different classes of languages (Sipser, 2012).
Generative systems of grammars, particularly Chomsky’s context-free grammars (CFGs) (Chomsky, 1956) and Kleene’s regular expressions (Kleene, 1951) became the workhorses of parsing for many data formats, and have been used nearly ubiquitously over several decades. However, inherent ambiguities and nondeterminism in allowable grammars increase the implementation complexity or decrease the runtime efficiency of parsers for these classes of language (Lang, 1974; Tomita, 2013), due to parsing conflicts and/or exponential blowup in parsing time or space due to nondeterminism (Kleene, 1951; McNaughton and Yamada, 1960).
1.2. Bottom-up parsers
Much early work in language recognition focused on bottom-up parsing, in which, as the input is processed in order, a forest of parse trees is built upwards from terminals towards the root. These parse trees are joined together into successively larger trees as higher-level productions or reductions are applied. This parsing pattern is termed shift-reduce (Aho et al., 2006). Examples of shift-reduce parsers include early work on precedence parsers and operator precedence parsers (Samelson and Bauer, 1960; Knuth, 1962; Gray, 1969; Aho et al., 1975; Aho and Ullman, 1977), Floyd’s operator precedence parsers (which are a proper subset of operator precedence parsers) (Floyd, 1963; Ružička, 1979), the later development of LR (Knuth, 1965) and LALR (lookahead) (DeRemer, 1969) parsers, the more complex GLR parser (Lang, 1974) designed to handle ambiguous and nondeterministic grammars, and “recursive ascent” parsers (Thomas, 1986). Some parsers in this class run into issues with shift-reduce conflicts and/or reduce-reduce conflicts caused by ambiguous grammars; the GLR parser requires careful and complex data structure maintenance to avoid exponential blowup in memory usage, and to keep parsing time complexity to in the length of the input. Optimal error recovery in shift-reduce parsers can be complex and difficult to implement.
1.3. Top-down parsers
More recently, recursive descent parsing or top-down parsing has increased in popularity, due to its simplicity of implementation, and due to the helpful property that the structure of mutually-recursive functions of the parser directly parallels the structure of the corresponding grammar.
The Pratt parser (Pratt, 1973; Crockford, 2007) is an improved recursive descent parser that combines the best properties of recursive descent and Floyd’s Operator Precedence parsing, and has good performance characteristics. However the association of semantic actions with tokens rather than rules in a Pratt parser makes the parser more amenable to parsing of dynamic languages than static languages (Crockford, 2007).
Recursive descent parsers have suffered from three significant problems:
Problem: In the worst case, the time complexity of a naïve recursive descent parser scales exponentially in the length of the input, due to unlimited lookahead, and exponentially in the size of the grammar, due to the lack of termination of recursion at already-visited nodes during depth-first search over the Directed Acyclic Graph (DAG) of the grammar.
Solution: The discovery of packrat parsing reduced the worst case time complexity for recursive descent parsing from exponential to linear in the length of the input (Norvig, 1991). Packrat parsers employ memoization to avoid duplication of work, effectively breaking edges in the recursive descent call graph so that they traverse only a minimum spanning tree of the DAG-structured call graph of the parse, terminating recursion at the second and subsequent incoming edge to any recursion frame.
Problem: Without special handling (Medeiros et al., 2014), left-recursive grammars cause naïve recursive descent parsers to get stuck in infinite recursion, resulting in stack overflow.
Prior progress: It was found that direct or indirect left recursive rules could be rewritten into right recursive form (with only a weak equivalence to the left recursive form) (Ford, 2002), although the rewriting rules can be complex, especially when semantic actions are involved, and the resulting parse tree can diverge quite significantly from the structure of the original grammar, making it more difficult to create an abstract syntax tree from the parse tree. With some extensions, packrat parsing can be made to handle left recursion (Ford, 2002; Frost and Hafiz, 2006; Frost et al., 2007; Warth et al., 2008), although usually with loss of linear-time performance, which is one of the primary reasons packrat parsing is chosen over other parsing algorithms (Warth et al., 2008). Some workarounds to supporting left recursion in recursive descent parsers only handle indirect left recursion, not direct left recursion.
Problem: If there is a syntax error partway through a parse, it is difficult or impossible to optimally recover the parsing state to allow the parser to continue parsing beyond the error. This makes recursive descent parsing particularly problematic for use in Integrated Development Environments (IDEs), where syntax highlighting and code assist need to continue to work after syntax errors are encountered during parsing. Recursive descent parsers are also not ideal for use in compilers, since compilers ideally need to be able to show more than one error message per compilation unit.
Prior progress: Many efforts have been made to improve the error recovery capabilities of recursive descent parsers, for example by identifying synchronization points from which parsing can recover with some reliability (de Medeiros and Mascarenhas, 2018)
, or by employing heuristic algorithms to attempt to recover after an error, yielding an “acceptable recovery rate” from 41% to 76% in recent work(de Medeiros et al., 2020). Up to and including this most recent state-of-art, optimal error recovery has resisted solution for recursive descent parsers.
1.4. Dynamic programming parsers (chart parsers)
A chart parser is a type of parser suited to parsing ambiguous grammars (Kay, 1980). Chart parsers avoid exponential blowup in parsing time arising from the nondeterminism of a grammar by reducing duplication of work through the use of memoization. Top-down chart parsers (such as packrat parsers) use memoized recursion, whereas bottom-up chart parsers more specifically use dynamic programming (Section 1.7). Chart parsers are comparable to the Viterbi algorithm (Viterbi, 1967), finding an optimal tree of paths through a directed acyclic graph (DAG) of all possible paths representing all possible parses of the input with the grammar. They may work unidirectionally or bidirectionally to build the parse tree.
The Earley parser is a top-down chart parser, and is mainly used for parsing natural language in computational linguistics (Earley, 1970). It can parse any context-free grammar, including left-recursive grammars. The Earley parser executes in cubic time in the general case, quadratic time for unambiguous grammars, and linear time for all LR(k) grammars – however for most computer languages, Earley parsers are less efficient than alternatives, so they do not see much use in compilers. There is also some complexity involved in building a parse tree using an Earley parser, and Earley parsers may have problems with some nullable grammars. To increase efficiency, modern Earley parsing variants use a forward pass, to keep track of all active rules or symbols in the grammar at the current position, followed by a reverse pass, following all back-references in order to assemble the actual parse tree. Earley parsing algorithms are not directly extensible to handle PEG (Section 1.6), as a result of PEG supporting arbitrary lookahead. The Earley parser may be converted from top-down memoized recursive form into bottom-up dynamic programming form (Voisin, 1988).
“Parsing with pictures” is a chart parsing algorithm that provides an alternative approach to parsing context-free languages. The authors claim that this method is simpler and easier to understand than standard parsers using derivations or pushdown automata (Pingali and Bilardi, 2012). This parsing method unifies Earley, SLL, LL, SLR, and LR parsers, and demonstrates that Earley parsing is the most fundamental Chomskyan context-free parsing algorithm, from which all others derive.
A particularly interesting though inefficient bottom-up chart parser is the CYK algorithm for context-free grammars (Cocke et al., 1970; Daniel, 1967; Kasami, 1966). This parsing algorithm employs dynamic programming to efficiently recursively subdivide the input string into all substrings, and determines whether each substring can be generated from each grammar rule. This results in performance in the length of the input and the size of the grammar . The algorithm is robust and simple, and has optimal error recovery capabilities since all rules are matched against all substrings of the input. However, the CYK algorithm is inefficient for very large inputs due to scaling as the cube of the input length , and additionally, the algorithm requires the grammar to be reduced to Chomsky Normal Form (CNF) (Chomsky, 1959) as a preprocessing step, which can result in a large or even an exponential blowup in grammar size in the worst case – therefore in practice, may also be large.
1.5. Left-to-right vs. right-to-left parsing
In prior work, with the exception of left/right orderless parsers such as the CYK algorithm, and parsers that perform a right-to-left “fixup pass” as a postprocessing step such as modern Earley parsers, almost all parsers – and certainly all practical parsers – have consumed input in left-to-right order only.
In fact, reversing the direction in which input is consumed from left-to-right to right-to-left changes the recognized language for most parsers applied to most grammars. This is principally because most parsing algorithms are only defined to operate either top-down or bottom-up, and even if the left-to-right parsing order is reversed, the top-down vs. bottom-up order cannot be easily reversed for most parsing algorithms, because this order is fundamental to the design of the algorithm. Without reversing both parsing axes, the semantics of a “half-reversed” parsing algorithm will not match the original algorithm (Section 2.4).
1.6. Parsing Expression Grammars (PEG)
A significant refinement of recursive descent parsing known as Parsing Expression Grammars (PEGs, or more commonly, but redundantly, “PEG grammars”) were proposed by Ford in his 2002 PhD thesis (Ford, 2002). In contrast to most other parsers, which represent grammar rules as generative productions or derivations, PEG grammar rules represent greedily-recognized patterns, replacing ambiguity with deterministic choice (i.e. all PEG grammars are unambiguous). There are only a handful of PEG rule types, and they can be implemented in a straightforward way. As a subset of all possible recursive rule types, all non-left-recursive PEG grammars can be parsed directly by a recursive descent parser, and can be parsed efficiently with a packrat parser. PEG grammars are more powerful than regular expressions, but it is conjectured (though as yet unproved) that there exist context-free languages that cannot be recognized by a PEG parser (Ford, 2004)). Top-down PEG parsers suffer from the same limitations as other packrat parsers: they cannot parse left-recursive grammars without special handling, and they have poor error recovery properties.
A PEG grammar consists of a finite set of rules. Each rule has the form , where is a unique rule name, and is a parsing expression, also referred to as a clause. One rule is designated as the starting rule, which becomes the root for top-down recursive parsing. PEG clauses are constructed from subclauses, terminals, and rule references. The subclauses within a rule’s clause may be composed into a tree, with PEG operators forming the non-leaf nodes, and terminals or rule references forming the leaf nodes.
Terminals can match individual characters, strings, or (in a more complex parser) even regular expressions. One special type of terminal is the empty string (with ASCII notation ()), which always matches at any position, even beyond the end of the input string, and consumes zero characters.
The PEG operators (Table 1) are defined as follows:
A Seq clause matches the input at a given start position if all of its subclauses match the input in order, with the first subclause match starting at the initial start position, and each subsequent subclause match starting immediately after the previous subclause match. Matching stops if a single subclause fails to match the input at its start position.
A First clause (ordered choice) matches if any of its subclauses match, iterating through subclauses in left to right order, with all match attempts starting at the current parsing position. After the first match is found, no other subclauses are matched, and the First clause matches. If no subclauses match, the First clause does not match. The First operator gives a top-down PEG parser limited backtracking capability.
A OneOrMore clause matches if its subclause matches at least once, greedily consuming as many matches as possible until the subclause no longer matches.
A FollowedBy clause matches if its subclause matches at the current parsing position, but no characters are consumed even if the subclause does match. This operator provides lookahead.
A NotFollowedBy clause matches if its subclause does not match at the current parsing position, but it does not consume any characters if the subclause does not match. This operator provides lookahead, logically negated.
Two additional PEG operators may be offered for convenience, and can be defined in terms of the basic PEG operators:
An Optional clause matches if its subclause matches. If the subclause does not match, the Optional clause still matches, consuming zero characters (i.e. this operator always matches). An Optional clause can be automatically transformed into First form , reducing the number of PEG operators that need to be implemented directly.
A ZeroOrMore clause matches if its subclause matches zero or more times, greedily consuming as many matches as possible. If its subclause does not match (in other words if the subclause matches zero times), then the ZeroOrMore still matches the input, consuming zero characters (i.e. this operator always matches). A ZeroOrMore clause can be automatically transformed into Optional form using a OneOrMore operator , which can be further transformed into First form using a OneOrMore operator , again reducing the number of operators that need to be implemented.
|Name||Num subclauses||Notation||Equivalent to||Example rule in ASCII notation|
|Seq||2+||Sum <- Prod '+' Prod;|
|First||2+||/ /||ArithOp <- '+' / '-';|
|OneOrMore||1||Whitespace <- [ \t\n\r]+;|
|FollowedBy||1||Exclamation <- Word &'!';|
|NotFollowedBy||1||VarName <- !ReservedWord [a-z]+;|
|Optional||1||Value <- Name ('[' Num ']')?;|
|ZeroOrMore||1||[or]||Program <- Statement*;|
[PEG operators]PEG operators, defined in terms of subclauses and , and the empty string .
1.7. Dynamic Programming: the Levenshtein distance
Dynamic programming (DP) is a method used to solve recurrence relations efficiently when direct recursive evaluation of a recurrence expression would recompute the same terms many times (often an exponential number of times). DP is analogous to an inductive proof, working bottom-up from base cases, with the addition of memoization, or the storing of recurrence parameters and results in a table so that no specific invocation of a recurrence needs to be recursively computed more than once. One of the most commonly taught DP algorithms is the Levenshtein distance or string edit distance algorithm (Levenshtein, 1966; Kruskal, 1983). This algorithm measures the number of edit operations (deletions, insertions, and modifications of one character for another) that are required to convert one string of length into another string of length .
The Levenshtein distance algorithm is depicted in Fig. 1. The DP table or memo table for the algorithm consists of rows and columns, initialized by filling the first row and the first column with ascending integers starting with (these are the base cases for induction or for the recurrence, marked with a gray background). The remaining entries in the table are populated top to bottom and left to right, until the toplevel recurrence evaluation is reached (marked “TOP” in the bottom right corner). The current cell value (marked with a circled asterisk) is calculated by applying the Levenshtein distance recurrence (Fig. 1a), which is a function of the values in three earlier-computed table cells (shown with arrows): the cell in the same row but previous column, the cell in the the same column but previous row, and the cell in the previous column and previous row. After the value of the current cell is calculated, the value is stored in the DP table or memoized for future reference.
The DP table must be populated in a reasonable order, such that a recurrence value is always stored in a table cell before any other cell that depends upon it is evaluated. For the Levenshtein recurrence, if an attempt were made to populate the DP table from right to left and/or bottom to top, then the recurrence would depend upon values that are not yet calculated or stored in the table, since all dependencies are to the left and/or above the current cell, therefore the algorithm would fail. In fact, all DP algorithms fail if the DP table is not populated in an order that respects the data dependency structure.
All Levenshtein recurrence dependencies are static or fixed: the recurrence always refers to three previous table entries using the same exact dependency structure. This allows a correct population order to be determined simply and reliably by examination of the dependency structure. For this algorithm, many population orders are possible, including: row first then column (Fig. 1b), column first then row (Fig. 1c), and sweeping a “wavefront” from top left to bottom right, aligned with the trailing diagonal, over which cell values can be populated in any order, or even in parallel (Fig. 1d). In fact many wavefront angles would work, as long as the wavefront moves from top left to bottom right (e.g. a wavefront with a gradient of 0.5 or 2 relative to the trailing diagonal will also populate the table in the correct order).
Note that as described, the Levenshtein distance algorithm works bottom-up, in the sense that “bottom” refers to the base cases of recursion (i.e. the top row and the leftmost column of the table) and “up” refers to moving towards the toplevel recurrence evaluation (marked “TOP” in the bottom-right corner of the table): moving from top left to bottom right in the DP table corresponds to moving upwards in the recursion hierarchy. However, this same recurrence could have been solved just as easily using memoized recursive descent instead: recursing down the tree of recurrence frames from the topmost recurrence evaluation to the base cases, checking whether an entry in the memo table has already been computed and memoized before recursing to it, and memoizing values as they are computed. The computed result of this top-down memoized variant of the Levenshtein distance algorithm would be the same as the bottom-up dynamic programming algorithm. In both variants, the time complexity would be .
Similarly, memoized recursive descent parsing, or packrat parsing, can be inverted to form a bottom-up dynamic programming equivalent, newly introduced here as pika parsing.
2. Pika parsing: A DP algorithm for bottom-up, right to left parsing
2.1. Overview of pika parsing
The pika parser111A pika (“PIE-kah”) is a small, mountain-dwelling, rabbit-like mammal found in Asia and North America, typically inhabiting moraine fields near the treeline. Like a packrat, it stores food in caches or “haystacks” during the warmer months to survive the winter. is a new memoizing parsing algorithm that uses dynamic programming (DP) to apply PEG operators in reverse order relative to packrat parsing, i.e. bottom-up (from terminals towards the root of the parse tree) and from right to left (from the end of the input back towards the beginning). This is the reverse of the top-down, left-to-right order of standard packrat parsing, in the same sense that the Levenshtein distance recurrence can be optimally computed as either the original bottom-up DP algorithm or as a top-down memoized recursive descent algorithm (Section 1.7).
As with packrat parsing, pika parsing does not need a lexer or lex preprocessing step to parse an input: the parser can perform both the lexing and parsing steps of a traditional parsing algorithm.
Pika parsing scales linearly in the length of the input, which is rare among parsing algorithms, and has favorable performance characteristics for smaller grammars and long inputs, though may incur a large constant cost per input character for large and complex grammars (Section 4). In particular, the pika parsing algorithm has essentially the same parsing and error recovery capabilities as the CYK parsing algorithm, but with cubically better scaling characteristics.
Pika parsers and packrat parsers are not semantically equivalent, since pika parsers support left-recursive grammars directly (Section 3.1), whereas standard packrat parsers do not. In this sense, pika parsers are strictly more powerful than standard packrat parsers, because pika parsers can recognize languages that standard packrat parsers cannot. Pika parsers also support optimal error recovery (Section 3.2), which so far has not proved tractable for packrat parsers.
2.2. The choice of the PEG formalism
The pika parser is defined in this paper to work with PEG grammars (Section 1.6), which are unambiguous and deterministic, eliminating the ambiguity and nondeterminism that complicate other bottom-up parsers and increase parsing time complexity. However, other unambiguous and nondeterministic grammar types could also be used, as long as they depend only upon matches to the right of or below a given entry in the memo table (Section 2.4).
2.3. PEG grammars are recurrence relations, so can be solved with DP
Recursive descent PEG parsing hierarchically applies matching rules, which may reference other rules in a graph structure (possibly containing cycles). Recursive descent parsing works exactly like the hierarchical evaluation of a recurrence relation, however for recursive descent parsing, the dependencies between recurrence frames are dynamic: given a Seq or OneOrMore clause, since both these PEG operators are able to match subclause(s) more than once, and the start position of each subsequent subclause match depends upon how many characters were consumed by previous matches in the sequence.
In spite of the need for dynamic dependencies when using recursive descent parsing with PEG grammars, it is still possible to formulate PEG parsing as a DP algorithm (Fig. 2). A DP parser creates a memo table with a row for each clause or subclause in the grammar, sorted topologically so that terminals are represented by rows at the bottom of the table (these are base cases for the recurrence), and the toplevel clause is represented by the first row (Section 2.5). The memo table has one column per character in the input, and the empty string can be considered to match beyond the end of the string (which creates a column of base cases for the recurrence to the right of the table). The top-leftmost cell (marked TOP in Fig. 2) is the toplevel recurrence frame, corresponding to the entry point for recursive descent parsing. As the table is populated, parse tree fragments are grown or extended upwards from leaves consisting of terminal matches. Parse tree fragments are extended upwards when a clause in the grammar is successfully matched at a given input position, meaning that a match was found in the memo table for each of the requisite subclauses of the clause. Thus each matching subclause represents a node in the generated parse tree (indicated with small white boxes in Fig. 2). The current memo entry is marked with a circled asterisk in Fig. 2, and memo entries that are fully parsed and finalized (or that cannot represent matches of grammar rules against the input) are shown with a hatched background.
2.4. Correct memo table population order for a DP PEG parser: bottom-up, right to left
When populating any entry in the memo table, the clause given by the row can depend upon subclause matches represented by memo entries either below the current cell (corresponding to a top-down recursive call from a higher-level/lower-precedence clause to a lower-level/higher-precedence clause in the grammar), or upon cells in any row of any column to the right of the current parsing position (corresponding to the second or subsequent subclause matches of a Seq or OneOrMore clause, after the first subclause match consumed at least one character).
By examining this dependency relationship, it is immediately clear that there is only one correct order in which to populate the memo table so that dependencies are always memoized before they are needed: bottom-up, then right to left. In other words, any parse nodes in the current column should be extended upwards as far as possible before the parser moves to the previous column. Without populating the memo table in this order, parsing will fail to complete, because the parser will attempt to read memo table entries for matches that have not yet been performed.
Bottom-up parsing is not unusual, but parsing from right to left is highly unusual among parsing algorithms (Section 1.4). Nevertheless, this is the only DP table population order that can successfully convert a memoized recursive descent parser into a DP or chart parser.
Note that when parsing from right to left, rules are still applied left-to-right. Therefore, there is no wasted work triggered by a spurious match of the right-most subclause of a Seq clause, for example by matching the closing parenthesis in a clause like ('(' E ')'): a Seq clause is not matched until its leftmost subclause is found to match, at which point the second and subsequent subclause matches are simply looked up in the memo table.
2.5. Topological sorting of clauses into bottom-up order
In order to be able to invert the top-down call order of a recursive descent parser so that rules can be parsed bottom-up, the grammar graph must be sorted into bottom-up topological order, from leaves (terminals) to root (the toplevel rule).
Topological order is straightforward to obtain for an arbitrary tree or DAG: simply perform a depth-first search (DFS) of the graph, writing the node to the end of the bottom-up topological order (or to the beginning of the top-down topological order) right before exiting a recursion frame (i.e. the topological sort order is the deduplicated postorder traversal order of the nodes in a DAG). If there is more than one greatest upper bound or toplevel node in the DAG, the DFS used to produce the topological sort order needs to be applied to each toplevel node in turn, appending to the end of the same topological sort output for each DFS call, and reusing the same set of visited nodes for each call (so that descendant nodes are not visited twice across different DFS calls).
However, the topological sort order is not defined for a graph containing cycles. To produce a reasonable topological sort order for a grammar graph containing cycles, the cycles must be broken in appropriate places to form a tree or DAG. The appropriate way to break cycles for a grammar graph is to remove the edge that causes a path from a toplevel node to double back to an node already in the path, since the node that would otherwise appear twice in the path must be the entry point for the cycle (and/or it must be the lowest precedence clause in the cycle).
The topological order for a set of grammar rules is obtained by running DFS starting from each node in the following order, building a single topological order across all DFS invocations, and sharing the set of visited nodes across invocations:
All toplevel nodes, i.e. any rules that are not referred to by any other rule. (There will typically be only one of these in a grammar.)
The lowest precedence clause in each precedence hierarchy. (The lowest precedence clause is the entry point into a precedence hierarchy.)
“Head nodes” of any other cycles in the grammar: this is the set of all nodes that can be reached twice in any path that can be traversed through the grammar starting from a toplevel node (halting further recursion when a head node is reached). A cycle detection variant of DFS is used to find these head nodes.
This topological sort algorithm labels each clause with an index (stored in the clauseIdx field of the Clause class – Section 5.3) representing the position of the clause in the topological order of all clauses in the grammar, in increasing order from terminals up to the toplevel clause.
The implementation of this topological sort algorithm can be seen in the grammar initialization stage of the reference parser (Section 9).
2.6. Scheduling parent clauses for evaluation
To extend the memoized parse tree fragment upwards from the current memo entry, each time a clause X successfully matches the input, the parser must identify all parent clauses for which X is the first subclause – or, in the case of a Seq operator, for which X is not the first subclause, but X would be matched in the same start position as the parent clause (as a result of all previous sibling subclauses matching but consuming zero characters). These parent clauses are termed the seed parent clauses of the subclause, and since the grammar does not change during parsing, seed parent clauses can be determined for each clause during initialization. Seed parent clauses for X would include (W / X / Y), (X*), (X?), (X Y Z) and (W? X Z), but would not include (W X Z) if a match of W always consumes at least one character, since X would then match at a start position strictly to the right of the start position of a match of its parent clause. As the parent clause (W X Z) will always be triggered by a match of W in the same start position before the memo table is even checked for a match of X, X does not need to trigger the parent clause.
Whenever a new match is found for the clause corresponding to the row of the current memo entry, all seed parent clauses of the matching clause are scheduled for evaluation at the same start position (i.e. in the same column as the current entry). This expands the “DP wavefront” upwards.
Even though all memo table entries to the right of and below the current cell are finalized (shown as hatched in Fig. 2), not every memo table entry needs to be evaluated, since no typical grammar will give rise to a parse tree consisting of one node for every entry in the memo table – therefore, the memo table can and should be stored sparsely. The seed parent clauses mechanism improves memory consumption and performance of the pika parser, since it allows a pika parsing implementation to use a sparse map for the memo table, and evaluate only the clauses that can possibly match at a given position.
2.7. Adjustment of memo table population order to handle cycles
When traversing down any path in a recursive descent parse tree from the root, precedence increases downwards towards the terminals: the start rule or toplevel node at the top of the parse tree has the lowest precedence, and leaves or terminal matches at the bottom of the parse tree have the highest precedence. It is therefore common to write recursive descent grammars in precedence climbing form (Section 3.3), where if a rule fails to match, it fails over to a higher level of precedence. Since pika parsers work bottom-up, they are actually precedence descent parsers, even though the grammar rules are still defined in precedence climbing form, i.e. as if designed to be parsed by a top-down parser. (Actually, although the DP table population order is bottom-up for pika parsing, the memo table lookup step where each clause is evaluated to see if it matches technically applies the match top-down, just as the Levenshtein distance DP algorithm is bottom-up, but applies the recurrence top-down in each DP table lookup step.)
When a memo entry depends upon a another entry to the right of the current column and above the current row (as shown for two of the three dependency arrows in Fig. 2), this indicates that a decrease in precedence, which typically only occurs through the use of a precedence override pattern, such as a rule that matches the lowest precedence clause of a precedence hierarchy surrounded by parentheses. Any dependency upon a topologically-higher clause also indicates that a grammar cycle has been encountered, since the clauses in the rows of the memo table were sorted topologically, breaking cycles only at the point where a path doubles back to a lower-precedence clause (Section 2.5).
When a grammar contains cycles, to replicate the left-recursive semantics of any top-down parser that can handle left recursion, a bottom-up parser must maximally expand the parse tree upwards for nodes that form part of a cycle before moving further upwards to populate the memo table entries higher in the same memo table column. This is accomplished by using a priority queue to schedule parent clauses for matching. The priority queue sorts in increasing order of bottom-up topological sort index of the clause (with the lowest topological sort indexes being assigned to the terminals, and the highest topological sort index being assigned to the start rule or toplevel rule). If a parent clause in a grammar cycle has lower precedence, then another match of clauses within the cycle will be attempted before moving further up the column, as a direct consequence of the use of a priority queue to schedule clauses for matching. This behavior enables left recursion (and right recursion) in grammar clauses to be handled directly.
2.8. Ensuring termination when grammars contain cycles
Recursive descent parsers get stuck in infinite recursion if grammar rules are left-recursive. To ensure termination of a pika parser, whenever a new match is found for a memo entry that already contains a match (which only happens if a cycle has been traversed in the grammar), the newer match must be longer than the older match for the memo entry to be updated, and for seed parent clauses to be scheduled for evaluation.
In other words, looping around a cycle in the grammar must consume at least one additional character per loop. This has two effects:
Left recursion is guaranteed to terminate at some point, because the input string is of finite length.
Left recursion is able to terminate early (before consuming the entire input) whenever looping around a cycle in the grammar results in a match of the same length as the previous match, or a mismatch, indicating that no higher-level matches were able to be found.
There is one exception to this: for First clauses, even before the length of the new match is checked against the length of the existing match as described above, the index of the matching subclause of each match must be compared, since the semantics of the First PEG operator require that an earlier matching subclause take priority over a later matching subclause (Section 5.4).
If an entry in the memo table already contains a match, and a new and better match is found for the same memo entry due to a cycle in the grammar, then the older match is linked into the parse tree as a descendant node of the new match, so the older match is not lost due to being overwritten by the newer match in the memo entry.
2.9. Step-through of pika parsing
Fig. 3 illustrates how pika parsing works for a concrete grammar and input, broken down into the 14 separate steps needed to parse the input. In step 1, the parsing position is the last character position, i.e. 5, and the terminal [a-z] matches in this position, consuming one character. There is only one seed parent clause of the terminal clause [a-z], which is the rule V, so V is scheduled to be matched at position 5 in in step 2. In step 2, rule V matches its first subclause [a-z], so the whole rule V is found to match at position 5, also consuming one character.
There is one seed parent clause of V, which is P, and since V matches at position 5, (P <- V+) is scheduled to be matched at position 5, and is also found to match at position 5, consuming one character. However, this match of P is not depicted in this diagram, because the match is not picked up as a subclause match by higher clauses, so it represents a spurious match, and is not linked into the final parse tree when parsing is complete.
Once no more upwards expansion is possible in input position 5, the parser moves to the left, to input position 4, and again tries matching all terminals in the bottom rows of the table, in order to trigger the bottom-up growth of the parse tree, then moves to successively higher clauses by following their seed parent clauses.
This process continues, until in step 10 and 11 the upwards growth in the parse tree results in a loop around the single cycle in the grammar rules: rule V matches the string "(bc)", but V also matches each of "b" and "c". There is only one row in the memo table for each grammar clause, so to allow the whole structure of the tree to be visualized simply, if a parse tree fragment loops around a cycle in the grammar, then clauses involved in the loop are shown duplicated on different rows.
A second loop around the grammar cycle can be observed in the final step: rule P matches the entire input, but also matches "bc".
3. Properties of pika parsers
3.1. Support for left and right recursion
With the conditions described in Section 2.8 in place, it is possible to handle both left and right recursion, either direct or indirect, using a single simple parsing mechanism. (It should be pointed out that reversing the parsing direction from left-to-right to right-to-left does not simply replace the left recursion problem with a corresponding right recursion problem.) No additional special handling is required to handle grammar cycles of any degree of nested complexity.
An example of applying pika parsing to a grammar that contains both left-recursive rules (addition or Sum, and multiplication or Prod) and a right-recursive rule (exponentiation or Exp) is given in Fig. 4. Note that all precedence and associativity rules are correctly respected in the final parse tree: the run of equal-precedence addition terms is built into a left-associative parse subtree; the run of equal-precedence exponentiations is built into a right-associative parse subtree; exponentiation takes precedence over multiplication, which takes precedence over addition (except in the case of the use of parentheses to override the precedence order for the inner summation, (e+f)). The parse terminates after several different loops around cycles in the grammar reach the toplevel rule, Assign, consuming the entire input.
3.2. Error recovery in pika parsers
PEG rules only depend upon the span of input from their start position to their end position, in other words they do not depend upon any characters before their start position or after their end position. Because the input is parsed from right to left, meaning the memo table is always fully populated to the right of the current parse position, the span of input to the right of a syntax error is always fully parsed (not including any long-range dependencies, e.g. the final closing curly brace for a method definition, when the syntax error occurs in a statement in the middle of the method). When a syntax error is encountered during right-to-left parsing, the syntax error was not even previously encountered to the right of the syntax error. To the left of the syntax error, rules whose matches end before the start of the syntax error will still match their corresponding input; and just as with recursive descent parsing, rules whose matches start before the syntax error but overlap with the syntax error will fail to match. Consequently, pika parsers have optimal error recovery characteristics without any further modification.
Syntax errors can be defined as regions of the input that are not spanned by matches of rules of interest. Recovering after a syntax error involves finding the next match in the memo table after the end of the syntax error for any grammar rule of interest: for example, a parser could skip over a syntax error to find the next complete function, statement, or expression in the input. This lookup requires time in the length of the input if a skip list or balanced tree is used to store each row of the memo table.
An example parse of an input containing a syntax error (a missing closing parenthesis after input position 15) is given in Fig. 5. For this example, the beginning and end position of the syntax error are determined by finding regions of the input that are not spanned by a match of either the Program or Assign rule (any rules could be chosen for the purpose of identifying complete parses of rules of interest, and whatever is not spanned by those rules is treated as a syntax error). Using this method, the syntax error is found to be located between input positions 11 and 16 inclusive: the memo table contains terminal matches at these input positions, but no matches of either Program or Assign. Assuming that the intent is to recover at the next sequence of Assign rule matches after the syntax error, the parser can recover by finding the next match of (Assign+) in the memo table after the last character of the syntax error.
3.3. Precedence parsing with PEG grammars
A precedence-climbing grammar tries to match a grammar rule at a given level of precedence, and, if that rule fails to match, defers or “fails over” to the next highest level of precedence (using a First clause, in the case of a PEG grammar). The highest precedence rule within a precedence hierarchy uses a precedence override pattern, such as parentheses, to support multiple nested loops around the precedence hierarchy cycle.
A simple precedence-climbing grammar is shown in Listing 1. The rule for each level of precedence in this example grammar has subclauses (corresponding to operands) that are references to rules at the next highest level of precedence. If the operands and/or the operator don’t match, then the entire rule fails over to the next highest level of precedence. The root of the parse subtree for any fully-parsed expression will be a match of the lowest-precedence clause, in this case E0.
This simple example grammar can match nested unequal levels of precedence, e.g. "1*2+3*4", but cannot match runs of equal precedence, e.g. "1+2+3", since these cannot be parsed unambiguously without specifying the associativity of the addition operator. This grammar also cannot handle direct nesting of equal levels of precedence, e.g. "--4".
Since pika parsers can handle left recursion directly, these issues can be fixed by modifying the grammar into the form shown in Listing 2, allowing unary minus and parentheses to self-nest, and resolving ambiguity in runs of equal precedence by rewriting rules into left-recursive form, which generates a parse tree with left-associative structure. Similarly, a right-associative rule Y1 employing a binary operator OP would move the self-recursive subclause to the right side, yielding the right-recursive form Y1 <- (Y2 OP Y1) / Y2, which would generate a parse tree with right-associative structure.
4. Performance and benchmarking
Ignoring cycles in the grammar, packrat parsing would appear to take time in the worst case for input length and grammar size , since there are columns and rows in the memo table. However, a very deep imbalanced parse tree tends towards the structure of a linked list, therefore both the worst case depth of the parse tree and the number of nodes in the parse tree are linearly correlated with , and the actual worst case performance of packrat parsing is actually no matter the structure or size of the grammar, assuming full memoization of recursive calls.
For pika parsing, there is a certain amount of work is wasted due to spurious matches (i.e. matches that will not end up in the final parse tree) caused by working bottom-up from terminals without taking parsing context into account. Initialization of bottom-up parsing requires matching terminals against input positions, requiring time. Spurious matches of terminals against the input (e.g. matching a word inside a comment as an identifier rather than a sequence of comment characters) may subsequently trigger higher-level spurious matches, up to some worst case height where no more higher-level structure can be spuriously matched. (Under most circumstances, with the exception of special cases like commented-out code, spurious matches usually do not cascade far up the grammar hierarchy, because random input does not typically match large parts of the grammar with high likelihood). The problem of spurious matches is exacerbated when grammars reuse subclauses between many rules: this can lead to the shared subclauses having multiple seed parent clauses, which can trigger an upwards-directed tree of spurious matches, effectively further increasing the overhead of spurious matches. The work required to find the correct matches and build the parse tree is , as for top-down parsing. This gives an overall runtime of . The number of terminals is constant for a given grammar, and it is a reasonable simplifying assumption that the worst-case ratio of spurious matches to correct matches is constant for the grammar. This reduces both the time and the memory complexity of pika parsing for a specific grammar to , where is some constant overhead factor of pika parsing relative to packrat parsing for the grammar, due to spurious matches. The factor is a complex function of not only the size and complexity of the grammar, but also of the distribution of all inputs that may be reasonably parsed – however, for a fixed grammar and a distribution of inputs, can be assumed to be constant, i.e. pika parsing is in the length of the input.
This linear scaling property of pika parsing was measured and demonstrated to hold for simple and complex grammars.
4.1. Benchmarking setup
Parboiled2 is a Scala-based (partially memoizing) packrat parsing library for PEG grammars, and does not handle left recursion. ANTLR4 is a Java-based memoizing ALL(*) parser generator, and handles direct left recursion but does not handle indirect left recursion (ANTLR3 handled neither form of left recursion). Both Parboiled2 and ANTLR4 are considered to have best-in-class performance, although both require careful grammar design to avoid superlinear scaling behaviors in the worst case.
Two grammars were chosen for benchmarking:
The expression grammar shown in Listing 1.
The Java grammar, an implementation of the Java language specification. For the pika parser and Parboiled2, a PEG grammar implementing the Java 6 language specification was obtained from the Parboiled2 software distribution. ANTLR4 is not a PEG parser, so this same grammar could not be used for ANTLR4, but a comparable grammar, implementing the Java 8 language specification and provided in the ANTLR4 software distribution, was used instead.
Note that any grammar for the Java language specification is complex, widely employing deep precedence climbing hierarchies throughout the grammar, which can result in parse trees dozens to hundreds of nodes deep. This grammar is arguably within the same order of magnitude of complexity as any of the most complex grammars in use for real-world programming languages. The Parboiled2 PEG grammar consists of 669 clauses, of which 143 are terminals. Of the 669 clauses in the grammar, 100 clauses had at least 2 seed parent clauses; 50 clauses had at least 4 seed parent clauses; 4 clauses had at least 10 seed parent clauses; and the clause with the highest number of seed parent clauses, the clause matching an identifier, (!Keyword Letter LetterOrDigit* Spacing), had 15 seed parent clauses. (The empty string matching clause, Nothing, had 88 seed parent clauses, but this is handled differently by the pika parser to reduce the number of memoized zero-length matches – Section 6.1.)
For each grammar type, a corresponding input dataset was generated.
The expression grammar is a simple precedence-climbing grammar that does not include direct or indirect left recursion (Parboiled2 and ANTLR4 cannot handle one or both forms of left recursion). This requires inputs to be fully parenthesized, so that runs of two or more operators of equal precedence do not occur in the input. In total, 2500 random arithmetic expressions were generated, consisting of integers, binary operators, unary negation and parentheses, nested up to 25 levels deep. This resulted in arithmetic expression strings with a minimum length of 5 characters, an average length of 140k characters, and a maximum length of 4.8M characters.
For the Java grammar, 5372 Java 6 source files were sourced from an old Java 6 version of the Spring framework, with an average size of 3.8kiB and a maximum size of 76kiB. Comments were stripped from the input files (to focus on the performance of parsing the structural elements of each source file), but string constants (which can also cause spurious matches for the pika parser) were left in place. (A more highly optimized pika parser would reduce spurious matches with a lex preprocessing step, see Section 6.3 – although this was not attempted for this benchmark.)
Each input from each dataset was provided to each of the three parsers, and parsed with the appropriate grammar. All benchmarks were run single-threaded with a Ryzen-1800X CPU using JRE version 13.0.2.
4.2. The scaling characteristics of pika parsing
As depicted in Fig. 6 and summarized in Table 2, pika parsing was measured as scaling linearly in the length of the input. Note that a straight line in a log-log plot corresponds to the polynomial , so an apparent linear correlation in a log-log plot does not imply a linear correlation if the same data were plotted with linear axes unless .
Parsing cannot be sub-linear, since all input must be consumed, so the fact that the smaller of the two scaling powers of the regression line (0.963) fell further below 1.0 than the larger scaling power (1.05) fell above 1.0 indicates that both these grammars are within reasonable error bounds of linear scaling. The values for the regression lines are both very close to 1.0, indicating that the linear correlation is strong.
Notably, this linear relationship between input length and parsing time held for both the simple expression grammar and the significantly more complex Java grammar, across several orders of magnitude of input length. This is a very favorable trait for a parsing algorithm.
The pika parsing speed for the expression grammar (in characters of input processed per second) is approximately 8.9 times higher than the parsing speed for the expression grammar: vs. characters per second respectively. For reference, the Java grammar consists of over 61 times as many clauses as the expression grammar, 6.85 times greater than the difference in parsing speed, so the impact of increasing the size of the grammar on the parsing speed appears to be sublinear. This is further reinforced by the fact that the larger Java grammar is also significantly more structurally complex than the expression grammar, and shares subclauses widely, leading to more seed parent clauses per clause, which will trigger more spurious matches than simply increasing the size of the grammar alone.
|5(a)||expression||Pika parsing time||input length||0.974|
|5(b)||Java||Pika parsing time||input length||0.975|
[Benchmarks of pika parsing time vs. input length]Benchmarks of pika parsing time vs. input length
4.3. Parsing time for ANTLR4 and Parboiled2 vs. the reference pika parser: expression grammar
For the expression grammar, the pika parser was roughly the same speed as Parboiled2 for small expressions, but the pika parser was 1000 times faster than Parboiled2 for larger expressions. In fact, Parboiled2 degraded nearly quadratically as a function of the length of the input for this grammar (as seen in the power term of the regression line, , where is the parsing time for the pika parser, which is a linear function of the length of the input).
The pika parser was also roughly the same speed as ANTLR4 for small expressions, but the pika parser was 10 times faster than ANTLR4 for large expressions, and ANTLR4 also degraded superlinearly in performance as a function of input length ().
4.4. Parsing time for ANTLR4 and Parboiled2 vs. the reference pika parser: Java grammar
For the Java grammar, Parboiled2 was estimated by regression to scale sublinearly in the length of the input (); however, this is impossible, since all input must be consumed by any parser. The regression power being below 1.0 merely demonstrates that Parboiled2 was superlinearly inefficient for small inputs, and became more efficient as the input size increased: for very long inputs, the performance of Parboiled2 must trend to at least linear. It can be seen then that the pika parser is approximately 100 times slower than Parboiled2 for moderate to long input lengths, whereas for shorter inputs, the pika parser was 10 times slower than Parboiled2. This demonstrates the overhead of spurious matches incurred by bottom-up matching with a large and complex grammar.
ANTLR4 is harder to analyze, because it exhibited an inconsistent (and multimodal) distribution of parsing times relative to the pika parser. For shorter inputs, the pika parser was up to 1000, 100 or 10 times slower than ANTLR4, whereas for the longest inputs, the pika parser and the ANTLR parser ran at roughly the same speed. Despite the slower speed for shorter inputs, the pika parser has a significantly more predictable parsing time as a function of input length than ANTLR4, and also has better scaling characteristics: parsing time for ANTLR4 scales between quadratically and cubically in the length of the input, as can be seen from the power of the regression line, (note however that the performance characteristics of ANTLR4 are quite unpredictable, with , so the regression line is only a rough estimate of the scaling properties of ANTLR4 relative to the pika parser).
|6(a)||expression||Parboiled2 parsing time||Pika parsing time||0.897|
|6(b)||expression||ANTLR4 parsing time||Pika parsing time||0.957|
|6(c)||Java||Parboiled2 parsing time||Pika parsing time||0.614|
|6(d)||Java||ANTLR4 parsing time||Pika parsing time||0.600|
[Benchmarks of parsing time for the Parboiled2 and ANTLR4 parsers vs. the pika parser]Benchmarks of parsing time for the Parboiled2 and ANTLR4 parsers vs. the pika parser
4.5. Performance discussion
The reference pika parser was zero to three orders of magnitude faster than Parboiled2 and zero to one order of magnitude faster than ANTLR4 for inputs conforming to the expression grammar, with every indication that the performance gap would widen further for even larger inputs. However, the pika parser was one to two orders of magnitude slower than Parboiled2 and zero to three orders of magnitude slower than ANTLR4 for inputs conforming to the Java grammar, with the performance gap narrowing as input length increased for the ANTLR4 case (i.e. the large constant performance overhead of pika parsing would eventually be dominated by the super-quadratic scaling behavior of ANTLR4). Parboiled2 appeared to maintain linear performance for large inputs. This indicates that the pika parser may not be the right tool for every parsing task, depending on the grammar, the expected range of input sizes, and the degree to which parsing performance is important for a given application.
However, the parse time for the pika parser consistently demonstrated a strongly linear relationship to the length of the input, for both the simple expression grammar and the complex Java grammar, in spite of large differences in size and structure between the two grammars. The strongly linear performance characteristics measured for the pika parser increase the attractiveness of this parsing algorithm for real-world usage, because the performance is very predictable. The widely-used Parboiled2 and ANTLR4 libraries both claim linear scaling under typical usage; however, each of these libraries exhibited quadratic or even super-quadratic performance degradation for one of the two grammars, and ANTLR4 in particular appears to suffer from multi-modal performance degradation behavior (Fig. 6(d)).
Some effort was made to understand cherry-picked cases where Parboiled2 and ANTLR4 were both able to parse a Java source file one or more orders of magnitude more quickly than the pika parser. In each of these cases, the source files were long, with very shallow syntactic structure (e.g. the files consisted of mostly constant static fields with literal initializer values). Deep grammatical structures slow down ANTLR4 in particular, so it is significantly more efficient when parsing only shallow grammatical structures. Parboiled2 used numerous tricks to speed up parsing, for example not memoizing the bottom few tiers of the parse tree (e.g. not memoizing identifiers character-by-character). These sorts of optimizations were not applied to the reference pika parser, further separating the performance of the pika parser from the other parsers.
The most significant mechanism for speeding up the pika parser is to reduce the number of spurious matches through the use of a lex preprocessing step (Section 6.3), in particular representing quoted strings and comments as a single token. This was not attempted for the Java grammar in these benchmarks, because complex work would be required for such a large grammar to ensure that lexing the input did not change the recognized language.
Many grammars are nowhere near as complex as the grammar for the Java language specification, and the pika parser appears to perform efficiently for grammars that are not as large or complex. Even for large or complex grammars, the overhead of pika parsing relative to packrat parsing may be acceptable, particularly where the benefits of direct handling of left recursion and optimal error recovery are worth the constant performance penalty, and/or where linear scaling in the input length must be guaranteed.
However, whether or not the pika parsing algorithm is usable for any particular real-world parsing usecases, the algorithm should still be of interest from a theoretical perspective, since the inversion of recursive descent parsing into dynamic programming form raises interesting possibilities for other traditionally top-down recursive algorithms.
5. Implementation details
The key classes of the pika parser are as follows. The code shown is simplified to illustrate only the most important aspects of the parsing algorithm and data structures (e.g. field initialization steps and trivial assignment constructors are usually omitted; optimizations are not shown; etc.). Listings are given in valid Java notation for precise semantic clarity compared to an invented pseudocode222Java is a reasonable choice for code snippets in published papers due to its popularity, its low degree of implicitness, and its general readability even by most non-Java programmers. The use of ad hoc invented pseudocodes in the majority of computer science papers over many years has contributed to a crisis of reproducibility, due to ambiguities in pseudocode notation, important omitted details in pseudocode snippets, and outright errors accidentally introduced when rewriting a working implementation into pseudocode. Often these issues only become apparent when a reader attempts to convert the pseudocode in a paper into an implementation in a specific language.; however, Java-specific features and classes were avoided wherever possible, so that the code is mostly generic, and should be readable to anyone with an understanding of C++-derived object oriented languages. See the reference parser for the full implementation (Section 9).
5.1. The MemoKey class
The MemoKey class (Listing 4) has two fields, clause and startPos, respectively corresponding to the row and column of an entry within the memo table333The hashCode and equals methods, required by Java to implement the key equality contract so that MemoKey can be used as a hashtable key, and the trivial assignment constructor, are not shown..
5.2. The Grammar class
The Grammar class (Listing 5) has a field allClauses, containing all unique clauses and subclauses in the grammar in bottom-up topological order (Section 2.5), and one primary method, parse, which contains the main parsing loop that matches all relevant clauses for each start position using a priority queue (Section 2.7), memoizing any matches that are found. Surprisingly, this parse method is the entire pika parsing algorithm, at the highest level. However, Clause.match (Section 5.3) and MemoTable.addMatch (Section 5.5) still need to be detailed.
5.3. The Clause class and its subclasses
The Clause class (Listing 6) represents a PEG operator or clause, and is linked to any subclauses via its subClauses field. This class has a subclass Terminal, which is the superclass of all terminal clause implementations, whereas nonterminals extend Clause directly.
Each subclass of Clause subclass must implement the following three methods:
The determineWhetherCanMatchZeroChars method sets the canMatchZeroChars field to true if the clause can match while consuming zero characters (see Section 6.1). This value may depend upon the canMatchZeroChars field of each of the subclauses, therefore this method must be called in bottom-up topological clause order.
The addAsSeedParentClause method adds a clause to its subclauses’ seedParentClauses list, if the clause and the subclause could be matched at the same starting position.
The match method checks whether all the necessary subclauses of the clause match in the memo table, and if so, returns a new Match object, otherwise returns null.
The Char class (Listing 7), which extends Terminal, which extends Clause, is a simple example of a terminal clause that can match a single specific character in the input. Terminals directly check whether they match the input string, rather than looking in the memo table as with nonterminal clauses, so they ignore the memoTable parameter. (Other methods are not shown for simplicity.)
The Seq class (Listing 8), which extends Clause, implements the Seq PEG operator. In order to reduce the number of zero-length matches added to the memo table, the determineWhetherCanMatchZeroChars method marks the Seq clause as able to match zero characters if all subclauses are able to match zero characters, and the addAsSeedParentClause method adds the Seq clause to the seedParentClauses field in each of its subclauses, up to and including the first subclause that consumes at least one character in any successful match. The match method returns a new Match object if all subclauses of the Seq clause are found to match in consecutive order, otherwise the method returns null. Any returned Match object may form one node in the final parse tree.
The other subclasses of Clause (implementing the other PEG operator types) can be found in the reference parser (Section 9).
5.4. The Match class
The Match class (Listing 9) is used to store matches (i.e. parse tree nodes) and their lengths (i.e. the number of characters consumed by the match) in memo entries, and to link matches to their subclause matches (i.e. to link parse tree nodes to their child nodes). For First clauses, the index of the matching subclause is also stored in the firstMatchingSubClauseIdx field of the Match object, so that a match of an earlier subclause can take priority over any match of later subclauses, as required by the semantics of the First PEG operator.
Two matches can be compared using the isBetterThan method to determine whether a newer match improves upon an earlier match in the same memo entry, as defined in Section 2.8.
5.5. The MemoTable class
The MemoTable class (Listing 10) is a wrapper for the memo table hashmap, stored in the field memoTable.
The lookUpBestMatch method looks up a memo entry by memo key, i.e. by clause and start position, returning any Match object stored in the memo entry corresponding to this memo key.
However, one tweak is necessary for NotFollowedBy subclauses: if there is no memo table entry for a NotFollowedBy subclause, then it is impossible to determine whether the NotFollowedBy subclause’s own subclause matched, or was not previously evaluated for a match (because of the inversion logic of the NotFollowedBy PEG operator. Therefore, when there is no memo table entry for a NotFollowedBy subclause, the subclause must also be evaluated for a match in terms of whether or not its own subclause matches.
There is also one optimization made to lookUpBestMatch with the goal of reducing the number of memoized zero-length matches (Section 6.1): if there is no memoized match for a given memo key, but the clause in the memo key can match zero characters (e.g. (X?)), then a zero-length match must be returned. (This is required since Nothing is excluded from the list of terminals that are matched bottom-up, for efficiency, since this clause matches at all input positions.)
The addMatch method checks whether the provided match parameter is not null (indicating a mismatch), and if not, looks up the current best match for the memo key. If there is no current best match, or the current best match is not as good a match as the newer match, then the memo entry is updated with the new match. If the memo entry was updated, the seed parent clause(s) of the matching clause are added to the priority queue to be scheduled for matching at the same start position.
As an optimization, again for reducing the number of memoized zero-length matches, even if the memo entry was not updated or if the match was null, and if the parent clause can match zero characters, then the seed parent clause is still scheduled for matching (since the ability of the parent clause to match zero characters means the parent will always match, whether or not the subclause matches).
6.1. Reducing the size of the memo table by avoiding memoizing zero-length matches
Seeding bottom-up matching from all terminals, including Nothing (which matches the empty string in every character position, consuming zero characters) is inefficient, because many entries will be created in the memo table for zero-length matches that will not be linked into the final parse tree.
Nothing is the simplest example of a clause that can match zero characters, but there are numerous other examples of PEG clauses that can match zero characters, for example any instance of Optional, such as (X?); any instance of ZeroOrMore, such as (X*); instances of First where any subclause (specifically the last subclause444If a First clause has any subclause that can match zero characters, then all subsequent subclauses of the First clause will be ignored, because the subclause that can match zero characters will always match.) can match zero characters, such as (X / Y / Z?); and instances of Seq where all subclauses can match zero characters (causing the whole Seq clause to match zero characters), such as (X? Y* Z?). Zero-length matches can cascade up the grammar from these terminals towards the root, adding match entries to the memo table at every input position for multiple levels of the grammar. This can be mitigated by not seeding upwards propagation of the DP wavefront from the Nothing match in every character position – however, then clauses that can match zero characters must be handled specially.
It is safe to assume that no clause will never have Nothing as its first subclause, since this would be useless for any type of parent clause. If Nothing is disallowed in the first subclause position, then it is unnecessary to trigger upwards expansion of the DP wavefront by seeding the memo table with zero-length matches at every input position during initialization, earlier non-Nothing subclauses of a Seq clause can fill the role of triggering the parent clauses.
However, when Nothing is never matched as a terminal when seeding upwards expansion of the DP wavefront, other subclause types that can match zero characters must be treated specially, in particular in the case of Seq clauses, which must be triggered as seed parent clauses by all subclauses that may match in the same start position, which means all subclauses up to and including the first subclause that must match one character or more (Listings 8, 10). This mechanism for minimizing the memoization of zero-length matches requires the Clause.canMatchZeroChars field to be set for each clause during initialization, in bottom-up topological order.
With each of these changes in place, parsing is able to complete successfully and correctly while greatly reducing the number of memoized zero-length matches. The full implementation of these optimization steps can be seen in the reference parser (Section 9).
6.2. Reducing the size of the memo table by rewriting OneOrMore into right-recursive form
A naïve implementation of the OneOrMore PEG operator is iterative, matching as many instances of the operator’s single subclause as possible, left-to-right, and assembling all subclause matches at each match position into an array, which is stored in the subClauseMatches field of the resulting Match object. However, when parsing the input from right to left, a rule like (Ident <- [a-z]+) adds match nodes to the memo table as a function of the maximum number of subclause matches of a OneOrMore clause, . For example, given the grammar clause ([a-z]+) and the input string "hello", this rule stores the following matches in the memo table: [ [h,e,l,l,o], [e,l,l,o], [l,l,o], [l,o], [o] ].
This can be resolved by rewriting OneOrMore clauses into right-recursive form, which turns the parse tree fragment that matches the OneOrMore clause and its subclauses into a linked list. For example, the OneOrMore rule (X <- Y+) can be rewritten as (X <- Y X?). Correspondingly, a ZeroOrMore rule (X <- Y*) can be rewritten as (X <- (Y X?)?). The effect of this rewriting pattern is to store successively shorter suffix matches as list tails, rather than each suffix of subclause matches being duplicated for each match position. Each match of X in the rewritten form consists of either one or two subclause matches: a single match of subclause Y, optionally followed by a nested match of X representing the tail of the list, comprised of successive matches of Y. With this modification, the number of subclause matches created and added to the memo table becomes linear (i.e. ) in the maximum number of subclause matches of a OneOrMore match: [h, [e, [l, [l, [o] ] ] ] ].
Rewriting OneOrMore clauses into right-recursive form changes the structure of the parse tree into right-associative form. This transformation can be easily and automatically reversed once parsing is complete, by flattening each right-recursive linked list of OneOrMore subclause matches into an array of subclause matches of a single OneOrMore node. This is implemented in the reference parser by adding a Match.getSubClauseMatches method that flattens the right-recursive list for OneOrMore matches, simply returning the subclause matches as an array.
6.3. Reducing the number of spurious matches using a lex preprocessing pass
When processing input bottom-up, spurious matches may be found that would not be found in a standard recursive descent parse, because a bottom-up parser is unaware of the higher-level context surrounding a given parsing position. For example, a bottom-up parse might spuriously match words inside of a comment or quoted string as identifiers. These matches will not be connected to the final parse tree, so they do not affect the structure of the final parse tree, but they do take up unnecessary space in the memo table, and creating and memoizing these spurious matches also wastes work.
Fortunately, spurious matches rarely cascade far up the grammar hierarchy, due to lacking structure recognizable by higher levels of the grammar; therefore, the impact on memory consumption and parsing time of these spurious matches tends to be limited. Spurious matches will therefore typically incur only a moderately small constant factor of overhead to parsing performance, and these spurious matches do not change the big-Oh time or space complexity of the parser (Section 4).
The overhead of spurious matches can be ameliorated through the use of a lexing or tokenization step, which greedily consumes tokens from left to right until all the input is consumed. Lexing is used as a preprocessing step by many parsers. A lexical token could be an identifier or symbol, or could be a longer match of a structured sequence, such as a complete quoted string or comment. These tokenized matches are then used to seed the memo table by the bottom-up parser, rather than seeding based on matches of all terminals at all input positions.
Note, however, that lexing will only work for simple grammars, and may cause recognition problems, since the limited pattern recognition capabilities of a lexer may change the recognized language. In fact it is impossible in the general case to create a lexer that is aware of all hierarchical context – that is what a parser accomplishes. To allow a lexer to recognize all possible reasonable tokens, all matches of any lex token pattern at any position must be found. If multiple tokens of different length match at the same start position, then the lexer must look for the next token starting from each of the possible token end positions. In the worst case, this can devolve into having to match each lex token at each start position.
For these reasons, lexing is probably not advantageous or at least may not be advisable, unless the grammar is simple and lexing can be executed unambiguously.
Nevertheless, a “lightweight lex” preprocessing pass may still be advisable for performance reasons, for example to remove comments and to mark the input position ranges for all quoted strings before parsing, since these two cases in particular cause a performance impact for pika parsing.
7. Automatic conversion of parse tree into an Abstract Syntax Tree (AST)
The structure of the parse tree resulting from a parse is directly induced by the structure of the grammar, i.e. there is one node in the parse tree for each clause and subclause of the grammar that matched the input. Many of these nodes can be ignored (e.g. nodes that match semantically-irrelevant whitespace or comments in the input), and could be suppressed in the output tree. Mapping the parse tree into a simpler structure that contains only the nodes of interest results in the Abstract Syntax Tree (AST).
The reference parser supports the syntax (ASTNodeLabel:Clause) for labeling any subclause in the grammar. After parsing is complete, all matches in the parse tree that do not have an AST node label are simply elided from the parse tree to create the AST (Fig. 8).
8. Future work
The pika parser algorithm could be extended to enable incremental parsing (Dubroy and Warth, 2017), in which when a document is parsed and then subsequently edited, only the minimum necessary subset of parsing work is performed, and as much work as possible is reused from the prior parsing operation.
To enable incremental parsing in a pika parser when a span of the input string that has changed, it is necessary to remove all matches from the memo table that span the changed region of the input. Bottom-up parsing is then seeded from terminal matches within the changed region, and the main parsing loop is run again in the usual bottom-up, right to left order.
Some care will also need to be taken to ensure that the memo table is indexed not based on start position of a match (which is an absolute position, therefore affected by insertions and deletions), but rather based on cumulative length of matches. Otherwise, insertions or deletions in the middle of the input string will require the start positions of all parse tree nodes to the right of the change to be updated with the new start position, and memo table entries will all have to be moved too. Dealing with this efficiently requires changing MemoKey so that it no longer includes an absolute startPos field, and instead memo entries would be indexed by column, from a linked list of columns corresponding to each input character.
Because the memo table tends to accumulate spurious matches that are not linked into the final parse tree, to keep memory usage low in an incremental pika parser that is built into an editor or IDE, the memo table should be periodically garbage collected by running a complete (non-incremental) parse from scratch.
9. Reference implementation
An MIT-licensed reference implementation of the pika parser is available at: http://github.com/lukehutch/pikaparser
The reference parser contains several optimizations not shown here, for example all clauses are interned, so that rules that share a subclause do not result in duplicate rows in the memo table. Additionally, rule references are replaced with direct references to the rule, to save on lookup time while traversing the grammar during parsing. The optimizations for reducing the number of memoized zero-length matches described in Section 6.1 are also implemented in the reference parser. These optimizations make preprocessing the grammar more complicated, but result in significant performance gains.
The reference implementation includes a meta-grammar or runtime parser generator that is able to parse a PEG grammar written in ASCII notation, making it easy to write new grammars.
The pika parser is a new type of PEG parser that employs dynamic programming to parse a document in reverse (from the end to the beginning of the input), and bottom-up (from individual characters up to the root of the parse tree). This parsing order supports parsing of directly left-recursive grammars, making grammar writing simpler, and also enables almost perfect error recovery after syntax errors, making pika parsers useful for implementing IDEs and compilers. Pika parsers take time linearly proportional to the length of the input, and are very efficient for smaller grammars or long inputs. For large grammars, bottom-up parsing can incur a significant constant overhead per character of input, which may make pika parsers inappropriate for some performance-sensitive uses. Mechanisms for implementing precedence climbing and left or right associativity were demonstrated for PEG grammars, and several new insights were provided into precedence, associativity, and left recursion. A reference implementation of the pika parser is available under an MIT license.
- Deterministic parsing of ambiguous grammars. Communications of the ACM 18 (8), pp. 441–452. Cited by: §1.2.
- The theory of parsing, translation, and compiling. Vol. 1, Prentice-Hall Englewood Cliffs, NJ. Cited by: §1.1.
- Compilers: principles, techniques, and tools. 2nd edition, Addison Wesley. Cited by: §1.1, §1.2.
- Principles of compiler design. Addison-Wesley. Cited by: §1.2.
- Three models for the description of language. IRE Transactions on information theory 2 (3), pp. 113–124. Cited by: §1.1.
- On certain formal properties of grammars. Information and Control 2 (2), pp. 137–167. Cited by: §1.4.
- Programming languages and their compilers. Courant Institute of Mathematical Sciences.. Cited by: §1.4.
- Top down operator precedence. Beautiful Code: Leading Programmers Explain How They Think, pp. 129–145. Cited by: §1.3.
- Younger. recognition and parsing of context-free languages in time . Information and Control 10 (2), pp. 189–208. Cited by: §1.4.
- Automatic syntax error reporting and recovery in parsing expression grammars. Science of Computer Programming 187, pp. 102373. Cited by: item 3.
- Syntax error recovery in parsing expression grammars. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing, pp. 1195–1202. Cited by: item 3.
- Practical translators for lr (k) languages.. Ph.D. Thesis, Massachusetts Institute of Technology. Cited by: §1.2.
- Incremental packrat parsing. In Proceedings of the 10th ACM SIGPLAN International Conference on Software Language Engineering, pp. 14–25. Cited by: §8.
- An efficient context-free parsing algorithm. Communications of the ACM 13 (2), pp. 94–102. Cited by: §1.4.
- Syntactic analysis and operator precedence. Journal of the ACM (JACM) 10 (3), pp. 316–333. Cited by: §1.2.
- Packrat parsing: a practical linear-time algorithm with backtracking. Ph.D. Thesis, Massachusetts Institute of Technology. Cited by: item 2, §1.6.
- Parsing expression grammars: a recognition-based syntactic foundation. In Proceedings of the 31st ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pp. 111–122. Cited by: §1.6.
- A new top-down parsing algorithm to accommodate ambiguity and left recursion in polynomial time. ACM SIGPLAN Notices 41 (5), pp. 46–54. Cited by: item 2.
- Modular and efficient top-down parsing for ambiguous left-recursive grammars. In Proceedings of the 10th International Conference on Parsing Technologies, IWPT ’07, USA, pp. 109–120. Cited by: item 2.
- Precedence parsers for programming languages. Ph.D. Thesis, University of California Berkeley, Calif.. Cited by: §1.2.
- An efficient recognition and syntax-analysis algorithm for context-free languages. Coordinated Science Laboratory Report no. R-257. Cited by: §1.4.
- Algorithm schemata and data structures in syntactic processing. Technical Report. Cited by: §1.4.
- Representation of events in nerve nets and finite automata. RAND Research Memorandum RM-704. Cited by: §1.1.
- History of writing compilers. In Proceedings of the 1962 ACM National Conference on Digest of Technical Papers, pp. 43. Cited by: §1.2.
- On the translation of languages from left to right. Information and Control 8 (6), pp. 607–639. Cited by: §1.2.
- An overview of sequence comparison: time warps, string edits, and macromolecules. SIAM review 25 (2), pp. 201–237. Cited by: §1.7.
- Deterministic techniques for efficient non-deterministic parsers. In International Colloquium on Automata, Languages, and Programming, pp. 255–269. Cited by: §1.1, §1.2.
- Binary codes capable of correcting deletions, insertions, and reversals. In Soviet Physics Doklady, Vol. 10, pp. 707–710. Cited by: §1.7.
- Regular expressions and state graphs for automata. IRE transactions on Electronic Computers (1), pp. 39–47. Cited by: §1.1.
- Left recursion in parsing expression grammars. Science of Computer Programming 96, pp. 177–190. Cited by: item 2.
- Parboiled2: a macro-based peg parser generator for scala 2.12+. External Links: Cited by: §4.1.
- Techniques for automatic memoization with applications to context-free parsing. Computational Linguistics 17 (1), pp. 91–98. Cited by: item 1.
- The definitive antlr 4 reference. Pragmatic Bookshelf. External Links: Cited by: §4.1.
- Parsing with pictures. UTCS tech report TR-2012. Cited by: §1.4.
- Top down operator precedence. In Proceedings of the 1st Annual ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, pp. 41–51. Cited by: §1.3.
- Validity test for floyd’s operator-precedence parsing algorithms. In International Symposium on Mathematical Foundations of Computer Science, pp. 415–424. Cited by: §1.2.
- Sequential formula translation. Communications of the ACM 3 (2), pp. 76–83. Cited by: §1.2.
- Introduction to the theory of computation. Cengage Learning. Cited by: §1.1.
- Very fast lr parsing. ACM SIGPLAN Notices 21 (7), pp. 145–151. Cited by: §1.2.
- Efficient parsing for natural language: a fast algorithm for practical systems. Vol. 8, Springer Science & Business Media. Cited by: §1.1.
- Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE transactions on Information Theory 13 (2), pp. 260–269. Cited by: §1.4.
A bottom-up adaptation of earley’s parsing algorithm.
International Workshop on Programming Language Implementation and Logic Programming, pp. 146–160. Cited by: §1.4.
- Packrat parsers can support left recursion. In Proceedings of the 2008 ACM SIGPLAN symposium on Partial evaluation and semantics-based program manipulation, pp. 103–110. Cited by: item 2.