The problem of grammar recognition is a decision problem of determining whether a string belongs to a language induced by a grammar. For context-free grammars, recognition can be done using parsing algorithms such as the CKY algorithm [kasami-65, younger-67, cocke-70] or the Earley algorithm [earley1970efficient]. The asymptotic complexity of these chart parsing algorithms is cubic in the length of the sentence.
In a major breakthrough, valiant75 showed that context-free grammar recognition is no more complex than Boolean matrix multiplication for a matrix of size where is linear in the length of the sentence, . With current state-of-the-art results in matrix multiplication, this means that CFG recognition can be done with an asymptotic complexity of .
In this paper, we show that the problem of linear context-free rewriting system recognition can also be reduced to Boolean matrix multiplication. Current chart parsing algorithms for binary LCFRS have an asymptotic complexity of , where is the maximal fan-out of the grammar.111Without placing a bound on , the problem of recognition of LCFRS languages is NP-hard satta1992recognition. Our algorithm takes time , for a constant which is a function of the grammar (and not the input string), and where the complexity of matrix multiplication is . The parameter can be as small as , meaning that we reduce parsing complexity from to , and that, in general, the savings in the exponent is larger for more complex grammars.
LCFRS is a broad family of grammars. As such, we are able to support the findings of rajasekaran98, who showed that tree-adjoining grammar recognition can be done in time (TAG can be reduced to LCFRS with ). As a result, combinatory categorial grammars, head grammars and linear indexed grammars can be recognized in time . In addition, we show that inversion transduction grammars (wu1997stochastic, ITGs) can be parsed in time , improving the best asymptotic complexity previously known for ITGs.
Matrix Multiplication State of the Art
Our algorithm reduces the problem of LCFRS parsing to Boolean matrix multiplication. Let be the complexity of multiplying two such matrices. These matrices can be naïvely multiplied in time by computing for each output cell the dot product between the corresponding row and column in the input matrices (each such product is an operation). Strassen69 discovered a way to do the same multiplication in time – his algorithm is a divide and conquer algorithm that eventually uses only operations (instead of ) to multiply matrices.
With this discovery, there have been many attempts to further reduce the complexity of matrix multiplication, relying on principles similar to Strassen’s method: a reduction in the number of operations it takes to multiply sub-matrices of the original matrices to be multiplied. CoppersmithW87 discovered an algorithm that has the asymptotic complexity of . Others have slightly improved their algorithm, and currently there is an algorithm for matrix multiplication with such that LeGall:2014. It is known that raz2002complexity.
While the asymptotically best matrix multiplication algorithms have large constant factors lurking in the -notation, Strassen’s algorithm does not, and is widely used in practice. benedi2007fast show speed improvement when parsing natural language sentences using Strassen’s algorithm as the matrix multiplication subroutine for Valiant’s algorithm for CFG parsing. This indicates that similar speed-ups may be possible in practice using our algorithm for LCFRS parsing.
Our main result is a matrix multiplication algorithm for unbalanced, single-initial binary LCFRS with asymptotic complexity where is the maximal number of combination points in all grammar rules. The constant can be easily determined from the grammar at hand:
where ranges over rules in the grammar and is the fan-out of nonterminal . Single-initial grammars are defined in §2, and include common formalisms such as tree-adjoining grammars. Any LCFRS can be converted to single-initial form by increasing its fan-out by at most one. The notion of unbalanced grammars is introduced in §4.4, and it is a condition on the set of LCFRS grammar rules that is satisfied with many practical grammars. In cases where the grammar is balanced, our algorithm can be used as a sub-routine so that it parses the binary LCFRS in time . A similar procedure was applied by nakanishi1998efficient for multiple component context-free grammars. See more discussion of this in §7.5.
Our results focus on the asymptotic complexity as a function of string length. We do not give explicit grammar constants. For other work that focuses on reducing the grammar constant in parsing, see for example eisner1999efficient; dunlop2010reducing; cohen-13a. For a discussion of the optimality of the grammar constants in Valiant’s algorithm, see for example Abboud (Backurs).
2 Background and Notation
This section provides background on LCFRS, and establishes notation used in the remainder of the paper. A reference table of notation is also provided in Appendix A.
For an integer , let denote the set of integers . Let . For a set , we denote by the set of all sequences of length 1 or more of elements from .
A span is a pair of integers denoting left and right endpoints for a substring in a larger string. The endpoints are placed in the “spaces” between the symbols in a string. For example, the span spans the first three symbols in the string. For a string of length , the set of potential endpoints is .
We turn now to give a succinct definition for binary LCFRS. For more details about LCFRS and their relationship to other grammar formalisms, see kallmeyer-10. A binary LCFRS is a tuple such that:
is the set of nonterminal symbols in the grammar.
is the set of terminal symbols in the grammar. We assume .
is a function specifying a fixed fan-out for each nonterminal ().
is a set of productions. Each production has the form where , and is a composition function , which specifies how to assemble the spans of the righthand side nonterminals into the spans of the lefthand side nonterminal. We use square brackets as part of the syntax for writing productions, and parentheses to denote the application of the function . The function must be linear and non-erasing, which means that if is applied on a pair of tuples of strings, then each input string appears exactly once in the output, possibly as a substring of one of the strings in the output tuple. Rules may also take the form , where returns a constant tuple of one string from .
is a start symbol. Without loss of generality, we assume .
The language of an LCFRS is defined as follows:
We define first the set for every :
For every .
For every and all tuples , , .
Nothing else is in .
The string language of is .
Intuitively, the process of generating a string from an LCFRS grammar consists of first choosing, top-down, a production to expand each nonterminal, and then, bottom-up, applying the composition functions associated with each production to build the string. As an example, the following context-free grammar:
corresponds to the following (binary) LCFRS:
The only derivation possible under this grammar consists of the function application .
The following notation will be used to precisely represent the linear non-erasing composition functions used in a specific grammar. For each production rule that operates on nonterminals , , and , we define variables from the set . In addition, we define variables for each rule where , taking values from . We write an LCFRS function as:
where each specifies the parameter strings that are combined to form the th string of the function’s result tuple. For example, for the rule in Eq. 2, and .
We adopt the following notational shorthand for LCFRS rules in the remainder of the paper. We write the rule:
where consists of a tuple of strings from the alphabet . In this notation, is always the tuple , and is always . We include and in the rule notation merely to remind the reader of the meaning of the symbols in .
For example, with context-free grammars, rules have the form:
indicating that and each have one span, and are concatenated in order to form .
A binary tree-adjoining grammar can also be represented as a binary LCFRS vw94. Figure 1 demonstrates how the adjunction operation is done with binary LCFRS. Each gray block denotes a span, and the adjunction operator takes the first span of nonterminal and concatenates it to the first span of nonterminal (to get the first span of ), and then takes the second span of and concatenates it with the second span of (to get the second span of ). For tree-adjoining grammars, rules have the form:
The fan-out of a nonterminal is the number of spans in the input sentence that it covers. The fan-out of CFG rules is one, and the fan-out of TAG rules is two. The fan-out of the grammar, , is the maximum fan-out of its nonterminals:
We sometimes refer to the skeleton of a grammar rule , which is just the context-free rule , omitting the variables. In that context, a logical statement such as is true if there is any rule with some , and .
For our parsing algorithm, we assume that the grammar is in a normal form such that the variables appear in order in , that is, that the spans of are not re-ordered by the rule, and similarly we assume that appear in order. If this is not the case in some rule, the grammar can be transformed by introducing a new nonterminal for each permutation of a nonterminal that can be produced by the grammar. We further assume that , that is, that the first span of begins with material produced by rather than by . If this not the case for some rule, and can be exchanged to satisfy this condition.
We refer to an LCFRS rule as single-initial if the leftmost endpoint of is internal to a span of , and dual-initial if the leftmost endpoint of is the beginning of a span of . Our algorithm will require the input LCFRS to be in single-initial form, meaning that all rules are single-initial. We note that grammars for common formalisms including TAG and synchronous context-free grammar (SCFG) are in this form. If a grammar is not in single-initial form, dual-initial rules can converted to single-initial form by adding a empty span to which combines with the first spans of immediately to its left, as shown in Figure 2. Specifically, for each dual-initial rule , if the first span of appears between spans and of , create a new nonterminal with , and add a rule , where produces along with a span of length zero between spans and of . We then replace the rule with , where the new span of combines with immediately to the left of ’s first span. Because the new nonterminal has fan-out one greater than , this grammar transformation can increase a grammar’s fan-out by at most one.
By limiting ourselves to binary LCFRS grammars, we do not necessarily restrict the power of our results. Any LCFRS with arbitrary rank (i.e. with an arbitrary number of nonterminals in the right-hand side) can be converted to a binary LCFRS (with potentially a larger fan-out). See discussion in §7.6.
Consider the phenomenon of cross-serial dependencies that exists in certain languages. It has been used in the past Shieber85 to argue that Swiss-German is not context-free. One can show that there is a homomorphism between Swiss-German and the alphabet such that the image of the homomorphism intersected with the regular language gives the language . Since is not context-free, this implies that Swiss-German is not context-free, because context-free languages are closed under intersection with regular languages.
Tree-adjoining grammars, on the other hand, are mildly context-sensitive formalisms that can handle such cross-serial dependencies in languages (where the s are aligned with s and the s are aligned with the s). For example, a tree-adjoining grammar for generating would include the following initial and auxiliary trees (nodes marked by are nodes where adjunction is not allowed):
|[. [. [. ] ] ]||[. [. [. ] ] ]||[. [. ] [. [. ] [. ] ] ]||[. [. ] [. [. ] [. ] ] ]|
This TAG corresponds to the following LCFRS:
Here we have one unary LCFRS rule for the initial tree, one unary rule for each adjunction tree, and one null-ary rule for each nonterminal producing a tuple of empty strings in order to represent TAG tree nodes at which no adjunction occurs. The LCFRS given above does not satisfy our normal form requiring each rule to have either two nonterminals on the righthand side with no terminals in the composition function, or zero nonterminals with a composition function returning fixed strings of terminals. However, it can be converted to such a form through a process analogous to converting a CFG to Chomsky Normal Form. For adjunction trees, the two strings returned by the composition function correspond the the material to the left and right of the foot node. The composition function merges terminals at the leaves of the adjunction tree with material produced by internal nodes of the tree at which adjunction may occur.
In general, binary LCFRS are more expressive than TAGs because they can have nonterminals with fan-out greater than two, and because they can interleave the arguments of the composition function in any order.
3 A Sketch of the Algorithm
Our algorithm for LCFRS string recognition is inspired by the algorithm of valiant75. It introduces a few important novelties that make it possible to use matrix multiplication for the goal of LCFRS recognition.
The algorithm relies on the observation that it is possible to construct a matrix with a specific non-associative multiplication and addition operator such that multiplying by itself times on the left or on the right yields -step derivations for a given string. The row and column indices of the matrix together assemble a set of spans in the string (the fan-out of the grammar determines the number of spans). Each cell in the matrix keeps track of the nonterminals that can dominate these spans. Therefore, computing the transitive closure of this matrix yields in each matrix cell the set of nonterminals that can dominate the assembled indices’ spans for the specific string at hand.
There are several key differences between Valiant’s algorithm and our algorithm. Valiant’s algorithm has a rather simple matrix indexing scheme for the matrix: the rows correspond to the left endpoints of a span and the columns correspond to its right endpoints. Our matrix indexing scheme can mix both left endpoints and right endpoints at either the rows or the columns. This is necessary because with LCFRS, spans for the right-hand side of an LCFRS rule can combine in various ways into a new set of spans for the left-hand side.
In addition, our indexing scheme is “over-complete.” This means that different cells in the matrix (or its matrix powers) are equivalent and should consist of the same nonterminals. The reason we need such an over-complete scheme is again because of the possible ways spans of a right-hand side can combine in an LCFRS. To address this over-completeness, we introduce into the multiplication operator a “copy operation” that copies nonterminals between cells in order to maintain the same set of nonterminals in equivalent cells.
To give a preliminary example, consider the tree-adjoining grammar rule shown in Figure 1. We consider an application of the rule with the endpoints of each span instantiated as shown in Figure 3. With our algorithm, this operation will translate into the following sequence of matrix transformations. We will start with the following matrices, and :
For , for example, the fact that appears for the pair of addresses (for row) and for column denotes that spans the constituents and in the string (this is assumed to be true – in practice, it is the result of a previous step of matrix multiplication). Similarly, with , spans the constituents and .
Note that are the two positions in the string where and meet, and that because and share these two endpoints, they can combine to form . In the matrix representation, appears as the column address of and as the row address of , meaning that and appear in cells that are combined during matrix multiplication. The result of multiplying by is the following:
Now appears in the cell that corresponds to the spans and . This is the result of merging the spans with (left span of and left span of ) into and the merging of the spans and (right span of and right span of ) into . Finally, an additional copying operation will lead to the following matrix:
Here, we copy the nonterminal from the address with the row and column into the address with the row and column . Both of these addresses correspond to the same spans and . Note that matrix row and column addresses can mix both starting points of spans and ending points of spans.
4 A Matrix Multiplication Algorithm for LCFRS
We turn next to give a description of the algorithm. Our description is constructed as follows:
In §4.1 we describe the basic matrix structure which is used for LCFRS recognition. This construction depends on a parameter , the contact rank, which is a function of the underlying LCFRS grammar we parse with. We also describe how to create a seed matrix, for which we need to compute the transitive closure.
In §4.2 we define the multiplication operator between cells of the matrices we use. This multiplication operator is distributive, but not associative, and as such, we use Valiant’s specialized transitive closure algorithm to compute transitive closure of the seed matrix given a string.
In §4.3 we define the contact rank parameter . The smaller is, the more efficient it is to parse with the specific grammar.
In §4.4 we define when a binary LCFRS is “balanced.” This is an end case that increases the final complexity of our algorithm by a factor of . Nevertheless, it is an important end case that appears in applications, such as inversion transduction grammars.
4.1 Matrix Structure
The algorithm will seek to compute the transitive closure of a seed matrix , where is a constant determined by the grammar (see §4.3). The matrix rows and columns are indexed by the set defined as:
where denotes the length of the sentence, and the exponent denotes a repeated Cartesian product. Thus each element of is a sequence of indices into the string, where each index is annotated with a bit (an element of the set ) indicating whether it is marked or unmarked. Marked indices will be used in the copy operator defined later. Indices are unmarked unless specified as marked: we use to denote a marked index with .
In the following, it will be safe to assume sequences from are monotonically increasing in their indices. For an , we overload notation, and often refer to the set of all elements in the first coordinate of each element in the sequence (ignoring the additional bits). As such,
The set is defined for .
If we state that is in and includes a set of endpoints, it means that is a sequence of these integers (ordered lexicographically) with the bit part determined as explained in the context (for example, all unmarked).
The quantity denotes the length of the sequence.
The quantity denotes the smallest index among the first coordinates of all elements in the sequence (ignoring the additional bits).
We emphasize that the variables , , and are mostly elements in as overloaded above, not integers, throughout this paper; we choose the symbols , , and by analogy to the variables in the CKY parsing algorithm, and also because we use the sequences as addresses for matrix rows and columns. For , we define to be the set of pairs such that for and for . This means that takes as input the two sequences in matrix indices, merges them, sorts them, then divides this sorted list into a set of consecutive pairs. Whenever , is undefined. The interpretation of this is that should always belong to and not . See more details in §4.2. In addition, if any element of or is marked, is undefined.
We define an order on elements and of by first sorting the sequences and and then comparing and lexicographically (ignoring the bits). This ensures that if . We assume that the rows and columns of our matrices are arranged in this order. For the rest of the discussion, we assume that is a constant, and refer to as and as .
We also define the set of triples as the following Cartesian product:
where , , , , , and are six special pre-defined symbols.222The symbols will be used for “copying commands:” (1) ”from row” ( ); (2) “from column” (); (3) “to row” ( ); (4) “to column” (); (5) “unmark row” ( ); (6) “unmark column” (). Each cell in is a set such that .
The intuition behind matrices of the type of (meaning and, as we see later, products of with itself, or its transitive closure) is that each cell indexed by ) in such a matrix consists of all nonterminals that can be generated by the grammar when parsing a sentence such that these nonterminals span the constituents (whenever is defined). Our normal form for LCFRS ensures that spans of a nonterminal are never re-ordered, meaning that it is not necessary to retain information about which indices demarcate which components of the nonterminal, because one can sort the indices and take the first two indices as delimiting the first span, the second two indices as delimiting the second span, and so on. The two additional elements in each triplet in a cell are actually just copies of the row and column indices of that cell. As such, they are identical for all triplets in that cell. The additional , , , , , symbols are symbols that indicate to the matrix multiplication operator that a “copying operation” should happen between equivalent cells (§4.2).
Figure 4 gives an algorithm to seed the initial matrix . Entries added in step 2 of the algorithm correspond to entries in the LCFRS parsing chart that can be derived immediately from terminals in the string. Entries added in step 3 of the algorithm do not depend on the input string or input grammar, but rather initialize elements used in the copy operation described in detail in §4.2. Because the algorithm only initializes entries with , the matrix is guaranteed to be upper triangular, a fact which we will take advantage of in §4.2.