Clique-Based Lower Bounds for Parsing Tree-Adjoining Grammars

03/02/2018 ∙ by Karl Bringmann, et al. ∙ Max Planck Society Universität Saarland 0

Tree-adjoining grammars are a generalization of context-free grammars that are well suited to model human languages and are thus popular in computational linguistics. In the tree-adjoining grammar recognition problem, given a grammar Γ and a string s of length n, the task is to decide whether s can be obtained from Γ. Rajasekaran and Yooseph's parser (JCSS'98) solves this problem in time O(n^2ω), where ω < 2.373 is the matrix multiplication exponent. The best algorithms avoiding fast matrix multiplication take time O(n^6). The first evidence for hardness was given by Satta (J. Comp. Linguist.'94): For a more general parsing problem, any algorithm that avoids fast matrix multiplication and is significantly faster than O(|Γ| n^6) in the case of |Γ| = Θ(n^12) would imply a breakthrough for Boolean matrix multiplication. Following an approach by Abboud et al. (FOCS'15) for context-free grammar recognition, in this paper we resolve many of the disadvantages of the previous lower bound. We show that, even on constant-size grammars, any improvement on Rajasekaran and Yooseph's parser would imply a breakthrough for the k-Clique problem. This establishes tree-adjoining grammar parsing as a practically relevant problem with the unusual running time of n^2ω, up to lower order factors.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Introduced in [14, 15], tree-adjoining grammars (TAGs) are a system to manipulate certain trees to arrive at strings, see Section 2 for a definition. TAGs are more powerful than context-free grammars, capturing various phenomena of human languages which require more formal power; in particular TAGs have an “extended domain of locality” as they allow “long-distance dependencies” [16]. These properties, and the fact that TAGs are efficiently parsable [29], make them highly desirable in the field of computer linguistics. This is illustrated by the large literature on variants of TAGs (see, e.g., [9, 21, 24, 30]), their formal language properties (see, e.g., [16, 29]), as well as practical applications (see, e.g., [2, 13, 25, 26]), including the XTAG project which developed a tree-adjoining grammar for (a large fraction of) the English language [10]. In fact, TAGs are so fundamental to computer linguistics that there is a biannual meeting called “International Workshop on Tree-Adjoining Grammars and Related Formalisms” [7], and they are part of their undergraduate curriculum (at least at Saarland University).

The prime algorithmic problems on TAGs are parsing and recognition. In the recognition problem, given a TAG and a string of length , the task is to decide whether can generate . The parsing problem is an extended variant where in case can generate we should also output a sequence of derivations generating . The first TAG parsers ran in time333In most running time bounds we ignore the dependence on the grammar size, as we are mostly interested in constant-size grammars in this paper.  [23, 29], which was improved by Rajasekaran and Yooseph [20] to , where is the exponent of (Boolean) matrix multiplication.

A limited explanation for the complexity of TAG parsing was given by Satta [22], who designed a reduction from Boolean matrix multiplication to TAG parsing, showing that any TAG parser running faster than on grammars of size yields a Boolean matrix multiplication algorithm running faster than . This result has several shortcomings: (1) It holds only for a more general parsing problem, where we need to determine for each substring of the given string whether it can be generated from . (2) It gives a matching lower bound only in the unusual case of , so that it cannot exclude time, e.g., . (3) It gives matching bounds only restricted to combinatorial algorithms, i.e., algorithms that avoid fast matrix multiplication444The notion of “combinatorial algorithms” is informal, intuitively meaning that we forbid unpractical algorithms such as fast matrix multiplication. It is an open research problem to find a reasonable formal definition.. Thus, so far there is no satisfying explanation of the complexity of TAG parsing.

Context-free grammars

The classic problem of parsing context-free grammars, which has important applications in programming languages, was in a very similar situation as TAG parsing until very recently. Parsers in time were known since the 60s [8, 11, 17, 31]. In a breakthrough, Valiant [27] improved this to . Finally, a reduction from Boolean matrix multiplication due to Lee [18] showed a matching lower bound for combinatorial algorithms for a more general parsing problem in the case that the grammar size is .

Abboud et al. [1] gave the first satisfying explanation for the complexity of context-free parsing, by designing a reduction from the classic -Clique problem, which asks whether there are pairwise adjacent vertices in a given graph . For this problem, for any fixed the trivial running time of can be improved to for any divisible by 3 [19] (see [12] for the case of not divisible by 3). The fastest combinatorial algorithm runs in time  [28]. The -Clique hypothesis states that both running times are essentially optimal, specifically that -Clique has no algorithm and no combinatorial algorithm for any . The main result of Abboud et al. [1] is a reduction from the -Clique problem to context-free grammar recognition on a specific, constant-size grammar , showing that any algorithm or any combinatorial algorithm for context-free grammar recognition would break the -Clique hypothesis. This matching conditional lower bound removes all disadvantages of Lee’s lower bound at the cost of introducing the -Clique hypothesis, see [1] for further discussions.

Our contribution

We extend the approach by Abboud et al. to the more complex setting of TAGs. Specifically, we design a reduction from the -Clique problem to TAG recognition:

Theorem 1.

There is a tree-adjoining grammar of constant size such that if we can decide in time whether a given string of length can be generated from , then -Clique can be solved in time , for any fixed . This reduction is combinatorial.

Via this reduction, any algorithm for TAG recognition would prove that -Clique is in time , for sufficiently large555For this and the next statement it suffices to set . . Furthermore, any combinatorial algorithm for TAG recognition would yield a combinatorial algorithm for -Clique in time , for sufficiently large . As both implications would violate the -Clique conjecture, we obtain tight conditional lower bounds for TAG recognition. As our result (1) works directly for TAG recognition instead of a more general parsing problem, (2) holds for constant size grammars, and (3) does not need the restriction to combinatorial algorithms, it overcomes all shortcomings of the previous lower bound based on Boolean matrix multiplication, at the cost of using the well-established -Clique hypothesis, which has also been used in [1, 3, 4, 5, 6].

We thus establish TAG parsing as a practically relevant problem with the quite unusual running time of , up to lower order factors. This is surprising, as the authors are aware of only one other problem with a (conjectured or conditional) optimal running time of , namely 6-Clique.

Techniques

The essential difference of tree-adjoining and context-free grammars is that the former can grow strings at four positions, see Figure 2(a). Writing a vertex in one position of the string, and writing the neighborhoods of vertices at other positions in the string, a simple tree-adjoining grammar can test whether is adjacent to , and . Extending this construction, for -cliques we can test whether , and form -cliques. Using two permutations of this test, we ensure that forms an almost--clique, i.e., only the edges might be missing (in Figure 1(b) below this situation is depicted for cliques instead of ). Finally, we use that a -clique can be decomposed into 3 almost--cliques, see Figure 1(a).

In the constructed string we essentially just enumerate 6 times all -cliques of the given graph

, as well as their neighborhoods, with appropriate padding symbols (see Section 

3). We try to make the constructed tree-adjoining grammar as easily accessible as possible by defining a certain programming language realized by these grammars, and phrasing our grammar in this language, which yields subroutines with an intuitive meaning (see Section 4).

2 Preliminaries on tree-adjoining grammars

In this section we define tree-adjoining grammars and give examples. Fix a set of terminals and a set of non-terminals. In the following, conceptually we partition the nodes of any tree into its leaves, the root, and the remaining inner nodes. An initial tree is a rooted tree where

  • the root and each inner node is labeled with a non-terminal,

  • each leaf is labeled with a terminal, and

  • each inner node can be marked for adjunction.

See Figure 0(a) for an example; nodes marked for adjunction are annotated by a rectangle. An auxiliary tree is a rooted tree where

  • the root and each inner node is labeled with a non-terminal,

  • exactly one leaf, called the foot node, is labeled with the same non-terminal as the root,

  • each remaining leaf is labeled with a terminal, and

  • each inner node can be marked for adjunction.

Initial trees are the starting points for derivations of the tree-adjoining grammar. These trees are then extended by repeatedly replacing nodes marked for adjunction by auxiliary trees. Formally, given an initial or auxiliary tree that contains at least one inner node marked for adjunction and given an auxiliary tree whose root has the same label as , we can combine these trees with the following operation called adjunction, see Figure 0(a) for an example.

  1. Replace ’s foot node by the subtree rooted at .

  2. Replace the node with the tree obtained from the last step.

Note that these steps make sense, since and have the same label. Note that adjunction does not change the number leaves labeled with a non-terminal symbol, i.e., an initial tree will stay an initial tree and an auxiliary tree will stay an auxiliary tree.

a

bc

a

b

c
(a) An initial tree (left) and an auxiliary tree (right); the internal nodes labeled and are marked for adjunction.
a

a

bb

cc
(b) Resulting tree after adjoining the auxiliary tree into the initial tree.
Figure 1: The basic building blocks and operation of tree-adjoining grammars.

A tree-adjoining grammar is now defined as a tuple where

  • is a finite set of initial trees and

  • is a finite set of auxiliary trees,

using the same terminals and non-terminals as labels. The set of derived trees of consists of all trees that can be generated by starting with an initial tree in and repeatedly adjoining auxiliary trees in . (Note that each derived tree is also an initial tree, but not necessarily in .) Finally, a string over alphabet can be generated by , if there is a derived tree in such that

  • contains no nodes marked for adjunction and

  • is obtained by concatenating the labels of the leaves of from left to right.

The language is then the set of all strings that can be generated by .

3 Encoding a graph in a string

Given a graph , in this section we construct a string (the graph gadget) that encodes its -cliques, over the terminal alphabet of size 19. In the next section we then design a tree-adjoining grammar that generates if and only if contains a -clique. We assume that , and we denote the binary representation of any by and the neighborhood of by . For two strings and , we use to denote their concatenation and to denote the reverse of .

We start with node and list gadgets, encoding a vertex and its neighborhood, respectively:

Note that and are adjacent iff is a substring of .

Next, we build clique versions of these gadgets, that encode a -clique and its neighborhood, respectively:

Note that two -cliques and form a -clique if and only if the substring of between the -th and -th symbol is a substring of the substring of between the -th and -th symbol , for all . Indeed, every pair of a vertex in and a vertex in is tested for adjacency.

Conceptually, we split any -clique into six -cliques. Thus, let be the set of all -cliques in . Our final encoding of the graph is:

As we will show, there is a tree-adjoining grammar of constant size that generates the string iff contains a -clique. The structure of this test is depicted in Figure 1(a). The clique-gadgets of the same highlighting style together allow us to test for an almost--clique, as it is depicted in Figure 1(a). The two gadgets of the same highlighting style then test for two claws of cliques, as depicted in Figure 1(b).

(a) Each is a -clique and there is an edge between two -cliques of some highlighting style if the clique gadgets of that style ensure that these two cliques together form a -clique.

(b) We will generate an almost--clique as in (a) by generating two claws. (This tests the edges , , and in (a) twice.)
Figure 2: Structure of our test for -cliques.

As the graph has nodes, for any node the node and list gadgets have a length of , and for a -clique the clique neighborhood gadgets thus have a length of . As our encoding of the graph consists of clique neighborhood gadgets, the resulting string length is . It is easy to see that it is also possible to construct all gadgets and in particular the encoding of a graph in linear time with respect to their length.

4 Programming with trees

It remains to design a clique-detecting tree-adjoining grammar. To make our reduction more accessible, we will think of tree-adjoining grammars as a certain programming language. In the end, we will then present a “program” that generates (a suitable superset of) the set all strings that represent a graph containing a -clique. We start by defining programs.

A normal tree with input and output is an auxiliary tree where:

  • the root is labeled with ,

  • exactly one node is marked for adjunction, and

  • this node lies on the path from the root to the foot node and is labeled .

See Figure 2(a) for an illustration. The special structure of a normal tree allows us to split its nodes into four categories (excluding the path from ’s root to its foot node): subtrees of left children of the path from ’s root to , subtrees of left children of the path from to ’s foot node, subtrees of right children of the path from to ’s foot node, and the remaining nodes (i.e., subtrees of right children of the path from ’s root to ). The concatenation of all terminal symbols in ’s leaves from left to right can then be split into four parts