Constituent Parsing as Sequence Labeling

10/21/2018 ∙ by Carlos Gómez-Rodríguez, et al. ∙ Universidade da Coruña 0

We introduce a method to reduce constituent parsing to sequence labeling. For each word w_t, it generates a label that encodes: (1) the number of ancestors in the tree that the words w_t and w_t+1 have in common, and (2) the nonterminal symbol at the lowest common ancestor. We first prove that the proposed encoding function is injective for any tree without unary branches. In practice, the approach is made extensible to all constituency trees by collapsing unary branches. We then use the PTB and CTB treebanks as testbeds and propose a set of fast baselines. We achieve 90 set, outperforming the Vinyals et al. (2015) sequence-to-sequence parser. In addition, sacrificing some accuracy, our approach achieves the fastest constituent parsing speeds reported to date on PTB by a wide margin.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Constituent parsing is a core problem in nlp where the goal is to obtain the syntactic structure of sentences expressed as a phrase structure tree.

Traditionally, constituent-based parsers have been built relying on chart-based, statistical models Collins (1997); Charniak (2000); Petrov et al. (2006), which are accurate but slow, with typical speeds well below 10 sentences per second on modern CPUs Kummerfeld et al. (2012).

Several authors have proposed more efficient approaches which are helpful to gain speed while preserving (or even improving) accuracy. sagae2005classifier present a classifier for constituency parsing that runs in linear time by relying on a shift-reduce stack-based algorithm, instead of a grammar. It is essentially an extension of transition-based dependency parsing

Nivre (2003). This line of research has been polished through the years Wang et al. (2006); Zhu et al. (2013); Dyer et al. (2016); Liu and Zhang (2017); Fernández-González and Gómez-Rodríguez (2018).

With an aim more related to our work, other authors have reduced constituency parsing to tasks that can be solved faster or in a more generic way. Fer2015Parsing reduce phrase structure parsing to dependency parsing. They propose an intermediate representation where dependency labels from a head to its dependents encode the nonterminal symbol and an attachment order that is used to arrange nodes into constituents. Their approach makes it possible to use off-the-shelf dependency parsers for constituency parsing. In a different line, vinyals2015grammar address the problem by relying on a sequence-to-sequence model where trees are linearized in a depth-first traversal order. Their solution can be seen as a machine translation model that maps a sequence of words into a parenthesized version of the tree. ChoeChar2016 recast parsing as language modeling. They train a generative parser that obtains the phrasal structure of sentences by relying on the vinyals2015grammar intuition and on the zaremba2014recurrent model to build the basic language modeling architecture.

More recently, ShenDistance2018 propose an architecture to speed up the current state-of-the-art chart parsers trained with deep neural networks

Stern et al. (2017); Kitaev and Klein (2018). They introduce the concept of syntactic distances, which specify the order in which the splitting points of a sentence will be selected. The model learns to predict such distances, to then recursively partition the input in a top-down fashion.


We propose a method to transform constituent parsing into sequence labeling. This reduces it to the complexity of tasks such as part-of-speech (PoS) tagging, chunking or named-entity recognition. The contribution is two-fold.

First, we describe a method to linearize a tree into a sequence of labels (§2) of the same length of the sentence minus one.111A last dummy label is generated to fulfill the properties of sequence labeling tasks. The label generated for each word encodes the number of common ancestors in the constituent tree between that word and the next, and the nonterminal symbol associated with the lowest common ancestor. We prove that the encoding function is injective for any tree without unary branchings. After applying collapsing techniques, the method can parse unary chains.

Second, we use such encoding to present different baselines that can effectively predict the structure of sentences (§3). To do so, we rely on a recurrent sequence labeling model based on bilstm’s Hochreiter and Schmidhuber (1997); Yang and Zhang (2018). We also test other models inspired in classic approaches for other tagging tasks Schmid (1994); Sha and Pereira (2003). We use the Penn Treebank (ptb) and the Penn Chinese Treebank (ctb) as testbeds.

The comparison against vinyals2015grammar, the closest work to ours, shows that our method is able to train more accurate parsers. This is in spite of the fact that our approach addresses constituent parsing as a sequence labeling problem, which is simpler than a sequence-to-sequence problem, where the output sequence has variable/unknown length. Despite being the first sequence labeling method for constituent parsing, our baselines achieve decent accuracy results in comparison to models coming from mature lines of research, and their speeds are the fastest reported to our knowledge.

2 Linearization of n-ary trees

Notation and Preliminaries

In what follows, we use bold style to refer to vectors and matrices (e.g

and ). Let = be an input sequence of words, where . Let be the set of constituent trees with leaf nodes that have no unary branches. For now, we will assume that the constituent parsing problem consists in mapping each sentence to a tree in , i.e., we assume that correct parses have no unary branches. We will deal with unary branches later.

To reduce the problem to a sequence labeling task, we define a set of labels that allows us to encode each tree in as a unique sequence of labels in , via an encoding function . Then, we can reduce the constituent parsing problem to a sequence labeling task where the goal is to predict a function , where are the parameters to be learned. To parse a sentence, we label it and then decode the resulting label sequence into a constituent tree, i.e., we apply .

For the method to be correct, we need the encoding of trees to be complete (every tree in must be expressible as a label sequence, i.e., must be a function, so we have full coverage of constituent trees) and injective (so that the inverse function is well-defined). Surjectivity is also desirable, so that the inverse is a function on , and the parser outputs a tree for any sequence of labels that the classifier can generate.

We now define our and show that it is total and injective. Our encoding is not surjective per se. We handle ill-formed label sequences in §2.3.

2.1 The Encoding

Let be a word located at position in the sentence, for . We will assign it a 2-tuple label , where: is an integer that encodes the number of common ancestors between and , and is the nonterminal symbol at the lowest common ancestor.

Basic encodings

The number of common ancestors may be encoded in several ways.

  1. Absolute scale: The simplest encoding is to make directly equal to the number of ancestors in common between and .

  2. Relative scale: A second and better variant consists in making represent the difference with respect to the number of ancestors encoded in . Its main advantage is that the size of the label set is reduced considerably.

Figure 1 shows an example of a tree linearized according to both absolute and relative scales.


Figure 1: An example of a constituency tree linearized applying both absolute and relative scales.

Encoding for trees with exactly children

For trees where all branchings have exactly children, it is possible to obtain a even more efficient linearization in terms of number of labels. To do so, we take the relative scale encoding as our starting point. If we build the tree incrementally in a left-to-right manner from the labels, if we find a negative , we will need to attach the word (or a new subtree with that word as its leftmost leaf) to the th node in the path going from to the root. If every node must have exactly children, there is only one valid negative value of : the one pointing to the first node in said path that has not received its th child yet. Any smaller value would leave this node without enough children (which cannot be fixed later due to the left-to-right order in which we build the tree), and any larger value would create a node with too many children. Thus, we can map negative values to a single label. Figure 2

shows an example for the case of binarized trees (



Figure 2: An example of a binarized constituency tree, linearized both applying absolute and relative scales.

Links to root

Another variant emerged from the empirical observation that some tokens that are usually linked to the root node (such as the final punctuation in Figure 1) were particularly difficult to learn for the simpler baselines. To successfully deal with these cases in practice, it makes sense to consider a simplified annotation scheme where a node is assigned a special tag (root, ) when it is directly linked to the root of the tree.

From now on, unless otherwise specified, we use the relative scale without the simplification for exactly children. This will be the encoding used in the experiments (§4), because the size of the label set is significantly lower than the one obtained by relying on the absolute one. Also, it works directly with non-binarized trees, in contrast to the encoding that we introduce for trees with exactly children, which is described only for completeness and possible interest for future work. For the experiments (§4), we also use the special tag (root, ) to further reduce the size of the label set and to simplify the classification of tokens connected to the root, where is expected to be large.

2.2 Theoretical correctness

We now prove that is a total function and injective for any tree in . We remind that trees in this set have no unary branches. Later (in §2.3) we describe how we deal with unary branches. To prove correctness, we use the relative scale. Correctness for the other scales follows trivially.


Every pair of nodes in a rooted tree has at least one common ancestor, and a unique lowest common ancestor. Hence, for any tree in , the label defined in Section 2.1 is well-defined and unique for each word , ; and thus is a total function from to .


The encoding method must ensure that any given sequence of labels corresponds to exactly one tree. Otherwise, we have to deal with ambiguity, which is not desirable.

For simplicity, we will prove injectivity in two steps. First, we will show that the encoding is injective if we ignore nonterminals (i.e., equivalently, that the encoding is injective for the set of trees resulting from replacing all the nonterminals in trees in with a generic nonterminal ). Then, we will show that it remains injective when we take nonterminals into account.

For the first part, let be a tree where nonterminals take a generic value . We represent the label of the th leaf node as . Consider the representation of as a bracketed string, where a single-node tree with a node labeled is represented by , and a tree rooted at with child subtrees is represented as .

Each leaf node will appear in this string as a substring . Thus, the parenthesized string has the form , where the s are strings that can only contain brackets and nonterminals, as by construction there can be no leaf nodes between and .

We now observe some properties of this parenthesized string. First, note that each of the substrings must necessarily be composed of zero or more closing parentheses followed by zero or more opening parentheses with their corresponding nonterminal, i.e., it must be of the form . This is because an opening parenthesis followed by a closing parenthesis would represent a leaf node, and there are no leaf nodes between and in the tree.

Thus, we can write as , where is a string matching the expression and a string matching the expression . With this, we can write the parenthesized string for as

Let us now denote by the string . Then, and taking into account that and are trivially empty in the previous expression due to bracket balancing, the expression for the tree becomes simply , where we know, by construction, that each is of the form .

Since we have shown that each tree in uniquely corresponds to a string , to show injectivity of the encoding, it suffices to show that different values for a generate different label sequences.

To show this, we can say more about the form of : it must be either of the form or of the form , i.e., it is not possible that contains both opening parenthesis before the leaf node and closing parentheses after the leaf node. This could only happen if the tree had a subtree of the form , but this is not possible since we are forbidding unary branches.

Hence, we can identify each with an integer number : if has neither opening nor closing parentheses outside the leaf node, if it has opening parentheses, and if it has closing parentheses. It is easy to see that corresponds to the values in the relative-scale label encoding of the tree . To see this, note that the number of unclosed parentheses at the point right after in the string exactly corresponds to the number of common ancestors between the th and th leaf nodes. A positive corresponds to opening parentheses before , so the number of common ancestors of and will be more than that of and . A negative corresponds to closing parentheses after , so the number of common ancestors will conversely decrease by . A value of zero means no opening or closing parentheses, and no change in the number of common ancestors.

Thus, different parenthesized strings generate different label sequences, which proves injectivity ignoring nonterminals (note that does not affect injectivity as it is uniquely determined by the other values: it corresponds to closing all the parentheses that remain unclosed at that point).

It remains to show that injectivity still holds when nonterminals are taken into account. Since we have already proven that trees with different structure produce different values of in the labels, it suffices to show that trees with the same structure, but different nonterminals, produce different values of . Essentially, this reduces to showing that every nonterminal in the tree is mapped into a concrete . That said, consider a tree , and some nonterminal in . Since trees in do not have unary branches, has at least two children. Consider the rightmost word in the first child subtree, and call it . Then, is the leftmost word in the second child subtree, and is the lowest common ancestor of and . Thus, , and a tree with identical structure but a different nonterminal at that position will generate a label sequence with a different value of . This concludes the proof of injectivity.

2.3 Limitations

We have shown that our proposed encoding is a total, injective function from trees without unary branches with yield of length to sequences of labels. This will serve as the basis for our reduction of constituent parsing to sequence labeling. However, to go from theory to practice, we need to overcome two limitations of the theoretical encoding: non-surjectivity and the inability to encode unary branches. Fortunately, both can be overcome with simple techniques.

Handling of unary branches

The encoding function cannot directly assign the nonterminal symbols of unary branches, as there is not any pair of words that have those in common. Figure 3 illustrates it with an example.

It is worth remarking that this is not a limitation of our encoding, but of any encoding that would facilitate constituent parsing as sequence labeling, as the number of nonterminal nodes in a tree with unary branches is not bounded by any function of . The fact that our encoding works for trees without unary branches owes to the fact that such a tree cannot have more than non-leaf nodes, and therefore it is always possible to encode all of them in labels associated with leaf nodes.


Figure 3: An example of a tree that cannot be directly linearized with our approach. and abstract over words and PoS tags. Dotted lines represent incorrect branches after applying and inverting our encoding naively without any adaptation for unaries. The nonterminal symbol of the second ancestor of (x) cannot be decoded, as no pair of words have x as their lowest common ancestor. A similar situation can be observed for the closest ancestor of (z).

To overcome this issue, we follow a collapsing approach, as is common in parsers that need special treatment of unary chains Finkel et al. (2008); Narayan and Cohen (2016); Shen et al. (2018). For clarity, we use the name intermediate unary chains to refer to unary chains that end up into a nonterminal symbol (e.g. in Figure 3) and leaf unary chains to name those that yield a PoS tag (e.g. ). Intermediate unary chains are collapsed into a chained single symbol, which can be encoded by as any other nonterminal symbol. On the other hand, leaf unary chains are collapsed together with the PoS tag, but these cannot be encoded and decoded by relying on , as our encoding assumes a fixed sequence of leaf nodes and does not encode them explicitly. To overcome this, we propose two methods:

  1. To use an extra function to enrich the PoS tags before applying our main sequence labeling function. This function is of the form , where is the set of labels of the leaf unary chains (without including the PoS tags) plus a dummy label . maps to if there is no leaf unary chain at , or to the collapsed label otherwise.

  2. To extend our encoding function to predict them as a part of our labels , by transforming them into 3-tuples where encodes the leaf unary chain collapsed label for , if there is any, or none otherwise. We call this extended encoding function .

The former requires to run two passes of sequence labeling to deal with leaf unary chains. The latter avoids this, but the number of labels is larger and sparser. In §4 we discuss how these two approaches behave in terms of accuracy and speed.


Our encoding, as defined formally in Section 2.1, is injective but not surjective, i.e., not every sequence of labels of the form corresponds to a tree in . In particular, there are two situations where a label sequence formally has no tree, and thus

is not formally defined and we have to use extra heuristics or processing to define it:

  • Sequences with conflicting nonterminals. A nonterminal can be the lowest common ancestor of more than two pairs of contiguous words when branches are non-binary. For example, in the tree in Figure 1, the lowest common ancestor of both “the” and “red” and of “red” and “toy” is the same node. This translates into , in the label sequence. If we take that sequence and set , we obtain a label sequence that does not strictly correspond to the encoding of any tree, as it contains a contradiction: two elements referencing the same node indicate different nonterminal labels. In practice, this problem is trivial to solve: when a label sequence encodes several conflicting nonterminals at a given position in the tree, we compute using the first such nonterminal and ignoring the rest.

  • Sequences that produce unary structures. There are sequences of values that do not correspond to a tree in because the only tree structure satisfying the common ancestor conditions of their values (the one built by generating the string of s in the injectivity proof) contains unary branchings, causing the problem described above where we do not have a specification for every nonterminal. An example of this is the sequence in absolute scaling, that was introduced in Figure 3. In practice, as unary chains have been previously collapsed, any generated unary node is considered as not valid and removed.

3 Sequence Labeling

Sequence labeling is an structured prediction task that generates an output label for every token in an input sequence Rei and Søgaard (2018)

. Examples of practical tasks that can be formulated under this framework in natural language processing are PoS tagging, chunking or named-entity recognition, which are in general fast. However, to our knowledge, there is no previous work on sequence labeling methods for constituent parsing, as an encoding allowing it was lacking so far.

In this work, we consider a range of methods ranging from traditional models to state-of-the-art neural models for sequence labeling, to test whether they are valid to train constituency-based parsers following our approach. We give the essential details needed to comprehend the core of each approach, but will mainly treat them as black boxes, referring the reader to the references for a careful and detailed mathematical analysis of each method. Appendix A specifies additional hyper-parameters for the tested models.


We add to every sentence both beginning and end tokens.

3.1 Traditional Sequence Labeling Methods

We consider two baselines to train our prediction function , based on popular sequence labeling methods used in nlp problems, such as PoS tagging or shallow parsing Schmid (1994); Sha and Pereira (2003).

Conditional Random Fields

Lafferty et al. (2001) Let crf be its prediction function, a crf

model computes conditional probability distributions of the form

such that crf = = . In our work, the inputs to the crf are words and PoS tags. To represent a word , we are using information of the word itself and also contextual information from .222We tried contextual information beyond the immediate previous and next word, but the performance was similar. In particular:

  • We extract the word form (lowercased), the PoS tag and its prefix of length 2, from . For these words we also include binary features: whether it is the first word, the last word, a number, whether the word is capitalized or uppercased.

  • Additionally, for we look at the suffixes of both length 3 and 2 (i.e. and ).

To build our CRF models, we relied on the sklearn-crfsuite library333

MultiLayer Perceptron

Rosenblatt (1958) We use one hidden layer. Let mlp be its prediction function, it treats sequence labeling as a set of independent predictions, one per word. The prediction for a word is computed as , where is the input vector and and the weights and biases to be learned at layer . We consider both a discrete (mlp) and an embedded (mlp

) perceptron. For the former, we use as inputs the same set of features as for the

crf. For the latter, the vector for is defined as a concatenation of word and PoS tag embeddings from .444In contrast to the discrete input, larger contextual information was useful.

To build our mlps, we relied on keras.555

3.2 Sequence Labeling Neural Models

We are using ncrfpp++666

, with PyTorch.

, a sequence labeling framework based on recurrent neural networks (

rnn) Yang and Zhang (2018), and more specifically on bidirectional short-term memory networks Hochreiter and Schmidhuber (1997), which have been successfully applied to problems such as PoS tagging or dependency parsing Plank et al. (2016); Kiperwasser and Goldberg (2016). Let lstm

be an abstraction of a standard long short-term memory network that processes the sequence

, then a bilstm encoding of its th element, bilstm is defined as:

bilstm = = =

In the case of multilayer bilstm’s, the time-step outputs of the bilstm are fed as input to the bilstm. The output label for each is finally predicted as .

Given a sentence , the input to the sequence model is a sequence of embeddings where each , such that and are a word and a PoS tag embedding, and is a word embedding obtained from an initial character embedding layer, also based on a bilstm. Figure 4 shows the architecture of the network.


Figure 4: Architecture of the neural model

4 Experiments

We report results on models trained using the relative scale encoding and the special tag (root,). As a reminder, to deal also with leaf unary chains, we proposed two methods in §2.3: to predict them relying both on the encoding functions and , or to predict them as a part of an enriched label predicted by the function . For clarity, we are naming these models with the superscripts and , respectively.


We use the Penn Treebank Marcus et al. (1994) and its official splits: Sections 2 to 21 for training, 22 for development and 23 for testing. For the Chinese Penn Treebank Xue et al. (2005): articles 001- 270 and 440-1151 are used for training, articles 301-325 for development, and articles 271-300 for testing. We use the version of the corpus with the predicted PoS tags of DyerRecurrent2016. We train the models based on the predicted output by the corresponding model.


We use the F-score from the evalb script. Speed is measured in sentences per second. As the problem is reduced to sequence labeling, we briefly comment on the accuracy (percentage of correctly predicted labels) of our baselines.

Source code


The models are run on a single thread of a CPU777An Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz. and on a consumer-grade GPU888A GeForce GTX 1080.. In sequence-to-sequence work Vinyals et al. (2015) the authors use a multi-core CPU (the number of threads was not specified), while we provide results on a single core for easier comparability. Parsing sentences on a CPU can be framed as an “embarrassingly parallel” problem Hall et al. (2014), so speed can be made to scale linearly with the number of cores. We use the same batch size as Vinyals et al. (2015) for testing (128).999A larger batch will likely result in faster parsing when executing the model on a gpu, but not necessarily on a cpu.

4.1 Results

Table 1 shows the performance of our baselines on the ptb development set. It is worth noting that since we are using different libraries to train the models, these might show some differences in terms of performance/speed beyond those expected in theory. For the bilstm model we test:

  • bilstm: It does not use pretrained word embeddings nor character embeddings. The number of layers is set to 1.

  • bilstm: It adds pretrained word embeddings from GloVe Pennington et al. (2014) for English and from the Gigaword corpus for Chinese Liu and Zhang (2017).

  • bilstm: It includes character embeddings processed through a bilstm.

  • bilstm: is set to 2. No character embeddings.

  • bilstm: is set to 2.

Model F-score Acc. Sent/s Sent/s
(cpu) (gpu)
crf 60.4 63.9 83 -
mlp 72.6 78.1 16 49
mlp 74.8 79.3 503 666

60.3 65.4 6 -
mlp 71.9 78.0 31 95
mlp 75.4 79.7 342 890
bilstm 87.2 88.9 144 541
bilstm 88.3 89.8 144 543
bilstm 88.5 90.0 120 456
bilstm 89.7 90.7 72 476
bilstm 89.9 90.9 65 405
bilstm 87.3 89.3 206 941
bilstm 88.5 90.1 209 957
bilstm 88.0 90.0 180 808
bilstm 89.8 90.9 119 842
bilstm 89.7 90.9 109 716

Table 1: Performance of the proposed sequence labeling methods on the development set of the ptb. For the crf models the complexity is quadratic with respect to the number of labels, which causes crf to be particularly slow.

Testbed CPU Run GPU Run F-score
#Cores Sents/s #GPU Sents/s
Sequence labeling

WSJ23 1 501 1 669 74.1
mlp WSJ23 1 349 1 929 74.8
bilstm WSJ23 1 148 1 581 88.1
bilstm WSJ23 1 221 1 1016 88.3
bilstm WSJ23 1 66 1 434 89.9
bilstm WSJ23 1 115 1 780 90.0
bilstm WSJ23 1 74 1 506 90.0
bilstm WSJ23 1 126 1 898 90.0

3-layer lstm WSJ 23 70
3-layer lstm + Attention WSJ 23 Multi-core 120 88.3
(number not
Vinyals et al. (2015) specified)
Constituency parsing as dependency parsing
Fer2015Parsing WSJ23 1 41 90.2
Chart-based parsers
charniak2000maximum WSJ23 1 6 89.5
petrov2007improved WSJ23 1 6 90.1
stern2017minimal WSJ23 16* 20 91.8
Kitaev2018Constituency WSJ23 2 70 95.1
+ELMo Peters et al. (2018)
Chart-based parsers with GPU-specific implementation
canny2013multi WSJ(30) 1 250
hall2014sparser WSJ(40) 1 404

Transition-based and other greedy constituent parsers
zhu2013fast WSJ23 1 101 89.9


WSJ23 1 90 90.4
DyerRecurrent2016 WSJ23 1 17 91.2
Fernández and Gómez-Rodríguez (2018) WSJ23 1 18 91.7
stern2017minimal WSJ23 16* 76 91.8
Liu2017InOrder WSJ23 91.8
ShenDistance2018 WSJ23 1 111 91.8

Table 2: Comparison against the state of the art.*stern2017minimal report that they use a 16-core machine, but sentences are processed one-at-a-time. Hence, they do not exploit inter-sentence parallelism, but they may gain some speed from intra-sentence parallelism. indicates the that the speed was reported in the paper itself. and indicate that the speeds were extracted from zhu2013fast and Fernández and Gómez-Rodríguez (2018).
Model F-score

mlp 64.4
bilstm 84.4
bilstm 84.1
bilstm 84.4
bilstm 83.1

zhu2013fast+P 83.2
DyerRecurrent2016 84.6
Liu2017InOrder 86.1
ShenDistance2018 86.5
Fernández and Gómez-Rodríguez (2018) 86.8

Table 3: Performance on the ctb test set

The and the models obtain similar F-scores. When it comes to speed, the bilstms are notably faster than the bilstms. models are expected to be more efficient, as leaf unary chains are handled implicitly. In practice, is a more expensive function to compute than the original , since the number of output labels is significantly larger, which reduces the expected gains with respect to the models. It is worth noting that our encoding is useful to train an mlp with a decent sense of phrase structure, while being very fast. Paying attention to the differences between F-score and Accuracy for each baseline, we notice the gap between them is larger for crfs and mlps. This shows the difficulties that these methods have, in comparison to the bilstm approaches, to predict the correct label when a word has few common ancestors with . For example, let -10x be the right (relative scale) label between and , and let =-1x and =-9x be two possible wrong labels. In terms of accuracy it is the same that a model predicts or , but in terms of constituent F-score, the first will be much worse, as many closed parentheses will remain unmatched.

Tables 2 and 3 compare our best models against the state of the art on the ptb and ctb test sets. The performance corresponds to models without reranking strategies, unless otherwise specified.

5 Discussion

We are not aware of work that reduces constituency parsing to sequence labeling. The work that can be considered as the closest to ours is that of vinyals2015grammar, who address it as a sequence-to-sequence problem, where the output sequence has variable/unknown length. In this context, even a one hidden layer perceptron outperforms their 3-layer lstm model without attention, while parsing hundreds of sentences per second. Our best models also outperformed their 3-layer lstm model with attention and even a simple bilstm model with pre-trained GloVe embeddings obtains a similar performance. In terms of F-score, the proposed sequence labeling baselines still lag behind mature shift-reduce and chart parsers. In terms of speed, they are clearly faster than both CPU and GPU chart parsers and are at least on par with the fastest shift-reduce ones. Although with significant loss of accuracy, if phrase-representation is needed in large-scale tasks where the speed of current systems makes parsing infeasible Gómez-Rodríguez (2017); Gómez-Rodríguez et al. (2017), we can use the simpler, less accurate models to get speeds well above any parser reported to date.

It is also worth noting that in their recent work, published while this manuscript was under review, ShenDistance2018 developed a mapping of binary trees with leaves to sequences of integers (Shen et al., 2018, Algorithm 1). This encoding is different from the ones presented here, as it is based on the height of lowest common ancestors in the tree, rather than their depth. While their purpose is also different from ours, as they use this mapping to generate training data for a parsing algorithm based on recursive partitioning using real-valued distances, their encoding could also be applied with our sequence labeling approach. However, it has the drawback that it only supports binarized trees, and some of its theoretical properties are worse for our goal, as the way to define the inverse of an arbitrary label sequence can be highly ambiguous: for example, a sequence of equal labels in this encoding can represent any binary tree with leaves.

6 Conclusion

We presented a new parsing paradigm, based on a reduction of constituency parsing to sequence labeling. We first described a linearization function to transform a constituent tree (with leaves) into a sequence of labels that encodes it. We proved that this encoding function is total and injective for any tree without unary branches. We also discussed its limitations: how to deal with unary branches and non-surjectivity, and showed how these can be solved. We finally proposed a set of fast and strong baselines.


This work has received funding from the European Research Council (ERC), under the European Union’s Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150), from the TELEPARES-UDC project (FFI2014-51978-C2-2-R) and the ANSWER-ASAP project (TIN2017-85160-C2-1-R) from MINECO, and from Xunta de Galicia (ED431B 2017/01). We gratefully acknowledge NVIDIA Corporation for the donation of a GTX Titan X GPU.


  • Canny et al. (2013) John Canny, David Hall, and Dan Klein. 2013. A multi-teraflop constituency parser using GPUs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1898–1907.
  • Charniak (2000) Eugene Charniak. 2000. A maximum-entropy-inspired parser. In Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference, pages 132–139. Association for Computational Linguistics.
  • Choe and Charniak (2016) Do Kook Choe and Eugene Charniak. 2016. Parsing as language modeling. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2331–2336, Austin, Texas. Association for Computational Linguistics.
  • Collins (1997) Michael Collins. 1997. Three generative, lexicalised models for statistical parsing. In Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, pages 16–23. Association for Computational Linguistics.
  • Dyer et al. (2016) Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. 2016. Recurrent neural network grammars. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 199–209. Association for Computational Linguistics.
  • Fernández-González and Gómez-Rodríguez (2018) Daniel Fernández-González and Carlos Gómez-Rodríguez. 2018. Faster Shift-Reduce Constituent Parsing with a Non-Binary, Bottom-Up Strategy. ArXiv e-prints.
  • Fernández-González and Martins (2015) Daniel Fernández-González and André F. T. Martins. 2015. Parsing as reduction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1523–1533. Association for Computational Linguistics.
  • Finkel et al. (2008) Jenny Rose Finkel, Alex Kleeman, and Christopher D Manning. 2008. Efficient, feature-based, conditional random field parsing. Proceedings of ACL-08: HLT, pages 959–967.
  • Gómez-Rodríguez (2017) Carlos Gómez-Rodríguez. 2017. Towards fast natural language parsing: FASTPARSE ERC Starting Grant. Procesamiento del Lenguaje Natural, 59.
  • Gómez-Rodríguez et al. (2017) Carlos Gómez-Rodríguez, Iago Alonso-Alonso, and David Vilares. 2017.

    How important is syntactic parsing accuracy? An empirical evaluation on rule-based sentiment analysis.

    Artificial Intelligence Review.
  • Hall et al. (2014) David Hall, Taylor Berg-Kirkpatrick, and Dan Klein. 2014. Sparser, better, faster GPU parsing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 208–217.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  • Kiperwasser and Goldberg (2016) Eliyahu Kiperwasser and Yoav Goldberg. 2016. Simple and accurate dependency parsing using bidirectional LSTM feature representations. Transactions of the Association for Computational Linguistics, 4:313–327.
  • Kitaev and Klein (2018) Nikita Kitaev and Dan Klein. 2018. Constituency parsing with a self-attentive encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia. Association for Computational Linguistics.
  • Kummerfeld et al. (2012) Jonathan K. Kummerfeld, David Hall, James R. Curran, and Dan Klein. 2012. Parser showdown at the Wall Street corral: An empirical investigation of error types in parser output. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1048–1059, Jeju Island, Korea. Association for Computational Linguistics.
  • Lafferty et al. (2001) John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In

    Proceedings of the Eighteenth International Conference on Machine Learning

    , ICML ’01, pages 282–289, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  • Liu and Zhang (2017) Jiangming Liu and Yue Zhang. 2017. In-order transition-based constituent parsing. Transactions of the Association for Computational Linguistics, 5:413–424.
  • Marcus et al. (1994) Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. 1994. The Penn Treebank: Annotating predicate argument structure. In Proceedings of the Workshop on Human Language Technology, HLT ’94, pages 114–119, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Narayan and Cohen (2016) Shashi Narayan and Shay B. Cohen. 2016. Optimizing spectral learning for parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1546–1556, Berlin, Germany. Association for Computational Linguistics.
  • Nivre (2003) Joakim Nivre. 2003. An efficient algorithm for projective dependency parsing. In Proceedings of the 8th International Workshop on Parsing Technologies (IWPT), pages 149–160.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
  • Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237. Association for Computational Linguistics.
  • Petrov et al. (2006) Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 433–440. Association for Computational Linguistics.
  • Petrov and Klein (2007) Slav Petrov and Dan Klein. 2007. Improved inference for unlexicalized parsing. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 404–411.
  • Plank et al. (2016) Barbara Plank, Anders Søgaard, and Yoav Goldberg. 2016. Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 412–418, Berlin, Germany. Association for Computational Linguistics.
  • Rei and Søgaard (2018) Marek Rei and Anders Søgaard. 2018. Zero-shot sequence labeling: Transferring knowledge from sentences to tokens. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 293–302. Association for Computational Linguistics.
  • Rosenblatt (1958) Frank Rosenblatt. 1958. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386.
  • Sagae and Lavie (2005) Kenji Sagae and Alon Lavie. 2005. A classifier-based parser with linear run-time complexity. In Proceedings of the Ninth International Workshop on Parsing Technology, pages 125–132. Association for Computational Linguistics.
  • Schmid (1994) Helmut Schmid. 1994. Part-of-speech tagging with neural networks. In Proceedings of the 15th Conference on Computational Linguistics - Volume 1, COLING ’94, pages 172–176, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Sha and Pereira (2003) Fei Sha and Fernando Pereira. 2003. Shallow parsing with conditional random fields. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 134–141. Association for Computational Linguistics.
  • Shen et al. (2018) Yikang Shen, Zhouhan Lin, Athul Paul Jacob, Alessandro Sordoni, Aaron Courville, and Yoshua Bengio. 2018. Straight to the tree: Constituency parsing with neural syntactic distance. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1171–1180. Association for Computational Linguistics.
  • Stern et al. (2017) Mitchell Stern, Jacob Andreas, and Dan Klein. 2017. A minimal span-based neural constituency parser. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 818–827, Vancouver, Canada. Association for Computational Linguistics.
  • Vinyals et al. (2015) Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2015. Grammar as a foreign language. In Advances in Neural Information Processing Systems, pages 2773–2781.
  • Wang et al. (2006) Mengqiu Wang, Kenji Sagae, and Teruko Mitamura. 2006. A fast, accurate deterministic parser for chinese. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, ACL-44, pages 425–432, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Xue et al. (2005) Naiwen Xue, Fei Xia, Fu-Dong Chiou, and Marta Palmer. 2005. The Penn Chinese Treebank: Phrase structure annotation of a large corpus. Natural language engineering, 11(2):207–238.
  • Yang and Zhang (2018) Jie Yang and Yue Zhang. 2018. NCRF++: An open-source neural sequence labeling toolkit. In Proceedings of ACL 2018, System Demonstrations, pages 74–79, Melbourne, Australia. Association for Computational Linguistics.
  • Zaremba et al. (2014) Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329.
  • Zhu et al. (2013) Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. 2013. Fast and accurate shift-reduce constituent parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 434–443.

Appendix A Setup configuration used to train our sequence labeling methods

Conditional Random Fields

We use the default configuration provided together with the sklearn-crfsuite library.

MultiLayer Perceptron

Both the discrete and distributed perceptrons are implemented in keras.

  • Training hyperparameters

    The model is trained up to 30 epochs, with early stopping (patience=4). We use Stochastic Gradient Descent (

    sgd) to optimize the objective function. The initial learning rate is set to 0.1.

  • Layer and embedding sizes. The dimension of the hidden layer is set to 100. For the perceptron fed with embeddings, we use 100 and 20 dimensions to represent a word and its PoS tag, respectively.

Bidirectional Long Short-Term Memory

We relied on the NCRFpp framework Yang and Zhang (2018).

  • Training hyperparameters We use mini-batching (the batch size during training is set to 8). As optimizer, we use sgd, setting the initial learning rate to 0.2, momentum to 0.9 and a linear decay of 0.05. We train the model up to 100 epochs and keep the best performing model in the development set.

  • Layer and embedding sizes: We use 100, 30 and 20 dimensions to represent a word, a postag and a character embedding. The output hidden layer from the character embeddings layer is set to 50. The left-to-right and right-to-left lstms generate each a hidden vector of size 400.