Arc-Standard Spinal Parsing with Stack-LSTMs

09/01/2017 ∙ by Miguel Ballesteros, et al. ∙ ibm NAVER LABS Corp. 0

We present a neural transition-based parser for spinal trees, a dependency representation of constituent trees. The parser uses Stack-LSTMs that compose constituent nodes with dependency-based derivations. In experiments, we show that this model adapts to different styles of dependency relations, but this choice has little effect for predicting constituent structure, suggesting that LSTMs induce useful states by themselves.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There is a clear trend in neural transition systems for parsing sentences into dependency trees Titov and Henderson (2007); Chen and Manning (2014); Dyer et al. (2015); Andor et al. (2016) and constituent trees Henderson (2004); Vinyals et al. (2014); Watanabe and Sumita (2015); Dyer et al. (2016); Cross and Huang (2016b)

. These transition systems use a relatively simple set of operations to parse in linear time, and rely on the ability of neural networks to infer and propagate hidden structure through the derivation. This contrasts with state-of-the-art factored linear models, which explicitly use of higher-order information to capture non-local phenomena in a derivation.

In this paper, we present a transition system for parsing sentences into spinal trees, a type of syntactic tree that explicitly represents together dependency and constituency structure. This representation is inherent in head-driven models Collins (1997) and was used by Carreras et al. (2008) with a higher-order factored model. We extend the Stack-LSTMs by Dyer et al. (2015) from dependency to spinal parsing, by augmenting the composition operations to include constituent information in the form of spines. To parse sentences, we use the extension by CrossHuang16a of the arc-standard system for dependency parsing Nivre (2004). This parsing system generalizes shift-reduce methods Henderson (2003); Sagae and Lavie (2005); Zhu et al. (2013); Watanabe and Sumita (2015) to be sensitive to constituent heads, as opposed to, for example, parse a constituent from left to right.

In experiments on the Penn Treebank, we look at how sensitive our method is to different styles of dependency relations, and show that spinal models based on leftmost or rightmost heads are as good or better than models using linguistic dependency relations such as Stanford Dependencies De Marneffe et al. (2006) or those by Yamada and Matsumoto (2003). This suggests that Stack-LSTMs figure out effective ways of modeling non-local phenomena within constituents. We also show that turning a dependency Stack-LSTM into spinal results in some improvements.

2 Spinal Trees

In a spinal tree each token is associated with a spine. The spine of a token is a (possibly empty) vertical sequence of non-terminal nodes for which the token is the head word. A spinal dependency is a binary directed relation from a node of the head spine to a dependent spine. In this paper we consider projective spinal trees. Figure 1 shows a constituency tree from the Penn Treebank together with two spinal trees that use alternative head identities: the spinal tree in 0(b) uses Stanford Dependencies De Marneffe et al. (2006), while the spinal tree in 0(c) takes the leftmost word of any constituent as the head. It is direct to map a constituency tree with head annotations to a spinal tree, and to map a spinal tree to a constituency or a dependency tree.

width=0.9 [.S And [.NP [.NP their suspicions ] [.PP of [.NP each other ] ] ] [.VP run [.ADVP deep ] ] ]

(a) A constituency tree from the Penn Treebank.


[edge slant=15pt]

[column sep=0.1cm] And & their & suspicions & of & each & other & run & deep & .

311NP [/depgraph/.cd, spinal/edge style] (13) – (spi-3-1); 322NP [/depgraph/.cd, spinal/edge style] (spi-3-1) – (spi-3-2); 411.5PP [/depgraph/.cd, spinal/edge style] (14) – (spi-4-1);

611NP [/depgraph/.cd, spinal/edge style] (16) – (spi-6-1);

712VP [/depgraph/.cd, spinal/edge style] (17) – (spi-7-1); 723S [/depgraph/.cd, spinal/edge style] (spi-7-1) – (spi-7-2);

811ADVP [/depgraph/.cd, spinal/edge style] (18) – (spi-8-1);

[edge slant=100pt, hide label]721s0.9 [edge slant=20pt, hide label]7232s0.85 [hide label]7181s0.9 [hide label]729s0.9

[hide label]312s0.9 [hide label]615s0.9

[hide label]32411 [hide label]41611

(b) The spinal tree of (0(a)) using Stanford Dependency heads.


[edge slant=15pt]

[column sep=0.1cm] And & their & suspicions & of & each & other & run & deep & .

113S [/depgraph/.cd, spinal/edge style] (11) – (spi-1-1); 211NP [/depgraph/.cd, spinal/edge style] (12) – (spi-2-1); 222NP [/depgraph/.cd, spinal/edge style] (spi-2-1) – (spi-2-2); 411.5PP [/depgraph/.cd, spinal/edge style] (14) – (spi-4-1);

511NP [/depgraph/.cd, spinal/edge style] (15) – (spi-5-1);

712VP [/depgraph/.cd, spinal/edge style] (17) – (spi-7-1);

811ADVP [/depgraph/.cd, spinal/edge style] (18) – (spi-8-1);

[edge slant=100pt, hide label]119s1 [edge slant=20pt, hide label]1171s0.9 [edge slant=20pt, hide label]1122s0.9

[edge slant=20pt, hide label]2241s1

[edge slant=20pt, hide label]7181s0.9

[edge slant=20pt, hide label]4151s1

[hide label]213s0.9 [hide label]516s0.9

(c) The spinal tree of (0(a)) using leftmost heads.
Figure 1: A constituency tree and two spinal trees.

3 Arc-Standard Spinal Parsing

We use the transition system by Cross and Huang (2016a), which extends the arc-standard system by Nivre (2004) for constituency parsing in a head-driven way, i.e. spinal parsing. We describe it here for completeness. The parsing state is a tuple , where is a buffer of input tokens to be processed; is a stack of tokens with partial spines; and is a set of spinal dependencies. The operations are the following:

  • shift :

    Shifts the first token of the buffer onto the stack, becomes a base spine consisting of a single token.

  • node() :

    Adds a non-terminal node onto the top element of the stack , which becomes . At this point, the node can receive left and right children (by the operations below) until the node is closed (by adding a node above, or by reducing the spine with an arc operation with this spine as dependent). By this single operation, the arc-standard system is extended to spinal parsing.

  • left-arc :

    The stack must have two elements, the top one is a spine , where is the top node of that spine, and the second element can be a token or a spine. The operation adds a spinal from the node to , and is reduced from the stack. The dependent becomes the leftmost child of the constituent .

  • right-arc :

    This operation is symmetric to left-arc, it adds a spinal dependency from the top node of the second spine in the stack to the top element , which is reduced from the stack and becomes the rightmost child of .

width= Transition Buffer Stack New Arc in [And, their, …] [] shift [their, suspicions, …] [And] shift [suspicions, of, …] [And, their] shift [of, each, …] […, their, susp.] node(NP) [of, each, …] […, their, susp. + NP] left-arc [of, each, …] [And, susp. + NP] (NP,their) node(NP) [of, each, …] [And, susp. + NP + NP] shift [each, other, …] […, susp. + NP + NP, of] node(PP) [each, other, …] […, susp. + NP + NP, of + PP] shift [other, run, …] […, of + PP, each] shift [run, deep, …] […, each, other] node(NP) [run, deep, …] […, each, other + NP] left-arc [run, deep, …] […, of + PP, other + NP] (NP, each) right-arc [run, deep, …] […, susp. + NP + NP, of + PP] (PP, NP) right-arc [run, deep, …] [And, susp. + NP + NP] (NP, PP)

Figure 2: Initial steps of the arc-standard derivation for the spinal tree in Figure 0(b), until the tree headed at “suspicions” is fully built. Spinal nodes are noted , where is the non-terminal, is the position of the head token, and is the node level in the spine.

At a high level, the system builds a spinal tree headed at token by:

  1. [nolistsep]

  2. Shifting the -th token to the top of the stack. By induction, the left children of are in the stack and are complete.

  3. Adding a constituency node to ’s spine.

  4. Adding left children to in head-outwards order with left-arc, which are removed from the stack.

  5. Adding right children to in head-outwards order with right-arc, which are built recursively.

  6. Repeating steps 2-4 for as many nodes in the spine of .

Figure 2 shows an example of a derivation. The process is initialized with all sentence tokens in the buffer, an empty stack, and an empty set of dependencies. Termination is always attainable and occurs when the buffer is empty and there is a single element in the stack, namely the spine of the full sentence head. This transition system is correct and sound with respect to the class of projective spinal trees, in the same way as the arc-standard system is for projective dependency trees Nivre (2008). A derivation has steps, where is the sentence length and is the number of constituents in the derivation.

We note that the system naturally handles constituents of arbitrary arity. In particular, unary productions add one node in the spine without any children. In practice we put a hard bound on the number of consecutive unary productions in a spine111Set to 10 in our experiments, to ensure that in the early training steps the model does not generate unreasonably long spines. We also note there is a certain degree of non-determinism: left and right arcs (steps 3 and 4) can be mixed as long as the children of a node are added in head-outwards order. At training time, our oracle derivations impose the order above (first left arcs, then right arcs), but the parsing system runs freely. Finally, the system can be easily extended with dependency labels, but we do not use them.

4 Spinal Stack-LSTMs

Dyer et al. (2015) presented an arc-standard parser that uses Stack-LSTMs, an extension of LSTMs Hochreiter and Schmidhuber (1997) for transition-based systems that maintains an embedding for each element in the stack.222We refer interested readers to Dyer et al. (2015); Ballesteros et al. (2017).. Our model is based on the same architecture, with the addition of the node() action. The state of our algorithm presented in Section 3 is represented by the contents of the stack, the buffer and a list with the history of actions with Stack-LSTMs. This state representation is then used to predict the next action to take.


when the parser predicts a left-arc() or right-arc()

, we compose the vector representation of the head and dependent elements; this is equivalent to what it is presented by

Dyer et al. (2015). The representation is obtained recursively as follows:

where is a learned parameter matrix, h represents the head spine and d represents the dependent spine (or token, if the dependent is just a single token) ; e is a bias term.

Similarly, when the parser predicts a node() action, we compose the embedding of the non-terminal symbol that is added () with the representation of the element at the top of the stack (), that might represent a spine or a single terminal symbol. The representation is obtained recursively as follows:


where is a learned parameter matrix, s represents the token in the stack (and its partial spine, if non-terminals have been added to it) and n represents the non-terminal symbol that we are adding to s; b is a bias term.

As shown by Kuncoro et al. (2017) composition is an essential component in this kind of parsing models.

5 Related Work

Collins (1997) first proposed head-driven derivations for constituent parsing, which is the key idea for spinal parsing, and later Carreras et al. (2008) came up with a higher-order graph-based parser for this representation. Transition systems for spinal parsing are not new. Ballesteros and Carreras (2015) presented an arc-eager system that labels dependencies with constituent nodes, and builds the spinal tree in post-processing. Hayashi et al. (2016) and Hayashi and Nagata (2016) presented a bottom-up arc-standard system that assigns a full spine with the shift operation, while ours builds spines incrementally and does not depend on a fixed set of full spines. Our method is different from shift-reduce constituent parsers Henderson (2003); Sagae and Lavie (2005); Zhu et al. (2013); Watanabe and Sumita (2015) in that it is head-driven. Cross and Huang (2016a) extended the arc-standard system to constituency parsing, which in fact corresponds to spinal parsing. The main difference from that work relies on the neural model: they use sequential BiLSTMs, while we use Stack-LSTMs and composition functions. Finally, dependency parsers have been extended to constituency parsing by encoding the additional structure in the dependency labels, in different ways (Hall et al., 2007; Hall and Nivre, 2008; Fernández-González and Martins, 2015).

6 Experiments

max width = LR LP F1 Leftmost heads 91.18 90.93 91.05 - Leftmost h., no n-comp 90.20 90.76 90.48 - Rightmost heads 91.03 91.20 91.11 - Rightmost h., no n-comp 90.64 91.24 90.04 - SD heads 90.75 91.11 90.93 93.49 SD heads, no n-comp 90.38 90.58 90.48 93.16 SD heads, dummy spines - - - 93.30 YM heads 90.82 90.84 90.83 -

Table 1: Development results for spinal models, in terms of labeled precision (LP), recall (LR) and F1 for constituents, and unlabeled attachment score (UAS) against Stanford dependencies. Spinal models are trained using different head annotations (see text). Models labeled with “no n-comp” do not use node compositions. The model labeled with “dummy spines” corresponds to a standard dependency model.

We experiment with stack-LSTM spinal models trained with different types of head rules. Our goal is to check how the head identities, which define the derivation sequence, interact with the ability of Stack-LSTMs to propagate latent information beyond the local scope of each action. We use the Penn Treebank Marcus et al. (1993) with standard splits.333We use the the same POS tags as Dyer et al. (2015).

We start training four spinal models, varying the head rules that define the spinal derivations:444It is simple to obtain a spinal tree given a constituency tree and a corresponding dependency tree. We assume that the dependency tree is projective and nested within the constituency tree, which holds for the head rules we use.

  • [noitemsep]

  • Leftmost heads as in Figure 0(c).

  • Rightmost heads.

  • Stanford Dependencies (SD) De Marneffe et al. (2006), as in Figure 0(b).

  • Yamada and Matsumoto heads (YM) Yamada and Matsumoto (2003).

Table 1 presents constituency and dependency metrics on the development set. The model using rightmost heads works the best at 91.11 F1, followed by the one using leftmost heads. It is worth to note that the two models using structural head identities (right or left) work better than those using linguistic ones. This suggests that the Stack-LSTM model already finds useful head-child relations in a constituent by parsing from the left (or right) even if there are non-local interactions. In this case, head rules are not useful.

The same Table 1 shows two ablation studies. First, we turn off the composition of constituent nodes into the latent derivations (Eq 1). The ablated models, tagged with “no n-comp”, perform from 0.5 to 1 points F1 worse, showing the benefit of adding constituent structure. Then, we check if constituent structure is any useful for dependency parsing metrics. To this end, we emulate a dependency parser using a spinal model by taking standard Stanford dependency trees and adding a dummy constituent for every head with all its children. This model, tagged “SD heads, dummy spines”, is slightly outperformed by the “SD heads” model using true spines, even though the margin is small.

Tables 2 and 3 present results on the test, for constituent and dependency parsing respectively. As shown in Table 2 our model is competitive compared to the best parsers; the generative parsers by Choe and Charniak (2016b), Dyer et al. (2016) and Kuncoro et al. (2017) are better than the rest, but compared to the rest our parser is at the same level or better. The most similar system is by Ballesteros and Carreras (2015) and our parser significantly improves the performance. Considering dependency parsing, our model is worse than the ones that train with exploration as Kiperwasser and Goldberg (2016) and Ballesteros et al. (2016), but it slightly improves the parser by Dyer et al. (2015) with static training. The systems that calculate dependencies by transforming phrase-structures with conversion rules and that use generative training are ahead compared to the rest.

max width = LR LP F1 Spinal (leftmost) 90.30 90.54 90.42 Spinal (rightmost) 90.23 90.77 90.50 Ballesteros and Carreras (2015) 88.7 89.2 89.0 Vinyals et al. (2014) (PTB-Only) 88.3 Cross and Huang (2016a) 89.9 chocharniak (PTB-Only) 91.2 chocharniak (Semi-sup) 93.8 Dyer et al. (2016) (Discr.) 91.2 Dyer et al. (2016) (Gen.) 93.3 kuncoroeacl (Gen.) 93.5 Liu and Zhang (2017) 91.3 92.1 91.7

Table 2: Constituency results on the PTB test set.

max width= UAS test Spinal, PTB spines + SD (TB-greedy) 93.15 Spinal, dummy spines + SD (TB-greedy) 93.10 Dyer et al. (2015) (TB-greedy) 93.1 Cross and Huang (2016a) 93.4 Ballesteros et al. (2016) (TB-dynamic) 93.6 Kiperwasser and Goldberg (2016) (TB-dynamic) 93.9 andor2016 (TB-Beam) 94.6 kuncoroEMNLP2016 (Graph-Ensemble) 94.5 chocharniak* (Semi-sup) 95.9 kuncoroeacl* (Generative) 95.8

Table 3: Stanford Dependency results (UAS) on PTB test set. Parsers marked with * calculate dependencies by transforming phrase-structures with conversion rules.

7 Conclusions

We have presented a neural model based on Stack-LSTMs for spinal parsing, using a simple extension of arc-standard transition parsing that adds constituent nodes to the dependency derivation. Our experiments suggest that Stack-LSTMs can figure out useful internal structure within constituents, and that the parser might work better without providing linguistically-derived head words. Overall, our spinal neural method is simple, efficient, and very accurate, and might prove useful to model constituent trees with dependency relations.