Dependency Parsing with LSTMs: An Empirical Evaluation

04/22/2016 ∙ by Adhiguna Kuncoro, et al. ∙ Johns Hopkins University Carnegie Mellon University 0

We propose a transition-based dependency parser using Recurrent Neural Networks with Long Short-Term Memory (LSTM) units. This extends the feedforward neural network parser of Chen and Manning (2014) and enables modelling of entire sequences of shift/reduce transition decisions. On the Google Web Treebank, our LSTM parser is competitive with the best feedforward parser on overall accuracy and notably achieves more than 3 dependencies, which has proved difficult for previous transition-based parsers due to error propagation and limited context information. Our findings additionally suggest that dropout regularisation on the embedding layer is crucial to improve the LSTM's generalisation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Two complementary approaches to transition-based dependency parsers have emerged recently. The feature engineering approach relies on hand-crafted feature templates to model interactions between sparse lexical features. While manually crafting these feature templates requires substantial expertise and extensive trial-and-error, this approach has led to state-of-the-art parsers in many languages [Buchholz and Marsi2006, Zhang and Nivre2011].

In contrast, the neural network

approach enables automatic learning of feature combinations through non-linear hidden layers and mitigates sparsity issues by sharing similar low-dimensional distributed representations for related words

[Bengio et al.2003].

In this work, we explore new model architectures under the neural network approach. In particular, we address the issue that the feedforward architecture of the Chen and Manning parser performs training on each oracle configuration independently of one another, disregarding the fact that the oracles for each training sentence represent a whole sequence of intertwined decisions. Our proposed extension uses a Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) units [Hochreiter and Schmidhuber1997]. At each time step of the transition system, the LSTM has theoretical access to the entire history of past decisions (i.e. shift or reduce). LSTMs are naturally suited for modelling sequences and have shown promising results in e.g. machine translation [Sutskever et al.2014] and text-vision modelling [Venugopalan et al.2014].

We particularly focus on the LSTM’s performance in identifying long-range dependencies. Such dependencies have proved difficult for most greedy transition-based parsers [McDonald and Nivre2007]

, including our feedforward baselines, that train on each oracle independently. This difficulty can be attributed to two main reasons: 1) most long-range dependencies are ambiguous, while the classifiers only have access to a limited context window, and 2) longer arcs are constructed after shorter arcs in transition-based parsing, thus increasing the chance of error propagation. In contrast, our LSTM has the key abilities of modelling whole sequences of training oracles and memorise all past context information, both of which are likely beneficial for longer dependencies.

Despite the LSTM’s theoretical advantages, in practice it is more prone to overfitting than the feedforward architecture, even with the same number of parameters. An additional contribution of this work is an empirical investigation that suggests that dropout [Srivastava et al.2014], particularly when applied to the embedding layer, substantially improves the LSTM’s generalisation ability regardless of hidden layer size.

2 LSTM Parsing Model

Figure 1: Left: Our LSTM architecture. Right: Feedforward architecture in chenmanning14. Connections with dropout are denoted by dashed lines.

2.1 Baseline Model

Our model is an extension to chenmanning14, which uses a feedforward neural network to predict the next transition of an arc-standard system. In arc-standard, a configuration consists of a buffer (holding the input words), a stack (holding the partial parse trees), and a set of dependency arcs . The parse tree is built by successively making one of these transitions:

  • SHIFT: move the next word on to

  • LEFT/RIGHT-ARC(): add left/right arc with label between top two words on

The features is a concatenation of embeddings from the top 3 words in and , first and second left-/right-most children of the top two words on , and the leftmost of leftmost / rightmost of rightmost children of the top two words on . At each configuration at time , the neural network first computes the hidden layer from the input (applying a non-linear function

), then calculates the probability of each transition in the output vector

111The dimension of is 2400 in experiments, and dimension of is , the number of possible transitions where = number of dependency label types.

is a dropout operator, which randomly sets elements to 0 with probability .

2.2 Our LSTM Model

Our LSTM model (shown in Figure 1) uses the same features as chenmanning14, but importantly adds new inputs based on past information (such as ). The addition of previous state leads to recurrence and enables modelling and training of the entire sequence of transitions.

While recurrence may cause the “vanishing gradient” problem

[Bengio et al.1994], the LSTM architecture solves this by introducing memory cells that could store information over long time intervals and keep gradients from diminishing. Input gates control what is stored in a memory cell , and output gates control whether the stored information is used in further computations. This allows information from the beginning of the sentence to influence transition actions at the end of the sentence. Forget gates are used to erase the information in the current memory cell.

The following equations describe our LSTM model with peephole connections [Gers et al.2002], as set forth by G13, and apply dropout similar to Zet14.

Crucially, the LSTM not only uses input in its predictions for , but also exploits values in the previous memory cell and hidden layer through the gates , , and . The values of these gates are bounded between due to the sigmoid , so multiplication with other components modulates what information is passed through.

Given training sentences with gold parse trees, our training data is a set of sequences of configurations and oracle transition actions at each time for each sentence . We maximise the log-likelihood of the oracle transition actions given by Equation (1), where is the set of parameters including word, POS, and label embeddings, and is the probability that the parser takes transition action at time .

(1)

We optimise by gradient backpropagation through time (BPTT) for each sentence

, feeding the parser with gold sequence of configurations . When the parser reaches the final configuration, the gradients are backpropagated from each prediction at time down to time .

3 Experiment

3.1 Experimental Settings

We conducted the experiments on the Google Web Treebank [Petrov and McDonald2012], consisting of the WSJ portion of the OntoNotes corpus and five additional web domains, with 48 dependency types. The models were trained only on the training set of the WSJ corpus, while the parameters were optimised using the WSJ dev set (i.e. no tuning using any of the web domains’ dev set).

As baselines, we re-implemented the Chen and Manning parser with the same setting, including results from both the feedforward model with Tanh activation function (same activation as the LSTM) and its better-performing Cubic counterpart. Training was done for a maximum of 400 epochs, stopped early if no better dev UAS was found after 30 consecutive epochs.

222The LSTM was trained with the Adadelta optimiser [Zeiler2012], using a decay rate of 0.95 and . The embeddings were similarly initialised as the feedforward baselines, while the weight connections were initialised using the same mechanism as GB10. We used automatic POS tags from the Stanford bi-directional tagger [Toutanova et al.2003], with tagging accuracies of 97% for the WSJ and 87-92% for the web domains.

3.2 Main Result and Analysis

Figure 2: Precision by Dependency Length
Figure 3: Recall by Dependency Length

The LAS result on the Google Web Treebank is summarised on Table 1, where F-T and F-C represent the feedforward baselines with Tanh and Cubic activations, respectively. Our LSTM model outperforms the feedforward baseline with the same Tanh activation function (87.5 vs 86.4 on WSJ Test), while achieving competitive accuracy with the Cubic baseline.

We furthermore investigate the models’ performance on long-range dependencies, reporting the result in terms of labelled precision and recall breakdown by dependency lengths on the WSJ test set in Table

2. This result is also plotted in Figures 3 and 3. Despite the models’ similar overall accuracy, our LSTM model outperforms the Cubic baseline by more than 3% in both precision and recall for dependency lengths greater than 7, and that the LSTM’s performance degrades more slowly as dependency length increases.

WSJ F - T F - C LSTM
Dev 86.0 87.2 87.8
Test 86.4 87.5 87.5
Web Test F - T F - C LSTM
Answers 74.1 74.9 74.6
Emails 74.6 75.6 74.4
Newsgroups 79.3 79.9 80.2
Reviews 76.5 77.2 77.0
Weblogs 80.7 81.1 81.2
Table 1: Google Web Treebank LAS Result
Dep.
Length
F - T F - C LSTM
1 Precision 91.4 92.2 91.7
Recall 93.1 93.6 93.0
2 Precision 87.9 88.9 89.3
Recall 90.3 91.0 90.1
3-6 Precision 81.7 83.4 82.6
Recall 79.2 81.3 81.4
7-49 Precision 68.1 70.3 73.5
Recall 62.6 65.6 69.5
Table 2: Long-range Arcs Precision & Recall

3.3 Regularisation Experiments

We discover that regularisation is important for the LSTM parser, more so than feedforward architectures. Table 3 compares the relative improvement due to dropout for feedforward vs. LSTM by constraining both models to have the same number of 500,000 parameters, corresponding to 50 hidden units for LSTM. Observe that LSTM becomes competitive only with dropout.

no dropout with dropout
F-Cubic 89.1 89.5 0.4
LSTM 87.4 89.5 2.1
Table 3: Effect of Dropout on UAS Accuracy

To investigate what kind of dropout is beneficial, we conducted further experiments on a subset of the training data (the first 80,000 tokens of the WSJ training set).333We used the same experimental settings as in Subsection 3.1 and evaluate UAS on the full WSJ dev and test set, with hidden layer size fixed at 60. The results of dropout and L-2 regularisation are in Table 4, along with the epoch where the best dev UAS is found. E-H and H-O indicate dropout between the embedding-hidden and hidden-output connections, respectively.

Reg Settings Dev Test Epoch
L2
0 80.2 80.0 42
80.7 80.8 25
79.9 80.3 43
79.8 80.1 43
80.5 80.4 46
83.4 82.9 206
81.6 81.6 159
Dropout
E-H 0.2 84.4 84.3 97
0.4 85.8 85.7 257
0.6 86.2 85.5 273
H-O 0.2 81.8 81.6 52
0.4 82.3 82.1 93
0.6 81.9 81.7 69
Both 0.2 85.4 85.0 122
0.4 86.1 85.9 315
0.6 85.3 85.3 500
Table 4: UAS Accuracy of Various Regularisation

While dropout generally results in slower convergence, the technique outperforms L-2 and significantly improves the model’s accuracy by more than 6%. Most importantly, we found input dropout to be more crucial than hidden-output dropout and achieves the same accuracy as dropout on both input and hidden layers, suggesting that our model can achieve good accuracy with input dropout alone. We found dropout rates between 0.4 and 0.6 to be effective. Further, we found that dropout generally improves LSTMs regardless of model size. Figure 4 shows how dropout of 0.5 on E-H and E-O layers improve results for various hidden layer sizes.

Figure 4: UAS Accuracy vs Hidden Layer Size

4 Related work

Recently, various neural network models have achieved state of the art results in many parsing tasks and languages, including the Google Web Treebank dataset used in this paper. Vet142 used LSTMs for sequence-to-sequence constituency parsing that makes no prior assumption of the parsing problem. For dependency parsing, S13 presented an RNN compositional model, similar to the RNN constituency parser of Set13.

More recently, the works of Det15 and KiperwasserGoldberg16 proposed transition-based LSTM models to automatically extract real-valued feature vectors from the parser configuration. The transition-based parser of Det15 used a “stack LSTM” architecture and composition functions to obtain a continuous, low-dimensional representation of the stack to represent partial trees, along with the buffer and history of actions. Both our work and the stack LSTM model similarly used greedy decoding, although one primary difference is that we used the LSTM to form temporal recurrence over the hidden states444We define the hidden states as the penultimate layer right before the softmax.

. We used the same feature extraction template as chenmanning14 and replaced the feedforward connections with LSTM network, while Det15 instead used the stack LSTM as a means to extract dense features from the parser configuration without explicit temporal recurrence.

Neural network models have also been used for structured training in transition-based parsing, achieving state of the art results on various dataset. Wet15 used a structured perceptron model on top of a feedforward transition-based dependency parser. When augmented with tri-training method on unlabelled data, their model achieved an impressive 87% LAS on the Web domain data of the Google Web Treebank similarly used in this work. Zet15 used beam search and contrastive learning to maximise the probability of the entire gold

sequence with respect to all other sequences in the beam. Aet16 similarly proposed a globally normalised model using beam search and Conditional Random Fields (CRF) loss [Lafferty2001] that achieved state of the art results on the benchmark English PTB dataset.

Our RNN parsing model is most similar with Xet16 that used temporal recurrence over the hidden states for CCG parsing, although we use LSTMs instead of Elman RNNs. Our work additionally investigates the effect of dropout on model performance, and demonstrate the efficacy of temporal recurrence to better capture long-range dependencies.

5 Conclusions

We present a transition-based dependency parser using recurrent LSTM units. The motivation is to exploit the entire history of shift/reduce transitions when making predictions. This LSTM parser is competitive with the feedforward neural network parser of chenmanning14 on overall LAS, and notably improves the accuracy of long-range dependencies. We also show the importance of dropout, particularly on the embedding layer, in improving the model’s accuracy.

Acknowledgments

We thank Graham Neubig and Hiroyuki Shindo for the useful feedback and comments.

References