Two complementary approaches to transition-based dependency parsers have emerged recently. The feature engineering approach relies on hand-crafted feature templates to model interactions between sparse lexical features. While manually crafting these feature templates requires substantial expertise and extensive trial-and-error, this approach has led to state-of-the-art parsers in many languages [Buchholz and Marsi2006, Zhang and Nivre2011].
In contrast, the neural network
approach enables automatic learning of feature combinations through non-linear hidden layers and mitigates sparsity issues by sharing similar low-dimensional distributed representations for related words[Bengio et al.2003].
In this work, we explore new model architectures under the neural network approach. In particular, we address the issue that the feedforward architecture of the Chen and Manning parser performs training on each oracle configuration independently of one another, disregarding the fact that the oracles for each training sentence represent a whole sequence of intertwined decisions. Our proposed extension uses a Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) units [Hochreiter and Schmidhuber1997]. At each time step of the transition system, the LSTM has theoretical access to the entire history of past decisions (i.e. shift or reduce). LSTMs are naturally suited for modelling sequences and have shown promising results in e.g. machine translation [Sutskever et al.2014] and text-vision modelling [Venugopalan et al.2014].
We particularly focus on the LSTM’s performance in identifying long-range dependencies. Such dependencies have proved difficult for most greedy transition-based parsers [McDonald and Nivre2007]
, including our feedforward baselines, that train on each oracle independently. This difficulty can be attributed to two main reasons: 1) most long-range dependencies are ambiguous, while the classifiers only have access to a limited context window, and 2) longer arcs are constructed after shorter arcs in transition-based parsing, thus increasing the chance of error propagation. In contrast, our LSTM has the key abilities of modelling whole sequences of training oracles and memorise all past context information, both of which are likely beneficial for longer dependencies.
Despite the LSTM’s theoretical advantages, in practice it is more prone to overfitting than the feedforward architecture, even with the same number of parameters. An additional contribution of this work is an empirical investigation that suggests that dropout [Srivastava et al.2014], particularly when applied to the embedding layer, substantially improves the LSTM’s generalisation ability regardless of hidden layer size.
2 LSTM Parsing Model
2.1 Baseline Model
Our model is an extension to chenmanning14, which uses a feedforward neural network to predict the next transition of an arc-standard system. In arc-standard, a configuration consists of a buffer (holding the input words), a stack (holding the partial parse trees), and a set of dependency arcs . The parse tree is built by successively making one of these transitions:
SHIFT: move the next word on to
LEFT/RIGHT-ARC(): add left/right arc with label between top two words on
The features is a concatenation of embeddings from the top 3 words in and , first and second left-/right-most children of the top two words on , and the leftmost of leftmost / rightmost of rightmost children of the top two words on . At each configuration at time , the neural network first computes the hidden layer from the input (applying a non-linear function111The dimension of is 2400 in experiments, and dimension of is , the number of possible transitions where = number of dependency label types.
is a dropout operator, which randomly sets elements to 0 with probability .
2.2 Our LSTM Model
Our LSTM model (shown in Figure 1) uses the same features as chenmanning14, but importantly adds new inputs based on past information (such as ). The addition of previous state leads to recurrence and enables modelling and training of the entire sequence of transitions.
While recurrence may cause the “vanishing gradient” problem[Bengio et al.1994], the LSTM architecture solves this by introducing memory cells that could store information over long time intervals and keep gradients from diminishing. Input gates control what is stored in a memory cell , and output gates control whether the stored information is used in further computations. This allows information from the beginning of the sentence to influence transition actions at the end of the sentence. Forget gates are used to erase the information in the current memory cell.
The following equations describe our LSTM model with peephole connections [Gers et al.2002], as set forth by G13, and apply dropout similar to Zet14.
Crucially, the LSTM not only uses input in its predictions for , but also exploits values in the previous memory cell and hidden layer through the gates , , and . The values of these gates are bounded between due to the sigmoid , so multiplication with other components modulates what information is passed through.
Given training sentences with gold parse trees, our training data is a set of sequences of configurations and oracle transition actions at each time for each sentence . We maximise the log-likelihood of the oracle transition actions given by Equation (1), where is the set of parameters including word, POS, and label embeddings, and is the probability that the parser takes transition action at time .
We optimise by gradient backpropagation through time (BPTT) for each sentence, feeding the parser with gold sequence of configurations . When the parser reaches the final configuration, the gradients are backpropagated from each prediction at time down to time .
3.1 Experimental Settings
We conducted the experiments on the Google Web Treebank [Petrov and McDonald2012], consisting of the WSJ portion of the OntoNotes corpus and five additional web domains, with 48 dependency types. The models were trained only on the training set of the WSJ corpus, while the parameters were optimised using the WSJ dev set (i.e. no tuning using any of the web domains’ dev set).
As baselines, we re-implemented the Chen and Manning parser with the same setting, including results from both the feedforward model with Tanh activation function (same activation as the LSTM) and its better-performing Cubic counterpart. Training was done for a maximum of 400 epochs, stopped early if no better dev UAS was found after 30 consecutive epochs.222The LSTM was trained with the Adadelta optimiser [Zeiler2012], using a decay rate of 0.95 and . The embeddings were similarly initialised as the feedforward baselines, while the weight connections were initialised using the same mechanism as GB10. We used automatic POS tags from the Stanford bi-directional tagger [Toutanova et al.2003], with tagging accuracies of 97% for the WSJ and 87-92% for the web domains.
3.2 Main Result and Analysis
The LAS result on the Google Web Treebank is summarised on Table 1, where F-T and F-C represent the feedforward baselines with Tanh and Cubic activations, respectively. Our LSTM model outperforms the feedforward baseline with the same Tanh activation function (87.5 vs 86.4 on WSJ Test), while achieving competitive accuracy with the Cubic baseline.
We furthermore investigate the models’ performance on long-range dependencies, reporting the result in terms of labelled precision and recall breakdown by dependency lengths on the WSJ test set in Table2. This result is also plotted in Figures 3 and 3. Despite the models’ similar overall accuracy, our LSTM model outperforms the Cubic baseline by more than 3% in both precision and recall for dependency lengths greater than 7, and that the LSTM’s performance degrades more slowly as dependency length increases.
|WSJ||F - T||F - C||LSTM|
|Web Test||F - T||F - C||LSTM|
|F - T||F - C||LSTM|
3.3 Regularisation Experiments
We discover that regularisation is important for the LSTM parser, more so than feedforward architectures. Table 3 compares the relative improvement due to dropout for feedforward vs. LSTM by constraining both models to have the same number of 500,000 parameters, corresponding to 50 hidden units for LSTM. Observe that LSTM becomes competitive only with dropout.
|no dropout||with dropout|
To investigate what kind of dropout is beneficial, we conducted further experiments on a subset of the training data (the first 80,000 tokens of the WSJ training set).333We used the same experimental settings as in Subsection 3.1 and evaluate UAS on the full WSJ dev and test set, with hidden layer size fixed at 60. The results of dropout and L-2 regularisation are in Table 4, along with the epoch where the best dev UAS is found. E-H and H-O indicate dropout between the embedding-hidden and hidden-output connections, respectively.
While dropout generally results in slower convergence, the technique outperforms L-2 and significantly improves the model’s accuracy by more than 6%. Most importantly, we found input dropout to be more crucial than hidden-output dropout and achieves the same accuracy as dropout on both input and hidden layers, suggesting that our model can achieve good accuracy with input dropout alone. We found dropout rates between 0.4 and 0.6 to be effective. Further, we found that dropout generally improves LSTMs regardless of model size. Figure 4 shows how dropout of 0.5 on E-H and E-O layers improve results for various hidden layer sizes.
4 Related work
Recently, various neural network models have achieved state of the art results in many parsing tasks and languages, including the Google Web Treebank dataset used in this paper. Vet142 used LSTMs for sequence-to-sequence constituency parsing that makes no prior assumption of the parsing problem. For dependency parsing, S13 presented an RNN compositional model, similar to the RNN constituency parser of Set13.
More recently, the works of Det15 and KiperwasserGoldberg16 proposed transition-based LSTM models to automatically extract real-valued feature vectors from the parser configuration. The transition-based parser of Det15 used a “stack LSTM” architecture and composition functions to obtain a continuous, low-dimensional representation of the stack to represent partial trees, along with the buffer and history of actions. Both our work and the stack LSTM model similarly used greedy decoding, although one primary difference is that we used the LSTM to form temporal recurrence over the hidden states444We define the hidden states as the penultimate layer right before the softmax.
. We used the same feature extraction template as chenmanning14 and replaced the feedforward connections with LSTM network, while Det15 instead used the stack LSTM as a means to extract dense features from the parser configuration without explicit temporal recurrence.
Neural network models have also been used for structured training in transition-based parsing, achieving state of the art results on various dataset. Wet15 used a structured perceptron model on top of a feedforward transition-based dependency parser. When augmented with tri-training method on unlabelled data, their model achieved an impressive 87% LAS on the Web domain data of the Google Web Treebank similarly used in this work. Zet15 used beam search and contrastive learning to maximise the probability of the entire goldsequence with respect to all other sequences in the beam. Aet16 similarly proposed a globally normalised model using beam search and Conditional Random Fields (CRF) loss [Lafferty2001] that achieved state of the art results on the benchmark English PTB dataset.
Our RNN parsing model is most similar with Xet16 that used temporal recurrence over the hidden states for CCG parsing, although we use LSTMs instead of Elman RNNs. Our work additionally investigates the effect of dropout on model performance, and demonstrate the efficacy of temporal recurrence to better capture long-range dependencies.
We present a transition-based dependency parser using recurrent LSTM units. The motivation is to exploit the entire history of shift/reduce transitions when making predictions. This LSTM parser is competitive with the feedforward neural network parser of chenmanning14 on overall LAS, and notably improves the accuracy of long-range dependencies. We also show the importance of dropout, particularly on the embedding layer, in improving the model’s accuracy.
We thank Graham Neubig and Hiroyuki Shindo for the useful feedback and comments.
- [Andor et al.2016] Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. 2016. Globally normalized transition-based neural networks. CoRR, abs/1603.06042.
- [Bansal et al.2014] Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2014. Tailoring continuous word representations for dependency parsing. In Proceedings of ACL.
- [Bastien et al.2012] Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio. 2012. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop.
- [Bengio et al.1994] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. Trans. Neur. Netw., 5(2):157–166, March.
[Bengio et al.2003]
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin.
A neural probabilistic language model.
JOURNAL OF MACHINE LEARNING RESEARCH, 3:1137–1155.
- [Bergstra et al.2010] James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. 2010. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June. Oral Presentation.
- [Buchholz and Marsi2006] Sabine Buchholz and Erwin Marsi. 2006. Conll-x shared task on multilingual dependency parsing. In Proc. of CoNLL, pages 149–164.
[Chen and Manning2014]
Danqi Chen and Christopher D. Manning.
A fast and accurate dependency parser using neural networks.
Empirical Methods in Natural Language Processing (EMNLP).
- [Collobert et al.2011] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. 2011. Natural language processing (almost) from scratch. CoRR, abs/1103.0398.
- [Duchi et al.2011] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159, July.
- [Dyer et al.2015] Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 334–343, Beijing, China, July. Association for Computational Linguistics.
- [Fonseca et al.2015] Erick R Fonseca, Avenida Trabalhador São-carlense, and Sandra M Aluísio. 2015. A deep architecture for non-projective dependency parsing. Proceedings of NAACL-HLT, pages 56–61.
- [Gers et al.2002] Felix A. Gers, Nicol N. Schraudolph, and Jürgen Schmidhuber. 2002. Learning precise timing with lstm recurrent networks. jmlr, 3:115–143.
[Glorot and Bengio2010]
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics.
- [Graves2013] Alex Graves. 2013. Generating sequences with recurrent neural networks. CoRR, abs/1308.0850.
- [Greff et al.2015] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber. 2015. LSTM: A search space odyssey. CoRR, abs/1503.04069.
- [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780, November.
- [Kiperwasser and Goldberg2016] Eliyahu Kiperwasser and Yoav Goldberg. 2016. Simple and accurate dependency parsing using bidirectional LSTM feature representations. CoRR, abs/1603.04351.
- [Lafferty2001] John Lafferty. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. pages 282–289. Morgan Kaufmann.
- [McDonald and Nivre2007] Ryan McDonald and Joakim Nivre. 2007. Characterizing the errors of data-driven dependency parsing models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and Natural Language Learning.
- [Petrov and McDonald2012] Slav Petrov and Ryan McDonald. 2012. Overview of the 2012 shared task on parsing the web. Notes of the First Workshop on Syntactic Analysis of Non-Canonical Language (SANCL).
- [Socher et al.2013] Richard Socher, John Bauer, Christopher D. Manning, and Andrew Y. Ng. 2013. Parsing with compositional vector grammars. In Proceedings of the ACL conference.
- [Srivastava et al.2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958.
- [Stenetorp2013] Pontus Stenetorp. 2013. Transition-based dependency parsing using recursive neural networks. In Deep Learning Workshop at the 2013 Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, Nevada, USA, December.
- [Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. CoRR, abs/1409.3215.
- [Toutanova et al.2003] Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 173–180, Stroudsburg, PA, USA. Association for Computational Linguistics.
- [Venugopalan et al.2014] Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond J. Mooney, and Kate Saenko. 2014. Translating videos to natural language using deep recurrent neural networks. CoRR, abs/1412.4729.
- [Vinyals et al.2014] Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey E. Hinton. 2014. Grammar as a foreign language. CoRR, abs/1412.7449.
- [Weiss et al.2015] David Weiss, Chris Alberti, Michael Collins, and Slav Petrov. 2015. Structured training for neural network transition-based parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 323–333, Beijing, China, July. Association for Computational Linguistics.
- [Xu et al.2016] Wenduan Xu, Michael Auli, and Stephen Clark. 2016. Expected f-measure training for shift-reduce parsing with recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 210–220, San Diego, California, June. Association for Computational Linguistics.
[Yamada and Matsumoto2003]
Hiroyasu Yamada and Yuji Matsumoto.
Statistical dependency analysis with support vector machines.In Proceedings of the International Workshop on Parsing Technologies (IWPT).
- [Zaremba et al.2014] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. CoRR, abs/1409.2329.
- [Zeiler2012] Matthew D. Zeiler. 2012. ADADELTA: an adaptive learning rate method. CoRR, abs/1212.5701.
- [Zhang and Nivre2011] Yue Zhang and Joakim Nivre. 2011. Transition-based dependency parsing with rich non-local features. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2, HLT ’11, pages 188–193, Stroudsburg, PA, USA. Association for Computational Linguistics.
- [Zhou et al.2015] Hao Zhou, Yue Zhang, Shujian Huang, and Jiajun Chen. 2015. A neural probabilistic structured-prediction model for transition-based dependency parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1213–1222, Beijing, China, July. Association for Computational Linguistics.