network produces a fixed-length vector representation of an input, while thedecoder
network produces a linearization of the target output structure as a sequence of output symbols. Encoder/decoder is state of the art for several key tasks in natural language processing, such as machine translationWu et al. (2016).
However, fixed-size encodings become less competitive when the input structure can be explicitly mapped to the output. In the simple case of predicting tags for individual tokens in a sentence, state-of-the-art taggers learn vector representations for each input token and predict output tags from those Ling et al. (2015); Huang et al. (2015); Andor et al. (2016). When the input or output is a syntactic parse tree, networks that explicitly operate over the compositional structure of the network typically outperform generic representations Dyer et al. (2015); Li et al. (2015); Bowman et al. (2016). Implictly learned mappings via attention mechanisms can significantly improve the performance of sequence-to-sequence Bahdanau et al. (2015); Vinyals et al. (2015), but require runtime that’s quadratic in the input size.
In this work, we propose a modular neural architecture that generalizes the encoder/decoder concept to include explicit structure. Our framework can represent sequence-to-sequence learning as well as models with explicit structure like bi-directional tagging models and compositional, tree-structured models. Our core idea is to define any given architecture as a series of modular units, where connections between modules are unfolded dynamically as a function of the intermediate activations produced by the network. These dynamic connections represent the explicit input and output structure produced by the network for a given task.
We build on the idea of transition systems from the parsing literature Nivre (2006), which linearize structured outputs as a sequence of (state, decision) pairs. Transition-based neural networks have recently been applied to a wide variety of NLP problems; dyer2015transition,lample2016neural,kiperwasser2016simple,zhang2016transition,andor2016globally, among others. We generalize these approaches with a new basic module, the Transition-Based Recurrent Unit (TBRU), which produces a vector representation for every transition state in the output linearization (Figure 1). These representations also serve as the encoding of the explicit structure defined by the states. For example, a TBRU that attaches two sub-trees while building a syntactic parse tree will also produce the hidden layer activations to serve as an encoding for the newly constructed phrase. Multiple TBRUs can be connected and learned jointly to add explicit structure to multi-task learning setups and share representations between tasks with different input or output spaces (Figure 2).
This inference procedure will construct an acyclic compute graph representing the network architecture, where recurrent connections are dynamically added as the network unfolds. We therefore call our approach Dynamic Recurrent Acyclic Graphical Neural Networks, or DRAGNN.
DRAGNN has several distinct modeling advantages over traditional fixed neural architectures. Unlike generic seq2seq, DRAGNN supports variable sized input representations that may contain explicit structure. Unlike purely sequential RNNs, the dynamic connections in a DRAGNN can span arbitrary distances in the input space. Crucially, inference remains linear in the size of the input, in contrast to quadratic-time attention mechanisms. Dynamic connections thus establish a compromise between pure seq2seq and pure attention architectures by providing a finite set of long-range inputs that ‘attend’ to relevant portions of the input space. Unlike recursive neural networks Socher et al. (2010, 2011)
DRAGNN can both predict intermediate structures (such as parse trees) and utilize those structures in a single deep model, backpropagating downstream task errors through the intermediate structures. Compared to models such as Stack-LSTMDyer et al. (2015) and SPINN Bowman et al. (2016), TBRUs are a more general formulation that allows incorporating dynamically structured multi-task learning Zhang and Weiss (2016) and more varied network architectures.
In sum, DRAGNN is not a particular neural architecture, but rather a formulation for describing neural architectures compactly. The key to this compact description is a new recurrent unit—the TBRU—which allows connections between nodes in an unrolled compute graph to be specified dynamically in a generic fashion. We utilize transition systems to provide succinct, discrete representations via linearizations of both the input and the output for structured prediction. We provide a straightforward way of re-using representations across NLP tasks that operate on different structures.
We demonstrate the effectiveness of DRAGNN on two NLP tasks that benefit from explicit structure: dependency parsing and extractive sentence summarization Filippova and Altun (2013). First, we show how to use TBRUs to incrementally add structure to the input and output of a “vanilla” seq2seq dependency parsing model, dramatically boosting accuracy over seq2seq with no additional computational cost. Second, we demonstrate how the same TBRUs can be used to provide structured intermediate syntactic representations for extractive sentence summarization. This yields better accuracy than is possible with the generic multi-task seq2seq Dong et al. (2015); Luong et al. (2016) approach. Finally, we show how multiple TBRUs for the same dependency parsing task can be stacked together to produce a single state-of-the-art dependency parsing model.
2 Transition Systems
We use transition systems to map inputs into a sequence of output symbols, . For the purposes of implementing DRAGNN, transition systems make explicit two desirable properties. First, we stipulate that the output symbols represent modifications of a persistent, discrete state, which makes book-keeping to construct the dynamic recurrent connections easier to express. Second, transition systems make it easy to enforce arbitrary constraints on the output, e.g. the output should produce a valid tree.
Formally, we use the same setup as andor2016globally, and define a transition system as:
A set of states .
A special start state .
A set of allowed decisions for all .
A transition function returning a new state for any decision .
For brevity, we will drop the dependence on in the functions given above. Throughout this work we will use transition systems in which all complete structures for the same input have the same number of decisions (or for brevity), although this is not necessary.
A complete structure is then a sequence of decision/state pairs such that , for , and . We will now define recurrent network architectures that operate over these linearizations of input and output structure.
3 Transition Based Recurrent Networks
We now formally define how to combine transition systems with recurrent networks into what we call a transition based recurrent unit (TBRU). A TBRU consists of the following:
A transition system ,
An input function that maps states to fixed-size vector representations, for example, an embedding lookup operation for features from the discrete state,
A recurrence function that maps states to a set of previous time steps:
where is the power set. Note that in general is not necessarily fixed and can vary with . We use to specify state-dependent recurrent links in the unrolled computation graph.
A RNN cell that computes a new hidden representation from the fixed and recurrent inputs:
Example 1. Sequential tagging RNN.
Let the input be a sequence of word embeddings, and the output be a sequence of tags . Then we can model a simple LSTM tagger as follows:
sequentially tags each input token, where , and is the set of possible tags. We call this the tagger transition system.
, the word embedding for the next token to be tagged.
to connect the network to the previous state.
is a single instance of the LSTM cell.
Example 2. Parsey McParseface.
The open-source syntactic parsing model of andor2016globally can be defined in our framework as follows:
is the arc-standard transition system (Figure 3), so the state contains all words and partially built trees on the stack as well as unseen words on the buffer.
is the concatenation of 52 feature embeddings extracted from tokens based on their positions in the stack and the buffer.
is empty, as this is a feed-forward network.
is a feed-forward multi-layer perceptron (MLP).
Inference with TBRUs.
Given the above, inference in the TBRU proceeds as follows:
Update the hidden state:
Update the transition state:
A schematic overview of a single TBRU is presented in Figure 3. By adjusting , , and , TBRUs can represent a wide variety of neural architectures.
3.1 Connecting multiple TBRUs to learn shared representations
While TBRUs are a useful abstraction for describing recurrent models, the primary motivation for this framework is to allow new architectures by combining representations across tasks and compositional structures. We do this by connecting multiple TBRUs with different transition systems via the recurrence function . We formally augment the above definition as follows:
We execute a list of TBRU components, one at a time, so that each TBRU advances a global step counter. Note that for simplicity, we assume an earlier TBRU finishes all of its steps before the next one starts execution.
Each transition state from the ’th component has access to the terminal states from every prior transition system, and the recurrence function for any given component can pull hidden activations from every prior one as well.
Example 3. “Input” transducer TBRUs via no-op decisions.
We find it useful to define TBRUs even when the transition system decisions don’t correspond to any output. These TBRUs, which we call no-op TBRUs, transduce the input according to some linearization. The simplest is the shift-only transition system, in which the state is just an input pointer , and there is only one transition which advances it: . Executing this transition system will produce a hidden representation for every input token.
Example 4. Encoder/decoder networks with TBRUs.
We can reproduce the encoder/decoder framework for sequence tagging by using two TBRUs: one using the shift-only transition system to encode the input, and the other using the tagger transition system. For input , we connect them as follows:
For shift-only TBRU: , .
For tagger TBRU: , .
We observe that the tagger TBRU starts at step after the shift-only TBRU finishes, that is a fixed embedding vector for the output tag , and that the tagger TBRU has access to both the final encoding vector as well as its own previous time step .
Example 4. Bi-directional LSTM tagger.
With three TBRUs, we can implement a simple bi-directional tagger. The first two run the shift-only transition system, but in opposite directions. The final TBRU runs the tagger transition system and concatenates the two representations:
Left to right: shift-only, , .
Right to left: shift-only, , .
Tagger: , , .
We observe that the network cell in the tagger TBRU takes recurrences only from the bi-directional representations, and so is not recurrent in the traditional sense. See Fig. 1 for an unrolled example.
|Parsing TBRU recurrence,||Parsing Accuracy (%)|
|Input links||Recurrent edges||News||Questions||Runtime|
Example 5. Multi-task bi-directional tagging.
Here we observe that it’s possible to add additional annotation tasks to the bi-directional TBRU stack from Example 4 simply by adding more instances of the tagger TBRUs that produce outputs from different tag sets, e.g. parts-of-speech vs. morphological tags. Most important, however, is that any additional TBRUs have access to all three earlier TBRUs. This means that we can support the “stack-propagation” Zhang and Weiss (2016) style of multi-task learning simply by changing for the last TBRU:
Remark: the raison d’être of DRAGNN.
This example highlights the primary advantage of our formulation: a TBRU can serve as both an encoder for downstream tasks and as a decoder for its own task simultaneously. This idea will prove particularly powerful when we consider syntactic parsing, which involves compositional structure over the input. For example, consider a no-op TBRU that traverses an input sequence in the order determined by a binary parse tree: this transducer can implement a recursive tree-structured network in the style of tai2015improved, which computes representations for sub-phrases in the tree. In contrast, with DRAGNN, we can use the arc-standard parser directly to produce the parse tree as well as encode sub-phrases into representations.
Example 6. Compositional representations from arc-standard dependency parsing.
We use the arc-standard transition system Nivre (2006) to model dependency trees. The system maintains two data structures as part of the state : an input pointer and a stack (Figure 3). Trees are built bottom up via three possible attachment decisions. Assume that the stack consists of , with the next token being . We use and to refer to the top two tokens on the stack. Then the decisions are defined as:
Shift: Push the next token on to the stack: , and advance the input pointer.
Left arc + label: Add an arc , and remove from the stack: .
Right arc + label: Add an arc , and remove from the stack: .
For a given parser state , we compute two types of recurrences:
, where Input returns the index of the next input token.
where Subtree(s,i) is a function returning the index of the last decision that modified the ’th token:
We show an example of the links constructed by these recurrences in Figure 4, and we investigate variants of this model in Section 4. This model is recursively compositional according to the decision taken by the network: when the TBRU at step decides to add an arc for state, the activations will be used to represent that new subtree in future decisions.111This composition function is similar to that in the constituent parsing SPINN model Bowman et al. (2016), but with several key differences. Since we use TBRUs, we compose new representations for “Shift” actions as well as reductions, we take inputs from other recurrent models, and we can utilize subtree representations in downstream tasks.
Example 7. Extractive summarization pipeline with parse representations.
To model extractive summarization, we follow andor2016globally and use a tagger transition system with two tags: Keep and Drop. However, whereas andor2016globally use discrete features of the parse tree, we can utilize the subtree recurrence function to pull compositional, phrase-based representations of tokens as constructed by the dependency parser. This model is outlined in Fig. 2. A full specification is given in the Appendix.
|Input representation||Multi-task style||A (%)||F1 (%)||LAS (%)|
|Parse sub-trees (Figure 2)||zhang2016stack||30.56||80.74||89.13|
3.2 How to train a DRAGNN
Given a list of TBRUs, we propose the following learning procedure. We assume training data consists of examples along with gold decision sequences for one of the TBRUs in the DRAGNN. At a minimum, we need such data for the final TBRU. Given decisions from prior components , we define a log-likelihood objective to train the ’th TBRU along its gold decision sequence :
, since we optimize the probabilities of the individual decisions in the gold sequence.
The remaining question is where the decisions come from. There are two options here: either 1) they come as part of the gold annotation (e.g. if we have joint tagging and parsing data), or 2) they are predicted by unrolling the previous components. When training the stacked extractive summarization model, the parse trees will be predicted by the previously trained parser TBRU.
When training a given TBRU, we unroll an entire input sequence and then use backpropagation through structure Goller and Kuchler (1996) to optimize (3.2). To train the whole system on a set of datasets, we use a strategy similar to Dong et al. (2015); Luong et al. (2016). We sample a target task , based on a pre-defined distribution, and take a stochastic optimization step on the objective of task ’s TBRU. In practice, task sampling is usually preceded by a deterministic number of pre-training steps, allowing, for example, to run a certain number of tagger training steps before running any parser training steps.
|Andor et al. (2016)||94.44||92.93||97.77||90.17||87.54||94.80||95.40||93.64||96.86|
|Deep stacked parsing||94.66||93.23||98.09||90.22||87.67||95.06||96.05||94.51||97.25|
In this section, we evaluate three aspects of our approach on two NLP tasks: English dependency parsing and extractive sentence summarization. For English dependency parsing, we primarily use the Union Treebank setup from andor2016globally. By evaluating on both news and questions domains, we can separately evaluate how the model handles naturally longer and shorter form text. On the Union Treebank setup there are 93 possible actions considering all arc-label combinations. For extractive sentence summarization, we use the dataset of filippova2013overcoming, where a large news collection is used to heuristically generate compression instances. The final corpus contains about 2.3M compression instances, but since we evaluated multiple tasks using this data, we sub-sampled the training set to be comparably sized to the parsing data (
60K training sentences). The test set contains 160K examples. We implement our method in TensorFlow, using mini-batches of size 4 and following the averaged momentum training and hyperparameter tuning procedure of weiss2015structured.
Using explicit structure improves encoder/decoder
We explore the impact of different types of recurrences on dependency parsing in Table 1. In this setup, we used relatively small models: single-layer LSTMs with 256 hidden units, taking 32-dimensional word or output symbol embeddings as input to each cell. In each case, the parsing TBRU takes input from a right-to-left shift-only TBRU. Under these settings, the pure encoder/decoder seq2seq model simply does not have the capacity to parse newswire text with any degree of accuracy, but the TBRU-based approach is nearly state-of-the-art at the same exact computational cost. As a point of comparison and an alternative to using input pointers, we also implemented an attention mechanism within DRAGNN. We used the dot-product formulation from parikh2016decomposable, where in the parser takes in all of the shift-only TBRU’s hidden states and aggregates over them.
Utilizing parse representations improves summarization
We evaluate our approach on the summarization task in Table 2. We compare two single-task LSTM tagging baselines against two multi-task approaches: an adaptation of DBLP:journals/corr/LuongLSVK15 and the stack-propagation idea of zhang2016stack. In both multi-task setups, we use a right-to-left shift-only TBRU to encode the input, and connect it to both our compositional arc-standard dependency parser and the Keep/Drop summarization tagging model.
In both setups we do not follow seq2seq, but utilize the Input function to connect output decisions directly to input token representations. However, in the stack-prop case, we use the Subtree function to connect the tagging TBRU to the parser TBRU’s phrase representations directly (Figure 2). We find that allowing the compressor to directly use the parser’s phrase representations significantly improves the outcome of the multi-task learning setup. In both setups, we pretrained the parsing model for 400K steps and tuned the subsequent ratio of parser/tagger update steps using a development set.
Deep stacked bi-directional parsing
Here we propose a continuous version of the bi-directional parsing model of attardi2009reverse: first, the sentence is parsed in the left-to-right order as usual; then a right-to-left transition system analyzes the sentence in reverse order using addition features extracted from the left-to-right parser. In our version, we connect the right-to-left parsing TBRU directly to the phrase representations of the left-to-right parsing TBRU, again using theSubtree function. Our parser has the significant advantage that the two directions of parsing can affect each other during training. During each training step the right-to-left parser uses representations obtained using the predictions of the left-to-right parser. Thus, the right-to-left parser can backpropagate error signals through the left-to-right parser and reduce cascading errors caused by the pipeline.
Our final model uses 5 TBRU units. Inspired by zhang2016stack, a left-to-right POS tagging TBRU provides the first layer of representations. Next, two shift-only TBRUs, one in each direction, provide representations to the parsers. Finally, we connect the left-to-right parser to the right-to-left parser using links defined via the Subtree function. The result (Table 3) is a state-of-the-art dependency parser, yielding the highest published accuracy on the Treebank Union setup for both part of speech tagging and parsing.
We presented a compact, modular framework for describing recurrent neural architectures. We evaluated our dynamically structured model and found it to be significantly more efficient and accurate than attention mechanisms for dependency parsing and extractive sentence summarization in both single- and multi-task setups. While we focused primarily on syntactic parsing, the framework provides a general means of sharing representations between tasks. There remains low-hanging fruit still to be explored: in particular, our approach can be globally normalized with multiple hypotheses in the intermediate structure. We also plan to push the limits of multi-task learning by combining many different NLP tasks, such as translation, summarization, tagging problems, and reasoning tasks, into a single model.
- Andor et al. (2016) Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. 2016. Globally normalized transition-based neural networks. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. pages 2442–2452.
- Attardi and Dell’Orletta (2009) Giuseppe Attardi and Felice Dell’Orletta. 2009. Reverse revision and linear tree combination for dependency parsing. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers. Association for Computational Linguistics, pages 261–264.
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. ICLR .
- Bowman et al. (2016) Samuel R Bowman, Jon Gauthier, Abhinav Rastogi, Raghav Gupta, Christopher D Manning, and Christopher Potts. 2016. A fast unified model for parsing and sentence understanding. ACL .
- Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP .
- Dong et al. (2015) Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 2015. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the ACL and the 7th International Joint Conference on Natural Language Processing. pages 1723–1732.
Dyer et al. (2015)
Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith.
Transition-based dependency parsing with stack long short-term memory pages 334––343.
- Filippova and Altun (2013) Katja Filippova and Yasemin Altun. 2013. Overcoming the lack of parallel data in sentence compression. In EMNLP. Citeseer, pages 1481–1491.
Goller and Kuchler (1996)
Christoph Goller and Andreas Kuchler. 1996.
Learning task-dependent distributed representations by backpropagation through structure.In Neural Networks, 1996., IEEE International Conference on. IEEE, volume 1, pages 347–352.
- Huang et al. (2015) Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional lstm-crf models for sequence tagging. ACL .
- Kalchbrenner and Blunsom (2013) Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. EMNLP .
- Kiperwasser and Goldberg (2016) Eliyahu Kiperwasser and Yoav Goldberg. 2016. Simple and accurate dependency parsing using bidirectional lstm feature representations. ACL .
Lample et al. (2016)
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and
Chris Dyer. 2016.
Neural architectures for named entity recognition.NAACL-HTL .
- Li et al. (2015) Jiwei Li, Minh-Thang Luong, Dan Jurafsky, and Eudard Hovy. 2015. When are tree structures necessary for deep learning of representations? EMNLP .
- Ling et al. (2015) Wang Ling, Tiago Luís, Luís Marujo, Ramón Fernandez Astudillo, Silvio Amir, Chris Dyer, Alan W Black, and Isabel Trancoso. 2015. Finding function in form: Compositional character models for open vocabulary word representation. EMNLP .
- Luong et al. (2016) Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2016. Multi-task sequence to sequence learning. ICLR .
- Nivre (2006) Joakim Nivre. 2006. Inductive dependency parsing. Springer.
Parikh et al. (2016)
Ankur P Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit.
A decomposable attention model for natural language inferencfne.EMNLP .
Socher et al. (2011)
Richard Socher, Eric H Huang, Jeffrey Pennin, Christopher D Manning, and
Andrew Y Ng. 2011.
Dynamic pooling and unfolding recursive autoencoders for paraphrase detection.In Advances in Neural Information Processing Systems. pages 801–809.
- Socher et al. (2010) Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2010. Learning continuous phrase representations and syntactic parsing with recursive neural networks. In NIPS-2010 Deep Learning and Unsupervised Feature Learning Workshop.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. pages 3104–3112.
- Tai et al. (2015) Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. ACL .
- Vinyals et al. (2015) Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2015. Grammar as a foreign language. In Advances in Neural Information Processing Systems. pages 2773–2781.
- Weiss et al. (2015) David Weiss, Chris Alberti, Michael Collins, and Slav Petrov. 2015. Structured training for neural network transition-based parsing. ACL .
- Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 .
- Zhang et al. (2016) Meishan Zhang, Yue Zhang, and Guohong Fu. 2016. Transition-based neural word segmentation. In Proceedings of the 54nd Annual Meeting of the Association for Computational Linguistics.
- Zhang and Weiss (2016) Yuan Zhang and David Weiss. 2016. Stack-propagation: Improved representation learning for syntax. In Proc. ACL.