1 Introduction
Many efforts to exploit linguistic hierarchy in NLP tasks make use of the output of a selfcontained parser system trained from a humanannotated treebank (Huang et al., 2006). An alternative approach aims to jointly learn the task at hand and relevant aspects of linguistic hierarchy, inducing from an unannotated training dataset parse trees that may or may not correspond to treebank annotation practices (Wu, 1997; Chiang, 2005).
Most deep learning models for NLP that aim to make use of linguistic hierarchy integrate an external parser, either to prescribe the recursive structure of the neural network
(Pollack, 1990; Goller and Küchler, 1996; Socher et al., 2013) or to provide a supervision signal or training data for a network that predicts its own structure (Socher et al., 2010; Bowman et al., 2016; Dyer et al., 2016b). But some recently described neural network models take the second approach and treat hierarchical structure as a latent variable, applying inference over graphbased conditional random fields (Kim et al., 2017), the straightthrough estimator
(Chung et al., 2017), or policy gradient reinforcement learning (Yogatama et al., 2017) to work around the inapplicability of gradientbased learning to problems with discrete latent states.For the task of machine translation, syntacticallyinformed models have shown promise both inside and outside the deep learning context, with hierarchical phrasebased models frequently outperforming traditional ones (Chiang, 2005) and neural MT models augmented with morphosyntactic input features (Sennrich and Haddow, 2016; Nadejde et al., 2017), a treestructured encoder (Eriguchi et al., 2016; Hashimoto and Tsuruoka, 2017), and a jointly trained parser (Eriguchi et al., 2017) each outperforming purelysequential baselines.
Drawing on many of these precedents, we introduce an attentional neural machine translation model whose encoder and decoder components are both treestructured neural networks that predict their own constituency structure as they consume or emit text. The encoder and decoder networks are variants of the RNNG model introduced by Dyer et al. (2016b), allowing tree structures of unconstrained arity, while text is ingested at the character level, allowing the model to discover and make use of structure within words.
The parsing decisions of the encoder and decoder RNNGs are parameterized by a stochastic policy trained using a weighted sum of two objectives: a language model loss term that rewards predicting the next character with high likelihood, and a tree attention term that rewards onetoone attentional correspondence between constituents in the encoder and decoder.
We evaluate this model on the GermanEnglish language pair of the flickr30k dataset, where it obtains similar performance to a strong characterlevel baseline. Analysis of the latent trees produced by the encoder and decoder shows that the model learns a reasonable segmentation and shallow parse, and most phraselevel constituents constructed while ingesting the German input sentence correspond meaningfully to constituents built while generating the English output.
2 Model
2.1 Encoder/Decoder Architecture
The model consists of a coupled encoder and decoder, where the encoder is a modified stackonly recurrent neural network grammar (Kuncoro et al., 2017) and the decoder is a stackonly RNNG augmented with constituentlevel attention. An RNNG is a topdown transitionbased model that jointly builds a sentence representation and parse tree, representing the parser state with a StackLSTM and using a bidirectional LSTM as a constituent composition function.
Our implementation is detailed in Figure 1, and differs from Dyer et al. (2016b) in that it lacks separate newnonterminal tokens for different phrase types, and thus does not include the phrase type as an input to the composition function. Instead, the values of for the encoder are fixed to a constant while the values of for the decoder are determined through an attention procedure (Section 2.2).
As originally described, the RNNG predicts parser transitions using a onelayer perceptron with three concatenated inputs: the last state of a unidirectional LSTM over the stack contents (), the last state of a unidirectional LSTM over the reversed buffer of unparsed tokens (), and the result of an LSTM over the past transitions (). All three of these states can be computed with at most one LSTM step per parser transition using the StackLSTM algorithm (Dyer et al., 2016a). But such a baseline RNNG is actually outperformed by one which conditions the parser transitions only on the stack representation (Kuncoro et al., 2017). Restricting our model to this stackonly case allows both the encoder and decoder to be supervised using a language model loss, while allowing the model access to would give it a trivial way to predict the next character and obtain zero loss.
2.2 Attention
With the exception of the attention mechanism, the encoder and decoder are identical. While the encoder uses a single token to represent a new nonterminal, the decoder represents a new nonterminal on the stack as a sum weighted by structural attention of the phrase representations of all nonterminal tree nodes produced by the encoder. In particular, we use the normalized dot products between the decoder stack representation and the stack representation at each encoder node (that is, the hidden state of the StackLSTM up to and including but not ) as coefficients in a weighted sum of the phrase embeddings corresponding to the encoder nodes:
(1) 
Since the dot products between encoder and decoder stack representations are a measure of structural similarity between the (left context of) the current decoder state and the encoder state. Within a particular decoder nonterminal, the model reduces to ordinary sequencetosequence transduction. Starting from the encoder’s representation of the corresponding nonterminal or a weighted combination of such representations, the decoder will emit a translated sequence of child constituents (both nonterminal and terminal) one by one—applying attention only when emitting nonterminal children.
2.3 Training
We formulate our model as a stochastic computation graph (Schulman et al., 2015)
, leading to a training paradigm that combines backpropagation (which provides the exact gradient through deterministic nodes) and vanilla policy gradient (which provides a Monte Carlo estimator for the gradient through stochastic nodes).
There are several kinds of training signals in our model. First, when the encoder or decoder chooses the gen action it passes the current stack state
through a onelayer softmax perceptron, giving the probability that the next token is each of the characters in the vocabulary. The language model loss
for each generated token is the negative log probability assigned to the groundtruth next token. The other differentiable training signal is the coverage loss , which is a measure of how much the attention weights diverge from the ideal of a onetoone mapping. This penalty is computed as a sum of three MSE terms:(2) 
Backpropagation using the differentiable losses affects only the weights of the output softmax perceptron. The overall loss function for these weights is a weighted sum of all
terms and :(3) 
There are additionally nondifferentiable rewards that bias the model towards or away from certain kinds of tree structures. Here, negative numbers correspond to penalties. We assign a tree reward of when the model predicts a reduce with only one child constituent (reduce with zero child constituents is forbidden) or predicts two reduce or nt transitions in a row. This biases the model against unary branching and reduces its likelihood of producing an exclusively left or rightbranching tree structure. In addition, for all constituents except the root, we assign a tree reward based on the size and type of its children. If and are the number of nonterminal and terminal children, this reward is if all children are terminal and otherwise. A reward structure like this biases the model against freely mixing terminals and nonterminals within the same constituent and provides incentive to build substantial tree structures early on in training so the model doesn’t get stuck in trivial local minima.
Within both the encoder and decoder, each stochastic action node has a corresponding tree reward if the action was reduce (otherwise zero) and a corresponding language model loss if the action was gen (otherwise zero). We subtract an exponential moving average baseline from each tree reward and additional exponential moving average baselines—computed independently for each character in the vocabulary, because we want to reduce the effect of character frequency—from the language model losses. If is the number of gen transitions among actions one through , and is a decay constant, the final reward for action with is:
(4) 
These rewards define the gradient that each stochastic node (with normalized action probabilities and chosen action ) produces during backpropagation according to the standard multinomial score function estimator (REINFORCE):
(5) 
3 Results
We evaluated our model on the GermanEnglish language pair of the flickr30k data, the textual component of the WMT Multimodal Translation shared task (Specia et al., 2016). An attentional sequencetosequence model with two layers and 384 hidden units from the OpenNMT project (Klein et al., 2017) was run at the character level as a baseline, obtaining 32.0 test BLEU with greedy inference. Our model with the same hidden size and greedy inference achieves test BLEU of 28.5 after removing repeated bigrams.
We implemented the model in PyTorch, benefiting from its strong support for dynamic and stochastic computation graphs, and trained with batch size 10 and the Adam optimizer
(Kingma and Ba, 2015)with early stopping after 12 epochs. Character embeddings and the encoder’s
embedding were initialized to random 384dimensional vectors. The value of
and the decay constant for the baselines’ exponential moving average were both set to 0.95.A random selection of translations is included in the supplemental material, while two attention plots are shown in Figure 2. Figure 2b demonstrates a common pathology of the model, where a phrasal encoder constituent would be attended to during decoding of the head word of the corresponding decoder constituent, while the head word of the encoder constituent would be attended to during decoding of the decoder constituent corresponding to the whole phrase. Another common pathology is repeated sentence fragments in the translation, which are likely generated because the model cannot condition future attention directly on past attention weights (the “input feeding” approach introduced by Luong et al. (2015)).
Translation quality also suffers because of our use of a stackonly RNNG, which we chose because an RNNG with both stack and buffer inputs is incompatible with a language model loss. During encoding, the model must decide at the very beginning of the sentence how deeply to embed the first character. But with a stackonly RNNG, it must make this decision randomly, since it isn’t able to use the buffer representation—which contains the entire sentence.
4 Conclusion
We introduce a new approach to leveraging unsupervised tree structures in NLP tasks like machine translation. Our experiments demonstrate that a smallscale MT dataset contains sufficient training signal to infer latent linguistic structure, and we are excited to learn what models like the one presented here can discover in fullsize translation corpora. One particularly promising avenue of research is to leverage the inherently compositional phrase representations produced by the encoder for other NLP tasks.
There are also many possible directions for improving the model itself and the training process. Value function baselines can replace exponential moving averages, pure reinforcement learning can replace teacher forcing, and beam search can be used in place of greedy inference. Solutions to the translation pathologies presented in Section 3 are likely more complex, although one possible approach would leverage variational inference using a teacher model that can see the buffer and helps train a stackonly student model.
References
 Bowman et al. (2016) Samuel Bowman, Jon Gauthier, Abhinav Rastogi, Raghav Gupta, Christopher Manning, and Christopher Potts. 2016. A fast unified model for parsing and sentence understanding. In ACL.
 Chiang (2005) David Chiang. 2005. A hierarchical phrasebased model for statistical machine translation. In ACL.
 Chung et al. (2017) Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. 2017. Hierarchical multiscale recurrent neural networks. In ICLR.

Dyer et al. (2016a)
Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A Smith.
2016a.
Transitionbased dependency parsing with stack long shortterm memory.
In EMNLP.  Dyer et al. (2016b) Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah Smith. 2016b. Recurrent neural network grammars. In NAACL.
 Eriguchi et al. (2016) Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa Tsuruoka. 2016. Treetosequence attentional neural machine translation. In ACL.
 Eriguchi et al. (2017) Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun Cho. 2017. Learning to parse and translate improves neural machine translation. In ACL.

Goller and Küchler (1996)
Christoph Goller and Andreas Küchler. 1996.
Learning taskdependent distributed representations by backpropagation through structure.
In IEEE International Conference on Neural Networks. IEEE, volume 1, pages 347–352.  Hashimoto and Tsuruoka (2017) Kazuma Hashimoto and Yoshimasa Tsuruoka. 2017. Neural machine translation with sourceside latent graph parsing. In EMNLP.
 Huang et al. (2006) Liang Huang, Kevin Knight, and Aravind Joshi. 2006. A syntaxdirected translator with extended domain of locality. In CHPJINLP. Association for Computational Linguistics.
 Kim et al. (2017) Yoon Kim, Carl Denton, Luong Hoang, and Alexander Rush. 2017. Structured attention networks. In ICLR.
 Kingma and Ba (2015) Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
 Klein et al. (2017) G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush. 2017. OpenNMT: Opensource toolkit for neural machine translation. ArXiv preprint arXiv:1701.02810 .
 Kuncoro et al. (2017) Adhiguna Kuncoro, Miguel Ballesteros, Lingpeng Kong, Chris Dyer, Graham Neubig, and Noah Smith. 2017. What do recurrent neural network grammars learn about syntax? In EACL.
 Luong et al. (2015) MinhThang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attentionbased neural machine translation. In EMNLP.
 Nadejde et al. (2017) Maria Nadejde, Siva Reddy, Rico Sennrich, Tomasz Dwojak, Marcin JunczysDowmunt, Philipp Koehn, and Alexandra Birch. 2017. Syntaxaware neural machine translation using CCG. arXiv preprint arXiv:1702.01147 .
 Pollack (1990) Jordan B Pollack. 1990. Recursive distributed representations. Artificial Intelligence 46(1):77–105.
 Schulman et al. (2015) John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. 2015. Gradient estimation using stochastic computation graphs. In NIPS.
 Sennrich and Haddow (2016) Rico Sennrich and Barry Haddow. 2016. Linguistic input features improve neural machine translation. In WMT.
 Socher et al. (2010) Richard Socher, Christopher Manning, and Andrew Ng. 2010. Learning continuous phrase representations and syntactic parsing with recursive neural networks. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning.
 Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP.
 Specia et al. (2016) Lucia Specia, Stella Frank, Khalil Sima’an, and Desmond Elliott. 2016. A shared task on multimodal machine translation and crosslingual image description. In WMT.
 Wu (1997) Dekai Wu. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational linguistics 23(3):377–403.
 Yogatama et al. (2017) Dani Yogatama, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Wang Ling. 2017. Learning to compose words into sentences with reinforcement learning. In ICLR.
Comments
There are no comments yet.