A Tree-based Decoder for Neural Machine Translation

08/28/2018 ∙ by Xinyi Wang, et al. ∙ 0

Recent advances in Neural Machine Translation (NMT) show that adding syntactic information to NMT systems can improve the quality of their translations. Most existing work utilizes some specific types of linguistically-inspired tree structures, like constituency and dependency parse trees. This is often done via a standard RNN decoder that operates on a linearized target tree structure. However, it is an open question of what specific linguistic formalism, if any, is the best structural representation for NMT. In this paper, we (1) propose an NMT model that can naturally generate the topology of an arbitrary tree structure on the target side, and (2) experiment with various target tree structures. Our experiments show the surprising result that our model delivers the best improvements with balanced binary trees constructed without any linguistic knowledge; this model outperforms standard seq2seq models by up to 2.1 BLEU points, and other methods for incorporating target-side syntax by up to 0.7 BLEU.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Most NMT methods use sequence-to-sequence (seq2seq) models, taking in a sequence of source words and generating a sequence of target words (Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Bahdanau et al., 2015). While seq2seq models can implicitly discover syntactic properties of the source language Shi et al. (2016), they do not explicitly model and leverage such information. Motivated by the success of adding syntactic information to Statistical Machine Translation (SMT) Galley et al. (2004); Menezes and Quirk (2007); Galley et al. (2006), recent works have established that explicitly leveraging syntactic information can improve NMT quality, either through syntactic encoders (Li et al., 2017; Eriguchi et al., 2016), multi-task learning objectives Chen et al. (2017); Eriguchi et al. (2017), or direct addition of syntactic tokens to the target sequence Nadejde et al. (2017); Aharoni and Goldberg (2017). However, these syntax-aware models only employ the standard decoding process of seq2seq models, i.e. generating one target word at a time. One exception is Wu et al. (2017), which utilizes two RNNs for generating target dependency trees. Nevertheless, Wu et al. (2017)

is specifically designed for dependency tree structures and is not trivially applicable to other varieties of trees such as phrase-structure trees, which have been used more widely in other works on syntax-based machine translation. One potential reason for the dearth of work on syntactic decoders is that such parse tree structures are not friendly to recurrent neural networks (RNNs).

In this paper, we propose TrDec, a method for incorporating tree structures in NMT. TrDec simultaneously generates a target-side tree topology and a translation, using the partially-generated tree to guide the translation process (§ 2). TrDec employs two RNNs: a rule RNN, which tracks the topology of the tree based on rules defined by a Context Free Grammar (CFG), and a word RNN, which tracks words at the leaves of the tree (§ 3). This model is similar to neural models of tree-structured data from syntactic and semantic parsing (Dyer et al., 2016; Alvarez-Melis and Jaakkola, 2017; Yin and Neubig, 2017), but with the addition of the word RNN, which is especially important for MT where fluency of transitions over the words is critical.

Figure 1: An example generation process of TrDec. Left: A target parse tree. The green squares represent preterminal nodes. Right: How our RNNs generate the parse tree on the left. The blue cells represent the activities of the rule RNN, while the grey cells represent the activities of the word RNN.  and  are the end-of-phrase and end-of-sentence tokens. Best viewed in color.

TrDec can generate any tree structure that can be represented by a CFG. These structures include linguistically-motivated syntactic tree representations, e.g. constituent parse trees, as well as syntax-free tree representations, e.g. balanced binary trees (§ 4). This flexibility of TrDec allows us to compare and contrast different structural representations for NMT.

In our experiments (§ 5), we evaluate TrDec using both syntax-driven and syntax-free tree representations. We benchmark TrDec on three tasks: Japanese-English and German-English translation with medium-sized datasets, and Oromo-English translation with an extremely small dataset. Our findings are surprising – TrDec performs well, but it performs the best with balanced binary trees constructed without any linguistic guidance.

2 Generation Process

TrDec simultaneously generates the target sequence and its corresponding tree structure. We first discuss the high-level generation process using an example, before describing the prediction model (§ 3) and the types of trees used by TrDec (§ 4).

Fig. 1 illustrates the generation process of the sentence “_The _cat _eat s _fi sh _.”, where the sentence is split into subword units, delimited by the underscore “_” (Sennrich et al., 2016). The example uses a syntactic parse tree as the intermediate tree representation, but the process of generating with other tree representations, e.g. syntax-free trees, follows the same procedure.

Trees used in TrDec have two types of nodes: terminal nodes, i.e. the leaf nodes that represent subword units; and nonterminal nodes, i.e. the non-leaf nodes that represent a span of subwords. Additionally, we define a preterminal node to be a nonterminal node whose children are all terminal nodes. In Fig. 1 Left, the green squares represent preterminal nodes.

TrDec generates a tree in a top-down, left-to-right order. The generation process is guided by a CFG over target trees, which is constructed by taking all production rules extracted from the trees of all sentences in the training corpus. Specifically, a rule RNN first generates the top of the tree structure, and continues until a preterminal is reached. Then, a word RNN fills out the words under the preterminal. The model switches back to the rule RNN after the word RNN finishes. This process is illustrated in Fig. 1 Right. Details are as follows:

Step 1. The source sentence is encoded by a sequential RNN encoder, producing the hidden states.

Step 2. The generation starts with a derivation tree with only a Root node. A rule RNN,

initialized by the last encoder hidden state computes the probability distribution over all CFG rules whose left hand side (LHS) is

Root, and selects a rule to apply to the derivation. In our example, the rule RNN selects ROOT S.

Step 3. The rule RNN applies production rules to the derivation in a top-down, left-to-right order, expanding the current opening nonterminal using a CFG rule whose LHS is the opening nonterminal. In the next two steps, TrDec applies the rules S NP VP PUNC and NP pre to the opening nonterminals S and NP, respectively. Note that after these two steps a preterminal node pre is created.

Step 4a. Upon seeing a preterminal node as the current opening nonterminal, TrDec switches to using a word RNN, initialized by the last state of the encoder, to populate this empty preterminal with phrase tokens, similar to a seq2seq decoder. For example the subword units _The and _cat are generated by the word RNN, ending with a special end-of-phrase token, i.e. .

Step 4b. While the word RNN generates subword units, the rule RNN also updates its own hidden states, as illustrated by the blue cells in Fig. 1 Right.

Step 5. After the word RNN generates , TrDec switches back to the rule RNN to continue generating the derivation from where the tree left off. In our example, this next stage is the opening nonterminal node VP. From here, TrDec chooses the rule VP pre NP.

TrDec repeats the process above, intermingling the rule RNN and the word RNN as described, and halts when the rule RNN generates the end-of-sentence token , completing the derivation.

3 Model

We now describe the computations during the generation process discussed in § 2. At first, a source sentence

, which is split into subwords, is encoded using a standard bi-directional Long Short-Term Memory (LSTM) network 

(Hochreiter and Schmidhuber, 1997). This bi-directional LSTM outputs a set of hidden states, which TrDec will reference using an attention function (Bahdanau et al., 2015).

As discussed, TrDec uses two RNNs to generate a target parse tree. In our work, both of these RNNs use LSTMs, but with different parameters.

Rule RNN.

At any time step in the rule RNN, there are two possible actions. If at the previous time step , TrDec generated a CFG rule, then the state is computed by:

where is the embedding of the CFG rule at time step ;

is the context vector computed by attention at

, i.e. input feeding (Luong et al., 2015); is the hidden state at the time step that generates the parent of the current node in the partial tree; is the hidden state of the most recent time step before that generated a subword (note that comes from the word RNN, discussed below); and denotes a concatenation.

Meanwhile, if at the previous time step , TrDec did not generate a CFG rule, then the update at time step must come from a subword being generated by the word RNN. In that case, we also update the rule RNN similarly by replacing the embedding of the CFG rule with the embedding of the subword.

Word RNN.

At any time step , if the word RNN is invoked, its hidden state is:

where is the hidden state of rule RNN that generated the CFG rule above the current terminal; is the embedding of the word generated at time step ; and is the attention context computed at the previous word RNN time step .

Softmax.

At any step

, our softmax logits are

, where varies depending on whether a rule or a subword unit is needed.

Figure 2: An example of four tree structures (Details of preterminals and subword units omitted for illustration purpose).
Figure 3: Conversion of a dependency tree for TrDec. Left: original dependency tree. Right: after conversion.

4 Tree Structures

Unlike prior work on syntactic decoders designed for utilizing a specific type of syntactic information (Wu et al., 2017), TrDec is a flexible NMT model that can utilize any tree structure. Here we consider two categories of tree structures:

Syntactic Trees are generated using a third-party parser, such as Berkeley parser (Petrov et al., 2006; Petrov and Klein, 2007).  Fig. 2 Top Left illustrates an example constituency parse tree. We also consider a variation of standard constituency parse trees where all of their nonterminal tags are replaced by a null tag, which is visualized in Fig. 2 Top Right.
In addition to constituency parse trees, TrDec can also utilize dependency parse trees via a simple procedure that converts a dependency tree into a constituency tree. Specifically, this procedure creates a parent node with null tag for each word, and then attaches each word to the parent node of its head word while preserving the word order. An example of this procedure is provided in Fig. 3.

Balanced Binary Trees are syntax-free trees constructed without any linguistic guidance. We use two slightly different versions of binary trees. Version 1 (Fig. 2 Bottom Left) is constructed by recursively splitting the target sentence in half and creating left and right subtrees from the left and right halves of the sentence respectively. Version 2 (Fig. 2 Bottom Right), is constructed by applying Version 1 on a list of nodes where consecutive words are combined together. All tree nodes in both versions have the null tag. We discuss these construction processes in more detail in Appendix A.1.

In the experiments detailed later, we evaluated TrDec with four different settings of tree structures: 1) the fully syntactic constituency parse trees; 2) constituency parse trees with null tags; 3) dependency parse trees; 4) a concatenation of both version 1 and version 2 of the binary trees, (which effectively doubles the amount of the training data and leads to slight increases in accuracy).

5 Experiments

Datasets.

We evaluate TrDec on three datasets: 1) the KFTT (ja-en) dataset Neubig (2011), which consists of Japanese-English Wikipedia articles; 2) the IWSLT2016 German-English (de-en) dataset Cettolo et al. (2016), which consists of TED Talks transcriptions; and 3) the LORELEI Oromo-English (or-en) dataset222LDC2017E29, which largely consists of texts from the Bible. Details are in Tab. 1. English sentences are parsed using Ckylark (Oda et al., 2015) for the constituency parse trees, and Stanford Parser (de Marneffe et al., 2006; Chen and Manning, 2014) for the dependency parse trees. We use byte-pair encoding (Sennrich et al., 2016) with 8K merge operations on ja-en, 4K merge operations on or-en, and 24K merge operations on de-en.

Dataset Train Dev Test
ja-en 405K 1166 1160
de-en 200K 1024 1333
or-en 6.5K 358 359
Table 1: # sentences in each dataset.

Baselines.

We compare TrDec against three baselines: 1) seq2seq: the standard seq2seq model with attention; 2) CCG: a syntax-aware translation model that interleaves Combinatory Categorial Grammar (CCG) tags with words on the target side of a seq2seq model (Nadejde et al., 2017); 3) CCG-null: the same model with CCG, but all syntactic tags are replaced by a null tag; and 4) LIN: a standard seq2seq model that generates linearized parse trees on the target side (Aharoni and Goldberg, 2017).

Results.

Tab. 2 presents the performance of our model and the three baselines. For our model, we report the performance of TrDec-con, TrDec-con-null, TrDec-dep, and TrDec-binary (settings 1,2,3,4 in § 4

). On the low-resource or-en dataset, we observe a large variance with different random seeds, so we run each model with

different seeds, and report the mean and standard deviation of these runs. TrDec-con-null and TrDec-con achieved comparable results, indicating that the syntactic labels have neither a large positive nor negative impact on TrDec. For ja-en and or-en, syntax-free TrDec outperforms all baselines. On de-en, TrDec loses to CCG-null, but the difference is not statistically significant (

).

Model ja-en de-en or-en
()
seq2seq
CCG
CCG-null
LIN
TrDec-con
TrDec-con-null
TrDec-dep
TrDec-binary
Table 2: BLEU scores of TrDec and other baselines. Statistical significance is indicated with () and (), compared with the best baseline.

Length Analysis.

We performed a variety of analyses to elucidate the differences between the translations of different models, and the most conclusive results were through analysis based on the length of the translations. First, we categorize the ja-en test set into buckets by length of the reference sentences, and compare the models for each length category. Fig. 4 shows the gains in BLEU score over seq2seq for the tree-based models. Since TrDec-con outperforms TrDec-dep for all datasets, we only focus on TrDec-con for analyzing TrDec’s performance with syntactic trees. The relative performance of CCG decreases on long sentences. However, TrDec, with both parse trees and syntax-free binary trees, delivers more improvement on longer sentences. This indicates that TrDec is better at capturing long-term dependencies during decoding. Surprisingly, TrDec-binary, which does not utilize any linguistic information, outperforms TrDec-con for all sentence length categories.

Second, Fig. 5 shows a histogram of translations by the length difference between the generated output and the reference. This provides an explanation of the difficulty of using parse trees. Ideally, this distribution will be focused around zero, indicating that the MT system is generating translations about the same length as the reference. However, the distribution of TrDec-con is more spread out than TrDec-binary, which indicates that it is more difficult for TrDec-con to generate sentences with appropriate target length. This is probably because constituency parse trees of sentences with similar number of words can have very different depth, and thus larger variance in the number of generation steps, likely making it difficult for the MT model to plan the sentence structure a-prior before actually generating the child sentences.

The gains of BLEU score over
Figure 4: The gains of BLEU score over seq2seq.
The gains of BLEU score over
Figure 5: Distribution of length difference from reference.

6 Conclusion

We propose TrDec, a novel tree-based decoder for NMT, that generates translations along with the target side tree topology. We evaluate TrDec on both linguistically-inspired parse trees and synthetic, syntax-free binary trees. Our model, when used with synthetic balanced binary trees, outperforms CCG, the existing state-of-the-art in incorporating syntax in NMT models.

The interesting result that syntax-free trees outperform their syntax-driven counterparts elicits a natural question for future work: how do we better model syntactic structure in these models? It would also be interesting to study the effect of using source-side syntax together with the target-side syntax supported by TrDec.

Acknowledgements

This material is based upon work supported in part by the Defense Advanced Research Projects Agency Information Innovation Office (I2O) Low Resource Languages for Emergent Incidents (LORELEI) program under Contract No. HR0011-15-C0114, and the National Science Foundation under Grant No. 1815287. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.

References

  • Aharoni and Goldberg (2017) Roee Aharoni and Yoav Goldberg. 2017. Towards string-to-tree neural machine translation. In ACL.
  • Alvarez-Melis and Jaakkola (2017) David Alvarez-Melis and Tommi S. Jaakkola. 2017. Tree-structured decoding with doubly recurrent neural network. In ICLR.
  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
  • Cettolo et al. (2016) M. Cettolo, J. Niehues, S. Stuker, L. Bentivogli, R. Cattoni, and M. Federico. 2016. The iwslt 2016 evaluation campaign. In IWSLT.
  • Chen and Manning (2014) Danqi Chen and Christopher D. Manning. 2014. A fast and accurate dependency parser using neural networks. In ACL.
  • Chen et al. (2017) Huadong Chen, Shujian Huang, David Chiang, and Jiajun Chen. 2017. Improved neural machine translation with a syntax-aware encoder and decoder. In ACL.
  • Dyer et al. (2016) Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. 2016. Recurrent neural network grammars. In ACL.
  • Eriguchi et al. (2016) Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa Tsuruoka. 2016. Tree-to-sequence attentional neural machine translation. In ACL.
  • Eriguchi et al. (2017) Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun Cho. 2017. Learning to parse and translate improves neural machine translation. In ACL.
  • Galley et al. (2006) Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei Wang, and Ignacio Thayer. 2006. Scalable inference and training of context-rich syntactic translation models. In ACL.
  • Galley et al. (2004) Michel Galley, Mark Hopkins, Kevin Knight, and Daniel Marcu. 2004. What’s in a translation rule? In NAACL.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. In Neural Computations.
  • Kalchbrenner and Blunsom (2013) Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In EMNLP.
  • Li et al. (2017) Junhui Li, Xiong Deyi, Zhaopeng Tu, Muhua Zhu, Min Zhang, and Guodong Zhou. 2017. Modeling source syntax for neural machine translation. In ACL.
  • Luong et al. (2015) Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In EMNLP.
  • de Marneffe et al. (2006) Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D. Manning. 2006. Generating typed dependency parses from phrase structure parses. In LREC.
  • Menezes and Quirk (2007) Arul Menezes and Chris Quirk. 2007. Using dependency order templates to improve generality in translation. In WMT.
  • Nadejde et al. (2017) Maria Nadejde, Siva Reddy, Rico Sennrich, Tomasz Dwojak, Marcin Junczys-Dowmunt, Philipp Koehn, and Alexandra Birch. 2017. Predicting target language CCG supertags improves neural machine translation. In WMT.
  • Neubig (2011) Graham Neubig. 2011. The Kyoto free translation task. http://www.phontron.com/kftt.
  • Oda et al. (2015) Yusuke Oda, Graham Neubig, Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura. 2015. Ckylark: A more robust pcfg-la parser. In NAACL Software Demonstration.
  • Petrov et al. (2006) Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and interpretable tree annotation. In COLING-ACL.
  • Petrov and Klein (2007) Slav Petrov and Dan Klein. 2007. Improved inference for unlexicalized parsing. In HLT-NAACL.
  • Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In ACL.
  • Shi et al. (2016) Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Does string-based neural mt learn source syntax? In EMNLP.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS.
  • Wu et al. (2017) Shuangzhi Wu, Dongdong Zhang, Nan Yang, Mu Li, and Ming Zhou. 2017. Sequence-to-dependency neural machine translation. In ACL.
  • Yin and Neubig (2017) Pengcheng Yin and Graham Neubig. 2017. A syntactic neural model for general-purpose code generation. In ACL.

Appendix A Appendix

a.1 Algorithm

Here we list two simple algorithms for making balanced binary trees on the target sentence. For our experiments of TrDec on binary trees, we use both algorithms to produce two versions of binary tree for each training sentence, and concatenate them as a form of data augmentation strategy.

Input : : the list of words in a sentence, : start index, : end index
Output : a balanced binary tree for words from to in
1 Function make_tree_v1(, , ):
2        if  then
3               return
4        end if
           index of split point
5        return
6
Algorithm 1 The first method of making balanced binary tree
Input : : the list of words in a sentence, : start index, : end index (inclusive)
Output : a balanced binary tree for words from to in
1 Function make_tree_v2(, , ):
2        while  do
3              
4        end while
5        if  then
6              
7        end if
8        return
9
Algorithm 2 The second method of making balanced binary tree