The past several years have witnessed promising progress in Neural Machine Translation (NMT) [Cho et al.2014, Sutskever et al.2014], in which attention model plays an increasingly important role [Bahdanau et al.2015, Luong et al.2015, Vaswani et al.2017]Schuster and Paliwal1997], and then generates a variable-length target sentence with another RNN and an attention mechanism. The attention mechanism plays a crucial role in NMT, as it shows which source word(s) the decoder should focus on in order to predict the next target word. However, there is no mechanism to effectively keep track of attention history in conventional attention-based NMT. This may let the decoder tend to ignore past attention information, and lead to the issues of repeating or dropping translations [Tu et al.2016]. For example, conventional attention-based NMT may repeatedly translate some source words while mistakenly ignore other words.
A number of recent efforts have explored ways to alleviate the inadequate translation problem. For example, Tu2016 Tu2016 employ coverage vector as a lexical-level indicator to indicate whether a source word is translated or not. mengEtAlCOLING2016 mengEtAlCOLING2016 and Zheng:2018:TACL Zheng:2018:TACL take the idea one step further, and directly model translated and untranslated source contents by operating on the attention context (i.e., the partial source content being translated) instead of on the attention probability (i.e., the chance of the corresponding source word is translated). Specifically, mengEtAlCOLING2016 mengEtAlCOLING2016 capture translation status with an interactive attention augmented with a NTM[Graves et al.2014] memory. Zheng:2018:TACL Zheng:2018:TACL separate the modeling of translated (Past) and untranslated (Future) source content from decoder states by introducing two additional decoder adaptive layers.
mengEtAlCOLING2016 mengEtAlCOLING2016 propose a generic framework of memory-augmented attention, which is independent from the specific architectures of the NMT models. However, the original mechanism takes only a single memory to both represent the source sentence and track attention history. Such overloaded usage of memory representations makes training the model difficult [Rocktäschel et al.2017]. In contrast, Zheng:2018:TACL Zheng:2018:TACL try to ease the difficulty of representation learning by separating the Past and Future functions from the decoder states. However, it is designed specifically for the precise architecture of NMT models.
In this work, we combine the advantages of both models by leveraging the generic memory-augmented attention framework, while easing the memory training by maintaining separate representations for the expected two functions. Partially inspired by [Miller et al.2016], we split the memory into two parts: a dynamical key-memory along with the update-chain of the decoder state to keep track of attention history, and a fixed value-memory to store the representation of source sentence throughout the whole translation process. In each decoding step, we conduct multi-rounds of memory operations repeatedly layer by layer, which may let the decoder have a chance of re-attention by considering the “intermediate” attention results achieved in early stages. This structure allows the model to leverage possibly complex transformations and interactions between 1) the key-value memory pair in the same layer, as well as 2) the key (and value) memory across different layers.
Experimental results on ChineseEnglish translation task show that attention model augmented with a single-layer key-value memory improves both translation and attention performances over not only a standard attention model, but also over the existing NTM-augmented attention model [Meng et al.2016]. Its multi-layer counterpart further improves model performances consistently. We also validate our model on bidirectional GermanEnglish translation tasks, which demonstrates the effectiveness and generalizability of our approach.
Given a source sentence and a target sentence , NMT models the translation probability word by word:
where is a non-linear function, and is the hidden state of decoder RNN at time step :
is a distinct source representation for time , calculated as a weighted sum of the source annotations:
where is the encoder annotation of the source word , and the weight is computed by
where scores how much attends to , where is an intermediate state tailored for computing the attention score with the information of .
The training objective is to maximize the log-likelihood of the training instances (, ):
mengEtAlCOLING2016 mengEtAlCOLING2016 propose to augment the attention model with a memory in the form of NTM, which aims at tracking the attention history during decoding process, as shown in Figure 1. At each decoding step, the NTM employs a network-based reader to read from the encoder annotations and output a distinct memory representation, which is subsequently used to update the decoder state. After predicting the target word, the updated decoder state is written back to the memory, which is controlled by a network-based writer. As seen, the interactive read-write operations can timely update the representation of source sentence along with the update-chain of the decoder state and let the decoder keep track of the attention history.
However, this mechanism takes only a single memory to both represent the source sentence and track attention history. Such overloaded usage of memory representations makes training the model difficult [Rocktäschel et al.2017].
3 Model Description
Figure 2 shows the architecture of the proposed Key-Value Memory-augmented Attention model (KVMemAtt). It consists of three components: 1) an encoder (on the left), which encodes the entire source sentence and outputs its annotations as the initialization of the Key-Memory and the Value-Memory; 2) the key-value memory-augmented attention model (in the middle), which generates the context representation of source sentence appropriate for predicting the next target word with iterative memory access operations conducted on the Key-Memory and the Value-Memory; and 3) the decoder (on the right), which predicts the next target word step by step.
Specifically, the Key-Memory and Value-Memory consists of slots, which are initialized with the annotations of the source sentence . KVMemAtt-based NMT maintains these two memories throughout the whole decoding process, with the Key-Memory keeping updated to track the attention history, and the Value-Memory keeping fixed to store the representation of source sentence. For example, the -th slot () in Value-Memory stores the representation of the -th source word (fixed after generated), and the -th slot () in Key-Memory stores the attention (or translation) status (updated as translation goes) corresponding to the -th source word. At step , the decoder state first meets the previous prediction to form a “query” state , which can be calculated as follows
where is the word-embedding of the previous word . The decoder uses the “query” state to address from the Key-Memory looking for an accurate attention vector , and reads from the Value-Memory with the guidance of to generate the source context representation . After that the Key-Memory is updated with and .
The memory access (i.e. address, read and update) in one decoding step can be conducted repeatedly, which may let the decoder have a chance of re-attention (with new information added) before making the final prediction. Suppose there are rounds of memory access in each decoding step. The detailed operations from round - to round are as follows:
First, we use the “query” state to address from to generate the “intermediate” attention vector
which is subsequently used as the guidance for reading from the value memory to get the “intermediate” context representation of source sentence
which works together with the “query” state are used to get the “intermediate” hidden state:
Finally, we use the “intermediate” hidden state to update as recording the “intermediate” attention status, to finish a round of operations,
After the last round () of the operations, we use as the resulting states to compute a final prediction via Eq. 1. Then the Key-Memory will be transited to the next decoding step +, being . The details of , and operations will be described later in next section.
As seen, the KVMemAtt mechanism can update the Key-Memory along with the update-chain of the decoder state to keep track of attention status and also maintains a fixed Value-Memory to store the representation of source sentence. At each decoding step, the KVMemAtt generates the context representation of source sentence via nontrivial transformations between the Key-Memory and the Value-Memory, and records the attention status via interactions between the two memories. This structure allows the model to leverage possibly complex transformations and interactions between two memories, and lets the decoder choose more appropriate source context for the word prediction at each step. Clearly, KVMemAtt can subsume the coverage models [Tu et al.2016, Mi et al.2016] and the interactive attention model [Meng et al.2016] as special cases, while more generic and powerful, as empirically verified in the experiment section.
3.1 Memory Access Operations
In this section, we will detail the memory access operations from round - to round at decoding time step .
Formally, is the Key-Memory in round at decoding time step before the decoder RNN state update, where is the number of memory slots and is the dimension of vector in each slot. The addressed attention vector is given by
where specifies the normalized weights assigned to the slots in , with the -th slot being . We can use content-based addressing to determine as described in [Graves et al.2014] or (quite similarly) use the soft-alignment as in Eq. 4. In this paper, for convenience we adopt the latter one. And the -th cell of is
Formally, is the Value-Memory, where is the number of memory slots and is the dimension of vector in each slot. Before the decoder state update at time , the output of reading at round is given by
where specifies the normalized weights assigned to the slots in .
Inspired by the attentive writing operation of neural turing machines[Graves et al.2014], we define two types of operation for updating the Key-Memory: Forget and Add.
Forget determines the content to be removed from memory slots. More specifically, the vector specifies the values to be forgotten or removed on each dimension in memory slots, which is then assigned to each slot through normalized weights . Formally, the memory (“intermediate”) after Forget operation is given by
is parameterized with , and stands for the activation function, and ;
specifies the normalized weights assigned to the slots in , and specifies the weight associated with the -th slot. is determined by
Add decides how much current information should be written to the memory as the added content
where is parameterized with , and . Clearly, with Forget and Add operations, KVMemAtt potentially can modify and add to the Key-Memory more than just history of attention.
The translation process is finished when the decoder generates (a special token that stands for the end of target sentence). Therefore, accurately generating is crucial for producing correct translation. Intuitively, accurate attention for the last word (i.e. ) of source sentence will help the decoder accurately predict . When to predict (i.e. ), the decoder should highly focus on (i.e. ). That is to say the attention probability of should be closed to 1.0. And when to generate other target words, such as , the decoder should not focus on too much. Therefore we define
|Existing end-to-end NMT systems|
|[Tu et al.2016]||Coverage||–||–||33.69||38.05||35.01||34.83||35.40|
|[Meng et al.2016]||MemAtt||–||–||35.69||39.24||35.74||35.10||36.44|
|[Wang et al.2016]||MemDec||–||–||36.16||39.81||35.91||35.98||36.95|
|[Zhang et al.2017]||Distortion||–||–||37.93||40.40||36.81||35.77||37.73|
|Our end-to-end NMT systems|
We carry out experiments on ChineseEnglish (ZhEn) and GermanEnglish (DeEn) translation tasks. For ZhEn, the training data consist of 1.25M sentence pairs extracted from LDC corpora. We choose NIST 2002 (MT02) dataset as our valid set, and NIST 2003-2006 (MT03-06) datasets as our test sets. For DeEn, we perform our experiments on the corpus provided by WMT17, which contains 5.6M sentence pairs. We use newstest2016 as the development set, and newstest2017 as the testset. We measure the translation quality with BLEU scores [Papineni et al.2002].111For ZhEEn task, we apply case-insensitive NIST BLEU mteval-v11b.pl. For DeEn tasks, we tokenized the reference and evaluated the performance with case-sensitive multi-bleu.pl. The metrics are exactly the same as in previous work.
In training the neural networks, we limit the source and target vocabulary to the most frequent 30K words for both sides in ZhEn task, covering approximately 97.7% and 99.3% of two corpus respectively. For DeEn, sentences are encoded using byte-pair encoding [Sennrich et al.2016], which has a shared source-target vocabulary of about 36000 tokens. The parameters are updated by SGD and mini-batch (size 80) with learning rate controlled by AdaDelta [Zeiler2012] ( and ). We limit the length of sentences in training to 80 words for ZhEn and 100 sub-words for DeEn. The dimension of word embedding and hidden layer is 512, and the beam size in testing is 10. We apply dropout on the output layer to avoid over-fitting [Hinton et al.2012], with dropout rate being 0.5. Hyper parameter in Eq. 19 is set to 1.0. The parameters of our KVMemAtt (i.e., encoder and decoder, except for those related to KVMemAtt) are initialized by the pre-trained baseline model.
4.2 Results on Chinese-English
We compare our KVMemAtt with two strong baselines: 1) RNNSearch, which is our in-house implementation of the attention-based NMT as described in Section 2; and 2) RNNSearch+MemAtt, which is our implementation of interactive attention [Meng et al.2016] on top of our RNNSearch. Table 1 shows the translation performance.
KVMemAtt brings in little parameter increase. Compared with RNNsearch, KVMemAtt-R1 and KVMemAtt-R2 only bring in 1.95% and 7.78% parameter increase. Additionally, introducing KVMemAtt slows down the training speed to a certain extent (18.39%39.56%). When running on a single GPU device Tesla P40, the speed of the RNNsearch model is 2773 target words per second, while the speed of the proposed models is 16762263 target words per second.
KVMemAtt with one round memory access (KVMemAtt-R1) achieves significant improvements over RNNsearch by 1.92 BLEU and over RNNsearch+MemAtt by 0.65 BLEU averagely on four test sets. It indicates that our key-value memory-augmented attention mechanism can give more effective power for the attention via nontrivial transformations and interactions between the Key-Memory and the Value-Memory. The two rounds counterpart (KVMemAtt-R2) can further improve the performance by 0.48 BLEU on average. It confirms our hypothesis that the decoder can benefit from the re-attention process, which considers the “intermediate” attention result achieved in the early stage and then makes a more accurate decision. However, we find that adding more than two rounds of memory access operations into KVMemAtt does not lead to better translation performance (not shown in the table). One possible reason is that memory access with more rounds leads to more updating operations (i.e. attentive writing) on the Key-Memory, which may be difficult to optimize within our current architecture design. We will leave it as future work.
The contribution of adding EOS-attention Objective is to assist the learning of attention, and guiding the parameter training. It consistently improves translation performance over different variants of KVMemAtt. It gives a further improvement of 0.40 BLEU points over KVMemAtt-R2, which is 2.80 BLEU points better than RNNSearch.
Intuitively, our KVMemAtt can enhance the attention and therefore improve the word alignment quality. To confirm our hypothesis, we carry out experiments of the alignment task on the evaluation dataset from [Liu and Sun2015], which contains 900 manually aligned Chinese-English sentence pairs. We use the alignment error rate (AER) [Och and Ney2003]
as the evaluation metric for the alignment task. Table2 lists the BLEU and AER scores. As expected, our KVMemAtt achieves better BLEU and AER scores (the lower the AER score, the better the alignment quality) than the strong baseline systems. Additionally, the results also indicate that the EOS-attention objective can assist the learning of attention-based NMT, since adding this objective yields better alignment performance. By visualizing the attention matrices, we found that the attention qualities are improved from the first round to the second round as expected, as shown in Figure 3.
|Existing end-to-end NMT systems|
|[Rikters et al.2017]||cGRU + dropout + named entity forcing + synthetic data||29.00||22.70|
|[Escolano et al.2017]||Char2Char + rescoring with inverse model + synthetic data||28.10||21.20|
|[Sennrich et al.2017]||cGRU + synthetic data||32.00||26.10|
|[Tu et al.2016]||RNNSearch + Coverage||28.70||23.60|
|[Zheng et al.2018]||RNNSearch + Past-Future-Layers||29.70||24.30|
|Our end-to-end NMT systems|
|2-4||+ KVMemAtt-R2 + AttEosObj||30.98||25.39|
We did a subjective evaluation to investigate the benefit of incorporating KVMemAtt to NMT, especially on alleviating the issues of over- and under-translations. Table 3 lists the translation adequacy of the RNNSearch baseline and our KVMemAtt-R2 on the randomly selected 100 sentences from test sets. From Table 3 we can see that, compared with the baseline system, our approach can decline the percentages of source words which are under-translated from 13.1% to 9.7% and which are over-translated from 2.7% to 1.3%. The main reason is that our KVMemAtt can keep track of attention status and generate more appropriate source context for predicting the next target word at each decoding step.
4.3 Results on German-English
We also evaluate our model on the WMT17 benchmarks on bidirectional GermanEnglish translation tasks, as listed in Table 4. Our baseline achieves even higher BLEU scores to the state-of-the-art NMT systems of WMT17, which do not use additional synthetic data.222[Sennrich et al.2017] obtains better BLEU scores than our model, since they use large scaled synthetic data (about 10M). It maybe unfair to compare our model to theirs directly. Our proposed model consistently outperforms two strong baselines (i.e., standard and memory-augmented attention models) on both DeEn and EnDe translation tasks. These results demonstrate that our model works well across different language pairs.
5 Related Work
Our work is inspired by the key-value memory networks [Miller et al.2016] originally proposed for the question answering, and has been successfully applied to machine translation [Gu et al.2017, Gehring et al.2017, Vaswani et al.2017, Tu et al.2018]. In these works, both the key-memory and value-memory are fixed during translation. Different from these works, we update the Key-Memory along with the update-chain of the decoder state via attentive writing operations (e.g. Forget and Add).
Our work is related to recent studies that focus on designing better attention models [Luong et al.2015, Cohn et al.2016, Feng et al.2016, Tu et al.2016, Mi et al.2016, Zhang et al.2017]. [Luong et al.2015] proposed to use a global attention to attend to all source words and a local attention model to look at a subset of source words. [Cohn et al.2016] extended the attention-based NMT to include structural biases from word-based alignment models. [Feng et al.2016] added implicit distortion and fertility models to attention-based NMT. [Zhang et al.2017] incorporated distortion knowledge into the attention-based NMT. [Tu et al.2016, Mi et al.2016] proposed coverage mechanisms to encourage the decoder to consider more untranslated source words during translation. These works are different from our KVMemAtt, since we use a rather generic key-value memory-augmented framework with memory access (i.e. address, read and update).
Our work is also related to recent efforts on attaching a memory to neural networks [Graves et al.2014] and exploiting memory [Tang et al.2016, Wang et al.2016, Feng et al.2017, Meng et al.2016, Wang et al.2017] during translation. [Tang et al.2016] exploited a phrase memory stored in symbolic form for NMT. [Wang et al.2016] extended the NMT decoder by maintaining an external memory, which is operated by reading and writing operations. [Feng et al.2017] proposed a neural-symbolic architecture, which exploits a memory to provide knowledge for infrequently used words. Our work differ at that we augment attention with a specially designed interactive key-value memory, which allows the model to leverage possibly complex transformations and interactions between the two memories via single- or multi-rounds of memory access in each decoding step.
6 Conclusion and Future Work
We propose an effective KVMemAtt model for NMT, which maintains a timely updated key-memory to track attention history and a fixed value-memory to store the representation of source sentence during translation. Via nontrivial transformations and iterative interactions between the two memories, our KVMemAtt can focus on more appropriate source context for predicting the next target word at each decoding step. Additionally, to further enhance the attention, we propose a simple yet effective attention-oriented objective in a weakly supervised manner. Our empirical study on ChineseEnglish, GermanEnglish and EnglishGerman translation tasks shows that KVMemAtt can significantly improve the performance of NMT.
For future work, we will consider to explore more rounds of memory access with more powerful operations on key-value memories to further enhance the attention. Another interesting direction is to apply the proposed approach to Transformer [Vaswani et al.2017], in which the attention model plays a more important role.
- [Bahdanau et al.2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
- [Cho et al.2014] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, 2014.
- [Cohn et al.2016] Trevor Cohn, Cong Duy Vu Hoang, Ekaterina Vymolova, Kaisheng Yao, Chris Dyer, and Gholamreza Haffari. Incorporating structural alignment biases into an attentional neural translation model. In NAACL, 2016.
- [Escolano et al.2017] Carlos Escolano, Marta R. Costa-jussà, and José A. R. Fonollosa. The talp-upc neural machine translation system for german/finnish-english using the inverse direction model in rescoring. In WMT, 2017.
- [Feng et al.2016] Shi Feng, Shujie Liu, Mu Li, and Ming Zhou. Implicit distortion and fertility models for attention-based encoder-decoder NMT model. In COLING, 2016.
- [Feng et al.2017] Yang Feng, Shiyue Zhang, Andi Zhang, Dong Wang, and Andrew Abel. Memory-augmented neural machine translation. arXiv, 2017.
- [Gehring et al.2017] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. arXiv, 2017.
- [Graves et al.2014] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv, 2014.
- [Gu et al.2017] Jiatao Gu, Yong Wang, Kyunghyun Cho, and Victor OK Li. Search engine guided non-parametric neural machine translation. arXiv, 2017.
- [Hinton et al.2012] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv, 2012.
- [Liu and Sun2015] Yang Liu and Maosong Sun. Contrastive unsupervised word alignment with non-local features. In AAAI, 2015.
- [Luong et al.2015] Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In EMNLP, 2015.
- [Meng et al.2016] Fandong Meng, Zhengdong Lu, Hang Li, and Qun Liu. Interactive attention for neural machine translation. In COLING, 2016.
- [Mi et al.2016] Haitao Mi, Baskaran Sankaran, Zhiguo Wang, and Abe Ittycheriah. Coverage embedding models for neural machine translation. In EMNLP, 2016.
- [Miller et al.2016] Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. Key-value memory networks for directly reading documents. In EMNLP, 2016.
- [Och and Ney2003] Franz Josef Och and Hermann Ney. A systematic comparison of various statistical alignment models. CL, 2003.
- [Papineni et al.2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.
- [Rikters et al.2017] Matīss Rikters, Chantal Amrhein, Maksym Del, and Mark Fishel. C-3ma: Tartu-riga-zurich translation systems for wmt17. In WMT, 2017.
- [Rocktäschel et al.2017] Tim Rocktäschel, Johannes Welbl, and Sebastian Riedel. Frustratingly short attention spans in neural language modeling. In ICLR, 2017.
- [Schuster and Paliwal1997] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. TSP, 1997.
- [Sennrich et al.2016] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In ACL, 2016.
- [Sennrich et al.2017] Rico Sennrich, Alexandra Birch, Anna Currey, Ulrich Germann, Barry Haddow, Kenneth Heafield, Antonio Valerio Miceli Barone, and Philip Williams. The university of edinburgh’s neural MT systems for WMT17. arXiv, 2017.
- [Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
- [Tang et al.2016] Yaohua Tang, Fandong Meng, Zhengdong Lu, Hang Li, and Philip L. H. Yu. Neural machine translation with external phrase memory. arXiv, 2016.
- [Tu et al.2016] Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. Modeling coverage for neural machine translation. In ACL, 2016.
- [Tu et al.2018] Zhaopeng Tu, Yang Liu, Shuming Shi, and Tong Zhang. Learning to remember translation history with a continuous cache. TACL, 2018.
- [Vaswani et al.2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
- [Wang et al.2016] Mingxuan Wang, Zhengdong Lu, Hang Li, and Qun Liu. Memory-enhanced decoder for neural machine translation. In EMNLP, 2016.
- [Wang et al.2017] Xing Wang, Zhengdong Lu, Zhaopeng Tu, Hang Li, Deyi Xiong, and Min Zhang. Neural machine translation advised by statistical machine translation. In AAAI, 2017.
- [Zeiler2012] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv, 2012.
- [Zhang et al.2017] Jinchao Zhang, Mingxuan Wang, Qun Liu, and Jie Zhou. Incorporating word reordering knowledge into attention-based neural machine translation. In ACL, 2017.
- [Zheng et al.2018] Zaixiang Zheng, Hao Zhou, Shujian Huang, Lili Mou, Dai Xinyu, Jiajun Chen, and Zhaopeng Tu. Modeling past and future for neural machine translation. TACL, 2018.