1 Related Work
In the literature, several cache-based translation models have been proposed for conventional statistical machine translation, besides traditional n-gram language models and neural language models. In this section, we will first introduce related work in cache-based language models and then in translation models.
For traditional n-gram language models, Kuhn1990A propose a cache-based language model, which mixes a large global language model with a small local model estimated from recent items in the history of the input stream for speech recongnition. della1992adaptive introduce a MaxEnt-based cache model by integrating a cache into a smoothed trigram language model, reporting reduction in both perplexity and word error rates. chueh2010topic present a new topic cache model for speech recongnition based on latent Dirichlet language model by incorporating a large-span topic cache into the generation of topic mixtures.
For neural language models, huang2014cache propose a cache-based RNN inference scheme, which avoids repeated computation of identical LM calls and caches previously computed scores and useful intermediate results and thus reduce the computational expense of RNNLM. Grave2016Improving extend the neural network language model with a neural cache model, which stores recent hidden activations to be used as contextual representations. Our caches significantly differ from these two caches in that we store linguistic items in the cache rather than scores or activations.
For neural machine translation, wangexploiting propose a cross-sentence context-aware approach and employ a hierarchy of Recurrent Neural Networks (RNNs) to summarize the cross-sentence context from source-side previous sentences. jean2017does propose a novel larger-context neural machine translation model based on the recent works on larger-context language modelling[Wang and Cho2016] and employ the method to model the surrounding text in addition to the source sentence.
For cache-based translation models, nepveu2004adaptive propose a dynamic adaptive translation model using cache-based implementation for interactive machine translation, and develop a monolingual dynamic adaptive model and a bilingual dynamic adaptive model. tiedemann2010context propose a cache-based translation model, filling the cache with bilingual phrase pairs from the best translation hypotheses of previous sentences in a document. gong2011cache further propose a cache-based approach to document-level translation, which includes three caches, a dynamic cache, a static cache and a topic cache, to capture various document-level information. bertoldi2013cache describe a cache mechanism to implement online learning in phrase-based SMT and use a repetition rate measure to predict the utility of cached items expected to be useful for the current translation.
Our caches are similar to those used by gong2011cache who incorporate these caches into statistical machine translation. We adapt them to neural machine translation with a neural cache model. It is worthwhile to emphasize that such adaptation is nontrivial as shown below because the two translation philosophies and frameworks are significantly different.
2 Attention-based NMT
In this section, we briefly describe the NMT model taken as a baseline. Without loss of generality, we adopt the NMT architecture proposed by bahdanau2015neural, with an encoder-decoder neural network.
The encoder uses bidirectional recurrent neural networks (Bi-RNN) to encode a source sentence with a forward and a backward RNN. The forward RNN takes as input a source sentencefrom left to right and outputs a hidden state sequence while the backward RNN reads the sentence in an inverse direction and outputs a backward hidden state sequence . The context-dependent word representations of the source sentence (also known as word annotation vectors) are the concatenation of hidden states and in the two directions.
The decoder is an RNN that predicts target words
via a multi-layer perceptron (MLP) neural network. The prediction is based on the decoder RNN hidden state, the previous predicted word and a source-side context vector . The hidden state of the decoder at time and the conditional probability of the next word are computed as follows:
2.3 Attention Model
In the attention model, the context vectoris calculated as a weighted sum over source annotation vectors :
where is the attention weight of each hidden state computed by the attention model, and
is a feed forward neural network with a single hidden layer.
The dl4mt tutorial111https://github.com/nyu-dl/dl4mt-tutorial/tree/master/session2 presents an improved implementation of the attention-based NMT system, which feeds the previous word to the attention model. We use the dl4mt tutorial implementation as our baseline, which we will refer to as RNNSearch*.
The proposed cache-based neural approach is implemented on the top of RNNSearch* system, where the encoder-decoder NMT framework is trained to optimize the sum of the conditional log probabilities of correct translations of all source sentences on a parallel corpus as normal.
3 The Cache-based Neural Model
In this section, we elaborate on the proposed cache-based neural model and how we integrate it into neural machine translation, Figure 1 shows the entire architecture of our NMT with the cache-based neural model.
3.1 Dynamic Cache and Topic Cache
The aim of cache is to incorporate document-level constraints and therefore to improve the consistency and coherence of document translations. In this section, we introduce our proposed dynamic cache and topic cache in detail.
3.1.1 Dynamic Cache
In order to build the dynamic cache, we dynamically extract words from recently translated sentences and the partial translation of current sentence being translated as words of dynamic cache. We apply the following rules to build the dynamic cache.
The max size of the dynamic cache is set to .
According to the first-in-first-out rule, when the dynamic cache is full and a new word is inserted into the cache, the oldest word in the cache will be removed.
Duplicate entries into the dynamic cache are not allowed when a word has been already in the cache.
It is worth noting that we also maintain a stop word list, and we added English punctuations and “UNK” into our stop word list. Words in the stop word list would not be inserted into the dynamic cache. So the common words like “a” and “the” cannot appear in the cache. All words in the dynamic cache can be found in the target-side vocabulary of RNNSearch*.
3.1.2 Topic Cache
In order to build the topic cache, we first use an off-the-shelf LDA topic tool222http://www.arbylon.net/projects/ to learn topic distributions of source- and target-side documents separately. Then we estimate a topic projection distribution over all target-side topics for each source topic by collecting events and accumulating counts of from aligned document pairs. Notice that is the topic with the highest topic probability on the source/target side. Then we can use the topic cache as follows:
During the training process of NMT, the learned target-side topic model is used to infer the topic distribution for each target document. For a target document d in the training data, we select the topic with the highest probability as the topic for the document. The most probable topical words in topic are extracted to fill the topic cache for the document .
In the NMT testing process, we first infer the topic distribution for a source document in question with the learned source-side topic model. From the topic distribution, we choose the topic with the highest probability as the topic for the source document. Then we use the learned topic projection function to map the source topic onto a target topic with the highest projection probability, as illustrated in Figure 2. After that, we use the most probable topical words in the projected target topic to fill the topic cache.
The words of topic cache and dynamic cache together form the final cache model. In practice, the cache stores word embeddings, as shown in Figure 3. As we do not want to introduce extra embedding parameters, we let the cache share the same target word embedding matrix with the NMT model. In this case, if a word is not in the target-side vocabulary of NMT, we discard the word from the cache.
3.2 The Model
The cache-based neural model is to evaluate the probabilities of words occurring in the cache and to provide the evaluation results for the decoder via a gating mechanism.
3.2.1 Evaluating Word Entries in the Cache
When the decoder generates the next target word , we hope that the cache can provide helpful information to judge whether is appropriate from the perspective of the document-level cache if occurs in the cache.To achieve this goal, we should appropriately evaluate the word entries in the cache.
In this paper, we build a new neural network layer as the scorer for the cache. At each decoding step , we use the scorer to score if is in the cache. The inputs to the scorer are the current hidden state of the decoder, previous word , context vector , and the word from the cache. The score of is calculated as follows:
where is a non-linear function.
This score is further used to estimate the cache probability of as follows:
3.2.2 Integrating the Cache-based Neural Model into NMT
Since we have two prediction probabilities for the next target word , one from the cache-based neural model , the other originally estimated by the NMT decoder
, how do we integrate these two probabilities? Here, we introduce a gating mechanism to combine them, and word prediction probabilities on the vocabulary of NMT are updated by combining the two probabilities through linear interpolation between the NMT probability and cache-based neural model probability. The final word prediction probability foris calculated as follows:
Notice that if is not in the cache, we set , where is the gate and computed as follows:
where is a non-linear function and
is sigmoid function.
We use the contextual elements of to score the current target word occurring in the cache (Eq. (6)) and to estimate the gate (Eq. (9)). If the target word is consistent with the context and in the cache at the same time, the probability of the target word will be high.
Finally, we train the proposed cache model jointly with the NMT model towards minimizing the negative log-likelihood on the training corpus. The cost function is computed as follows:
where are all parameters in the cache-based NMT model.
3.3 Decoding Process
Our cache-based NMT system works as follows:
When the decoder shifts to a new test document, clear the topic and dynamic cache.
Obtain target topical words for the new test document as described in Section 4.1 and fill them in the topic cache.
Clear the dynamic cache when translating the first sentence of the test document.
For each sentence in the new test document, translate it with the proposed cache-based NMT and continuously expands the dynamic cache with newly generated target words and target words obtained from the best translation hypothesis of previous sentences.
In this way, the topic cache can provide useful global information at the beginning of the translation process while the dynamic cache is growing with the progress of translation.
We evaluated the effectiveness of the proposed cache-based neural model for neural machine translation on NIST Chinese-English translation tasks.
4.1 Experimental Setting
We selected corpora LDC2003E14, LDC2004T07, LDC2005T06, LDC2005T10 and a portion of data from the corpus LDC2004T08 (Hong Kong Hansards/Laws/News) as our bilingual training data, where document boundaries are explicitly kept. In total, our training data contain 103,236 documents and 2.80M sentences. On average, each document consists of 28.4 sentences. We chose NIST05 dataset (1082 sentence pairs) as our development set, and NIST02, NIST04, NIST06 (878, 1788, 1664 sentence pairs. respectively) as our test sets. We compared our proposed model against the following two systems:
Moses [Koehn et al.2007]: an off-the-shelf phrase-based translation system with its default setting.
RNNSearch*: our in-house attention-based NMT system which adopts the feedback attention as described in Section 3 .
For Moses, we used the full training data to train the model. We ran GIZA++ [Och and Ney2000] on the training data in both directions, and merged alignments in two directions with “grow-diag-final” refinement rule [Koehn et al.2005] to obtain final word alignments. We trained a 5-gram language model on the Xinhua portion of GIGA-WORD corpus using SRILM Toolkit with a modified Kneser-Ney smoothing.
For RNNSearch, we used the parallel corpus to train the attention-based NMT model. The encoder of RNNSearch consists of a forward and backward recurrent neural network. The word embedding dimension is 620 and the size of a hidden layer is 1000. The maximum length of sentences that we used to train RNNSearch in our experiments was set to 50 on both Chinese and English side. We used the most frequent 30K words for both Chinese and English. We replaced rare words with a special token “UNK”. Dropout was applied only on the output layer and the dropout rate was set to 0.5. All the other settings were the same as those in [Bahdanau et al.2015]. Once the NMT model was trained, we adopted a beam search to find possible translations with high probabilities. We set the beam width to 10.
For the proposed cache-based NMT model, we implemented it on the top of RNNSearch*. We set the size of the dynamic and topic cache and to 100, 200, respectively. For the dynamic cache, we only kept those most recently-visited items. For the LDA tool, we set the number of topics considered in the model to 100 and set the number of topic words that are used to fill the topic cache to 200. The parameter and of LDA were set to 0.5 and 0.1, respectively. We used a feedforward neural network with two hidden layers to define (Equation (6)) and (Equation (9)). For , the number of units in the two hidden layers were set to 500 and 200 respectively. For , the number of units in the two hidden layers were set to 1000 and 500 respectively. We used a pre-training strategy that has been widely used in the literature to train our proposed model: training the regular attention-based NMT model using our implementation of RNNSearch*, and then using its parameters to initialize the parameters of the proposed model, except for those related to the operations of the proposed cache model.
We used the stochastic gradient descent algorithm with mini-batch and Adadelta to train the NMT models. The mini-batch was set to 80 sentences and decay ratesand of Adadelta were set to 0.95 and .
4.2 Experimental Results
Table 1 shows the results of different models measured in terms of BLEU score333As our model requires document boundaries so as to gurantee that cache words are from the same document, we use all LDC corpora that provide document boundaries. Most training sentences are from Hong Kong Hansards/Laws Parallel Text (accounting for 57.82% of our training data) which are in the law domain rather than the news domain of our test/dev sets. This is the reason that our baseline is lower than other published results obtained using more news-domain training data without document boundaries.. From the table, we can find that our implementation RNNSearch* using the feedback attention and dropout outperforms Moses by 3.23 BLEU points. The proposed model achieves an average gain of 1.01 BLEU points over RNNSearch* on all test sets. Further, the model achieves an average gain of 1.60 BLEU points over RNNSearch*, and it outperforms Moses by 4.83 BLEU points. These results strongly suggest that the dynamic and topic cache are very helpful and able to improve translation quality in document translation.
Effect of the Gating Mechanism
In order to validate the effectiveness of the gating mechanism used in the cache-based neural model, we set a fixed gate value for , in other words, we use a mixture of probabilities with fixed proportions to replace the gating mechanism that automatically learns weights for probability mixture.
Table 2 displays the result. When we set the gate to a fixed value 0.3, the performance has an obvious decline comparing with that of in terms of BLEU score. The performance is even worse than RNNSearch* by 10.11 BLEU points. Therefore without a good mechanism, the cache-based neural model cannot be appropriately integrated into NMT. This shows that the gating mechanism plays a important role in .
Effect of the Topic Cache
When the NMT decoder translates the first sentence of a document, the dynamic cache is empty. In this case, we hope that the topic cache will provide document-level information for translating the first sentence. We therefore further investigate how the topic cache influence the translation of the first sentence in a document. We count and calculate the average number of words that appear in both the translations of the beginning sentences of documents and the topic cache.
The statistical results are shown in Table 3. Without using the cache model, RNNSearch* generates translations that contain words from the topic cache as these topic words are tightly related to documents being translated. With the topic cache, our neural cache model enables the translations of the first sentences to be more relevant to the global topics of documents being translated as these translations contain more words from the topic cache that describes these documents. As the dynamic cache is empty when the decoder translates the beginning sentences, the topic cache is complementary to such a cold cache at the start. Comparing the numbers of translations generated by our model and human translations (Reference in Table 3), we can find that with the help of the topic cache, translations of the first sentences of documents are becoming closer to human translations.
Analysis on the Cache-based Neural Model
As shown above, the topic cache is able to influence both the translations of beginning sentences and those of subsequent sentences while the dynamic cache built from translations of preceding sentences has an impact on the translations of subsequent sentences. We further study what roles the dynamic and topic cache play in the translation process. For this aim, we calculate the average number of words in translations generated by that are also in the caches. During the counting process, stop words and “UNK” are removed from sentence and document translations.
Table 4 shows the results. If only the topic cache is used ([,] in Table 4), the cache still can provide useful information to help NMT translate sentences and documents. 28.3 words per document and 2.39 words per sentence are from the topic cache. When both the dynamic and topic cache are used ([,] in Table 4), the numbers of words that both occur in sentence/document translations and the two caches sharply increase from 2.61/30.27 to 6.73/81.16. The reason for this is that words appear in preceding sentences will have a large probability of appearing in subsequent sentences. This shows that the dynamic cache plays a important role in keeping document translations consistent by reusing words from preceding sentences.
|SRC||(1) 并 将 计划 中 的 一 系列 军事 行动 提前 付诸 实施 。|
|(2) 会议 决定 加大 对 巴方 的 军事 打击 行动 。|
|REF||(1) and to implement ahead of schedule a series of military actions still being planned .|
|(2) the meeting decided to increase military actions against palestinian .|
|RNNSearch*||(1) and to implement a series of military operations .|
|(2) the meeting decided to increase military actions against the palestinian side .|
|+||(1) and to implement a series of military actions plans.|
|(2) the meeting decided to increase military actions against the palestinian side .|
We also provide two translation examples in Table 5. We can find that RNNSearch* generates different translations “operations” and “actions” for the same chinese word “行动(xingdong)”, while our proposed model produces the same translation “actions”.
4.3 Analysis on Translation Coherence
The average cosine similarity of adjacent sentences (coherence) on all test sets.
We want to further study how the proposed cache-based neural model influence coherence in document translation. For this, we follow Lapata2005Automatic to measure coherence as sentence similarity. First, each sentence is represented by the mean of the distributed vectors of its words. Second, the similarity between two sentences is determined by the cosine of their means.
where , and is the vector for word .
We use Word2Vec444http://word2vec.googlecode.com/svn/trunk/ to get the distributed vectors of words and English Gigaword fourth Edition555https://catalog.ldc.upenn.edu/LDC2009T13 as training data to train Word2Vec. We consider that embeddings from word2vec trained on large monolingual corpus can well encode semantic information of words. We set the dimensionality of word embeddings to 200. Table 6 shows the average cosine similarity of adjacent sentences on all test sets. From the table, we can find that the model produces better coherence in document translation than RNNSearch* in term of cosine similarity.
In this paper, we have presented a novel cache-based neural model for NMT to capture the global topic information and inter-sentence cohesion dependencies. We use a gating mechanism to integrate both the topic and dynamic cache into the proposed neural cache model. Experiment results show that the cache-based neural model achieves consistent and significant improvements in translation quality over several state-of-the-art NMT and SMT baselines. Further analysis reveals that the topic cache and dynamic cache are complementary to each other and that both are able to guide the NMT decoder to use topical words and to reuse words from recently translated sentences as next word predictions.
The present research was supported by the National Natural Science Foundation of China (Grant No. 61622209). We would like to thank three anonymous reviewers for their insightful comments.
- [Bahdanau et al.2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR.
- [Bertoldi et al.2013] Nicola Bertoldi, Mauro Cettolo, and Marcello Federico. 2013. Cache-based online adaptation for machine translation enhanced computer assisted translation. Proceedings of the XIV Machine Translation Summit, pages 35–42.
- [Chueh and Chien2010] Chuang-Hua Chueh and Jen-Tzung Chien. 2010. Topic cache language model for speech recognition. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pages 5194–5197. IEEE.
- [Della Pietra et al.1992] Stephen Della Pietra, Vincent Della Pietra, Robert L Mercer, and Salim Roukos. 1992. Adaptive language modeling using minimum discriminant estimation. In Proceedings of the workshop on Speech and Natural Language, pages 103–106. Association for Computational Linguistics.
[Gong et al.2011]
Zhengxian Gong, Min Zhang, and Guodong Zhou.
Cache-based document-level statistical machine translation.
Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 909–919. Association for Computational Linguistics.
- [Grave et al.2016] Edouard Grave, Armand Joulin, and Nicolas Usunier. 2016. Improving neural language models with a continuous cache.
- [Huang et al.2014] Zhiheng Huang, Geoffrey Zweig, and Benoit Dumoulin. 2014. Cache based recurrent neural network language model inference for first pass speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 6354–6358. IEEE.
- [Jean et al.2015] Sebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vocabulary for neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 1–10.
- [Jean et al.2017] Sebastien Jean, Stanislas Lauly, Orhan Firat, and Kyunghyun Cho. 2017. Does neural machine translation benefit from larger context? arXiv preprint arXiv:1704.05135.
- [Koehn et al.2005] Philipp Koehn, Amittai Axelrod, Alexandra Birch, Chris Callison-Burch, Miles Osborne, David Talbot, and Michael White. 2005. Edinburgh system description for the 2005 iwslt speech translation evaluation. In IWSLT, pages 68–75.
- [Koehn et al.2007] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pages 177–180. Association for Computational Linguistics.
- [Kuhn and De Mori1990] R. Kuhn and Renato De Mori. 1990. A cache-based natural language model for speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(6):570–583.
[Lapata and Barzilay2005]
Mirella Lapata and Regina Barzilay.
Automatic evaluation of text coherence: models and representations.
International Joint Conference on Artificial Intelligence, pages 1085–1090.
- [Luong et al.2015a] Minh Thang Luong, Hieu Pham, and Christopher D. Manning. 2015a. Effective approaches to attention-based neural machine translation. Computer Science.
- [Luong et al.2015b] Minh-Thang Luong, Ilya Sutskever, Quoc V Le, Oriol Vinyals, and Wojciech Zaremba. 2015b. Addressing the rare word problem in neural machine translation. Association for Computational Linguistics.
- [Nepveu et al.2004] Laurent Nepveu, Guy Lapalme, Philippe Langlais, and George F. Foster. 2004. Adaptive language and translation models for interactive machine translation. In Conference on Empirical Methods in Natural Language Processing , EMNLP 2004, A Meeting of Sigdat, A Special Interest Group of the Acl, Held in Conjunction with ACL 2004, 25-26 July 2004, Barcelona, Spain, pages 190–197.
- [Och and Ney2000] Franz Josef Och and Hermann Ney. 2000. Improved statistical alignment models. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pages 440–447. Association for Computational Linguistics.
- [Shen et al.2015] Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2015. Minimum risk training for neural machine translation. Computer Science.
- [Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
- [Tam et al.2007] Yik Cheung Tam, Ian Lane, and Tanja Schultz. 2007. Bilingual lsa-based adaptation for statistical machine translation. Machine Translation, 21(4):187–207.
- [Tiedemann2010] Jörg Tiedemann. 2010. Context adaptation in statistical machine translation using models with exponentially decaying cache. In Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing, pages 8–15. Association for Computational Linguistics.
- [Wang and Cho2016] Tian Wang and Kyunghyun Cho. 2016. Larger-context language modelling with recurrent neural network. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1319–1329.
- [Wang et al.2017] Longyue Wang, Zhaopeng Tu, Andy Way, and Qun Liu. 2017. Exploiting cross-sentence context for neural machine translation. Proceedings of the Conference on Empirical Methods in Natural Language Processing.
- [Xiong and Zhang2013] Deyi Xiong and Min Zhang. 2013. A topic-based coherence model for statistical machine translation. In AAAI.