Past several years have observed a significant progress in Neural Machine Translation (NMT) [Kalchbrenner and Blunsom2013, Cho et al.2014, Sutskever, Vinyals, and Le2014, Bahdanau, Cho, and Bengio2015]. Particularly, NMT has significantly enhanced the performance of translation between a language pair involving rich morphology prediction and/or significant word reordering [Luong and Manning2015, Bentivogli et al.2016]Hochreiter and Schmidhuber1997] enables NMT to conduct long-distance reordering, which is a significant challenge for Statistical Machine Translation (SMT) [Brown et al.1993, Koehn, Och, and Marcu2003].
Unlike SMT which employs a number of components, NMT adopts an end-to-end encoder-decoder
framework to model the entire translation process. The role of encoder is to summarize the source sentence into a sequence of latent vectors, and the decoder acts as a language model to generate a target sentence word by word by selectively leveraging the information from the latent vectors at each step. In learning, NMT essentially estimates the likelihood of a target sentence given a source sentence.
However, conventional NMT faces two main problems:
Translations generated by NMT systems often lack of adequacy. When generating target words, the decoder often repeatedly selects some parts of the source sentence while ignoring other parts, which leads to over-translation and under-translation [Tu et al.2016b]. This is mainly due to that NMT does not have a mechanism to ensure that the information in the source side is completely transformed to the target side.
Likelihood objective is suboptimal in decoding. NMT utilizes a beam search to find a translation that maximizes the likelihood. However, we observe that likelihood favors short translations, and thus fails to distinguish good translation candidates from bad ones in a large decoding space (e.g., beam size ). The main reason is that likelihood only captures unidirectional dependency from source to target, which does not correlate well with translation adequacy [Li and Jurafsky2016, Shen et al.2016].
While previous work partially solves the above problems, in this work we propose a novel encoder-decoder-reconstructor model for NMT, aiming at alleviating these problems in a unified framework. As shown in Figure 1, given a Chinese sentence “duoge jichang beipo guanbi .”, the standard encoder-decoder translates it into an English sentence and assigns a likelihood score. Then, the newly added reconstructor reconstructs the translation back to the source sentence and calculates the corresponding reconstruction score. Linear interpolation of the two scores produces an overall score of the translation.
As seen, the added reconstructor imposes a constraint that an NMT model should be able to reconstruct the input source sentence from the target-side hidden layers, which encourages decoder to embed complete information of the source side. The reconstruction score serves as an auxiliary objective to measure the adequacy of translation. The combined objective consisting of likelihood and reconstruction, which measures both fluency and adequacy of translations, is used in both training and testing.
Experimental results show that the proposed approach consistently improves the translation performance when increasing the decoding space. Our model achieves a significant improvement of 2.3 BLEU points over a strong attention-based NMT system, and of 4.5 BLEU points over a state-of-the-art SMT system, trained on the same data.
Encoder-Decoder based NMT
Given a source sentence and a target sentence
, end-to-end NMT directly models the translation probability word by word:
where is the model parameters and is partial translation. Prediction of the i-th target word is generally made in an encoder-decoder framework:
where is the
-th hidden target state computed by the decoder Recurrent Neural Network (RNN),is the -th source representation for generating the -th target word, and
is an activation function in the decoder. Current NMT models differ in their ways of calculatingfrom the hidden states from the encoder. Please refer to [Sutskever, Vinyals, and Le2014, Bahdanau, Cho, and Bengio2015] for more details. The parameters of NMT model are trained to maximize the likelihood of a set of training examples :
When generating each target word, the decoder adaptively selects partial information (i.e., ) from the encoder. This actually adopts a greedy way to select the most useful information for each generated word. There is, however, no mechanism to guarantee that the decoder conveys complete information from the source sentence to the target sentence.
In addition, we find that the performance of NMT decreases as the decoding space increases, as shown in Table 1. This is because likelihood favors short but inadequate translation candidates, which are newly added together with good candidates in larger decoding spaces. Normalizing likelihood by translation length faces the same problem.
It is important to introduce an auxiliary objective to measure the adequacy of translation, which complements likelihood.
Reconstruction in Auto-Encoder
Reconstruction is a standard concept in auto-encoder, which is usually realized by a feed forward network [Bourlard and Kamp1988, Vincent et al.2010, Socher et al.2011]. The model consists of an encoding function to compute a representation from an input, and a decoding function to reconstruct the input from the representation. The parameters involved in the two functions are trained to maximize the reconstruction score, which measures the similarity between the original input and reconstructed input.
Reconstruction examines whether the reconstructed input is faithful to the original input, which is essentially similar to the consideration of adequacy in translation. It is natural to integrate reconstruction into NMT to enhance adequacy of translation. The basic idea of our approach is to reconstruct the source sentence from the latent representations of decoder, and use the reconstruction score as the adequacy measure. Analogous to auto-encoder, our approach also learns a latent representation of source sentence on the target side. Our approach can be viewed as a supervised auto-encoder in the sense that the latent representation is not only used to reconstruct the source sentence, but also used to generate the target sentence.
We prepose a novel encoder-decoder-reconstructor framework. More specifically, we base our approach on top of attention-based NMT [Bahdanau, Cho, and Bengio2015, Luong, Pham, and Manning2015], which will be used as baseline in the experiments later. We note that the proposed approach is generally applicable to any other type of NMT architectures, such as the sequence-to-sequence model [Sutskever, Vinyals, and Le2014]. The model architecture, shown in Figure 2, consists of two components:
Standard encoder-decoder reads the input sentence and outputs its translation along with the likelihood score, as shown in the background section.
Added reconstructor reads the hidden state sequence from the decoder and outputs a score of exactly reconstructing the input sentence, which we will describe below.
As shown in Figure 2, the reconstructor reconstructs the input. Here we use the hidden layer at the target side as the representation of the translation, since it plays a key role in generation of the translation. We aim at encouraging it to embed complete source information, and in the meantime to reduce the complexity of model and make the training easy.
Specifically, the reconstructor reconstructs the source sentence word by word, which is conditioned on the inverse context vector for each input word . The inverse context vector is computed as a weighted sum of hidden layers at the target-side:
The weight of each hidden layer
is computed by an added inverse attention model, which has its own parameters independent from the original attention model. The reconstruction probability is calculated by
where is the hidden state in the reconstructor, and computed by
Here and are softmax function and activation function for the reconstructor, respectively. The source words share the same word embeddings with the encoder.
Formally, we train both the encoder-decoder and the reconstructor on a set of training examples , where is the state sequence in the decoder after generating , and and are model parameters in the encoder-decoder and reconstructor respectively. The new training objective is:
where is a hyper-parameter that balances the preference between likelihood and reconstruction.
Note that the objective consists of two parts: likelihood measures translation fluency, and reconstruction measures translation adequacy. It is clear that the combined objective is more consistent with the goal of enhancing overall translation quality, and can more effectively guide the parameter training for making better translation.
Once a model is trained, we use a beam search to find a translation that approximately maximizes both the likelihood and reconstruction score. As shown in Figure 3, given an input sentence, a two-phase scheme is used:
The standard encoder-decoder produces a set of translation candidates, each of which is a triple consisting of a translation candidate, its corresponding hidden layer at the target-side , and its likelihood score .
For each translation candidate, the reconstructor reads its corresponding hidden layer at the target-side and outputs an auxiliary reconstruction score . Linear interpolation of likelihood and reconstruction score produces an overall score, which is used to select the final translation.111Interpolation weight in testing is the same as in training.
In testing, reconstruction works as a reranking technique to select a better translation from the -best candidates generated by the decoder.
We carry out experiments on Chinese-English translation. The training dataset consists of 1.25M sentence pairs extracted from LDC corpora, with 27.9M Chinese words and 34.5M English words respectively.222The corpora include LDC2002E18, LDC2003E07, LDC2003E14, LDC2004T07, LDC2004T08 and LDC2005T06. We choose the NIST 2002 (MT02) dataset as validation set, and the NIST 2005 (MT05), 2006 (MT06) and 2008 (MT08) datasets as test sets. We use the case-insensitive 4-gram NIST BLEU score [Papineni et al.2002]
as evaluation metric, andsign-test [Collins, Koehn, and Kučerová2005] for statistical significance test.
We compare our method with state-of-the-art SMT and NMT models:
Moses [Koehn et al.2007]: an open source phrase-based translation system with default configuration and a 4-gram language model trained on the target portion of training data.
RNNSearch: our re-implemented attention-based NMT system, which incorporates dropout [Hinton et al.2012] on the output layer and improves the attention model by feeding the lastly generated word.
For training RNNSearch
, we limit the source and target vocabularies to the most frequent 30K words in Chinese and English. We train each model with the sentences of length up to 80 words in the training data. We shuffle mini-batches as we proceed and the mini-batch size is 80. The word embedding dimension is 620 and the hidden layer dimension is 1000. We train for 15 epochs using Adadelta[Zeiler2012].
For our model, we use the same setting as RNNSearch if applicable. We set the hyper-parameter . The parameters of our model (i.e., encoder and decoder, except those related to reconstructor) are initialized by the RNNSearch model trained on a parallel corpus. We further train all the parameters of our model for another 10 epochs.
Correlation between Reconstruction and Adequacy
In the first experiment, we investigate the validity of our assumption that reconstruction score correlates well with translation adequacy, which is the underlying assumption of the approach. We conduct a subjective evaluation: two human evaluators are asked to evaluate the translations of 200 source sentences randomly sampled from the test sets. We calculate Pearson Correlation between the reconstruction scores and the corresponding adequacy and fluency scores on the samples, as shown in Table 2. Two evaluators produce similar results: reconstruction score is more related to translation adequacy than fluency.
Effect of Reconstruction on Translation
In this experiment, we investigate the effect of reconstruction on translation performance over time, which is measured in BLEU scores on the validation set. For reconstruction, we use the reconstructor to stochastically generate a source sentence for each translation,333Note that it is different from the standard procedure, which calculates the probability of exactly reconstructing the original input. and calculate the BLEU score of the reconstructed input under the reference of the original input. Generally, as shown in Figure 4, the BLEU score of translation goes up with the improvement of reconstruction over time. The translation performance reaches a peak at iteration 110K, when the model achieves a balance between likelihood and reconstruction score. Therefore, we use the trained model at iteration 110K in the following experiments.
Effect of Reconstruction in Large Decoding Space
Can our approach cope with the limitation of likelihood in large decoding spaces? To answer this question, we investigate the effect of reconstruction on different beam sizes , as shown in Table 3. Our approach can indeed solve the problem: increasing the size of decoding space generally leads to improving the BLEU score. We attribute this to the ability of the combined objective to measure both fluency and adequacy of translation candidates. There is a significant gap between and . However, keeping increasing does not result in significant improvements of translation accuracy but greatly decreases decoding efficiency. Therefore, in the following experiments we set the max value of to , and use normalized likelihood for if we don’t use reconstruction in testing.
Table 4 shows the translation performances on test sets measured in BLEU score. RNNSearch significantly outperforms Moses by 2.2 BLEU points on average, indicating that it is a strong baseline NMT system. This is mainly due to the introduction of two advanced techniques. Increasing beam size leads to decreasing translation performances on test sets, which is consistent with the result on the validation set. We compare our methods with “RNNSearch (Beam=10)” in the following analysis, since it yields the best performance in the baseline systems.
First, the introduction of reconstruction significantly improves the performance over baseline by 1.1 BLEU points with beam size . Most importantly, we obtain a further improvement of 1.2 BLEU points when expanding the decoding space. Second, our approach also consistently improves the quality (in terms of Oracle score, see the last column) of -best translation candidates over the baseline system on various beam sizes. This confirms our claim that the combined objective contributes to parameter training for generating better translation candidates.
We conduct extensive analyses to better understand our model in terms of efficiency of the added reconstruction, contribution of reconstruction from training and testing, alleviating typical translation problems, and building the ability of handling long sentences.
Introducing reconstruction significantly slows down the training speed, while it slightly decreases the decoding speed. For training, when running on a single GPU device Tesla K80, the speed of the baseline model is 960 target words per second, while the speed of the proposed model is 500 target words per second. For decoding with beam=10, the speed of the baseline model is 2.28 seconds per sentence, while that of the proposed approach is 2.60 seconds per sentence.444For decoding with beam=100, the speeds are 22.97 and 25.29 seconds per sentence, respectively. We attribute the effectiveness of decoding to the avoidance of beam search for reconstruction and the benefit of batch computation on GPU.
|Rec. used in||Beam|
The contribution of reconstruction is of two-fold: (1) enabling parameter training for generating better translation candidates, and (2) enabling better reranking of generated candidates in testing. Table 5 lists the improvements from the two contribution sources. When applied only in training, reconstruction improves translation performance by generating fluent and adequate translation candidates. On top of that, reconstruction-based reranking further improves the performance. The improvements are more significant when decoding spaces increase.
We then conduct a subjective evaluation to investigate the benefit of incorporating reconstruction on the randomly selected 200 sentences. Table 6 shows the results of subjective evaluation on translation. RNNSearch suffers from serious under-translation and over-translation problems, which is consistent with the finding in other work [Tu et al.2016b]. Incorporating reconstruction significantly alleviates these problems, and reduces 11.0% and 38.5% of under-translation and over-translation errors respectively. The main reason is that both under-translation and over-translation lead to lower reconstruction scores, and thus are penalized by the reconstruction objective. As a result, the corresponding candidate is less likely to be selected as the final translation.
Following Bahdanau et al. Bahdanau:2015:ICLR, we group sentences of similar lengths together and compute the BLEU score for each group, as shown in Figure 5. Clearly the proposed approach outperforms all the other systems in all length segments. Specifically, RNNSearch outperforms Moses on all sentence segments, while its performance degrades faster than its competitors, which is consistent with the finding in [Bentivogli et al.2016]. This is mainly due to that RNNSearch seriously suffers from inadequate translations on long sentences [Tu et al.2016b]. Our model explicitly encourages the decoder to incorporate source information as much as possible, and thus the improvements are more significant on long sentences.
Comparison with Previous Work
We re-implement the methods of Tu:2016:ACL Tu:2016:ACL,Tu:2016:arXiv on top of RNNSearch. For the coverage mechanism [Tu et al.2016b], we use the neural network based coverage, and the coverage dimension is 100. For the context gates [Tu et al.2016a], we apply them on both source and target sides. Table 7 lists the comparison results. Coverage mechanism and context gates significantly improve translation performance individually, and combining them achieves a further improvement. This is consistent with the results in [Tu et al.2016b, Tu et al.2016a]. Our model consistently improves the translation performance when further combined with the models.
Our work is inspired by research on improving NMT by:
Enhancing Translation Adequacy
Recently, several work shows that NMT favors fluent but inadequate translations [Tu et al.2016b, Tu et al.2016a]. While all the work is towards enhancing adequacy of NMT, our approach is complimentary: the above work is still under the standard encoder-decoder framework, while we propose a novel encoder-decoder-reconstructor framework. Experiments show that combining those models together can further improve the translation performance.
Improving Beam Search
Standard NMT models exploit a simple beam search algorithm to generate the translation word by word. Several researchers rescore word candidates with additional features, such as language model probability [Gulcehre et al.2015] and SMT features [He et al.2016, Stahlberg et al.2016]. In contrast, Li:2016:NAACL Li:2016:NAACL rescore translation candidates on sentence-level with the mutual information between source and target sides. In the above work, NMT is treated as a black-box and its weighted outputs are combined with other features only in testing. In this work, we move forward further by incorporating reconstruction score into the objective of training, which leads to creation of better translation candidates.
Capturing Bidirectional Dependency
Standard NMT models only capture the unidirectional dependency from source to target with the likelihood objective. It has been shown that combination of two directional models outperforms each model alone [Liang, Taskar, and Klein2006, Cheng et al.2016a, Cheng et al.2016b]. Among them, Cheng:2016:ACL Cheng:2016:ACL reconstruct the monolingual corpora with two separate source-to-target and target-to-source NMT models. Closely related to Cheng:2016:ACL Cheng:2016:ACL, our approach aims at enhancing adequacy of unidirectional (i.e., source-to-target) NMT via an auxiliary target-to-source objective on parallel corpora, while theirs focuses on learning bidirectional NMT models via auto-encoders on monolingual corpora. Therefore, we use the decoder states as the input of the reconstructor, to encourage the target representation to contain the complete source information to reconstruct back to the source sentence.
We propose a novel encoder-decoder-reconstructor framework for NMT, in which the newly added reconstructor introduces an auxiliary score to measure the adequacy of translation candidates. The advantage of the proposed approach is of two-fold. First, it improves parameter training for producing better translation candidates. Second, it consistently improves translation performance when the decoding space increases, while conventional NMT fails to do so. Experimental results show that the two advantages can indeed help our approach to consistently improve translation performance.
There is still a significant gap between de facto translation and oracle of -best translation candidates, especially when the decoding space increases. We plan to narrow the gap with rich features, which can better measure the quality of translation candidates. It is also necessary to validate the effectiveness of our approach on more language pairs and other NMT architectures.
This work is supported by China National 973 project 2014CB340301. Yang Liu is supported by the National Natural Science Foundation of China (No. 61522204) and the 863 Program (2015AA015407).
- [Bahdanau, Cho, and Bengio2015] Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In ICLR 2015.
- [Bentivogli et al.2016] Bentivogli, L.; Bisazza, A.; Cettolo, M.; and Federico, M. 2016. Neural versus Phrase-Based Machine Translation Quality: a Case Study. In EMNLP 2016.
- [Bourlard and Kamp1988] Bourlard, H., and Kamp, Y. 1988. Biological Cybernetics 59(4-5):291–294.
- [Brown et al.1993] Brown, P. E.; Pietra, S. A. D.; Pietra, V. J. D.; and Mercer, R. L. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2):263–311.
- [Cheng et al.2016a] Cheng, Y.; Shen, S.; He, Z.; He, W.; Wu, H.; Sun, M.; and Liu, Y. 2016a. Agreement-based joint training for bidirectional attention-based neural machine translation. In IJCAI 2016.
- [Cheng et al.2016b] Cheng, Y.; Xu, W.; He, Z.; He, W.; Wu, H.; Sun, M.; and Liu, Y. 2016b. Semi-Supervised Learning for Neural Machine Translation. In ACL 2016.
- [Cho et al.2014] Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP 2014.
- [Collins, Koehn, and Kučerová2005] Collins, M.; Koehn, P.; and Kučerová, I. 2005. Clause restructuring for statistical machine translation. In ACL 2005.
- [Gulcehre et al.2015] Gulcehre, C.; Firat, O.; Xu, K.; Cho, K.; Barrault, L.; Lin, H.-C.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2015. On Using Monolingual Corpora in Neural Machine Translation. arXiv.
- [He et al.2016] He, W.; He, Z.; Wu, H.; and Wang, H. 2016. Improved neural machine translation with smt features. In AAAI 2016.
- [Hinton et al.2012] Hinton, G. E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. R. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv.
- [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural Computation.
- [Kalchbrenner and Blunsom2013] Kalchbrenner, N., and Blunsom, P. 2013. Recurrent continuous translation models. In EMNLP 2013.
- [Koehn et al.2007] Koehn, P.; Hoang, H.; Birch, A.; Callison-Burch, C.; Federico, M.; Bertoldi, N.; Cowan, B.; Shen, W.; Moran, C.; Zens, R.; Dyer, C.; Bojar, O.; Constantin, A.; and Herbst, E. 2007. Moses: open source toolkit for statistical machine translation. In ACL 2007.
- [Koehn, Och, and Marcu2003] Koehn, P.; Och, F. J.; and Marcu, D. 2003. Statistical phrase-based translation. In NAACL 2003.
- [Li and Jurafsky2016] Li, J., and Jurafsky, D. 2016. Mutual information and diverse decoding improve neural machine translation. In NAACL 2016.
- [Liang, Taskar, and Klein2006] Liang, P.; Taskar, B.; and Klein, D. 2006. Alignment by agreement. In NAACL 2006.
- [Luong and Manning2015] Luong, M.-T., and Manning, C. D. 2015. Stanford neural machine translation systems for spoken language domains. In IWSLT 2015.
- [Luong, Pham, and Manning2015] Luong, M.-T.; Pham, H.; and Manning, C. D. 2015. Effective approaches to attention-based neural machine translation. In EMNLP 2015.
- [Papineni et al.2002] Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL 2002.
- [Shen et al.2016] Shen, S.; Cheng, Y.; He, Z.; He, W.; Wu, H.; Sun, M.; and Liu, Y. 2016. Minimum Risk Training for Neural Machine Translation. In ACL 2016.
[Socher et al.2011]
Socher, R.; Pennington, J.; Huang, E. H.; Ng, A. Y.; and Manning, C. D.
Semi-supervised recursive autoencoders for predicting sentiment distributions.In EMNLP 2011.
- [Stahlberg et al.2016] Stahlberg, F.; Hasler, E.; Waite, A.; and Byrne, B. 2016. Syntactically Guided Neural Machine Translation. arXiv.
- [Sutskever, Vinyals, and Le2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In NIPS 2014.
- [Tu et al.2016a] Tu, Z.; Liu, Y.; Lu, Z.; Liu, X.; and Li, H. 2016a. Context Gates for Neural Machine Translation. In arXiv.
- [Tu et al.2016b] Tu, Z.; Lu, Z.; Liu, Y.; Liu, X.; and Li, H. 2016b. Modeling Coverage for Neural Machine Translation. In ACL 2016.
[Vincent et al.2010]
Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; and Manzagol, P.-A.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.
Journal of Machine Learning Research11(Dec):3371–3408.
- [Zeiler2012] Zeiler, M. D. 2012. ADADELTA: an adaptive learning rate method. arXiv.