Recently, end-to-end neural machine translation (NMT) [Kalchbrenner and Blunsom2013, Sutskever et al.2014, Bahdanau et al.2015] has attracted increasing attention from the community. Providing a new paradigm for machine translation, NMT aims at training a single, large neural network that directly transforms a source-language sentence to a target-language sentence without explicitly modeling latent structures (e.g., word alignment, phrase segmentation, phrase reordering, and SCFG derivation) that are vital in conventional statistical machine translation (SMT) [Brown et al.1993, Koehn et al.2003, Chiang2005].
, with an encoder to read and encode a source-language sentence into a vector, from which a decoder generates a target-language sentence. While early efforts encode the input into a fixed-length vector, Bahdanau et al. Bahdanau:15 advocate the attention mechanism to dynamically generate a context vector for a target word being generated.
Although NMT models have achieved results on par with or better than conventional SMT, they still suffer from a major drawback: the models are optimized to maximize the likelihood of training data instead of evaluation metrics that actually quantify translation quality. Ranzato et al. Ranzato:15 indicate two drawbacks of maximum likelihood estimation
(MLE) for NMT. First, the models are only exposed to the training distribution instead of model predictions. Second, the loss function is defined at the word level instead of the sentence level.
In this work, we introduce minimum risk training (MRT) for neural machine translation. The new training objective is to minimize the expected loss (i.e., risk) on the training data. MRT has the following advantages over MLE:
Direct optimization with respect to evaluation metrics: MRT introduces evaluation metrics as loss functions and aims to minimize expected loss on the training data.
Applicable to arbitrary loss functions: our approach allows arbitrary sentence-level loss functions, which are not necessarily differentiable.
Transparent to architectures: MRT does not assume the specific architectures of NMT and can be applied to any end-to-end NMT systems.
and deep learning based MT[Gao et al.2014], to the best of our knowledge, this work is the first effort to introduce MRT into end-to-end NMT. Experiments on a variety of language pairs (Chinese-English, English-French, and English-German) show that MRT leads to significant improvements over MLE on a state-of-the-art NMT system [Bahdanau et al.2015].
Given a source sentence and a target sentence
, end-to-end NMT directly models the translation probability:
where is a set of model parameters and is a partial translation.
-th target word can be modeled by using a recurrent neural network:
where is the -th hidden state on the target side, is the context for generating the -th target word, and is a non-linear function. Current NMT approaches differ in calculating and and defining . Please refer to [Sutskever et al.2014, Bahdanau et al.2015] for more details.
Given a set of training examples , the standard training objective is to maximize the log-likelihood of the training data:
We use to denote the length of the -th target sentence .
The partial derivative with respect to a model parameter is calculated as
Ranzato et al. Ranzato:15 point out that MLE for end-to-end NMT suffers from two drawbacks. First, while the models are trained only on the training data distribution, they are used to generate target words on previous model predictions, which can be erroneous, at test time. This is referred to as exposure bias [Ranzato et al.2015]. Second, MLE usually uses the cross-entropy loss focusing on word-level errors to maximize the probability of the next correct word, which might hardly correlate well with corpus-level and sentence-level evaluation metrics such as BLEU [Papineni et al.2002] and TER [Snover et al.2006].
As a result, it is important to introduce new training algorithms for end-to-end NMT to include model predictions during training and optimize model parameters directly with respect to evaluation metrics.
3 Minimum Risk Training for Neural Machine Translation
Minimum risk training (MRT), which aims to minimize the expected loss on the training data, has been widely used in conventional SMT [Och2003, Smith and Eisner2006, He and Deng2012] and deep learning based MT [Gao et al.2014]. The basic idea is to introduce evaluation metrics as loss functions and assume that the optimal set of model parameters should minimize the expected loss on the training data.
Let be the -th sentence pair in the training data and be a model prediction. We use a loss function to measure the discrepancy between the model prediction and the gold-standard translation . Such a loss function can be negative smoothed sentence-level evaluation metrics such as BLEU [Papineni et al.2002], NIST [Doddington2002], TER [Snover et al.2006], or METEOR [Lavie and Denkowski2009] that have been widely used in machine translation evaluation. Note that a loss function is not parameterized and thus not differentiable.
In MRT, the risk is defined as the expected loss with respect to the posterior distribution:
where is a set of all possible candidate translations for .
The training objective of MRT is to minimize the risk on the training data:
Intuitively, while MLE aims to maximize the likelihood of training data, our training objective is to discriminate between candidates. For example, in Table 1, suppose the candidate set contains only three candidates: , , and . According to the losses calculated by comparing with the gold-standard translation , it is clear that is the best candidate, is the second best, and is the worst: . The right half of Table 1 shows four models. As model 1 (column 3) ranks the candidates in a reverse order as compared with the gold-standard (i.e., ), it obtains the highest risk of . Achieving a better correlation with the gold-standard than model 1 by predicting , model 2 (column 4) reduces the risk to . As model 3 (column 5) ranks the candidates in the same order with the gold-standard, the risk goes down to . The risk can be further reduced by concentrating the probability mass on (column 6). As a result, by minimizing the risk on the training data, we expect to obtain a model that correlates well with the gold-standard.
In MRT, the partial derivative with respect to a model parameter is given by
Since Eq. (10) suggests there is no need to differentiate , MRT allows arbitrary non-differentiable loss functions. In addition, our approach is transparent to architectures and can be applied to arbitrary end-to-end NMT models.
Despite these advantages, MRT faces a major challenge: the expectations in Eq. (10) are usually intractable to calculate due to the exponential search space of , the non-decomposability of the loss function , and the context sensitiveness of NMT.
To alleviate this problem, we propose to only use a subset of the full search space to approximate the posterior distribution and introduce a new training objective:
where is a sampled subset of the full search space, and is a distribution defined on the subspace :
Note that is a hyper-parameter that controls the sharpness of the distribution [Och2003].
Algorithm LABEL:alg shows how to build by sampling the full search space. The sampled subset initializes with the gold-standard translation (line 1). Then, the algorithm keeps sampling a target word given the source sentence and the partial translation until reaching the end of sentence (lines 3-16). Note that sampling might produce duplicate candidates, which are removed when building the subspace. We find that it is inefficient to force the algorithm to generate exactly distinct candidates because high-probability candidates can be sampled repeatedly, especially when the probability mass highly concentrates on a few candidates. In practice, we take advantage of GPU’s parallel architectures to speed up the sampling. 111To build the subset, an alternative to sampling is computing top- translations. We prefer sampling to computing top- translations for efficiency: sampling is more efficient and easy-to-implement than calculating -best lists, especially given the extremely parallel architectures of GPUs.
Given the sampled space, the partial derivative with respect to a model parameter of is given by
Since , the expectations in Eq. (14) can be efficiently calculated by explicitly enumerating all candidates in . In our experiments, we find that approximating the full space with samples (i.e., ) works very well and further increasing sample size does not lead to significant improvements (see Section 4.3).
We evaluated our approach on three translation tasks: Chinese-English, English-French, and English-German. The evaluation metric is BLEU [Papineni et al.2002] as calculated by the
For Chinese-English, the training data consists of 2.56M pairs of sentences with 67.5M Chinese words and 74.8M English words, respectively. We used the NIST 2006 dataset as the validation set (hyper-parameter optimization and model selection) and the NIST 2002, 2003, 2004, 2005, and 2008 datasets as test sets.
For English-French, to compare with the results reported by previous work on end-to-end NMT [Sutskever et al.2014, Bahdanau et al.2015, Jean et al.2015, Luong et al.2015b], we used the same subset of the WMT 2014 training corpus that contains 12M sentence pairs with 304M English words and 348M French words. The concatenation of news-test 2012 and news-test 2013 serves as the validation set and news-test 2014 as the test set.
For English-German, to compare with the results reported by previous work [Jean et al.2015, Luong et al.2015a], we used the same subset of the WMT 2014 training corpus that contains 4M sentence pairs with 91M English words and 87M German words. The concatenation of news-test 2012 and news-test 2013 is used as the validation set and news-test 2014 as the test set.
We compare our approach with two state-of-the-art SMT and NMT systems:
RNNsearch [Bahdanau et al.2015]: an attention-based NMT system using maximum likelihood estimation.
Moses uses the parallel corpus to train a phrase-based translation model and the target part to train a 4-gram language model using the SRILM toolkit [Stolcke2002]. 222It is possible to exploit larger monolingual corpora for both Moses and RNNsearch [Gulcehre et al.2015, Sennrich et al.2015]. We leave this for future work. The log-linear model Moses uses is trained by the minimum error rate training (MERT) algorithm [Och2003] that directly optimizes model parameters with respect to evaluation metrics.
RNNsearch uses the parallel corpus to train an attention-based neural translation model using the maximum likelihood criterion.
On top of RNNsearch, our approach replaces MLE with MRT. We initialize our model with the RNNsearch50 model [Bahdanau et al.2015]. We set the vocabulary size to 30K for Chinese-English and English-French and 50K for English-German. The beam size for decoding is 10. The default loss function is negative smoothed sentence-level BLEU.
4.2 Effect of
The hyper-parameter controls the smoothness of the distribution (see Eq. (13)). As shown in Figure 1, we find that has a critical effect on BLEU scores on the Chinese-English validation set. While deceases BLEU scores dramatically, improves translation quality significantly and consistently. Reducing further to , however, results in lower BLEU scores. Therefore, we set in the following experiments.
4.3 Effect of Sample Size
For efficiency, we sample candidate translations from the full search space to build an approximate posterior distribution (Section 3). Figure 2 shows the effect of sample size on the Chinese-English validation set. It is clear that BLEU scores consistently rise with the increase of . However, we find that a sample size larger than 100 (e.g., ) usually does not lead to significant improvements and increases the GPU memory requirement. Therefore, we set in the following experiments.
4.4 Effect of Loss Function
As our approach is capable of incorporating evaluation metrics as loss functions, we investigate the effect of different loss functions on BLEU, TER and NIST scores on the Chinese-English validation set. As shown in Table 2, negative smoothed sentence-level BLEU (i.e, sBLEU) leads to statistically significant improvements over MLE (). Note that the loss functions are all defined at the sentence level while evaluation metrics are calculated at the corpus level. This discrepancy might explain why optimizing with respect to sTER does not result in the lowest TER on the validation set. As sBLEU consistently improves all evaluation metrics, we use it as the default loss function in our experiments.
4.5 Comparison of Training Time
We used a cluster with 20 Telsa K40 GPUs to train the NMT model. For MLE, it takes the cluster about one hour to train 20,000 mini-batches, each of which contains 80 sentences. The training time for MRT is longer than MLE: 13,000 mini-batches can be processed in one hour on the same cluster.
Figure 3 shows the learning curves of MLE and MRT on the validation set. For MLE, the BLEU score reaches its peak after about 20 hours and then keeps going down dramatically. Initializing with the best MLE model, MRT increases BLEU scores dramatically within about 30 hours. 333Although it is possible to initialize with a randomized model, it takes much longer time to converge. Afterwards, the BLEU score keeps improving gradually but there are slight oscillations.
4.6 Results on Chinese-English Translation
4.6.1 Comparison of BLEU Scores
Table 3 shows BLEU scores on Chinese-English datasets. For RNNsearch, we follow Luong et al. Luong:15 to handle rare words. We find that introducing minimum risk training into neural MT leads to surprisingly substantial improvements over Moses and RNNsearch with MLE as the training criterion (up to +8.61 and +7.20 BLEU points, respectively) across all test sets. All the improvements are statistically significant.
4.6.2 Comparison of TER Scores
Table 4 gives TER scores on Chinese-English datasets. The loss function used in MRT is sBLEU. MRT still obtains dramatic improvements over Moses and RNNsearch with MLE as the training criterion (up to -10.27 and -8.32 TER points, respectively) across all test sets. All the improvements are statistically significant.
4.6.3 BLEU Scores over Sentence Lengths
Figure 4 shows the BLEU scores of translations generated by Moses, RNNsearch with MLE, and RNNsearch with MRT on the Chinese-English test set with respect to input sentence lengths. While MRT consistently improves over MLE for all lengths, it achieves worse translation performance for sentences longer than 60 words.
One reason is that RNNsearch tends to produce short translations for long sentences. As shown in Figure 5, both MLE and MRE generate much shorter translations than Moses. This results from the length limit imposed by RNNsearch for efficiency reasons: a sentence in the training set is no longer than 50 words. This limit deteriorates translation performance because the sentences in the test set are usually longer than 50 words.
|MLE vs. MRT|
|Source||meiguo daibiao tuan baokuo laizi shidanfu daxue de yi wei zhongguo zhuanjia , liang ming canyuan waijiao zhengce zhuli yiji yi wei fuze yu pingrang dangju da jiaodao de qian guowuyuan guanyuan .|
|Reference||the us delegation consists of a chinese expert from the stanford university , two senate foreign affairs policy assistants and a former state department official who was in charge of dealing with pyongyang authority .|
|Moses||the united states to members of the delegation include representatives from the stanford university , a chinese expert , two assistant senate foreign policy and a responsible for dealing with pyongyang before the officials of the state council .|
|RNNsearch-MLE||the us delegation comprises a chinese expert from stanford university , a chinese foreign office assistant policy assistant and a former official who is responsible for dealing with the pyongyang authorities .|
|RNNsearch-MRT||the us delegation included a chinese expert from the stanford university , two senate foreign policy assistants , and a former state department official who had dealings with the pyongyang authorities .|
|Existing end-to-end NMT systems|
|Bahdanau et al. Bahdanau:15||gated RNN with search||MLE||30K||28.45|
|Jean et al. Jean:15||gated RNN with search||30K||29.97|
|Jean et al. Jean:15||gated RNN with search + PosUnk||30K||33.08|
|Luong et al. Luong:15||LSTM with 4 layers||40K||29.50|
|Luong et al. Luong:15||LSTM with 4 layers + PosUnk||40K||31.80|
|Luong et al. Luong:15||LSTM with 6 layers||40K||30.40|
|Luong et al. Luong:15||LSTM with 6 layers + PosUnk||40K||32.70|
|Sutskever et al. Sutskever:14||LSTM with 4 layers||80K||30.59|
|Our end-to-end NMT systems|
|this work||gated RNN with search||MLE||30K||29.88|
|gated RNN with search||MRT||30K||31.30|
|gated RNN with search + PosUnk||MRT||30K||34.23|
|Existing end-to-end NMT systems|
|Jean et al. Jean:15||gated RNN with search||MLE||16.46|
|Jean et al. Jean:15||gated RNN with search + PosUnk||18.97|
|Jean et al. Jean:15||gated RNN with search + LV + PosUnk||19.40|
|Luong et al. Luong:15a||LSTM with 4 layers + dropout + local att. + PosUnk||20.90|
|Our end-to-end NMT systems|
|this work||gated RNN with search||MLE||16.45|
|gated RNN with search||MRT||18.02|
|gated RNN with search + PosUnk||MRT||20.45|
4.6.4 Subjective Evaluation
We also conducted a subjective evaluation to validate the benefit of replacing MLE with MRT. Two human evaluators were asked to compare MLE and MRT translations of 100 source sentences randomly sampled from the test sets without knowing from which system a candidate translation was generated.
Table 5 shows the results of subjective evaluation. The two human evaluators made close judgements: around 54% of MLE translations are worse than MRE, 23% are equal, and 23% are better.
4.6.5 Example Translations
Table 6 shows some example translations. We find that Moses translates a Chinese string “yi wei fuze yu pingrang dangju da jiaodao de qian guowuyuan guanyuan” that requires long-distance reordering in a wrong way, which is a notorious challenge for statistical machine translation. In contrast, RNNsearch-MLE seems to overcome this problem in this example thanks to the capability of gated RNNs to capture long-distance dependencies. However, as MLE uses a loss function defined only at the word level, its translation lacks sentence-level consistency: “chinese” occurs twice while “two senate” is missing. By optimizing model parameters directly with respect to sentence-level BLEU, RNNsearch-MRT seems to be able to generate translations more consistently at the sentence level.
4.7 Results on English-French Translation
Table 7 shows the results on English-French translation. We list existing end-to-end NMT systems that are comparable to our system. All these systems use the same subset of the WMT 2014 training corpus and adopt MLE as the training criterion. They differ in network architectures and vocabulary sizes. Our RNNsearch-MLE system achieves a BLEU score comparable to that of Jean et al. Jean:15. RNNsearch-MRT achieves the highest BLEU score in this setting even with a vocabulary size smaller than Luong et al. Luong:15 and Sutskever et al. Sutskever:14. Note that our approach does not assume specific architectures and can in principle be applied to any NMT systems.
4.8 Results on English-German Translation
Table 8 shows the results on English-German translation. Our approach still significantly outperforms MLE and achieves comparable results with state-of-the-art systems even though Luong et al. Luong:15a used a much deeper neural network. We believe that our work can be applied to their architecture easily.
Despite these significant improvements, the margins on English-German and English-French datasets are much smaller than Chinese-English. We conjecture that there are two possible reasons. First, the Chinese-English datasets contain four reference translations for each sentence while both English-French and English-German datasets only have single references. Second, Chinese and English are more distantly related than English, French and German and thus benefit more from MRT that incorporates evaluation metrics into optimization to capture structural divergence.
5 Related Work
Our work originated from the minimum risk training algorithms in conventional statistical machine translation [Och2003, Smith and Eisner2006, He and Deng2012]. Och Och:03 describes a smoothed error count to allow calculating gradients, which directly inspires us to use a parameter to adjust the smoothness of the objective function. As neural networks are non-linear, our approach has to minimize the expected loss on the sentence level rather than the loss of 1-best translations on the corpus level. Smith and Eisner Smith:06 introduce minimum risk annealing for training log-linear models that is capable of gradually annealing to focus on the 1-best hypothesis. He et al. He:12 apply minimum risk training to learning phrase translation probabilities. Gao et al. Gao:14 leverage MRT for learning continuous phrase representations for statistical machine translation. The difference is that they use MRT to optimize a sub-model of SMT while we are interested in directly optimizing end-to-end neural translation models.
The Mixed Incremental Cross-Entropy Reinforce (MIXER) algorithm [Ranzato et al.2015]
is in spirit closest to our work. Building on the REINFORCE algorithm proposed by Williams Williams:92, MIXER allows incremental learning and the use of hybrid loss function that combines both REINFORCE and cross-entropy. The major difference is that Ranzato et al. Ranzato:15 leverage reinforcement learning while our work resorts to minimum risk training. In addition, MIXER only samples one candidate to calculate reinforcement reward while MRT generates multiple samples to calculate the expected risk. Figure2
indicates that multiple samples potentially increases MRT’s capability of discriminating between diverse candidates and thus benefit translation quality. Our experiments confirm Ranzato et al. Ranzato:15’s finding that taking evaluation metrics into account when optimizing model parameters does help to improve sentence-level text generation.
In this paper, we have presented a framework for minimum risk training in end-to-end neural machine translation. The basic idea is to minimize the expected loss in terms of evaluation metrics on the training data. We sample the full search space to approximate the posterior distribution to improve efficiency. Experiments show that MRT leads to significant improvements over maximum likelihood estimation for neural machine translation, especially for distantly-related languages such as Chinese and English.
In the future, we plan to test our approach on more language pairs and more end-to-end neural MT systems. It is also interesting to extend minimum risk training to minimum risk annealing following Smith and Eisner Smith:06. As our approach is transparent to loss functions and architectures, we believe that it will also benefit more end-to-end neural architectures for other NLP tasks.
This work was done while Shiqi Shen and Yong Cheng were visiting Baidu. Maosong Sun and Hua Wu are supported by the 973 Program (2014CB340501 & 2014CB34505). Yang Liu is supported by the National Natural Science Foundation of China (No.61522204 and No.61432013) and the 863 Program (2015AA011808). This research is also supported by the Singapore National Research Foundation under its International Research Centre@Singapore Funding Initiative and administered by the IDM Programme.
- [Ayana et al.2016] Ayana, Shiqi Shen, Zhiyuan Liu, and Maosong Sun. 2016. Neural headline generation with minimum risk training. arXiv:1604.01904.
- [Bahdanau et al.2015] Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR.
- [Brown et al.1993] Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguisitics.
- [Chiang2005] David Chiang. 2005. A hierarchical phrase-based model for statistical machine translation. In Proceedings of ACL.
- [Cho et al.2014] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Proceedings of EMNLP.
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics.In Proceedings of HLT.
- [Gao et al.2014] Jianfeng Gao, Xiaodong He, Wen tao Yih, and Li Deng. 2014. Learning continuous phrase representations for translation modeling. In Proceedings of ACL.
- [Gulcehre et al.2015] Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2015. On using monolingual corpora in neural machine translation. arXiv:1503.03535.
[He and Deng2012]
Xiaodong He and Li Deng.
Maximum expected bleu training of phrase and lexicon translation models.In Proceedings of ACL.
- [Jean et al.2015] Sebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vocabulary for neural machine translation. In Proceedings of ACL.
- [Kalchbrenner and Blunsom2013] Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of EMNLP.
- [Koehn and Hoang2007] Philipp Koehn and Hieu Hoang. 2007. Factored translation models. In Proceedings of EMNLP.
- [Koehn et al.2003] Philipp Koehn, Franz J. Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of HLT-NAACL.
- [Lavie and Denkowski2009] Alon Lavie and Michael Denkowski. 2009. The mereor metric for automatic evaluation of machine translation. Machine Translation.
- [Lin2004] Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of ACL.
- [Luong et al.2015a] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015a. Effective approaches to attention-based neural machine translation. In Proceedings of EMNLP.
- [Luong et al.2015b] Minh-Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. 2015b. Addressing the rare word problem in neural machine translation. In Proceedings of ACL.
- [Och2003] Franz J. Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of ACL.
- [Papineni et al.2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of ACL.
- [Ranzato et al.2015] Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv:1511.06732v1.
- [Sennrich et al.2015] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Improving neural machine translation models with monolingual data. arXiv:1511.06709.
- [Smith and Eisner2006] David A. Smith and Jason Eisner. 2006. Minimum risk annealing for training log-linear models. In Proceedings of ACL.
- [Snover et al.2006] Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of AMTA.
- [Stolcke2002] Andreas Stolcke. 2002. Srilm - am extensible language modeling toolkit. In Proceedings of ICSLP.
- [Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of NIPS.
- [Willams1992] Ronald J. Willams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning.