1 Introduction
Neural machine translation (NMT) Kalchbrenner and Blunsom (2013); Cho et al. (2014); Sutskever et al. (2014); Bahdanau et al. (2014) has now achieved impressive performance Wu et al. (2016); Gehring et al. (2017); Vaswani et al. (2017); Hassan et al. (2018); Chen et al. (2018); Lample et al. (2018)
and draws more attention. NMT models are built on the encoderdecoder framework where the encoder network encodes the source sentence to distributed representations and the decoder network reconstructs the target sentence form the representations word by word.
Currently, NMT models are usually trained with the wordlevel loss (i.e., crossentropy) under the teacher forcing algorithm Williams and Zipser (1989), which forces the model to generate translation strictly matching the groundtruth at the word level. However, in practice it is impossible to generate translation totally the same as ground truth. Once different target words are generated, the wordlevel loss cannot evaluate the translation properly, usually underestimating the translation. In addition, the teacher forcing algorithm suffers from the exposure bias Ranzato et al. (2015) as it uses different inputs at training and inference, that is groundtruth words for the training and previously predicted words for the inference. Kim and Rush (2016) proposed a method of sequencelevel knowledge distillation, which use teacher outputs to direct the training of student model, but the student model still have no access to its own predicted words. Scheduled sampling(SS) Bengio et al. (2015); Venkatraman et al. (2015) attempts to alleviate the exposure bias problem through mixing groundtruth words and previously predicted words as inputs during training. However, the sequence generated by SS may not be aligned with the target sequence, which is inconsistent with the wordlevel loss.
In contrast, sequencelevel objectives, such as BLEU Papineni et al. (2002), GLEU Wu et al. (2016), TER Snover et al. (2006), and NIST Doddington (2002), evaluate translation at the sentence or gram level and allow for greater flexibility, and thus can mitigate the above problems of the wordlevel loss. However, due to the nondifferentiable of sequencelevel objectives, previous works on sequencelevel training Ranzato et al. (2015); Shen et al. (2016); Bahdanau et al. (2016); Wu et al. (2016); He et al. (2016); Wu et al. (2017); Yang et al. (2017)
mainly rely on reinforcement learning algorithms
Williams (1992); Sutton et al. (2000) to find an unbiased gradient estimator for the gradient update. Sparse rewards in this situation often cause the high variance of gradient estimation, which consequently leads to unstable training and limited improvements.Lamb et al. (2016); Gu et al. (2017); Ma et al. (2018)
respectively use the discriminator, critic and bagofwords target as sequencelevel training objectives, all of which are directly connected to the generation model and hence enable direct gradient update. However, these methods do not allow for direct optimization with respect to evaluation metrics.
In this paper, we propose a method to combine the strengths of the wordlevel and sequencelevel training, that is the direct gradient update without gradient estimation from wordlevel training and the greater flexibility from sequencelevel training. Our method introduces probabilistic gram matching which makes sequencelevel objectives (e.g., BLEU, GLEU) differentiable. During training, it abandons teacher forcing and performs greedy search instead to take into consideration the predicted words. Experiment results show that our method significantly outperforms wordlevel training with the crossentropy loss and sequencelevel training under the reinforcement framework. The experiments also indicate that greedy search strategy indeed has superiority over teacher forcing.
2 Background
NMT is based on an endtoend framework which directly models the translation probability from the source sentence
to the target sentence :(1) 
where is the target length and is the model parameters. Given the training set with sentences pairs, the training objective is to maximize the loglikelihood of the training data as
(2)  
where the superior indicates the mth sentence in the dataset and is the length of mth target sentence.
In the above model, the probability of each target word is conditioned on the previous target words. The scenario is that in the training time, the teacher forcing algorithm is employed and the ground truth words from the target sentence are fed as context, while during inference, the ground truth words are not available and the previous predicted words are instead fed as context. This discrepancy is called exposure bias.
3 Model
3.1 SequenceLevel Objectives
Many automatic evaluation metrics of machine translation, such as BLEU, GLEU and NIST, are based on the ngram matching. Assuming that and are the output sentence and the ground truth sentence with length and respectively, the count of an gram in sentence is calculated as
(3) 
where is the indicator function. The matching count of the gram between and is given by
(4) 
Then the precision and the recall of the predicted grams are calculated as follows
(5)  
(6) 
BLEU, the most widely used metric for machine translation evaluation, is defined based on the ngram precision as follows
(7) 
where BP stands for the brevity penalty and is the weight for the gram. In contrast, GLEU is the minimum of recall and precision of  grams where  grams are counted together:
(8) 
3.2 probabilistic SequenceLevel Objectives
In the output sentence , the prediction probability varies among words. Some words are translated by the model with high confidence while some words are translated with high uncertainty. However, when calculating the count of grams in Eq.(3), all the words in the output sentence are treated equally, regardless of their respective prediction probabilities.
To give a more precise description of gram counts which considers the variety of prediction probabilities, we use the prediction probability as the count of word , and correspondingly the count of an ngram is the product of these probabilistic counts of all the words in the gram, not one anymore. Then the probabilistic count of is calculated by summing over the output sentence as
(9)  
Now the probabilistic sequencelevel objective can be got by replacing with (the tilde over the head indicates the probabilistic version) and keeping the rest unchanged. Here, we take BLEU as an example and show how the probabilistic BLEU (denoted as PBLEU) is defined. From this purpose, the matching count of ngram in Eq.(4) is modified as follows
(10) 
and the predict precision of grams changes into
(11) 
Finally, the probabilistic BLEU (PBLEU) is defined as
(12) 
Probabilistic GLEU (PGLEU) can be defined in a similar way. Specifically, we denote the probabilistic precision of ngrams as PPn. The probabilistic precision is more reasonable than recall since the denominator in Eq.(11) plays a normalization role, so we modify the definition in Eq.(8) and define PGLEU as simply the probabilistic precision of 14 grams.
The general probabilistic loss function is:
(13) 
where represents the probabilistic sequencelevel objectives, and and are the predicted translation and the ground truth for the th sentence respectively. The calculation of the probabilistic objective is illustrated in Figure 1. This probabilistic loss can work with decoding strategies such as greedy search and teacher forcing. In this paper we employ greedy search rather than teacher forcing so as to use the previously predicted words as context and alleviate the exposure bias problem.
System  Dev(MT02)  MT03  MT04  MT05  MT06  AVG 

BaseNMT  36.72  33.95  37.44  33.96  33.09  34.61 
MRT  37.17  34.89  37.90  34.62  33.78  35.30 
RF  37.13  34.66  37.69  34.55  33.74  35.16 
PBLEU  37.26  34.54  38.05  34.30  34.11  35.25 
PGLEU  37.44  34.67  38.11  34.24  34.58  35.40 
PP2  38.03  35.45  39.30  35.10  34.59  36.11 
4 Experiment
4.1 Settings
We carry out experiments on ChinesetoEnglish translation.^{1}^{1}1Experiment code: https://github.com/ictnlp/GS4NMT The training data consists of 1.25M pairs of sentences extracted from LDC corpora^{2}^{2}2The corpora include LDC2002E18, LDC2003E07, LDC2003E14, Hansards portion of LDC2004T07, LDC2004T08 and LDC2005T06.. Sentence pairs with either side longer than 50 were dropped. We use NIST 2002 (MT 02) as the validation set and NIST 20032006 (MT 0308) as the test sets. We use the case insensitive 4gram NIST BLEU score Papineni et al. (2002) for the translation task.
We apply our method to an attentionbased NMT system Bahdanau et al. (2014)
implemented by Pytorch. Both source and target vocabularies are limited to 30K. All word embedding sizes are set to 512, and the sizes of hidden units in both encoder and decoder RNNs are also set to 512. All parameters are initialized by uniform distribution over
. The minibatch stochastic gradient descent (SGD) algorithm is employed to train the model with batch size of 40. In addition, the learning rate is adjusted by adadelta optimizer
Zeiler (2012) with and . Dropout is applied on the output layer with dropout rate of 0.5. The beam size is set to 10.4.2 Performance
Systems We first pretrain the baseline model by maximum likelihood estimation (MLE) and then refine the model using probabilistic sequencelevel objectives, including PBLEU, PGLEU and PP2 (probabilistic 2gram precision). In addition, we reproduce previous works which train the NMT model through minimum risk training (MRT) Shen et al. (2016) and REINFORCE algorithm (RF) Ranzato et al. (2015). When reproducing their works, we set BLEU, GLEU and 2gram precision as training objectives respectively and find out that GLEU yields the best performance. In the following, we only report the results with training objective GLEU.
Performance Table 1 shows the translation performance on test sets measured in BLEU score. Simply training NMT model by the probabilistic 2gram precision achieves an improvement of 1.5 BLEU points, which significantly outperforms the reinforcementbased algorithms. We also test the precision of other ngrams and their combinations, but do not notice significant improvements over PP2. Notice that our method only changes the loss function, without any modification on model structure and training data.
4.3 Why Pretraining
We use the probabilistic loss to finetune the baseline model rather than training from scratch. This is in line with our motivation: to alleviate the exposure bias and make the model exposed to its own output during training. In the very beginning of the training, the model’s translation capability is nearly zero and the generated sentences are often meaningless and do not contain useful information for the training, so it is unreasonable to directly apply the greedy search strategy. Therefore, we first apply the teacher forcing algorithm to pretrain the model, and then we let the model generate the sentences itself and learn from its own outputs.
Another reason favoring pretraining is that pretraining can lower the training cost. The training cost of the introduced probabilistic loss is about three times higher than the cost of cross entropy. Without pretraining, the training time will be much higher than usual. Otherwise, the training cost is acceptable if the probabilistic loss is only for finetuning.
4.4 Effect of Decoding Strategy
The probabilistic loss, defined in Eq.(13), is computed from the model output and reference . In this section, we apply two different decoding strategies to generate : 1. teacher forcing, which uses the ground truth as decoder input. 2. greedy search, which feeds the word with maximum probability. By conducting this experiment, we attempt to figure out where the improvements come from: the modification of loss or the mitigation of exposure bias?
Figure 2 shows the learning curves of the two decoding strategies with training objective PP2. Teacher forcing raises about 0.5 BLEU improvements and greedy search outperform the teacher forcing algorithm by nearly 1 BLEU point. We conclude that the probabilistic loss has its own advantage even when trained by the teacher forcing algorithm, and greedy search is effective in alleviating the exposure bias.
Notice that the greedy search strategy highly relys on the probabilistic loss and can not be conducted independently. Greedy search together with the wordlevel loss is very similar with the scheduled sampling(SS). However, SS is inconsistent with the wordlevel loss since the wordlevel loss requires strict alignment between hypothesis and reference, which can only be accomplished by the teacher forcing algorithm.
4.5 Correlation with Evaluation Metrics
In this section, we explore how the probabilistic objective correlates with the real evaluation metric. We randomly sample 100 pairs of sentences from the training set and compute their PGLEU and GLEU scores (Wu et al. (2016) indicates that GLEU have better performance in the sentencelevel evaluation than BLEU).
Directly computing the correlation between GLEU and PGLEU gives the correlation coefficient 0.86, which indicates strong correlation. In addition, we draw the scatter diagram of the 100 pairs of sentences in Figure 3 with GLEU as xaxis and PGLEU as yaxix. Figure 3 shows that PGLEU correlates well with GLEU, suggesting that it is reasonable to directly train the NMT model with PGLEU.
5 Conclusion
Wordlevel loss cannot evaluate the translation properly and suffers from the exposure bias, and sequencelevel objectives are usually indifferentiable and require gradient estimation. We propose probabilistic sequencelevel objectives based on ngram matching, which relieve the dependence on gradient estimation and can directly train the NMT model. Experiment results show that our method significantly outperforms previous sequencelevel training works and successfully alleviates the exposure bias through performing greedy search.
6 Acknowledgments
We thank the anonymous reviewers for their insightful comments. This work was supported by the National Natural Science Foundation of China (NSFC) under the project NO.61472428 and the project NO. 61662077.
References
 Bahdanau et al. (2016) Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2016. An actorcritic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086.
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

Bengio et al. (2015)
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015.
Scheduled sampling for sequence prediction with recurrent neural networks.
In Advances in Neural Information Processing Systems, pages 1171–1179.  Chen et al. (2018) Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, et al. 2018. The best of both worlds: Combining recent advances in neural machine translation. arXiv preprint arXiv:1804.09849.
 Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
 Doddington (2002) George Doddington. 2002. Automatic evaluation of machine translation quality using ngram cooccurrence statistics. In Proceedings of the second international conference on Human Language Technology Research, pages 138–145. Morgan Kaufmann Publishers Inc.
 Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122.

Gu et al. (2017)
Jiatao Gu, Kyunghyun Cho, and Victor OK Li. 2017.
Trainable greedy decoding for neural machine translation.
In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, pages 1968–1978.  Hassan et al. (2018) Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin JunczysDowmunt, William Lewis, Mu Li, et al. 2018. Achieving human parity on automatic chinese to english news translation. arXiv preprint arXiv:1803.05567.
 He et al. (2016) Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and WeiYing Ma. 2016. Dual learning for machine translation. In Advances in Neural Information Processing Systems, pages 820–828.
 Kalchbrenner and Blunsom (2013) Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1700–1709.
 Kim and Rush (2016) Yoon Kim and Alexander M Rush. 2016. Sequencelevel knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327.
 Lamb et al. (2016) Alex M Lamb, Anirudh Goyal ALIAS PARTH GOYAL, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. 2016. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pages 4601–4609.
 Lample et al. (2018) Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. Phrasebased & neural unsupervised machine translation. arXiv preprint arXiv:1804.07755.
 Ma et al. (2018) Shuming Ma, Xu Sun, Yizhong Wang, and Junyang Lin. 2018. Bagofwords as target for neural machine translation. arXiv preprint arXiv:1805.04871.
 Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
 Ranzato et al. (2015) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732.
 Shen et al. (2016) Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2016. Minimum risk training for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1683–1692.
 Snover et al. (2006) Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of association for machine translation in the Americas, volume 200.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
 Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.
 Venkatraman et al. (2015) Arun Venkatraman, Martial Hebert, and J Andrew Bagnell. 2015. Improving multistep prediction of learned time series models. In AAAI, pages 3024–3030.
 Williams (1992) Ronald J Williams. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. In Reinforcement Learning, pages 5–32. Springer.
 Williams and Zipser (1989) Ronald J Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280.
 Wu et al. (2017) Lijun Wu, Yingce Xia, Li Zhao, Fei Tian, Tao Qin, Jianhuang Lai, and TieYan Liu. 2017. Adversarial neural machine translation. arXiv preprint arXiv:1704.06933.
 Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
 Yang et al. (2017) Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. 2017. Improving neural machine translation with conditional sequence generative adversarial nets. arXiv preprint arXiv:1703.04887.
 Zeiler (2012) Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
Comments
There are no comments yet.