Introduction
Neural Machine Translation (NMT) [Cho et al.2014, Sutskever, Vinyals, and Le2014, Bahdanau, Cho, and Bengio2014] has seen the rapid development in the past several years, from catching up with Statistical Machine Translation (SMT) [Koehn, Och, and Marcu2003, Chiang2007] to outperforming it by significant margins on many languages [Sennrich, Haddow, and Birch2016b, Wu et al.2016, Tu et al.2016, Eriguchi, Hashimoto, and Tsuruoka2016, Wang et al.2017a, Vaswani et al.2017]
. In a conventional NMT model, an encoder first transforms the source sequence into a sequence of intermediate hidden vector representations, based on which, a decoder generates the target sequence word by word.
Due to the autoregressive structure, current NMT systems usually suffer from the socalled exposure bias problem [Bengio et al.2015]: during inference, true previous target tokens are unavailable and replaced by tokens generated by the model itself, thus mistakes made early can mislead subsequent translation, yielding unsatisfied translations with good prefixes but bad suffixes (shown in Table 1). Such an issue can become severe as sequence length increases.
Input 


Ref. 


L2R 


R2L 

To address this problem, one line of research attempts to reduce the inconsistency between training and inference so as to improve the robustness when giving incorrect previous predictions, such as designing sequencelevel objectives or adopting reinforcement learning approaches
[Ranzato et al.2015, Shen et al.2016, Wiseman and Rush2016]. Another line tries to leverage a complementary NMT model that generates target words from right to left (R2L) to distinguish unsatisfied translation results from a nbest list generated by the L2R model [Liu et al.2016, Wang et al.2017b].In their work, the R2L NMT model is only used to rerank the translation candidates generated by L2R model, while the candidates in the nbest list still suffer from the exposure bias problem and limit the room for improvement. Another problem is that, the complementary R2L model tends to generate translation results with good suffixes and bad prefixes, due to the same exposure bias problem, as shown in Table 1. Similar with using the R2L model to augment the L2R model, the L2R model can also be leveraged to improve the R2L model.
Instead of reranking the nbest list, we try to take consideration of the agreement between the L2R and R2L models into both of their training objectives, hoping that the agreement information can help to learn better models integrating their advantages to generate translations with good prefixes and good suffixes. To this end, we introduce two KullbackLeibler (KL) divergences between the probability distributions defined by L2R and R2L models into the NMT training objective as regularization terms. Thus, we can not only maximize the likelihood of training data but also minimize the L2R and R2L model divergence at the same time, in which the latter one severs as a measure of exposure bias problem of the currently evaluated model. With this method, the L2R model can be enhanced using the R2L model as a helper system, and the R2L model can also be improved with the help of L2R model. We integrate the optimization of R2L and L2R models into a joint training framework, in which they act as helper systems for each other, and both models achieve further improvements with an interactive update process.
Our experiments are conducted on ChineseEnglish and EnglishGerman translation tasks, and demonstrate that our proposed method significantly outperforms stateoftheart baselines.
Neural Machine Translation
Neural Machine Translation (NMT) is an endtoend framework to directly model the conditional probability of target translation given source sentence . In practice, NMT systems are usually implemented with an attentionbased encoderdecoder architecture. The encoder reads the source sentence and transforms it into a sequence of intermediate hidden vectors
using a neural network. Given the hidden state
, the decoder generates target translation with another neural network that jointly learns language and alignment models.The structure of neural networks first employs recurrent neural networks (RNN) or its variants  Gated Recurrent Unit
[Cho et al.2014][Hochreiter and Schmidhuber1997] networks. Recently, two additional architectures have been proposed, improving not only parallelization but also the stateoftheart result: the fully convolution model [Gehring et al.2017] and the selfattentional transformer [Vaswani et al.2017].For model training, given a parallel corpus , the standard training objective in NMT is to maximize the likelihood of the training data:
(1) 
where is the neural translation model and is the model parameter.
One big problem of the model training is that the history of any target word is correct and has been observed in the training data, but during inference, all the target words are predicted and may contain mistakes, which are fed as inputs to the model and quickly accumulated along with the sequence generation. This is called the exposure bias problem [Bengio et al.2015].
Our Approach
To deal with the exposure bias problem, we try to maximize the agreement between translations from L2R and R2L NMT models, and divide the NMT training objective into two parts: the standard maximum likelihood of training data, and the regularization terms that indicate the divergence of L2R and R2L models based on the current model parameter. In this section, we will start with basic model notations, followed by discussions of model regularization terms and efficient gradient approximation methods. In the last part, we show that the L2R and R2L NMT models can be jointly improved to achieve even better results.
Notations
Given source sentence and its target translation , let and be L2R and R2L translation models, in which and are corresponding model parameters. Specifically, L2R translation model can be decomposed as , which means L2R model adopts previous targets as history to predict the current target at each step , while the R2L translation model can similarly be decomposed as and employs later targets as history to predict current target at each step .
NMT Model Regularization
Since L2R and R2L models are different chain decompositions of the same translation probability, output probabilities of the two models should be identical:
(2)  
However, if these two models are optimized separately by maximum likelihood estimation (MLE), there is no guarantee that the above equation will hold. To satisfy this constraint, we introduce two KullbackLeibler (KL) divergence regularization terms into the MLE training objective (Equation
1). For L2R model, the new training objective is:(3)  
where is a hyperparameter for regularization terms. These regularization terms are 0 when Equation 2 holds, otherwise regularization terms will guide the training process to reduce the disagreement between L2R and R2L models.
Unfortunately, it is impossible to calculate entire gradients of this objective function, since we need to sum over all translation candidates in an exponential search space for KL divergence. To alleviate this problem, we follow shenEtAl:2016:P161 shenEtAl:2016:P161 to approximate the full search space with a sampled subspace and then design an efficient KL divergence approximation algorithm. Specifically, we derive the gradient calculation equation based on the definition of KL divergence, and then design proper sampling methods for two different KL divergence regularization terms.
For , according to the definition of KL divergence, we have
(4)  
where is a set of all possible candidate translations for the source sentence . Since is irrelevant to parameter , the partial derivative of this KL divergence with respect to can be written as
(5)  
in which are the gradients specified with a standard sequencetosequence NMT network. The expectation can be approximated with samples from the R2L model . Therefore, minimizing this regularization term is equal to maximizing the loglikelihood on the pseudo sentence pairs sampled from the R2L model.
For , similarly we have
(6)  
The partial derivative of this KL divergence with respect to is calculated as follows:
(7)  
Similarly, we use sampling for the calculation of expectation . There are two differences in Equation 7 compared with Equation 5: 1) Pseudo sentence pairs are not sampled from the R2L model (), but from the L2R model itself (); 2) is used as weight to penalize incorrect pseudo pairs.
To sum up, the partial derivative of objective function with respect to can be approximately written as follows:
(8)  
The overall training is shown in Algorithm 1.
Joint Training for Paired NMT Models
In practice, due to the imperfection of R2L model, the agreement between L2R and R2L models sometimes may mislead L2R model training. On the other hand, due to the symmetry of L2R and R2L models, L2R model can also serve as the discriminator to punish bad translation candidates generated from R2L model. Similarly, the objective function of the R2L model can be defined as follow:
(9)  
The corresponding training procedure is similar with Algorithm 1.
Based on the above, L2R and R2L models can act as helper systems for each other in a joint training process: the L2R model is used as auxiliary system to regularize R2L model , and the R2L model is used as auxiliary system to regularize L2R model . This training process can be iteratively carried out to obtain further improvements because after each iteration both L2R and R2L models are expected to be improved with regularization method.
To simultaneously optimize these two models, we design a novel training algorithm with the overall training objective defined as the sum of objectives in both directions:
(10) 
As illustrated in Figure 1, the whole training process contains two major steps: pretraining and jointtraining. First, given parallel corpora , we pretrain both L2R and R2L models with MLE principle. Next, based on pretrained models, we jointly optimize L2R and R2L models with an iterative process. In each iteration, we fix R2L model and use it as a helper to optimize L2R model with Equation 3, and at the same time, we fix L2R model and use it as a helper to optimize R2L model with Equation 9. The iterative training continues until the performance on development set does not increase.
Experiments
Setup
To examine the effectiveness of our proposed approach, we conduct experiments on three datasets, including NIST OpenMT for ChineseEnglish, WMT17 for EnglishGerman and ChineseEnglish. In all experiments, we use BLEU [Papineni et al.2002] as the automatic metric for translation evaluation.
Datasets.
For NIST OpenMT’s ChineseEnglish translation task, we select our training data from LDC corpora,^{1}^{1}1The corpora include LDC2002E17, LDC2002E18, LDC2003E07, LDC2003E14, LDC2005E83, LDC2005T06, LDC2005T10, LDC2006E17, LDC2006E26, LDC2006E34, LDC2006E85, LDC2006E92, LDC2006T06, LDC2004T08, LDC2005T10 which consists of 2.6M sentence pairs with 65.1M Chinese words and 67.1M English words respectively. Any sentence longer than 80 words is removed from training data. The NIST OpenMT 2006 evaluation set is used as validation set, and NIST 2003, 2005, 2008, 2012 datasets as test sets. We limit the vocabulary to contain up to 50K most frequent words on both source and target sides, and convert remaining words into the <unk> tokens. In decoding period, we follow LuongACL2015 LuongACL2015 to handle the <unk> replacement.
For WMT17’s EnglishGerman translation task,
we use the preprocessed training data provided by the task organizers.^{2}^{2}2http://data.statmt.org/wmt17/translationtask/preprocessed
/deen/
The training data consists of 5.8M sentence pairs with 141M English words and 134M German words respectively.
We use the newstest2016 as the validation set and the newstest2017 as the test set.
The maximal sentence length is set as 128.
For vocabulary, we use 37K subword tokens based on Byte Pair Encoding (BPE) [Sennrich, Haddow, and
Birch2016b].
System  NIST2006  NIST2003  NIST2005  NIST2008  NIST2012  Average 
Transformer  44.33  45.69  43.94  34.80  32.63  40.28 
Transformer+MRT  45.21  46.60  45.11  36.77  34.78  41.69 
Transformer+JS  45.04  46.32  44.58  36.81  35.02  41.51 
Transformer+RT  46.14  48.28  46.24  38.07  36.31  43.01 
For WMT17’s ChineseEnglish translation task,
we use all the available parallel data, which consists of 24M sentence pairs, including News Commentary, UN Parallel Corpus and CWMT Corpus.^{3}^{3}3http://www.statmt.org/wmt17/translationtask.html
The newsdev2017 is used as the validation set and newstest2017 as the test set.
We also limit the maximal sentence length to 128.
For data preprocessing, we segment Chinese sentences with our inhouse Chinese word segmentation tool and tokenize English sentences with the scripts provided in Moses.^{4}^{4}4https://github.com/mosessmt/mosesdecoder/blob/master
/scripts/tokenizer/tokenizer.perl
Then we learn a BPE model on preprocessed sentences with 32K merge operations, in which 44K and 33K subword tokens are adopted as source and target vocabularies respectively.
Experimental Details.
The Transformer model Vaswani2017AttentionIA Vaswani2017AttentionIA is adopted as our baseline.
For all translation tasks, we follow the transformer_base_v2 hyperparameter setting^{5}^{5}5https://github.com/tensorflow/tensor2tensor/blob/v1.3.0/
tensor2tensor/models/transformer.py
which corresponds to a 6layer transformer with a model size of 512. The parameters are initialized using a normal distribution with a mean of 0 and a variance of
, where and are the number of rows and columns in the structure [Glorot and Bengio2010]. All models are trained on 4 Tesla M40 GPUs for a total of 100K steps using the Adam [Kingma and Ba2014] algorithm. The initial learning rate is set to 0.2 and decayed according to the schedule in Vaswani2017AttentionIA Vaswani2017AttentionIA. During training, the batch size is set to approximately 4096 words per batch and checkpoints are created every 60 minutes. At test time, we use a beam of 8 and a length penalty of 1.0.Other hyperparameters used in our approach are set as . To build the synthetic data in Algorithm 1, we adopt beam search to generate translation candidates with beam size 4, and the best sample is used for the estimation of KL divergence. In practice, to speed up the decoding process, we sort all source sentences according to the sentence length, and then 32 sentences are simultaneously translated with parallel decoding implementation. In our experiments, we try different settings (), and find the achieves the best BLEU result on validation set. We also test the model performance on validation set with the bigger parameter . When , we do not find the further improvement but it brings some training times due to more pseudo sentence pairs. In addition, we use sentencelevel BLEU to filter wrong translations whose BLEU score is not greater than 30%. Notice that the R2L model gets comparable results with the L2R model, thus only the result of the L2R model is reported in our experiments.
Evaluation on NIST Corpora
Table 2 shows the evaluation results of different models on NIST datasets. MRT represents shenEtAl:2016:P161 shenEtAl:2016:P161’s method, RT denotes our regularization approach, and JS represents liu2016agreement liu2016agreement’s method that modifies the inference strategy by reranking the best results with a joint probability of bidirectional models. All the results are reported based on caseinsensitive BLEU and computed using Moses multibleu.perl script.
We observe that by taking agreement information into consideration, Transformer+JS and Transformer+RT can bring improvement across different test sets, in which our approach achieves 2.73 BLEU point improvements over Transformer on average. These results confirm that introducing agreement between L2R and R2L models helps handle exposure bias problem and improves translation quality.
Besides, we see that Transformer+RT gains better performance than Transformer+JS across different test sets, with 1.5 BLEU point improvements on average. Since Transformer+JS only leverages the agreement restriction in inference stage, L2R and R2L models still suffer from exposure bias problem and generate bad translation candidates, which limits the room for improvement in the reranking process. Instead of combining R2L model during inference, our approach utilizes the intrinsic probabilistic connection between L2R and R2L models to guide the learning process. The two NMT models are expected to adjust in disagreement cases and then the exposure bias problem of them can be solved.
EnglishGerman  ChineseEnglish  
System  newstest2016  newstest2017  newsdev2017  newstest2017 
Transformer  32.58  25.48  20.87  23.01 
Transformer+MRT  33.27  25.87  21.66  24.24 
Transformer+JS  32.91  25.93  21.25  23.59 
Transformer+RT  34.56  27.18  22.50  25.38 
Transformerbig  33.58  27.13  21.91  24.03 
Transformerbig+BT  35.06  28.34  23.59  25.53 
Transformerbig+BT+RT  36.78  29.46  24.84  27.21 
Edinburgh’s NMT System (ensemble)  36.20  28.30  24.00  25.70 
Sogou’s NMT System (ensemble)      22.90  26.40 
Longer source sentence implies longer translation that more easily suffers from exposure bias problem. To further verify our approach, we group source sentences of similar length together and calculate the BLEU score for each group. As shown in Figure 2, we can see that our mechanism achieves the best performance in all groups. The gap between our method and the other three methods is small when the length is smaller than 10, and the gap becomes bigger when the sentences become longer. This further confirms the efficiency of our proposed method in dealing with the explore bias problem.
Evaluation on WMT17 Corpora
For WMT17 Corpora, we verify the effectiveness of our approach on EnglishGerman and ChineseEnglish translation tasks from two angles:
1) We compare our approach with baseline systems when only parallel corpora is used;
2) We investigate the impact of the combination of backtranslation technique [Sennrich, Haddow, and
Birch2016a] and our approach.
Since backtranslation method brings in more synthetic data, we choose Transformerbig settings defined in Vaswani2017AttentionIA Vaswani2017AttentionIA for this setting.
Experimental results are shown in Table 3, in which BT denotes backtranslation method.
In order to be comparable with NMT systems reported in WMT17, all results are reported based on casesensitive BLEU and computed using official tools  SacreBLEU.^{6}^{6}6https://github.com/awslabs/sockeye/tree/master/contrib/
sacrebleu
Only Parallel Corpora.
As shown in Table 3, Transformer+JS and Transformer+RT both generate improvements on test sets for EnglishGerman and ChineseEnglish translation tasks. It confirms the effectiveness of leveraging agreements between L2R and R2L models. Additionally, our approach significantly outperforms Transformer+JS, yielding the best BLEU score on EnglishGerman and ChineseEnglish test sets respectively when merely using bilingual data. These results further prove the effectiveness of our method.
Combining with BackTranslation Method.
To verify the effect of our approach when monolingual data is available, we combine our approach with backtranslation method. We first randomly select 5M German sentences and 12M English sentences from “News Crawl: articles from 2016”. Then GermanEnglish and EnglishChinese NMT systems learnt with parallel corpora are used to translate monolingual target sentences.
From Table 3
, we find that Transformerbig gains better performance than Transformer due to more model parameters. When backtranslation method is employed, Transformerbig+BT achieves 1.21 and 1.5 BLEU point improvements over Transformerbig on EnglishGerman and ChineseEnglish sets respectively. Our method can further gain remarkable improvements based on backtranslation method, resulting in the best results on all translation tasks. These results show that in semisupervised learning scenarios, the NMT model still can benefit from our proposed approach. In addition, our single model Transformerbig+BT+RT even achieves the best performance on WMT17’s EnglishGerman and ChineseEnglish translation tasks, among all the reported results, including ensemble systems.
Effect of Joint Training
EnglishGerman  ChineseEnglish  
Iteration 0  32.58  20.87 
Iteration 1  33.86  21.92 
Iteration 2  34.56  22.50 
Iteration 3  34.58  22.47 
We further investigate the impact of our joint training algorithm during the whole training process. Table 4 shows the BLEU scores on WMT validation sets in each iteration. For each iteration, we train NMT models until the performance on the development set does not increase. We find that more iterations can lead to better evaluation results consistently and 2 iterations are enough to reach convergence in our experiments. However, further iterations cannot bring noticeable translation accuracy improvements but more training time. For training cost, since our method is based on pretrained models, the entire training time is almost 2 times the original MLE training.
Source 


Reference 


Transformer 


Transformer (R2L) 


Transformer+RT 

Example
In this section, we give a case study to analyze our method. Table 5 provides a ChineseEnglish translation example from newstest2017. We find that Transformer produces the translation with good prefixes but bad suffixes, while Transformer (R2L) generates the translation with desirable suffixes but incorrect prefixes. For our approach, we can see that Transformer+RT produces a highquality translation in this case, which is much better than Transformer and Transformer (R2L). The reason is that leveraging the agreement between L2R and R2L models in training stage can better punish bad suffixes generated by Transformer and encourage desirable suffixes from Transformer (R2L).
Related Work
Targetbidirectional transduction techniques have been explored in statistical machine translation, under the IBM framework [Watanabe and Sumita2002] and the featuredriven linear models [Finch and Sumita2009, Zhang et al.2013]. Recently, liu2016agreement liu2016agreement and zhang2018AsynBid zhang2018AsynBid migrate this method from SMT to NMT by modifying the inference strategy and decoder architecture of NMT. liu2016agreement liu2016agreement propose to generate best translation candidates from L2R and R2L NMT models and leverage the joint probability of two models to find the best candidates from the combined best list. zhang2018AsynBid zhang2018AsynBid design a twostage decoder architecture for NMT, which generates translation candidates in a righttoleft manner in firststage and then gets final translation based on source sentence and previous generated R2L translation. Different from their method, our approach directly exploits the targetbidirectional agreement in training stage by introducing regularization terms. Without changing the neural network architecture and inference strategy, our method keeps the same speed as the original model during inference.
To handle the exposure bias problem, many methods have been proposed, including designing new training objectives [Shen et al.2016, Wiseman and Rush2016] and adopting reinforcement learning approaches [Ranzato et al.2015, Bahdanau et al.2016]. shenEtAl:2016:P161 shenEtAl:2016:P161 attempt to directly minimize expected loss (maximize the expected BLEU) with Minimum Risk Training (MRT). wisemanrush:2016:EMNLP2016 wisemanrush:2016:EMNLP2016 adopt a beamsearch optimization algorithm to reduce inconsistency between training and inference. Besides, ranzato2015sequence ranzato2015sequence propose a mixture training method to perform a gradual transition from MLE training to BLEU score optimization using reinforcement learning. bahdanau2016actor bahdanau2016actor design an actorcritic algorithm for sequence prediction, in which the NMT system is the actor, and a critic network is proposed to predict the value of output tokens. Instead of designing taskspecific objective functions or complex training strategies, our approach only adds regularization terms to standard training objective function, which is simple to implement but effective.
Conclusion
In this paper, we have presented a simple and efficient regularization approach to neural machine translation, which relies on the agreement between L2R and R2L NMT models. In our method, two KullbackLeibler divergences based on probability distributions of L2R and R2L models are added to the standard training objective as regularization terms. An efficient approximation algorithm is designed to enable fast training of the regularized training objective and then a training strategy is proposed to jointly optimize L2R and R2L models. Empirical evaluations are conducted on ChineseEnglish and EnglishGerman translation tasks, demonstrating that our approach leads to significant improvements compared with strong baseline systems.
In our future work, we plan to test our method on other sequencetosequence tasks, such as summarization and dialogue generation. Besides the backtranslation method, it is also worth trying to integrate our approach with other semisupervised methods to better leverage unlabeled data.
Acknowledgments
This research was partially supported by grants from the National Key Research and Development Program of China (Grant No. 2018YFB1004300), the National Natural Science Foundation of China (Grant No. 61703386), the Anhui Provincial Natural Science Foundation (Grant No. 1708085QF140), and the Fundamental Research Funds for the Central Universities (Grant No. WK2150110006).
Besides, we appreciate Dongdong Zhang and Ren Shuo for the fruitful discussions. We also thank the anonymous reviewers for their careful reading of our paper and insightful comments.
References
 [Bahdanau et al.2016] Bahdanau, D.; Brakel, P.; Xu, K.; Goyal, A.; Lowe, R.; Pineau, J.; Courville, A. C.; and Bengio, Y. 2016. An actorcritic algorithm for sequence prediction. CoRR abs/1607.07086.
 [Bahdanau, Cho, and Bengio2014] Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473.
 [Bengio et al.2015] Bengio, S.; Vinyals, O.; Jaitly, N.; and Shazeer, N. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In NIPS.
 [Chiang2007] Chiang, D. 2007. Hierarchical phrasebased translation. Computational Linguistics.
 [Cho et al.2014] Cho, K.; van Merrienboer, B.; Çaglar Gulçehre; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoderdecoder for statistical machine translation. In EMNLP.
 [Eriguchi, Hashimoto, and Tsuruoka2016] Eriguchi, A.; Hashimoto, K.; and Tsuruoka, Y. 2016. Treetosequence attentional neural machine translation. In ACL.
 [Finch and Sumita2009] Finch, A. M., and Sumita, E. 2009. Bidirectional phrasebased statistical machine translation. In EMNLP.
 [Gehring et al.2017] Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; and Dauphin, Y. 2017. Convolutional sequence to sequence learning. In ICML.
 [Glorot and Bengio2010] Glorot, X., and Bengio, Y. 2010. Understanding the difficulty of training deep feedforward neural networks. In AISTATS.
 [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long shortterm memory. Neural computation.
 [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. CoRR abs/1412.6980.
 [Koehn, Och, and Marcu2003] Koehn, P.; Och, F. J.; and Marcu, D. 2003. Statistical phrasebased translation. In HLTNAACL.
 [Liu et al.2016] Liu, L.; Utiyama, M.; Finch, A. M.; and Sumita, E. 2016. Agreement on targetbidirectional neural machine translation. In HLTNAACL.
 [Luong et al.2015] Luong, T.; Sutskever, I.; Le, Q. V.; Vinyals, O.; and Zaremba, W. 2015. Addressing the rare word problem in neural machine translation. In ACL.
 [Papineni et al.2002] Papineni, K.; Roucos, S. E.; Ward, T.; and Zhu, W.J. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL.
 [Ranzato et al.2015] Ranzato, M.; Chopra, S.; Auli, M.; and Zaremba, W. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732.
 [Sennrich et al.2017] Sennrich, R.; Birch, A.; Currey, A.; Germann, U.; Haddow, B.; Heafield, K.; Barone, A. V. M.; and Williams, P. G. 2017. The university of edinburgh’s neural mt systems for wmt17. In WMT.
 [Sennrich, Haddow, and Birch2016a] Sennrich, R.; Haddow, B.; and Birch, A. 2016a. Improving neural machine translation models with monolingual data. In ACL.
 [Sennrich, Haddow, and Birch2016b] Sennrich, R.; Haddow, B.; and Birch, A. 2016b. Neural machine translation of rare words with subword units. In ACL.
 [Shen et al.2016] Shen, S.; Cheng, Y.; He, Z.; He, W.; Wu, H.; Sun, M.; and Liu, Y. 2016. Minimum risk training for neural machine translation. In ACL.
 [Sutskever, Vinyals, and Le2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In NIPS.
 [Tu et al.2016] Tu, Z.; Lu, Z.; Liu, Y.; Liu, X.; and Li, H. 2016. Modeling coverage for neural machine translation. In ACL.
 [Vaswani et al.2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is all you need. In NIPS.
 [Wang et al.2017a] Wang, M.; Lu, Z.; Zhou, J.; and Liu, Q. 2017a. Deep neural machine translation with linear associative unit. In ACL.
 [Wang et al.2017b] Wang, Y.; Cheng, S.; Jiang, L.; Yang, J.; Chen, W.; Li, M.; Shi, L.; Wang, Y.; and Yang, H. 2017b. Sogou neural machine translation systems for wmt17. In WMT.
 [Watanabe and Sumita2002] Watanabe, T., and Sumita, E. 2002. Bidirectional decoding for statistical machine translation. In COLING.
 [Wiseman and Rush2016] Wiseman, S., and Rush, A. M. 2016. Sequencetosequence learning as beamsearch optimization. In EMNLP.
 [Wu et al.2016] Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; Klingner, J.; Shah, A.; Johnson, M.; Liu, X.; Kaiser, L.; Gouws, S.; Kato, Y.; Kudo, T.; Kazawa, H.; Stevens, K.; Kurian, G.; Patil, N.; Wang, W.; Young, C.; Smith, J.; Riesa, J.; Rudnick, A.; Vinyals, O.; Corrado, G. S.; Hughes, M.; and Dean, J. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144.
 [Zhang et al.2013] Zhang, H.; Toutanova, K.; Quirk, C.; and Gao, J. 2013. Beyond lefttoright: Multiple decomposition structures for smt. In HLTNAACL.
 [Zhang et al.2018] Zhang, X.; Su, J.; Qin, Y.; Liu, Y.; Ji, R.; and Wang, H. 2018. Asynchronous bidirectional decoding for neural machine translation. In AAAI.