Regularizing Neural Machine Translation by Target-bidirectional Agreement

08/13/2018
by   Zhirui Zhang, et al.
Microsoft
0

Although Neural Machine Translation (NMT) has achieved remarkable progress in the past several years, most NMT systems still suffer from a fundamental shortcoming as in other sequence generation tasks: errors made early in generation process are fed as inputs to the model and can be quickly amplified, harming subsequent sequence generation. To address this issue, we propose a novel model regularization method for NMT training, which aims to improve the agreement between translations generated by left-to-right (L2R) and right-to-left (R2L) NMT decoders. This goal is achieved by introducing two Kullback-Leibler divergence regularization terms into the NMT training objective to reduce the mismatch between output probabilities of L2R and R2L models. In addition, we also employ a joint training strategy to allow L2R and R2L models to improve each other in an interactive update process. Experimental results show that our proposed method significantly outperforms state-of-the-art baselines on Chinese-English and English-German translation tasks.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

05/13/2019

Synchronous Bidirectional Neural Machine Translation

Existing approaches to neural machine translation (NMT) generate the tar...
07/18/2019

Forward-Backward Decoding for Regularizing End-to-End TTS

Neural end-to-end TTS can generate very high-quality synthesized speech,...
09/19/2017

Dynamic Oracle for Neural Machine Translation in Decoding Phase

The past several years have witnessed the rapid progress of end-to-end N...
09/03/2019

Multi-agent Learning for Neural Machine Translation

Conventional Neural Machine Translation (NMT) models benefit from the tr...
08/22/2019

Dual Skew Divergence Loss for Neural Machine Translation

For neural sequence model training, maximum likelihood (ML) has been com...
12/16/2021

Learning and Analyzing Generation Order for Undirected Sequence Models

Undirected neural sequence models have achieved performance competitive ...
04/02/2021

Attention Forcing for Machine Translation

Auto-regressive sequence-to-sequence models with attention mechanisms ha...

Introduction

Neural Machine Translation (NMT) [Cho et al.2014, Sutskever, Vinyals, and Le2014, Bahdanau, Cho, and Bengio2014] has seen the rapid development in the past several years, from catching up with Statistical Machine Translation (SMT) [Koehn, Och, and Marcu2003, Chiang2007] to outperforming it by significant margins on many languages [Sennrich, Haddow, and Birch2016b, Wu et al.2016, Tu et al.2016, Eriguchi, Hashimoto, and Tsuruoka2016, Wang et al.2017a, Vaswani et al.2017]

. In a conventional NMT model, an encoder first transforms the source sequence into a sequence of intermediate hidden vector representations, based on which, a decoder generates the target sequence word by word.

Due to the autoregressive structure, current NMT systems usually suffer from the so-called exposure bias problem [Bengio et al.2015]: during inference, true previous target tokens are unavailable and replaced by tokens generated by the model itself, thus mistakes made early can mislead subsequent translation, yielding unsatisfied translations with good prefixes but bad suffixes (shown in Table 1). Such an issue can become severe as sequence length increases.

Input
1χ23 1 34, 4 32 44 1
3 24 44 4 14 142∋241
43 43 41 12.
Ref.
Supporters say the two tunnels will benefit the
environment and help California ensure the water
supply is safer.
L2R
Supporters say these two tunnels will benefit the
environment and to help a secure water supply
in California.
R2L
Supporter say the tunnel will benefit the environ-
ment and help California ensure the water supply
is more secure.
Table 1: Example of an unsatisfied translation generate by a left-to-right (L2R) decoder and a right-to-left (R2L) decoder.

To address this problem, one line of research attempts to reduce the inconsistency between training and inference so as to improve the robustness when giving incorrect previous predictions, such as designing sequence-level objectives or adopting reinforcement learning approaches

[Ranzato et al.2015, Shen et al.2016, Wiseman and Rush2016]. Another line tries to leverage a complementary NMT model that generates target words from right to left (R2L) to distinguish unsatisfied translation results from a n-best list generated by the L2R model [Liu et al.2016, Wang et al.2017b].

In their work, the R2L NMT model is only used to re-rank the translation candidates generated by L2R model, while the candidates in the n-best list still suffer from the exposure bias problem and limit the room for improvement. Another problem is that, the complementary R2L model tends to generate translation results with good suffixes and bad prefixes, due to the same exposure bias problem, as shown in Table 1. Similar with using the R2L model to augment the L2R model, the L2R model can also be leveraged to improve the R2L model.

Instead of re-ranking the n-best list, we try to take consideration of the agreement between the L2R and R2L models into both of their training objectives, hoping that the agreement information can help to learn better models integrating their advantages to generate translations with good prefixes and good suffixes. To this end, we introduce two Kullback-Leibler (KL) divergences between the probability distributions defined by L2R and R2L models into the NMT training objective as regularization terms. Thus, we can not only maximize the likelihood of training data but also minimize the L2R and R2L model divergence at the same time, in which the latter one severs as a measure of exposure bias problem of the currently evaluated model. With this method, the L2R model can be enhanced using the R2L model as a helper system, and the R2L model can also be improved with the help of L2R model. We integrate the optimization of R2L and L2R models into a joint training framework, in which they act as helper systems for each other, and both models achieve further improvements with an interactive update process.

Our experiments are conducted on Chinese-English and English-German translation tasks, and demonstrate that our proposed method significantly outperforms state-of-the-art baselines.

Neural Machine Translation

Neural Machine Translation (NMT) is an end-to-end framework to directly model the conditional probability of target translation given source sentence . In practice, NMT systems are usually implemented with an attention-based encoder-decoder architecture. The encoder reads the source sentence and transforms it into a sequence of intermediate hidden vectors

using a neural network. Given the hidden state

, the decoder generates target translation with another neural network that jointly learns language and alignment models.

The structure of neural networks first employs recurrent neural networks (RNN) or its variants - Gated Recurrent Unit

[Cho et al.2014]

and Long Short-Term Memory

[Hochreiter and Schmidhuber1997] networks. Recently, two additional architectures have been proposed, improving not only parallelization but also the state-of-the-art result: the fully convolution model [Gehring et al.2017] and the self-attentional transformer [Vaswani et al.2017].

For model training, given a parallel corpus , the standard training objective in NMT is to maximize the likelihood of the training data:

(1)

where is the neural translation model and is the model parameter.

One big problem of the model training is that the history of any target word is correct and has been observed in the training data, but during inference, all the target words are predicted and may contain mistakes, which are fed as inputs to the model and quickly accumulated along with the sequence generation. This is called the exposure bias problem [Bengio et al.2015].

Our Approach

To deal with the exposure bias problem, we try to maximize the agreement between translations from L2R and R2L NMT models, and divide the NMT training objective into two parts: the standard maximum likelihood of training data, and the regularization terms that indicate the divergence of L2R and R2L models based on the current model parameter. In this section, we will start with basic model notations, followed by discussions of model regularization terms and efficient gradient approximation methods. In the last part, we show that the L2R and R2L NMT models can be jointly improved to achieve even better results.

Notations

Given source sentence and its target translation , let and be L2R and R2L translation models, in which and are corresponding model parameters. Specifically, L2R translation model can be decomposed as , which means L2R model adopts previous targets as history to predict the current target at each step , while the R2L translation model can similarly be decomposed as and employs later targets as history to predict current target at each step .

NMT Model Regularization

Since L2R and R2L models are different chain decompositions of the same translation probability, output probabilities of the two models should be identical:

(2)

However, if these two models are optimized separately by maximum likelihood estimation (MLE), there is no guarantee that the above equation will hold. To satisfy this constraint, we introduce two Kullback-Leibler (KL) divergence regularization terms into the MLE training objective (Equation

1). For L2R model, the new training objective is:

(3)

where is a hyper-parameter for regularization terms. These regularization terms are 0 when Equation 2 holds, otherwise regularization terms will guide the training process to reduce the disagreement between L2R and R2L models.

Unfortunately, it is impossible to calculate entire gradients of this objective function, since we need to sum over all translation candidates in an exponential search space for KL divergence. To alleviate this problem, we follow shen-EtAl:2016:P16-1 shen-EtAl:2016:P16-1 to approximate the full search space with a sampled sub-space and then design an efficient KL divergence approximation algorithm. Specifically, we derive the gradient calculation equation based on the definition of KL divergence, and then design proper sampling methods for two different KL divergence regularization terms.

For , according to the definition of KL divergence, we have

(4)

where is a set of all possible candidate translations for the source sentence . Since is irrelevant to parameter , the partial derivative of this KL divergence with respect to can be written as

(5)

in which are the gradients specified with a standard sequence-to-sequence NMT network. The expectation can be approximated with samples from the R2L model . Therefore, minimizing this regularization term is equal to maximizing the log-likelihood on the pseudo sentence pairs sampled from the R2L model.

For , similarly we have

(6)

The partial derivative of this KL divergence with respect to is calculated as follows:

(7)

Similarly, we use sampling for the calculation of expectation . There are two differences in Equation 7 compared with Equation 5: 1) Pseudo sentence pairs are not sampled from the R2L model (), but from the L2R model itself (); 2) is used as weight to penalize incorrect pseudo pairs.

Input: Bilingual Data
         R2L Model
      Output: L2R Model

1:procedure training process
2:     while Not Converged do
3:         Sample sentence pairs from bilingual data ;
4:         Generate translation candidates for by translation model and build pseudo sentence pairs ;
5:         Generate translation candidates for by translation model and build pseudo sentence pairs weighted with ;
6:         Update with Equation 8 given original data () and two synthetic data ( and ).
7:     end while
8:end procedure
Algorithm 1 Training Algorithm for L2R Model

To sum up, the partial derivative of objective function with respect to can be approximately written as follows:

(8)

The overall training is shown in Algorithm 1.

Joint Training for Paired NMT Models

In practice, due to the imperfection of R2L model, the agreement between L2R and R2L models sometimes may mislead L2R model training. On the other hand, due to the symmetry of L2R and R2L models, L2R model can also serve as the discriminator to punish bad translation candidates generated from R2L model. Similarly, the objective function of the R2L model can be defined as follow:

(9)

The corresponding training procedure is similar with Algorithm 1.

Figure 1: Illustration of joint training of NMT models in two directions (L2R model and R2L model ).

Based on the above, L2R and R2L models can act as helper systems for each other in a joint training process: the L2R model is used as auxiliary system to regularize R2L model , and the R2L model is used as auxiliary system to regularize L2R model . This training process can be iteratively carried out to obtain further improvements because after each iteration both L2R and R2L models are expected to be improved with regularization method.

To simultaneously optimize these two models, we design a novel training algorithm with the overall training objective defined as the sum of objectives in both directions:

(10)

As illustrated in Figure 1, the whole training process contains two major steps: pre-training and joint-training. First, given parallel corpora , we pre-train both L2R and R2L models with MLE principle. Next, based on pre-trained models, we jointly optimize L2R and R2L models with an iterative process. In each iteration, we fix R2L model and use it as a helper to optimize L2R model with Equation 3, and at the same time, we fix L2R model and use it as a helper to optimize R2L model with Equation 9. The iterative training continues until the performance on development set does not increase.

Experiments

Setup

To examine the effectiveness of our proposed approach, we conduct experiments on three datasets, including NIST OpenMT for Chinese-English, WMT17 for English-German and Chinese-English. In all experiments, we use BLEU [Papineni et al.2002] as the automatic metric for translation evaluation.

Datasets.

For NIST OpenMT’s Chinese-English translation task, we select our training data from LDC corpora,111The corpora include LDC2002E17, LDC2002E18, LDC2003E07, LDC2003E14, LDC2005E83, LDC2005T06, LDC2005T10, LDC2006E17, LDC2006E26, LDC2006E34, LDC2006E85, LDC2006E92, LDC2006T06, LDC2004T08, LDC2005T10 which consists of 2.6M sentence pairs with 65.1M Chinese words and 67.1M English words respectively. Any sentence longer than 80 words is removed from training data. The NIST OpenMT 2006 evaluation set is used as validation set, and NIST 2003, 2005, 2008, 2012 datasets as test sets. We limit the vocabulary to contain up to 50K most frequent words on both source and target sides, and convert remaining words into the <unk> tokens. In decoding period, we follow LuongACL2015 LuongACL2015 to handle the <unk> replacement.

For WMT17’s English-German translation task, we use the pre-processed training data provided by the task organizers.222http://data.statmt.org/wmt17/translation-task/preprocessed
/de-en/
The training data consists of 5.8M sentence pairs with 141M English words and 134M German words respectively. We use the newstest2016 as the validation set and the newstest2017 as the test set. The maximal sentence length is set as 128. For vocabulary, we use 37K sub-word tokens based on Byte Pair Encoding (BPE) [Sennrich, Haddow, and Birch2016b].

System NIST2006 NIST2003 NIST2005 NIST2008 NIST2012 Average
Transformer 44.33 45.69 43.94 34.80 32.63 40.28
Transformer+MRT 45.21 46.60 45.11 36.77 34.78 41.69
Transformer+JS 45.04 46.32 44.58 36.81 35.02 41.51
Transformer+RT 46.14 48.28 46.24 38.07 36.31 43.01
Table 2: Case-insensitive BLEU scores (%) for Chinese-English translation on NIST datasets. The “Average” denotes the average BLEU score of all datasets in the same setting.

For WMT17’s Chinese-English translation task, we use all the available parallel data, which consists of 24M sentence pairs, including News Commentary, UN Parallel Corpus and CWMT Corpus.333http://www.statmt.org/wmt17/translation-task.html The newsdev2017 is used as the validation set and newstest2017 as the test set. We also limit the maximal sentence length to 128. For data pre-processing, we segment Chinese sentences with our in-house Chinese word segmentation tool and tokenize English sentences with the scripts provided in Moses.444https://github.com/moses-smt/mosesdecoder/blob/master
/scripts/tokenizer/tokenizer.perl
Then we learn a BPE model on pre-processed sentences with 32K merge operations, in which 44K and 33K sub-word tokens are adopted as source and target vocabularies respectively.

Experimental Details.

The Transformer model Vaswani2017AttentionIA Vaswani2017AttentionIA is adopted as our baseline. For all translation tasks, we follow the transformer_base_v2 hyper-parameter setting555https://github.com/tensorflow/tensor2tensor/blob/v1.3.0/
tensor2tensor/models/transformer.py

which corresponds to a 6-layer transformer with a model size of 512. The parameters are initialized using a normal distribution with a mean of 0 and a variance of

, where and are the number of rows and columns in the structure [Glorot and Bengio2010]. All models are trained on 4 Tesla M40 GPUs for a total of 100K steps using the Adam [Kingma and Ba2014] algorithm. The initial learning rate is set to 0.2 and decayed according to the schedule in Vaswani2017AttentionIA Vaswani2017AttentionIA. During training, the batch size is set to approximately 4096 words per batch and checkpoints are created every 60 minutes. At test time, we use a beam of 8 and a length penalty of 1.0.

Other hyper-parameters used in our approach are set as . To build the synthetic data in Algorithm 1, we adopt beam search to generate translation candidates with beam size 4, and the best sample is used for the estimation of KL divergence. In practice, to speed up the decoding process, we sort all source sentences according to the sentence length, and then 32 sentences are simultaneously translated with parallel decoding implementation. In our experiments, we try different settings (), and find the achieves the best BLEU result on validation set. We also test the model performance on validation set with the bigger parameter . When , we do not find the further improvement but it brings some training times due to more pseudo sentence pairs. In addition, we use sentence-level BLEU to filter wrong translations whose BLEU score is not greater than 30%. Notice that the R2L model gets comparable results with the L2R model, thus only the result of the L2R model is reported in our experiments.

Evaluation on NIST Corpora

Table 2 shows the evaluation results of different models on NIST datasets. MRT represents shen-EtAl:2016:P16-1 shen-EtAl:2016:P16-1’s method, RT denotes our regularization approach, and JS represents liu2016agreement liu2016agreement’s method that modifies the inference strategy by reranking the -best results with a joint probability of bidirectional models. All the results are reported based on case-insensitive BLEU and computed using Moses multi-bleu.perl script.

We observe that by taking agreement information into consideration, Transformer+JS and Transformer+RT can bring improvement across different test sets, in which our approach achieves 2.73 BLEU point improvements over Transformer on average. These results confirm that introducing agreement between L2R and R2L models helps handle exposure bias problem and improves translation quality.

Besides, we see that Transformer+RT gains better performance than Transformer+JS across different test sets, with 1.5 BLEU point improvements on average. Since Transformer+JS only leverages the agreement restriction in inference stage, L2R and R2L models still suffer from exposure bias problem and generate bad translation candidates, which limits the room for improvement in the re-ranking process. Instead of combining R2L model during inference, our approach utilizes the intrinsic probabilistic connection between L2R and R2L models to guide the learning process. The two NMT models are expected to adjust in disagreement cases and then the exposure bias problem of them can be solved.

Figure 2: Performance of the generated translations with respect to the length of source sentences on NIST datasets.
English-German Chinese-English
System newstest2016 newstest2017 newsdev2017 newstest2017
Transformer 32.58 25.48 20.87 23.01
Transformer+MRT 33.27 25.87 21.66 24.24
Transformer+JS 32.91 25.93 21.25 23.59
Transformer+RT 34.56 27.18 22.50 25.38
Transformer-big 33.58 27.13 21.91 24.03
Transformer-big+BT 35.06 28.34 23.59 25.53
Transformer-big+BT+RT 36.78 29.46 24.84 27.21
Edinburgh’s NMT System (ensemble) 36.20 28.30 24.00 25.70
Sogou’s NMT System (ensemble) - - 22.90 26.40
Table 3: Case-sensitive BLEU scores (%) for English-German and Chinese-English translation on WMT test sets. Edinburgh [Sennrich et al.2017] and Sogou [Wang et al.2017b] NMT systems are No.1 system in leaderboard of WMT 2017’s English-German and Chinese-English translation tasks respectively.

Longer source sentence implies longer translation that more easily suffers from exposure bias problem. To further verify our approach, we group source sentences of similar length together and calculate the BLEU score for each group. As shown in Figure 2, we can see that our mechanism achieves the best performance in all groups. The gap between our method and the other three methods is small when the length is smaller than 10, and the gap becomes bigger when the sentences become longer. This further confirms the efficiency of our proposed method in dealing with the explore bias problem.

Evaluation on WMT17 Corpora

For WMT17 Corpora, we verify the effectiveness of our approach on English-German and Chinese-English translation tasks from two angles: 1) We compare our approach with baseline systems when only parallel corpora is used; 2) We investigate the impact of the combination of back-translation technique [Sennrich, Haddow, and Birch2016a] and our approach. Since back-translation method brings in more synthetic data, we choose Transformer-big settings defined in Vaswani2017AttentionIA Vaswani2017AttentionIA for this setting. Experimental results are shown in Table 3, in which BT denotes back-translation method. In order to be comparable with NMT systems reported in WMT17, all results are reported based on case-sensitive BLEU and computed using official tools - SacreBLEU.666https://github.com/awslabs/sockeye/tree/master/contrib/
sacrebleu

Only Parallel Corpora.

As shown in Table 3, Transformer+JS and Transformer+RT both generate improvements on test sets for English-German and Chinese-English translation tasks. It confirms the effectiveness of leveraging agreements between L2R and R2L models. Additionally, our approach significantly outperforms Transformer+JS, yielding the best BLEU score on English-German and Chinese-English test sets respectively when merely using bilingual data. These results further prove the effectiveness of our method.

Combining with Back-Translation Method.

To verify the effect of our approach when monolingual data is available, we combine our approach with back-translation method. We first randomly select 5M German sentences and 12M English sentences from “News Crawl: articles from 2016”. Then German-English and English-Chinese NMT systems learnt with parallel corpora are used to translate monolingual target sentences.

From Table 3

, we find that Transformer-big gains better performance than Transformer due to more model parameters. When back-translation method is employed, Transformer-big+BT achieves 1.21 and 1.5 BLEU point improvements over Transformer-big on English-German and Chinese-English sets respectively. Our method can further gain remarkable improvements based on back-translation method, resulting in the best results on all translation tasks. These results show that in semi-supervised learning scenarios, the NMT model still can benefit from our proposed approach. In addition, our single model Transformer-big+BT+RT even achieves the best performance on WMT17’s English-German and Chinese-English translation tasks, among all the reported results, including ensemble systems.

Effect of Joint Training

English-German Chinese-English
Iteration 0 32.58 20.87
Iteration 1 33.86 21.92
Iteration 2 34.56 22.50
Iteration 3 34.58 22.47
Table 4: Translation performance of our method on WMT validation sets during training process. “Iteration 0” denotes baseline Transformer model.

We further investigate the impact of our joint training algorithm during the whole training process. Table 4 shows the BLEU scores on WMT validation sets in each iteration. For each iteration, we train NMT models until the performance on the development set does not increase. We find that more iterations can lead to better evaluation results consistently and 2 iterations are enough to reach convergence in our experiments. However, further iterations cannot bring noticeable translation accuracy improvements but more training time. For training cost, since our method is based on pre-trained models, the entire training time is almost 2 times the original MLE training.

Source
442 1 ≥1≥1 Louis Galicia 4 32 31 11 42 411 1 42 KGO 34,
12 4 144 4 234 21 1 Frank 2 4≥44 2 4 411 1 Sons &
Daughters 11 34 14 234 21 1 13 14.
Reference
The victim’s brother, Louis Galicia, told ABC station KGO in San Francisco that Frank, previously a
line cook in Boston, had landed his dream job as line chef at San Francisco’s Sons & Daughters
restaurant six months ago.
Transformer
Louis Galicia, the victim’s brother, told ABC radio station KGO in San Francisco that Frank, who used
to work as an assembly line cook in Boston, had found an ideal job in the Sons & Daughters restaurant
in San Francisco six months ago.
Transformer (R2L)
The victim’s brother, Louis Galia, told ABC’s station KGO, in San Francisco, Frank had found an ideal
job as a pipeline chef six months ago at Sons & Daughters restaurant in San Francisco .
Transformer+RT
The victim’s brother, Louis Galicia, told ABC radio station KGO in San Francisco that Frank, who
previously worked as an assembly line cook in Boston, found an ideal job as an assembly line cook
six months ago at Sons & Daughters restaurant in San Francisco.
Table 5: Translation examples of different systems. Text highlighted in wavy lines is incorrectly translated.

Example

In this section, we give a case study to analyze our method. Table 5 provides a Chinese-English translation example from newstest2017. We find that Transformer produces the translation with good prefixes but bad suffixes, while Transformer (R2L) generates the translation with desirable suffixes but incorrect prefixes. For our approach, we can see that Transformer+RT produces a high-quality translation in this case, which is much better than Transformer and Transformer (R2L). The reason is that leveraging the agreement between L2R and R2L models in training stage can better punish bad suffixes generated by Transformer and encourage desirable suffixes from Transformer (R2L).

Related Work

Target-bidirectional transduction techniques have been explored in statistical machine translation, under the IBM framework [Watanabe and Sumita2002] and the feature-driven linear models [Finch and Sumita2009, Zhang et al.2013]. Recently, liu2016agreement liu2016agreement and zhang2018AsynBid zhang2018AsynBid migrate this method from SMT to NMT by modifying the inference strategy and decoder architecture of NMT. liu2016agreement liu2016agreement propose to generate -best translation candidates from L2R and R2L NMT models and leverage the joint probability of two models to find the best candidates from the combined -best list. zhang2018AsynBid zhang2018AsynBid design a two-stage decoder architecture for NMT, which generates translation candidates in a right-to-left manner in first-stage and then gets final translation based on source sentence and previous generated R2L translation. Different from their method, our approach directly exploits the target-bidirectional agreement in training stage by introducing regularization terms. Without changing the neural network architecture and inference strategy, our method keeps the same speed as the original model during inference.

To handle the exposure bias problem, many methods have been proposed, including designing new training objectives [Shen et al.2016, Wiseman and Rush2016] and adopting reinforcement learning approaches [Ranzato et al.2015, Bahdanau et al.2016]. shen-EtAl:2016:P16-1 shen-EtAl:2016:P16-1 attempt to directly minimize expected loss (maximize the expected BLEU) with Minimum Risk Training (MRT). wiseman-rush:2016:EMNLP2016 wiseman-rush:2016:EMNLP2016 adopt a beam-search optimization algorithm to reduce inconsistency between training and inference. Besides, ranzato2015sequence ranzato2015sequence propose a mixture training method to perform a gradual transition from MLE training to BLEU score optimization using reinforcement learning. bahdanau2016actor bahdanau2016actor design an actor-critic algorithm for sequence prediction, in which the NMT system is the actor, and a critic network is proposed to predict the value of output tokens. Instead of designing task-specific objective functions or complex training strategies, our approach only adds regularization terms to standard training objective function, which is simple to implement but effective.

Conclusion

In this paper, we have presented a simple and efficient regularization approach to neural machine translation, which relies on the agreement between L2R and R2L NMT models. In our method, two Kullback-Leibler divergences based on probability distributions of L2R and R2L models are added to the standard training objective as regularization terms. An efficient approximation algorithm is designed to enable fast training of the regularized training objective and then a training strategy is proposed to jointly optimize L2R and R2L models. Empirical evaluations are conducted on Chinese-English and English-German translation tasks, demonstrating that our approach leads to significant improvements compared with strong baseline systems.

In our future work, we plan to test our method on other sequence-to-sequence tasks, such as summarization and dialogue generation. Besides the back-translation method, it is also worth trying to integrate our approach with other semi-supervised methods to better leverage unlabeled data.

Acknowledgments

This research was partially supported by grants from the National Key Research and Development Program of China (Grant No. 2018YFB1004300), the National Natural Science Foundation of China (Grant No. 61703386), the Anhui Provincial Natural Science Foundation (Grant No. 1708085QF140), and the Fundamental Research Funds for the Central Universities (Grant No. WK2150110006).

Besides, we appreciate Dongdong Zhang and Ren Shuo for the fruitful discussions. We also thank the anonymous reviewers for their careful reading of our paper and insightful comments.

References