Joint Training for Neural Machine Translation Models with Monolingual Data

by   Zhirui Zhang, et al.

Monolingual data have been demonstrated to be helpful in improving translation quality of both statistical machine translation (SMT) systems and neural machine translation (NMT) systems, especially in resource-poor or domain adaptation tasks where parallel data are not rich enough. In this paper, we propose a novel approach to better leveraging monolingual data for neural machine translation by jointly learning source-to-target and target-to-source NMT models for a language pair with a joint EM optimization method. The training process starts with two initial NMT models pre-trained on parallel data for each direction, and these two models are iteratively updated by incrementally decreasing translation losses on training data. In each iteration step, both NMT models are first used to translate monolingual data from one language to the other, forming pseudo-training data of the other NMT model. Then two new NMT models are learnt from parallel data together with the pseudo training data. Both NMT models are expected to be improved and better pseudo-training data can be generated in next step. Experiment results on Chinese-English and English-German translation tasks show that our approach can simultaneously improve translation quality of source-to-target and target-to-source models, significantly outperforming strong baseline systems which are enhanced with monolingual data for model training including back-translation.



page 1

page 2

page 3

page 4


Semi-Supervised Learning for Neural Machine Translation

While end-to-end neural machine translation (NMT) has made remarkable pr...

Improving Neural Machine Translation with Pre-trained Representation

Monolingual data has been demonstrated to be helpful in improving the tr...

A Study of Reinforcement Learning for Neural Machine Translation

Recent studies have shown that reinforcement learning (RL) is an effecti...

Revisiting Negation in Neural Machine Translation

In this paper, we evaluate the translation of negation both automaticall...

Unsupervised Neural Machine Translation with SMT as Posterior Regularization

Without real bilingual corpus available, unsupervised Neural Machine Tra...

Exploring Monolingual Data for Neural Machine Translation with Knowledge Distillation

We explore two types of monolingual data that can be included in knowled...

Meta Back-translation

Back-translation is an effective strategy to improve the performance of ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Neural machine translation (NMT) performs end-to-end translation based on an encoder-decoder framework  [Kalchbrenner and Blunsom2013, Cho et al.2014, Sutskever, Vinyals, and Le2014, Bahdanau, Cho, and Bengio2014] and has obtained state-of-the-art performances on many language pairs [Luong, Pham, and Manning2015, Sennrich, Haddow, and Birch2016b, Tu et al.2016, Wu et al.2016]

. In the encoder-decoder framework, an encoder first transforms the source sequence into vector representations, based on which, a decoder generates the target sequence. Such framework brings appealing properties over the traditional phrase-based statistical machine translation (SMT) systems  

[Koehn, Och, and Marcu2003, Chiang2007], such as little requirements for human feature engineering, or prior domain knowledge. On the other hand, to train the large amount of parameters in the encoder and decoder networks, most NMT systems heavily rely on high-quality parallel data and perform poorly in resource-poor or domain-specific tasks. Unlike bilingual data, monolingual data are usually much easier to collect and more diverse, and have been attractive resources for improving machine translation models since 1990’s when data-driven machine translation systems were first built.

Monolingual data play a key role in training SMT systems. Additional target monolingual data are usually required to train a powerful language model, which is an important feature of an SMT system’s log-linear model. Using source-side monolingual data in SMT were also explored. Ueffing2007transductive Ueffing2007transductive introduced a transductive semi-supervised learning method, in which source monolingual sentences are translated and filtered to build pseudo bilingual data, which are added to the original bilingual data to re-train the SMT model.

For NMT systems, gulcehre2015using gulcehre2015using first tried both shallow and deep fusion methods to integrate an external RNN language model into the encoder-decoder framework. The shallow fusion method simply linearly combines the translation probability and the language model probability, while the deep fusion method connects the RNN language model with the decoder to form a new tightly coupled network. Instead of introducing an explicit language model,  Cheng2016SemiSupervisedLF Cheng2016SemiSupervisedLF proposed an auto-encoder-based method which encodes and reconstructs monolingual sentences, in which source-to-target and target-to-source NMT models serve as the encoder and decoder respectively.

Sennrich2016ImprovingNM Sennrich2016ImprovingNM proposed back-translation for data augmentation as another way to leverage the target monolingual data. In this method, both the NMT model and training algorithm are kept unchanged, instead they employed a new approach to constructing training data. That is, target monolingual sentences are translated with a pre-constructed machine translation system into source language, which are used as additional parallel data to re-train the source-to-target NMT model. Although back-translation has been proven to be robust and effective, one major problem for further improvement is the quality of automatically generated training data from monolingual sentences. Due to the imperfection of machine translation system, some of the incorrect translations are very likely to hurt the performance of source-to-target model.

In this paper, we present a novel method for making extended usage of monolingual data from both source side and target side by jointly optimizing a source-to-target NMT model and a target-to-source NMT model through an iterative process. In each iteration, these two models serve as helper machine translation systems for each other as in back-translation: is used to generated pseudo-training data for model with target-side monolingual data, and is used to generated pseudo-training data for model with source-side monolingual data. The key advantage of our new approach comparing with existing work is that the training process can be repeated to obtain further improvements because after each iteration both model and are expected to be improved with additional pseudo-training data. Therefore, in the next iteration, better pseudo-training data can be generated with these two improved models, resulting even better model and model , so on and so forth.

To jointly optimize the two models in both directions, we design a new semi-supervised training objective, with which the generated training sentence pairs are weighted so that the negative impact of noisy translations can be minimized. Original bilingual sentence pairs are all weighted as 1, while the synthetic sentence pairs are weighted as the normalized model output probability. Similar to the post-processing step as described in Ueffing2007transductive Ueffing2007transductive, our weight mechanism also plays an important role in improving the final translation performance. As we will show in the paper, the overall iterative training process essentially adds a joint EM estimation over the monolingual data to the MLE estimation over bilingual data: the E-step tries to estimate the expectations of translations of the monolingual data, while the M-step updates model parameters with the smoothed translation probability estimation.

Our experiments are conducted on NIST OpenMT’s Chinese-English translation task and WMT’s English-German translation task. Experimental results demonstrate that our joint training method can significantly improve translation quality of both source-to-target and target-to-source models, compared with back-translation and other strong baselines.

Neural Machine Translation

In this section, we will first briefly introduce the NMT model used in our work. The NMT model follows the attention-based architecture proposed by Bahdanau2014NeuralMT Bahdanau2014NeuralMT, and it is implemented as an encoder-decoder framework with recurrent neural networks (RNN). RNN are usually implemented as Gated Recurrent Unit (GRU) 

[Cho et al.2014]

(adopted in our work) or Long Short-Term Memory (LSTM) networks 

[Hochreiter and Schmidhuber1997]. The whole architecture can be divided into three components: encoder, decoder and attention mechanism.


The encoder reads the source sentence and transforms it into a sequence of hidden states , using a bi-directional RNN. At each time stamp , the hidden state is defined as the concatenation of the forward and backward RNN hidden states , where , .


The decoder uses another RNN to generate the translation based on the hidden states generated by the encoder. At each time stamp , the conditional probability of each word from a target vocabulary is computed by


where is the hidden state of the decoder, which is calculated conditioned on the previous hidden state , previous word and the source context vector :


where the source context vector is computed by the attention mechanism.

Attention Mechanism

The context vector is a weighted sum of the hidden states with the coefficients computed by



is a feed-forward neural network with a single hidden layer.

MLE Training

NMT systems are usually trained to maximize the conditional log-probability of the correct translation given a source sentence with respect to the parameters of the model:


where is size of the training corpus, and is the length of the target sentence .

As with the most of deep learning models, the model parameters

have to be learnt with fully labeled data, which means parallel sentence pairs in the machine translation task, while monolingual data cannot be directly applied to model training.

Joint Training for Paired NMT Models

Figure 1: Illustration of joint-EM training of NMT models in two directions ( and ) using both source () and target () monolingual corpora, combined with bilingual data . is the generated synthetic data with probability by translating using , and is the synthetic data with probability by translating using .

Back translation fills the gap between the requirement for parallel data and availability of monolingual data in NMT model training with the help of machine translation systems. Specially, given a set of sentences in target language , a pre-constructed target-to-source machine translation system is used to automatically generate their translations in source language . Then the synthetic sentence pairs are used as additional parallel data to train the source-to-target NMT model, together with the original bilingual data.

Our work follows this parallel data synthesis approach, but extends the task setting from solely improving the source-to-target NMT model training with target monolingual data to a paired one: we aim to jointly optimize a source-to-target NMT model and a target-to-source NMT model with the aid of monolingual data from both source language and target language . Different from back translation, in which both automatic translation and NMT model training are performed only once, our method runs the machine translation for monolingual data and updates NMT models and through several iterations. At each iteration step, model and serves as each other’s pseudo-training data generator: is used to translate into for , while is used to translate to for .

1:procedure Pre-training
2:     Initialize and with random weights and ;
3:     Pre-train and on bilingual data with Equation 4;
4:end procedure
5:procedure Joint-training
6:     while Not Converged do
7:         Use to generate back-translation for and build pseudo-parallel corpora ; E-Step for
8:         Use to generate back-translation for and build pseudo-parallel corpora ; E-Step for
9:         Train with Equation 10 given weighted bilingual corpora ; M-Step for
10:         Train with Equation 12 given weighted bilingual corpora ; M-Step for
11:     end while
12:end procedure
Algorithm 1 Joint Training Algorithm for NMT

The joint training process is illustrated in Figure 1, in which the first 2 iterations are shown. Before the first iteration starts, two initial translation models and are pre-trained with parallel data . This step is denoted as iteration 0 for sake of consistency.

In iteration 1, at first, two NMT systems based on and are used to translate monolingual data and , which forms two synthetic training data sets and . Model and are then trained on the updated training data by combining and with parallel data . It is worth noting that we use n-best translations from an NMT system, and the selected translations are weighted with the translation probabilities from the NMT model.

In iteration 2, the above process is repeated, but the synthetic training data are re-generated with the updated NMT models and , which are presumably more accurate. In turn, the learnt NMT models and are also expected to be improved over the first iteration.

The formal algorithm is listed in Algorithm 1

, which is divided into two major steps: pre-training and joint training. As we will show in next section, the joint training step essentially adds an EM (Expectation-Maximization) process over the monolingual data in both source and target languages

111Note that the training criteria on parallel data are still using MLE (maximum likelihood estimation) .

Training Objective

Next we will show how to derive our new learning objective for joint training, starting with the case that only one NMT model is involved.

Given parallel corpus and monolingual corpus in target language , the semi-supervised training objective is to maximize the likelihood of both bilingual data and monolingual data:


where the first term on the right side denotes the likelihood of bilingual data and the second term represents the likelihood of target-side monolingual data. Next we introduce the source translations as hidden states for the target sentences and decompose as


where is latent variable representing the source translation of target sentence ,

is the approximated probability distribution of

, represents the marginal distribution of sentence , and

is the Kullback-Leibler Divergence between two probability distributions. In order to make the equal sign to be valid in Equation

6, must satisfy the following condition


where is a constant and does not depend on . Given , can be calculated as


where denotes the true target-to-source translation probability. Since it is usually not possible to calculate in practice, we use the translation probability given by a target-to-source NMT model as . Combining Equation 5 and 6, we have


This means is a lower bound of the true likelihood function . Since is irrelevant to parameters , can be simplified as


The first part of is the same as the MLE training, while the second part can be optimized with EM algorithm. We can estimate the expectation of source translation probability in the E-step, and maximize the second part in the M-step. The E-step uses the target-to-source translation model to generate the source translations as hidden variables, which are paired with the target sentences to build a new distribution of training data together with true parallel data . Therefore maximizing can be approximated by maximizing the log likelihood on the new training data. The translation probability is used as the weight of the pseudo sentence pairs, which helps with filtering out bad translations.

It is easy to verify that back-translation approach [Sennrich, Haddow, and Birch2016a] is a special case of this formulation of , in which = 1 because only the best translation from the NMT model is used


Similarly, the likelihood of NMT model can be derived as


where is a target translation (hidden state) of the source sentence . The overall training objective is the sum of likelihood in both directions

During the derivation of , we use the translation probability from as the approximation of the true distribution . When gets closer to , we can get a tighter lower bound of , gaining more opportunities to improve . Joint training of paired NMT models is designed to solve this problem if source monolingual data are also available.



We evaluate our proposed approach on two language pairs: ChineseEnglish and EnglishGerman. In all experiments, we use BLEU [Papineni et al.2002]

as the evaluation metric for translation quality.

Direction System NIST2006 NIST2003 NIST2005 NIST2008 NIST2012 Average
CE RNNSearch 38.61 39.39 38.31 30.04 28.48 34.97
RNNSearch+M 40.66 43.26 41.61 32.48 31.16 37.83
SS-NMT 41.53 44.03 42.24 33.40 31.58 38.56
JT-NMT 42.56 45.10 44.36 34.10 32.26 39.67
EC RNNSearch 17.75 18.37 17.10 13.14 12.85 15.84
RNNSearch+M 21.28 21.19 19.53 16.47 15.86 18.87
SS-NMT 21.62 22.00 19.70 17.06 16.48 19.37
JT-NMT 22.56 22.98 20.95 17.62 17.39 20.30
Table 1: Case-insensitive BLEU scores (%) on ChineseEnglish translation. The “Average” denotes the average BLEU score of all datasets in the same setting. The “C” and “E” denote Chinese and English respectively.


For ChineseEnglish translation, we select our training data from LDC corpora222 The corpora include LDC2002E17, LDC2002E18, LDC2003E07, LDC2003E14, LDC2005E83, LDC2005T06, LDC2005T10, LDC2006E17, LDC2006E26, LDC2006E34, LDC2006E85, LDC2006E92, LDC2006T06, LDC2004T08, LDC2005T10, which consists of 2.6M sentence pairs with 65.1M Chinese words and 67.1M English words respectively. We use 8M Chinese sentences and 8M English sentences randomly extracted from Xinhua portion of Gigaword corpus as the monolingual data sets. Any sentence longer than 60 words is removed from training data (both the bilingual data and pseudo bilingual data). For Chinese-English, NIST OpenMT 2006 evaluation set is used as validation set, and NIST 2003, NIST 2005, NIST 2008, NIST2012 datasets as test sets. In both validation and test data sets, each Chinese sentence has four reference translations. For English-Chinese, we use the NIST datasets in a reverse direction: treating the first English sentence in the four reference translation as a source sentence and the Chinese sentence as the single reference. We limit the vocabulary to contain up to 50K most frequent words on both the source and target side, and convert remaining words into the <unk> token.

For EnglishGerman translation, we choose the WMT’14 training corpus used in  jean-EtAl:2015:ACL-IJCNLP jean-EtAl:2015:ACL-IJCNLP. This training corpus contains 4.5M sentence pairs with 116M English words and 110M German words. For monolingual data, we randomly select 8M English sentences and 8M German sentences from “News Crawl: articles from 2012” provided by WMT’14. The concatenation of news-test 2012 and news-test 2013 is used as the validation set and news-test 2014 as the test set. The maximal sentence length is also set as 60. We use 50K sub-word tokens as vocabulary based on Byte Pair Encoding [Sennrich, Haddow, and Birch2016b].

Implementation Details

The RNNSearch model proposed by Bahdanau2014NeuralMT Bahdanau2014NeuralMT is adopted as our baseline, which uses a single layer GRU-RNN for the encoder and another. The size of word embedding (for both source and target words) is 256 and the size of hidden layer is set to 1024. The parameters are initialized using a normal distribution with a mean of 0 and a variance of

, where and are the number of rows and columns in the structure [Glorot and Bengio2010]. Our models are optimized with the Adadelta [Zeiler2012] algorithm with mini-batch size 128. We re-normalize gradient if its norm is larger than 2.0 [Pascanu, Mikolov, and Bengio2013]. At test time, beam search with size 8 is employed to find the best translation, and translation probabilities are normalized by the length of the translation sentences. In post-processing step, we follow the work of LuongACL2015 LuongACL2015 to handle <unk> replacement for ChineseEnglish translation.

For building the synthetic bilingual data in our approach, beam size is set to 4 to speed up the decoding process. In practice, we first sort all monolingual data according to the sentence length and then 64 sentences are simultaneously translated with parallel decoding implementation. As for model training, we found that 4-5 EM iterations are enough to converge. The best model is selected according to the BLEU scores on the validation set during EM process.


Our proposed joint-training approach is compared with three NMT baselines for all translation tasks:

  • RNNSearch: Attention-based NMT system [Bahdanau, Cho, and Bengio2014]. Only bilingual corpora are used to train a standard attention-based NMT model.

  • RNNSearch+M: Bilingual and target-side monolingual corpora are used to train RNNSearch. We follow Sennrich2016NeuralMT Sennrich2016NeuralMT to construct pseudo-parallel corpora by generating source language with back-translation of target-side monolingual data.

  • SS-NMT: Semi-supervised NMT training proposed by Cheng2016SemiSupervisedLF Cheng2016SemiSupervisedLF. To be fair in all experiment, their method adopts the same settings as our approach including the same source and target monolingual data.

System Architecture ED DE
jean-EtAl:2015:ACL-IJCNLP jean-EtAl:2015:ACL-IJCNLP Gated RNN with search + PosUnk 18.97 -
jean-EtAl:2015:ACL-IJCNLP jean-EtAl:2015:ACL-IJCNLP Gated RNN with search + PosUnk + 500K vocabs 19.40 -
shen-EtAl:2016:P16-1 shen-EtAl:2016:P16-1 Gated RNN with search + PosUnk + MRT 20.45 -
luong2015effective luong2015effective LSTM with 4 layers + dropout + local att. + PosUnk 20.90 -
RNNSearch Gated RNN with search + BPE 19.78 24.91
RNNSearch+M Gated RNN with search + BPE + monolingual data 21.89 26.81
SS-NMT Gated RNN with search + BPE + monolingual data 22.64 27.30
JT-NMT Gated RNN with search + BPE + monolingual data 23.60 27.98
Table 2: Case-sensitive BLEU scores (%) on EnglishGerman translation. “PosUnk” denotes LuongACL2015 LuongACL2015’s technique of handling rare words. “MRT” denotes minimum risk training proposed in shen-EtAl:2016:P16-1 shen-EtAl:2016:P16-1. “BPE” denotes Byte Pair Encoding proposed by Sennrich2016NeuralMT Sennrich2016NeuralMT for word segmentation. The “D” and “E” denote German and English respectively.

ChineseEnglish Translation Result

Table 1 shows the evaluation results of different models on NIST datasets, in which JT-NMT represents our joint training for NMT using monolingual data. All the results are reported based on case-insensitive BLEU.

Compared with RNNSearch, we can see that RNNSearch+M, SS-NMT and JT-NMT all bring significant improvements across different test sets. Our approach achieves the best result, 4.7 and 4.46 BLEU points improvement over RNNSearch on average for Chinese-to-English and English-to-Chinese respectively. These results confirm that exploiting massive monolingual corpora improves translation performance.

From Table 1, we can find our JT-NMT achieves better performances than RNNSearch+M across different test sets, with 1.84 and 1.43 points of BLEU improvements on average in Chinese-to-English and English-to-Chinese directions respectively. Compared with RNNSearch+M, our joint training approach introduces data weight to better handle poor pseudo-training data, and the joint interactive training can boost the models of two directions with the help of each other, instead of only use the target-to-source model to help source-to-target model. Our approach also yields better translation than SS-NMT with at least 0.93 points BLEU improvements on average. This result shows that our method can better make use of both source and target monolingual corpora than Cheng2016SemiSupervisedLF Cheng2016SemiSupervisedLF’s approach.

EnglishGerman Translation Result

For EnglishGerman translation task, in addition to the baseline system, we also include results of other existing NMT systems, including jean-EtAl:2015:ACL-IJCNLP jean-EtAl:2015:ACL-IJCNLP, shen-EtAl:2016:P16-1 shen-EtAl:2016:P16-1 and luong2015effective luong2015effective. In order to be comparable with other work, all the results are reported based on case-sensitive BLEU. Experiment results are shown in Table 2.

We can observe that the baseline RNNSearch with BPE method achieves better results than jean-EtAl:2015:ACL-IJCNLP jean-EtAl:2015:ACL-IJCNLP, even better than the result using larger vocabulary of size 500K. Compared with RNNSearch, we observe that RNNSearch+M, SS-NMT and JT-NMT bring significant improvements in both English-to-German and German-to-English directions. It confirms the effectiveness of leveraging monolingual corpus. Our approach outperforms RNNSearch+M and SS-NMT by a notable margin and obtains the best BLEU score of 23.6 and 27.98 in English-to-German and German-to-English test set respectively. These experimental results further confirm the effectiveness of our joint training mechanism, similar as shown in the ChineseEnglish translation tasks.

Effect of Joint Training

(a) Chinese-English Translation
(b) English-Chinese Translation
(c) German-English Translation
(d) English-German Translation
Figure 2: BLEU scores (%) on ChineseEnglish and EnglishGerman validation and test sets for JT-NMT during training process. “Dev” denotes the results of validation datasets, while “Test” denotes the results of test datasets.

We further investigate the impact of our joint training approach JT-NMT during the whole training process. Figure 2 shows the BLEU scores on ChineseEnglish and EnglishGerman validation and test sets in each iteration. We can find that more iterations can lead to better evaluation results consistently, which verifies that the joint training of NMT models in two directions can boost their translation performance.

In Figure 2, “Iteration 0” is the BLEU scores of baseline RNNSearch, and obviously the first few iterations gain most, especially for “Iteration 1”. After three iterations, we cannot get significant improvement anymore. As we said previously, along with the target-to-source model approaches the ideal translation probability, the lower bound of the loss will be closer to the true loss. During the training, the closer the lower bound to the true loss, the smaller the potential gain. Since there is a lot of uncertainty during the training, the performance sometimes drops a little.

JT-NMT (Iteration 1) can be considered as the general version of RNNSearch+M that any pseudo sentence pair is weighted as 1. From Table 3, we can see that JT-NMT (Iteration 1) slightly surpass RNNSearch+M on all test datasets, which proves that the weight introduced in our algorithm can clean poor synthetic data and lead to better performance. Our approach will assign low weight to synthetic sentence pairs with poor translation, so as to punish its effect to the model update. The translation will be refined and improved in subsequent iterations, as shown in Table 4, which shows translation results of a Chinese sentence in different iterations.

System CE EC DE ED
RNNSearch+M 37.83 18.87 26.81 21.89
JT-NMT (Iteration 1) 38.23 19.10 27.07 22.20
Table 3: The BLEU scores (%) on ChineseEnglish and EnglishGerman translation tasks. For ChineseEnglish translation, we list the average results of all test sets. For EnglishGerman translation, we list the results of news-test2014.
当 终场 哨声 响 起 , 意大利 首都 罗马 沸腾 了 。
dang zhongchang shaosheng xiang qi , yidali shoudu luoma feiteng le .
when the final whistle sounded , the italian capital of rome boiled .
[Iteration 0]: the italian capital of rome was boiling with the rome .
[Iteration 1]: the italian capital of rome was boiling with the sound of the end of the door .
[Iteration 4]: when the final whistle sounded , the italian capital of rome was boiling .
Table 4: Example translations of a Chinese sentence in different iterations.

Related Work

Neural machine translation has drawn more and more attention in recent years [Bahdanau, Cho, and Bengio2014, Luong, Pham, and Manning2015, Jean et al.2015, Tu et al.2016, Wu et al.2016]. For the original NMT system, only parallel corpora can be used for model training using MLE method, therefore much research in the literature attempts to exploit massive monolingual corpora. gulcehre2015using gulcehre2015using first investigate the integration of monolingual data for neural machine translation. They train monolingual language models independently, which is integrated into the NMT system with proposed shallow and deep fusion methods. Sennrich2016ImprovingNM Sennrich2016ImprovingNM propose to generate the synthetic bilingual data by translating the target monolingual sentences to source language sentences, and the mixture of original bilingual data and the synthetic parallel data are used to retrain the NMT system. As an extension of their approach, our approach introduces translation probabilities from target-to-source model as weights of synthetic parallel sentences to punish poor pseudo parallel sentences, and further interactive training of NMT models in two directions are used to refine them.

Recently, zhang2016exploiting zhang2016exploiting propose a multi-task learning framework to exploit source-side monolingual data, in which they jointly perform machine translation on synthetic bilingual data and sentence reordering with source-side monolingual data. Cheng2016SemiSupervisedLF Cheng2016SemiSupervisedLF reconstruct monolingual data by auto-encoder, in which the source-to-target and target-to-source translation models form a closed loop and are jointly updated. Different from their method, our approach extends Sennrich2016ImprovingNM Sennrich2016ImprovingNM by directly introducing source-side monolingual data to improve reverse NMT models and adopts EM algorithm to iteratively update bidirectional NMT models. Our approach can better exploit both target and source monolingual data, while they show no improvement when using both target and source monolingual data compared just target monolingual data. he2016dual he2016dual treat the source-to-target and target-to-source models as the primal and dual tasks respectively, similar to the work of Cheng2016SemiSupervisedLF Cheng2016SemiSupervisedLF, they also employed round-trip translations for each monolingual sentence to obtain feedback signals. ramachandran-liu-le ramachandran-liu-le adopt pre-trained weights of two language models to initial the encoder and decoder of a seq2seq model, and then fine-tune it with labeled data. Their approach is complementary to our mechanism by leveraging pre-trained language model to initial bidirectional NMT models, and it may lead to additional gains.


In this paper, we propose a new semi-supervised training approach to integrating the training of a pair of translation models in a unified learning process with the help of monolingual data from both source and target sides. In our method, a joint-EM training algorithm is employed to optimize two translation models cooperatively, in which the two models are able to mutually boost their translation performance. Translation probability of the other model is used as the weight to estimate translation accuracy and punish the bad translations. Empirical evaluations are conducted in ChineseEnglish and EnglishGerman translation tasks, and demonstrate that our approach leads to significant improvements, compared with strong baseline systems. In the future work, we plan to extend this method to jointly train multiple NMT systems for 3+ languages using massive monolingual data.


This research was partially supported by grants from the National Natural Science Foundation of China (Grants No. 61727809, 61325010 and U1605251). We appreciate Dongdong Zhang, Shuangzhi Wu, Wenhu Chen, Guanlin Li for the fruitful discussions. We also thank the anonymous reviewers for their careful reading of our paper and insightful comments.