Microsoft Research Asia's Systems for WMT19

11/07/2019 ∙ by Yingce Xia, et al. ∙ 0

We Microsoft Research Asia made submissions to 11 language directions in the WMT19 news translation tasks. We won the first place for 8 of the 11 directions and the second place for the other three. Our basic systems are built on Transformer, back translation and knowledge distillation. We integrate several of our rececent techniques to enhance the baseline systems: multi-agent dual learning (MADL), masked sequence-to-sequence pre-training (MASS), neural architecture optimization (NAO), and soft contextual data augmentation (SCA).



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We participated in the WMT19 shared news translation task in 11 translation directions. We achieved first place for 8 directions: GermanEnglish, GermanFrench, ChineseEnglish, EnglishLithuanian, EnglishFinnish, and RussianEnglish, and three other directions were placed second (ranked by teams), which included LithuanianEnglish, FinnishEnglish, and EnglishKazakh.

Our basic systems are based on Transformer, back translation and knowledge distillation. We experimented with several techniques we proposed recently. In brief, the innovations we introduced are:

Multi-agent dual learning (MADL)

The core idea of dual learning is to leverage the duality between the primal task (mapping from domain to domain ) and dual task (mapping from domain to ) to boost the performances of both tasks. MADL Wang et al. (2019) extends the dual learning He et al. (2016); Xia et al. (2017a) framework by introducing multiple primal and dual models. It was integrated into our submitted systems for GermanEnglish and GermanFrench translations.

Masked sequence-to-sequence pretraining (MASS)

Pre-training and fine-tuning have achieved great success in language understanding. MASS (Song et al., 2019), a pre-training method designed for language generation, adopts the encoder-decoder framework to reconstruct a sentence fragment given the remaining part of the sentence: its encoder takes a sentence with randomly masked fragment (several consecutive tokens) as input, and its decoder tries to predict this masked fragment. It was integrated into our submitted systems for ChineseEnglish and EnglishLithuanian translations.

Neural architecture optimization (NAO)

As well known, the evolution of neural network architecture plays a key role in advancing neural machine translation. Neural architecture optimization (NAO), our newly proposed method

Luo et al. (2018), leverages the power of a gradient-based method to conduct optimization and guide the creation of better neural architecture in a continuous and more compact space given the historically observed architectures and their performances. It was applied in EnglishFinnish translations in our submitted systems.

Soft contextual data augmentation (SCA)

While data augmentation is an important trick to boost the accuracy of deep learning methods in computer vision tasks, its study in natural language tasks is relatively limited. SCA

Zhu et al. (2019) softly augments a randomly chosen word in a sentence by its contextual mixture of multiple related words, i.e., replacing the one-hot representation of a word by a distribution provided by a language model over the vocabulary. It was applied in RussianEnglish translation in our submitted systems.

2 Our Techniques

2.1 Multi-agent dual learning (MADL)

MADL is an enhanced version of dual learning (He et al., 2016; Wang et al., 2018). It leverages primal translation models and dual translation models for training, and eventually outputs one and one for inference, where , . All these models are pre-trained on bilingual data . The -th primal model has a non-negative weight and the -th dual model has a non-negative weight . All the ’s and ’s are hyper-parameters. Let denote a combined translation model from to , and a combined translation model from to ,


and work as follows: for any and ,

Let denote the bilingual dataset. Let and denote the monolingual data of and . The training objective function of MADL can be written as follows:


Note that and will not be optimized during training and we eventually output and for translation. More details can be found in Wang et al. (2019).

2.2 Masked sequence-to-sequence pre-training (MASS)

MASS is a pre-training method for language generation. For machine translation, it can leverage monolingual data in two languages to pre-train a translation model. Given a sentence , we denote as a modified version of where its fragment from position to are masked, and is the number of tokens of sentence . We denote as the number of tokens being masked from position to . We replace each masked token by a special symbol , and the length of the masked sentence is not changed. denotes the sentence fragment of from to .

MASS pre-trains a sequence to sequence model by predicting the sentence fragment taking the masked sequence as input. We use the log likelihood as the objective function:


where , denote the source and target domain. In addition to zero/low-resource setting (Leng et al., 2019), we also extend MASS to supervised setting where bilingual sentence pair can be leveraged for pre-training. The log likelihood in the supervised setting is as follows:


where represents the concatenation operation. and

denote the probability of translating a masked sequence to another language, which encourage the encoder to extract meaningful representations of unmasked input tokens in order to predict the masked output sequence.

and denote the probability of generating the masked source/target segment given both the masked source and target sequences, which encourage the model to extract cross-lingual information. and denote the probability of generating the masked fragment given only the masked sequence in another language. More details about MASS can be found in  Song et al. (2019).

2.3 Neural architecture optimization (NAO)

Figure 1: Visualization of different levels of the search space, from the network, to the layer, to the node. For each of the different layers, we search its unique layer space. The lines in the middle part denote all possible connections between the three nodes (constituting the layer space) as specified via each architecture, while among them the deep black lines indicate the particular connection in Transformer. The right part similarly contains the two branches used in Node2 of Transformer.

NAO Luo et al. (2018) is a gradient based neural architecture search (NAS) method. It contains three key components: an encoder, an accuracy predictor, and a decoder, and optimizes a network architecture as follows. (1) The encoder maps a network architecture

to an embedding vector

in a continuous space . (2) The predictor, a function , takes as input and predicts the dev set accuracy of the architecture . We perform a gradient ascent step, i.e., moving along the direction specified via the gradient , and get a new embedding vector :


where is the step size. (3) The decoder is used to map back to the corresponding architecture . The new architecture is assumed to have better performance compared with the original one due to the property of gradient ascent. NAO repeats the above three steps, and sequentially generates better and better architectures.

To learn high-quality encoder, decoder and performance prediction function, it is essential to have a large quantity of paired training data in the form of , where is the dev set accuracy of the architecture . To reduce computational cost, we share weights among different architectures Pham et al. (2018) to aid the generation of such paired training data.

We use NAO to search powerful neural sequence-to-sequence architectures. The search space is illustrated in Fig. 1. Specifically, each network is composed of encoder layers and decoder layers. We set in our experiments. Each encoder layer further contains nodes and each decoder layer contains nodes. The node has two branches, respectively taking the output of other node as input, and applies a particular operator (OP), for example, identity, self-attention and convolution, to generate the output. The outputs of the two branches are added together as the output of the node. Each encoder layer contains two nodes while each decoder layer has three. For each layer, we search: 1) what is the operator at each branch of every node. For a comprehensive list of different OPs, please refer to the Appendix of this paper; 2) the topology of connection between nodes within each layer. In the middle part of Fig. 1, we plot possible connections within the nodes of a layer specified by all candidate architectures, with a particular highlight of Transformer Vaswani et al. (2017).

To construct the final network, we do not adopt the typically used way of stacking the same layer multiple times. Instead we assume that layers in encoder/decoder could have different architectures and directly search such personalized architecture for each layer. We found that such a design significantly improves the performance due to the more flexibility.

2.4 Soft contextual data augmentation (SCA)

SCA is a data augmentation technology for NMT Zhu et al. (2019), which replaces a randomly chosen word in a sentence with its soft version. For any word , its soft version is a distribution over the vocabulary of words: , where and .

Given the distribution , one may simply sample a word from this distribution to replace the original word . Different from this method, we directly use this distribution vector to replace the randomly chosen word from the original sentence. Suppose is the embedding matrix of all the words. The embedding of the soft version of is


which is the expectation of word embeddings over the distribution.

In our systems, we leverage a pre-trained language model to compute and condition on all the words preceding . That is, for the -th word in a sentence, we have

where denotes the probability of the -th word in the vocabulary appearing after the sequence . The language model is pre-trained using the monolingual data.

3 Submitted Systems

3.1 EnglishGerman

We submit constrained systems to both English to German and German to English translations, with the same techniques.


We concatenate “Europarl v9”, “News Commentary v14”, “Common Crawl corpus” and “Document-split Rapid corpus” as the basic bilingual dataset (denoted as ). Since “Paracrawl” data is noisy, we select 20M bilingual data from this corpus using the script filter_interactive.py111Scripts at The two parts of bilingual data are concatenated together (denoted as ). We clean by normalizing the sentences, removing non-printable characters, and tokenization. We share a vocabulary for the two languages and apply BPE for word segmentation with merge operations. (We tried different BPE merge operations but found no significant differences.) For monolingual data, we use English sentences (denoted as ) and German sentences (denoted as ) from Newscrawl, and preprocess them in the same way as bilingual data. We use newstest 2016 and the validation set and newstest 2018 as the test set.

Model Configuration

We use the PyTorch implementation of Transformer

222 We choose the Transformer_big setting, in which both the encoder and decoder are of six layers. The dropout rate is fixed as . We set the batchsize as and the parameter --update-freq as . We apply Adam Kingma and Ba (2015) optimizer with learning rate .

Training Pipeline

The pipeline consists of three steps:

1. Pre-train two EnglishGerman translation models (denoted as and ) and two GermanEnglish translation models (denoted as and ) on ; pre-train another EnglishGerman (denoted as ) and GermanEnglish (denoted as ) on .

2. Apply back translation following Sennrich et al. (2016a); Edunov et al. (2018). We back-translate and using and with beam search, add noise to the translated sentences Edunov et al. (2018), merge the synthetic data with , and train one EnglishGerman model and one GermanEnglish model for seven days on eight V100 GPUs.

3. Apply MADL to and . That is, the in Eqn.(2) is specified as the combination of with equal weights; and consists of . During training, we will only update and . To speed up training, we randomly select monolingual English and German sentences from and respectively instead of using all monolingual sentences. The eventual output models are denoted as and respectively. This step takes days on four P40 GPUs.


EnDe DeEn
news16 news18 news16 news18
Table 1: Results of EnglishGerman by sacreBLEU.

The results are summarized in Table 1, which are evaluated by sacreBLEU333 The baseline is the average accuracy of models using only bitext, i.e., and for EnglishGerman translation and and for GermanEnglish, and BT is the accuracy of the model after back-translation training. As can be seen, back translation improves accuracy. For example, back-translation boosts the BLEU score from to on news18 EnglishGerman translation, which is point improvement. MADL further boosts BLEU to , obtaining another -point improvement, demonstrating the effectiveness of our method.

For the final submission, we accumulate many translation models (trained using bitext, back translation, and MADL, with different random seeds) and do knowledge distillation on the source sentences from WMT14 to WMT19 test sets. Take EnglishGerman translation as an example. Denote the English inputs as , where is the size of the test set. For each in , we translate to using EnglishGerman models and eventually obtain

where is the -th translation model we accumulated, is the combination of inputs from WMT14 to WMT19. After obtaining , we randomly select bitext pairs (denoted as ) from and finetune model on . We stop tuning when the BLEU scores of WMT16 (i.e., the validation set) drops.

We eventually obtain BLEU score for EnglishGerman and for GermanEnglish on WMT19 test sets and are ranked in the first place in these two translation tasks.

3.2 GermanFrench

For GermanFrench translation, we follow a similar process as the one used to EnglishGerman tasks introduced in Section 3.1. We merge the “commoncrawl”, “europarl-v7” and part of “de-fr.bicleaner07” selected by as the bilingual data. We collect monolingual sentences for French and for German from newscrawl. The data pre-processing rule and training procedure are the same as that used in Section 3.1. We split sentences from the “dev08_14” as the validation set and use the remaining ones as the test set.

The results of GermanFrench translation on the test set are summarized in Table 2.

DeFr FrDe
Table 2: Results of GermanFrench by sacreBLEU.

Again, our method achieves significant improvement over the baselines. Specifically, MADL boosts the baseline of GermanFrench and FrenchGerman by and points respectively.

Our submitted GermanFrench is a single system trained by MADL, achieving BLEU on WMT19. The FrenchGerman is an ensemble of three independently trained models, achieving BLEU score. Our systems are ranked in the first place for both GermanFrench and FrenchGerman in the leaderboard.

3.3 ChineseEnglish


For ChineseEnglish translation, we use all the bilingual and monolingual data provided by the WMT official website, and also extra bilingual and monolingual data crawled from the web. We filter the total 24M bilingual pairs from WMT using the script as described in Section 3.1 and get 18M sentence pairs. We use the Chinese monolingual data from XMU monolingual corpus444 and English monolingual data from News Crawl as well as the English sentences from all English-XX language pairs in WMT. We use 100M additional parallel sentences drawn from UN data, Open Subtitles and Web crawled data, which is filtered using the same filter rule described above, as well as fast align and in/out-domain filter. Finally we get 38M bilingual pairs. We also crawled 80M additional Chinese monolingual sentences from Sougou, China News, Xinhua News, Sina News, Ifeng News, and 2M English monolingual sentences from China News and Reuters. We use newstest2017 and newstest2018 on Chinese-English as development datasets.

We normalize the Chinese sentence from SBC case to DBC case, remove non-printable characters and tokenize with both Jieba555 and PKUSeg666 to increase diversity. For English sentences, we remove non-printable characters and tokenize with Moses tokenizer777 r/scripts/tokenizer/tokenizer.perl. We follow previous practice (Hassan et al., 2018) and apply Byte-Pair Encoding (BPE) (Sennrich et al., 2016b) separately for Chinese and English, each with 40K vocabulary.

MASS Pre-training

We pre-train MASS (Transfomer_big) with both monolingual and bilingual data. We use 100M Chinese and 300M English monolingual sentences for the unsupervised setting (Equation 3), and with a total of 18M and 56M bilingual sentence pairs for the supervised settings (Equation 4). We share the encoder and decoder for all the losses in Equation 3 and 4. We then fine-tune the MASS pre-trained model on both 18M and 56M bilingual sentence pairs to get the baseline translation model for both ChineseEnglish and EnglishChinese.

Back Translation and Knowledge Distillation

We randomly choose 40M monolingual sentences for Chinese and English respectively for back translation (Sennrich et al., 2016a; He et al., 2016) and knowledge distillation (Kim and Rush, 2016; Tan et al., 2019). We iterate back translation and knowledge distillation multiple times, to gradually boost the performance of the model.


The results on newstest2017 and newstest2018 are shown in Table 3. We list two baseline Transformer_big systems which use 18M bilingual data (constraint) and 56M bilingual data (unconstraint) respectively. The pre-trained model achieves about 1 BLEU point improvement after fine-tuning on both 18M and 56M bilingual data. After iterative back translation (BT) and knowledge distillation (KD), as well as re-ranking, our system achieves 30.8 and 30.9 BLEU points on newstest2017 and newstest2018 respectively.

System newstest17 newstest18
Baseline (18M) 24.2 24.5
+ MASS (18M) 25.2 25.4
Baseline (56M) 26.9 27.0
+ MASS (56M) 28.0 27.8
+ Iterative BT/KD 30.4 30.5
+ Reranking 30.8 30.9
Table 3: BLEU scores on ChineseEnglish test sets.

WMT19 Submission

For the WMT19 submission, we conduct fine-tuning and speculation to further boost the accuracy by using the source sentences in the WMT19 test set. We first filter the bilingual as well as pseudo-generated data according to the relevance to the source sentences. We use the filter method in Deng et al. (2018) and continue to train the model on the filtered data. Second, we conduct speculation on the test source sentences following the practice in Deng et al. (2018). The final BLEU score of our submission is 39.3, ranked in the first place in the leaderboard.

3.4 EnglishLithuanian

For EnglishLithuanian translation, we follow the similar process as that for ChineseEnglish task introduced in Section 3.3. We use all the WMT bilingual data, which is 2.24M after filtration. We use the same English monolingual data as used in Chinese-English. We select 100M Lithuanian monolingual data from official commoncrawl and use all the wiki and news Lithuanian monolingual data provided by WMT. In addition, we crawl 5M Lithuanian news data from LRT website888 We share the BPE vocabulary between English and Lithuanian, and the vocabulary size is 65K.

All the bilingual and monolingual data are used for MASS pre-training, and all the bilingual data are used for fine-tuning. For iterative back translation and knowledge distillation, we split 24M English monolingual data as well as 12M Lithuanian monolingual data into 5 parts through sampling with replacement, to get different models independently so as to increase diversity in re-ranking/ensemble. Each model uses 8M English monolingual data and 6M Lithuanian monolingual data. For our WMT19 submission, different from zh-en, speculation technology is not used.

The BLEU scores on newsdev19 are shown in Table 4. Our final submissions for WMT19 achieves 20.1 BLEU points for EnglishLithuanian translation (ranked in the first place) and 35.6 for LithuanianEnglish translation (ranked in the second place).

System EnLt LtEn
Baseline 20.7 28.2
MASS + Fine-tune 21.5 28.7
+ Iterative BT/KD 28.3 33.6
+ Reranking 29.1 34.2
Table 4: BLEU scores for EnglishLithuanian on the newsdev19 set.

3.5 EnglishFinnish


We use the official English-Finnish data from WMT19, including both bilingual data and monolingual data. After de-duplicating, the bilingual data contains aligned sentence pairs. We share the vocabulary for English and Finnish with BPE units. We use the WMT17 and WMT18 English-Finnish test sets as two development datasets, and tune hyper-parameters based on the concatenation of them.

Architecture search

We use NAO to search sequence-to-sequence architectures for English-Finnish translation tasks, as introduced in subsection 2.3. We use PyTorch for our implementations. Due to time limitations, we are not targeting at finding better neural architectures than Transformer; instead we target at models with comparable performance to Transformer, while providing diversity in the reranking process. The whole search process takes days on P40 GPU cards and the discovered neural architecture, named as NAONet, is visualized in the Appendix.

Train single models

The final system for English-Finnish is obtained through reranking of three strong model checkpoints, respectively from the Transformer model decoding from left to right (L2R Transformer), the Transformer model decoding from right to left (R2L Transformer) and NAONet decoding from left to right. All the models have 6-6 layers in encoder/decoder, and are obtained using the same process which is detailed as below.

Step 1: Base models. Train two models and based on all the bilingual dataset (M), respectively for EnglishFinnish and FinnishEnglish translations.

Step 2: Back translation. Do the normal back translation Sennrich et al. (2016a); He et al. (2016) using and . Specifically we choose monolingual English corpus, use to generate the pseudo bitext with beam search (beam size is set to ), and mix it with the bilingual data to continue the training of . The ratio of mixing is set as through up-sampling. The model obtained through such a process is denoted as . The same process is applied to the opposite direction and the new model is attained.

Step 3: Back translation + knowledge distillation. In this step we generate more pseudo bitext by sequence level knowledge distillation Kim and Rush (2016) apart from using back translation. To be more concrete, as the first step, similar to Step 2, we choose monolingual English and Finnish corpus, and generate the translations using and , respectively. The resulting pseudo bitext is respectively denoted as and . Then we concatenate all the bilingual data, and , and use the whole corpus to train a new English-Finnish model from scratch. The attained model is denoted as .

Step 4: Finetune. In this step we try a very simple data selection method to handle the domain mismatch problem in WMT. We remove all the bilingual corpus from Paracrawl which is generally assumed to be quite noisy Junczys-Dowmunt (2018) and use the remaining bilingual corpus () to finetune

for one epoch. The resulting model is denoted as

which is set as the final model checkpoint.

newstest17 newstest18
Baseline 26.09 16.07
+BT 28.84 18.54
+BT & KD 29.76 19.13
+Finetune 30.19 19.46
Table 5: BLEU scores of L2R Transformer on EnglishFinnish test sets.
newstest17 newstest18
L2R Transformer 30.19 19.46
R2L Transformer 30.40 19.73
NAONet 30.54 19.58
Table 6: The final BLEU scores on EnglishFinnish test sets, for the three models: L2R Transformer, R2L Transformer and NAONet, after the four steps of training.

To investigate the effects of the four steps, we record the resulting BLEU scores on WMT17 and WMT18 test sets in Table 5, taking the L2R Transformer model as an example. Furthermore, we report the final BLEU scores of the three models after the four steps in Table 6. All the results are obtained via beam size and length penalty . The similar results for Finnish-English translation are shown in Table 7.

newstest17 newstest18
L2R Transformer 35.66 25.56
R2L Transformer 35.31 25.56
NAONet 36.18 26.38
Table 7: The final BLEU scores on FinnishEnglish test sets, for the three models: L2R Transformer, R2L Transformer and NAONet, after the four steps of training.


We use n-best re-ranking to deliver the final translation results using the three model checkpoints introduced in the last subsection. The beam size is set as . The weights of the three models, as well as the length penalty in generation, are tuned on the WMT-18 test sets. The results are shown in the second row of Table 8.

news17 news18 news19
w/ NAONet
31.48 21.21 27.4
w/o NAONet
30.82 20.79 /
Table 8: EnglishFinnish BLEU scores of re-ranking using the three models. “news” is short for “newstest”.
news17 news18 news19
w/ NAONet
37.54 27.51 31.9
w/o NAONet
36.83 26.99 /
Table 9: FinnishEnglish BLEU scores of re-ranking using the three models.

We would also like to investigate what is the influence of the NAONet to the re-ranking results. To achieve that, in re-ranking we replace NAONet with another model from L2R Transformer, trained with the same process in subsection 3.5 with the difference only in random seeds, while maintain the other two models unchanged. The results are illustrated in the last row of Table 8. From the comparison of the two rows in Table 8, we can see the new architecture NAONet discovered via NAO brings more diversity in the ranking, thus leading to better results. We also report the similar results for Finnish-English tasks in Table 9.

Our systems achieve for and for EnglishFinnish and FinnishEnglish, ranked in the first place and second place (by teams), respectively.

3.6 RussianEnglish


We use the bitext data from the several corpora: ParaCrawl, Common Crawl, News Commentary, Yandex Corpus, and UN Parallel Corpus. We also use News Crawl corpora as monolingual data. The data is filtered by rules such as sentence length, language identification, resulting a training dataset with 16M bilingual pairs and 40M monolingual sentences (20M for English and 20M for Russian). We use WMT17 and WMT18 test set as development data. The two languages use separate vocabularies, each with 50K BPE merge operations.

Our system

Our final system for Russian

English translation is a combination of Transformer network

Vaswani et al. (2017), back translation Sennrich et al. (2016a), knowledge distillation Kim and Rush (2016), soft contextual data augmentation Zhu et al. (2019), and model ensemble. We use Transformer_big as network architecture. We first train two models, EnglishRussian and RussianEnglish respectively, on bilingual pairs as baseline model. Based on these two models, we perform back translation and knowledge distillation on monolingual data, generating 40M synthetic data. Combining both bilingual and synthetic data, we get a large train corpus with 56M pairs in total. We upsample the bilingual pairs and shuffle the combined corpus to ensure the balance between bilingual and synthetic data. Finally, we train the RussianEnglish model from scratch. During the training, we also use soft contextual data augmentation to further enhance training. Following the above procedures, 5 different models are trained and ensembled for final submission.


Our final submission achieves 40.1 BLEU score, ranked first in the leaderboard. Table 10 reports the results of our system on the development set.

newstest17 newstest18
Baseline 36.5 32.6
+BT & KD 40.9 35.2
+SCA 41.7 35.6
Table 10: RussianEnglish BLEU scores.

3.7 EnglishKazakh


We notice that most of the parallel data are out of domain. Therefore, we crawl some external data:

(1) We crawl all news articles from, a Kazakh-English news website. Then we match an English new article to a Kazakh one by matching their images with image hashing. In this way, we find 10K pairs of bilingual news articles. We use their title as additional parallel data. These data are in-domain and useful in training.

(2) We crawl 140K parallel sentence pairs from Although most of these sentences are out-of-domain, they significantly extended the size of our parallel dataset and lead to better results.

Because most of our parallel training data are noisy, we filter these data with some rules: (1) For the KazakhTV dataset, we remove any sentence pair with an alignment score less than 0.05. (2) For the Wiki Titles dataset, we remove any sentence pair that starts with User or NGC. (3) For all datasets, we remove any sentence pair in which the English sentence contains no lowercase alphabets. (4) For all datasets, we remove any sentence pair where the length ratio is greater than 2.5:1.

We tokenize all our data using the Moses Decoder. We learn a shared BPE  (Sennrich et al., 2016b) from all our data (including all WMT19 parallel data, WMT19 monolingual data999When we learn BPE, English monolingual data is down-sampled to make the number of English sentences roughly the same as the number of Kazakh sentences., glosbe, news titles, and news contents) and get a shared vocabulary of 49,152 tokens. Finally, our dataset consists of 300K bilingual sentence pairs, 700K Kazakh monolingual sentences, and many English monolingual sentences.

Our system

Our model is based on the Transformer Vaswani et al. (2017)

. We vary the hyper-parameters to increase the diversity of our model. Our models usually have 6 encoder layers, 6/7 decoder layers, ReLU/GELU 

Hendrycks and Gimpel (2016)activation function, and an embedding dimension of 640.

We train 4 English-Kazakh models and 4 Kazakh-English models with different random seeds and hyper-parameters. Then we apply back-translation Edunov et al. (2018) and knowledge distillation (Kim and Rush, 2016) for 6 rounds. In each round, we

1. Sample 4M sentences from English monolingual data and back-translate them to Kazakh with the best EN-KK model (on the dev set) in the previous round.

2. Back-translate all Kazakh monolingual data to English with the best KK-EN model in the previous round.

3. Sample 200K sentences from English monolingual data and translate them to Kazakh using the ensemble of all EN-KK models in the previous round.

4. Train 4 English-Kazakh models with BT data from step 2 and KD data from step 3. We up-sample bilingual sentence pairs by 2x.

5. Train 4 Kazakh-English models with BT data from step 1. We up-sample bilingual sentence pairs by 3x.


Our final submission achieves 10.6 BLEU score, ranked second by teams in the leaderboard.

4 Conclusions

This paper describes Microsoft Research Asia’s neural machine translation systems for the WMT19 shared news translation tasks. Our systems are built on Transformer, back translation and knowledge distillation, enhanced with our recently proposed techniques: multi-agent dual learning (MADL), masked sequence-to-sequence pre-training (MASS), neural architecture optimization (NAO), and soft contextual data augmentation (SCA). Due to time and GPU limitations, we only apply each technique to a subset of translation tasks. We believe combining them together will further improve the translation accuracy and will conduct experiments in the future. Furthermore, some other techniques such as deliberation learning (Xia et al., 2017b), adversarial learning (Wu et al., 2018b)

, and reinforcement learning 

(He et al., 2017; Wu et al., 2018a) could also hep and are worthy of exploration.


This work is supported by Microsoft Machine Translation team.


  • Y. Deng, S. Cheng, J. Lu, K. Song, J. Wang, S. Wu, L. Yao, G. Zhang, H. Zhang, P. Zhang, et al. (2018) Alibaba’s neural machine translation systems for wmt18. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pp. 368–376. Cited by: §3.3.
  • S. Edunov, M. Ott, M. Auli, and D. Grangier (2018) Understanding back-translation at scale.

    2018 Conference on Empirical Methods in Natural Language Processing

    Cited by: §3.1, §3.7.
  • H. Hassan, A. Aue, C. Chen, V. Chowdhary, J. Clark, C. Federmann, X. Huang, M. Junczys-Dowmunt, W. Lewis, M. Li, et al. (2018) Achieving human parity on automatic chinese to english news translation. arXiv preprint arXiv:1803.05567. Cited by: §3.3.
  • D. He, H. Lu, Y. Xia, T. Qin, L. Wang, and T. Liu (2017) Decoding with value networks for neural machine translation. In NIPS, pp. 178–187. Cited by: §4.
  • D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W. Ma (2016) Dual learning for machine translation. In NIPS, pp. 820–828. Cited by: §1, §2.1, §3.3, §3.5.
  • D. Hendrycks and K. Gimpel (2016) Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR abs/1606.08415. External Links: Link, 1606.08415 Cited by: §3.7.
  • M. Junczys-Dowmunt (2018) Microsoft’s submission to the wmt2018 news translation task: how i learned to stop worrying and love the data. In Proceedings of the Third Conference on Machine Translation (WMT), Volume 2: Shared Task Papers, pp. 429––434. Cited by: §3.5.
  • Y. Kim and A. M. Rush (2016) Sequence-level knowledge distillation. In EMNLP 2016, Austin, Texas, pp. 1317–1327. External Links: Link, Document Cited by: §3.3, §3.5, §3.6, §3.7.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. International Conference on Learning Representation (ICLR). Cited by: §3.1.
  • Y. Leng, X. Tan, T. Qin, X. Li, and T. Liu (2019) Unsupervised pivot translation for distant languages. arXiv preprint arXiv:1906.02461. Cited by: §2.2.
  • R. Luo, F. Tian, T. Qin, E. Chen, and T. Liu (2018) Neural architecture optimization. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 7816–7827. External Links: Link Cited by: §1, §2.3.
  • H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean (2018) Efficient neural architecture search via parameters sharing. ICML, pp. 4092–4101. Cited by: §2.3.
  • R. Sennrich, B. Haddow, and A. Birch (2016a) Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 86–96. Cited by: §3.1, §3.3, §3.5, §3.6.
  • R. Sennrich, B. Haddow, and A. Birch (2016b) Neural machine translation of rare words with subword units. 54th Annual Meeting of the Association for Computational Linguistics (ACL). Cited by: §3.3, §3.7.
  • K. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2019) Mass: masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450. Cited by: §1, §2.2.
  • X. Tan, Y. Ren, D. He, T. Qin, and T. Liu (2019) Multilingual neural machine translation with knowledge distillation. In ICLR, External Links: Link Cited by: §3.3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §2.3, §3.6, §3.7.
  • Y. Wang, Y. Xia, L. Zhao, J. Bian, T. Qin, G. Liu, and T. Liu (2018)

    Dual transfer learning for neural machine translation with marginal distribution regularization


    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §2.1.
  • Y. Wang, Y. Xia, T. He, F. Tian, T. Qin, C. Zhai, and T. Liu (2019) Multi-agent dual learning. In ICLR, External Links: Link Cited by: §1, §2.1.
  • L. Wu, F. Tian, T. Qin, J. Lai, and T. Liu (2018a) A study of reinforcement learning for neural machine translation. In EMNLP 2018, pp. 3612–3621. Cited by: §4.
  • L. Wu, Y. Xia, F. Tian, L. Zhao, T. Qin, J. Lai, and T. Liu (2018b) Adversarial neural machine translation. In

    Asian Conference on Machine Learning

    pp. 534–549. Cited by: §4.
  • Y. Xia, T. Qin, W. Chen, J. Bian, N. Yu, and T. Liu (2017a)

    Dual supervised learning

    In ICML, pp. 3789–3798. Cited by: §1.
  • Y. Xia, F. Tian, L. Wu, J. Lin, T. Qin, N. Yu, and T. Liu (2017b) Deliberation networks: sequence generation beyond one-pass decoding. In NIPS, pp. 1784–1794. Cited by: §4.
  • J. Zhu, F. Gao, L. Wu, Y. Xia, T. Qin, W. Zhou, X. Cheng, and T. Liu (2019) Soft contextual data augmentation for neural machine translation. In ACL 2019, Cited by: §1, §2.4, §3.6.