Multilingual Denoising Pre-training for Neural Machine Translation

by   Yinhan Liu, et al.

This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART – a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective. mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages; previous MT pre-training has focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete model allows it to be directly fine tuned for supervised (both sentence-level and document-level) and unsupervised machine translation, with no task-specific modifications. We demonstrate that adding mBART initialization produces performance gains in all but the highest-resource settings, including up to 12 BLEU points for low resource MT and over 5 BLEU points for many document-level and unsupervised models. We also show it also enables new types of transfer to language pairs with no bi-text or that were not in the pre-training corpus, and present extensive analysis of which factors contribute the most to effective pre-training.


page 3

page 16


DOCmT5: Document-Level Pretraining of Multilingual Language Models

In this paper, we introduce DOCmT5, a multilingual sequence-to-sequence ...

Pre-training via Paraphrasing

We introduce MARGE, a pre-trained sequence-to-sequence model learned wit...

Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information

We investigate the following question for machine translation (MT): can ...

Pre-Trained Multilingual Sequence-to-Sequence Models: A Hope for Low-Resource Language Translation?

What can pre-trained multilingual sequence-to-sequence models like mBART...

DEEP: DEnoising Entity Pre-training for Neural Machine Translation

It has been shown that machine translation models usually generate poor ...

Linguistically-driven Multi-task Pre-training for Low-resource Neural Machine Translation

In the present study, we propose novel sequence-to-sequence pre-training...

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

We present BART, a denoising autoencoder for pretraining sequence-to-seq...

1 Introduction

Despite its wide adoption for other NLP tasks devlin2018bert; liu2019roberta; yang2019xlnet; lewis2019bart; raffel2019exploring, self-supervised pretraining is not yet common practice in machine translation (MT). Existing MT approaches only pre-train parts of the model, including the encoder lample2019cross and the decoder edunov2019pre, or use pre-training objectives that only reconstruct parts of text song2019mass, or only focus on English corpora lewis2019bart; raffel2019exploring

. In this paper, we show that significant performance gains are possible by pre-training a complete autoregressive model with an objective that noises and reconstructs full texts across many languages.

In this work, we present mBART – a multilingual sequence-to-sequence (Seq2Seq) denoising auto-encoder. mBART is trained by applying the BART lewis2019bart to large-scale monolingual corpora across many languages. The input texts are noised by masking phrases and permuting sentences, and a single Transformer vaswani2017attention model is learned to recover the texts. Different from other pre-training approaches for MT lample2019cross; song2019mass, mBART pre-trains a complete autoregressive Seq2Seq model. mBART is trained once for all languages, providing a set of parameters that can be fine-tuned for any of the language pairs in both supervised and unsupervised settings, without any task-specific or language-specific modifications or initialization schemes.

Extensive experiments demonstrate that this simple approach works remarkably well. We first focus on existing MT benchmarks. For supervised sentence-level MT, mBART initialization leads to significant gains (up to 12 BLEU points) across low/medium-resource pairs (<10M bi-text pairs), without sacrificing performance in high-resource settings. These results further improve with back-translation (BT), setting a new state-of-the-art on WMT16 English-Romanian and the FloRes test sets. For document-level MT, our document-level pre-training improves results by up to 5.5. For the unsupervised case, we see consistent gains and produce the first non-degenerate results for less related language pairs (e.g., 9.5 BLEU gain on Nepali-English). Previous pre-training schemes have only considered subsets of these tasks, but we compare performance where possible and demonstrate that mBART consistently performs the best.

We also show that mBART enables new types of transfer across language pairs. For example, fine-tuning on bi-text in one language pair (e.g., Korean-English) creates a model that can translate from all other languages in the monolingual pre-training set (e.g., Italian-English), with no further training. We also show that languages not in pre-training corpora can benefit from mBART, strongly suggesting that the initialization is at least partially language universal. Finally, we present a detailed analysis of which factors contribute the most to effective pre-training, including the number of languages and their overall similarity.

Figure 1: Framework for our Multilingual Denoising Pre-training (left) and fine-tuning on downstream MT tasks (right), where we use (1) sentence permutation (2) word-span masking as the injected noise. A special language id token is added at both the encoder and decoder. One multilingual pre-trained model is used for all tasks.

2 Multilingual Denoising Pre-training

We use a large-scale common crawl (CC) corpus (section 2.1) to pre-train BART models (section 2.2). Our experiments in the later sections involve finetuning a range of models pre-trained on different subsets of the CC languages section 2.3).

2.1 Data: CC25 corpus


We pre-train on a subset of 25 languages – CC25 – extracted from the Common Crawl (CC)  (wenzek2019ccnet; conneau2019unsupervised)111 CC25 includes languages from different families and with varied amounts of text (Table 1). Following lample2019cross, we re-balanced the corpus by up/down-sampling text from each language with a ratio :


where is the percentage of each language in CC-25. We use the smoothing parameter .

Code Language Tokens/M Size/GB
En English 55608 300.8
Ru Russian 23408 278.0
Vi Vietnamese 24757 137.3
Ja Japanese 530 (*) 69.3
De German 10297 66.6
Ro Romanian 10354 61.4
Fr French 9780 56.8
Fi Finnish 6730 54.3
Ko Korean 5644 54.2
Es Spanish 9374 53.3
Zh Chinese (Sim) 259 (*) 46.9
It Italian 4983 30.2
Nl Dutch 5025 29.3
Ar Arabic 2869 28.0
Tr Turkish 2736 20.9
Hi Hindi 1715 20.2
Cs Czech 2498 16.3
Lt Lithuanian 1835 13.7
Lv Latvian 1198 8.8
Kk Kazakh 476 6.4
Et Estonian 843 6.1
Ne Nepali 237 3.8
Si Sinhala 243 3.6
Gu Gujarati 140 1.9
My Burmese 56 1.6
Table 1: Languages and Statistics of the CC25 Corpus. A list of 25 languages ranked with monolingual corpus size. Throughout this paper, we replace the language names with their ISO codes for simplicity. (*) Chinese and Japanese corpus are not segmented, so the tokens counts here are sentences counts


We tokenize with a sentence-piece model (SPM, kudo-richardson-2018-sentencepiece) learned on the full CC data that includes subword tokens. While not all of these languages are used for pre-training, this tokenization supports fine-tuning on additional languages. We do not apply additional preprocessing, such as true-casing or normalizing punctuation/characters.

2.2 Model: mBART

Our models follow the BART lewis2019bart sequence-to-sequence pre-training scheme, as reviewed in this section. While BART was only pretrained for English, we systematically study the effects of pre-training on different sets of languages.


We use a standard sequence-to-sequence Transformer architecture vaswani2017attention, with layers of encoder and layers of decoder with model dimension of on heads (M parameters). We include an additional layer-normalization layer on top of both the encoder and decoder, which we found stabilized training at FP16 precision.


Our training data covers languages: where each is a collection of monolingual documents in language . We (1) assume access to a noising function , defined below, that corrupts text, and (2) train the model to predict the original text given . More formally, we aim to maximize :


where is an instance in language and the distribution is defined by the Seq2Seq model.

Noise function

Following lewis2019bart, we use two types of noise in

. We first remove spans of text and replace them with a mask token. We mask 35% of the words in each instance by random sampling a span length according to a Poisson distribution (

). We also permute the order of sentences within each instance. The decoder input is the original text with one position offset. A language id symbol <LID> is used as the initial token to predict the sentence. It is also possible to use other noise types, such as those in lample2018phrase, but we leave the exploration of the optimal noising strategy to future work.

Instance format

For each instance of a batch, we sample a language id symbol <LID>, and we pack as many consecutive sentences as possible sampled from the corresponding corpus of <LID>, until either it hits the document boundary or reaches the max token length. Sentences in the instance are separated by the end of sentence (</S>) token. Then, we append the selected <LID> token to represent the end of this instance. Pre-training at “multi-sentence” level enables us to work on both sentence and document translation.


Our full model (including languages) is trained on 256 Nvidia V100 GPUs (32GB) for 500K steps. The total batch size is around K tokens per GPU, matching BART lewis2019bart configuration. We use the Adam optimizer (, ) and linear learning rate decay scheduling. The total training time was approximately 2.5 weeks. We started the training with dropout and reduced it to at 250K steps and at 400K steps. All experiments are done with Fairseq ott2019fairseq.

En-Gu En-Kk En-Vi En-Tr En-Ja En-Ko

Data Source

10K 91K 133K 207K 223K 230K


0.0 0.0 0.8 0.2 23.6 24.8 12.2 9.5 10.4 12.3 15.3 16.3

0.3 0.1 7.4 2.5 36.1 35.4 22.5 17.8 19.1 19.4 24.6 22.6

En-Nl En-Ar En-It En-My En-Ne En-Ro

Data Source

237K 250K 250K 259K 564K 608K


34.6 29.3 27.5 16.9 31.7 28.0 23.3 34.9 7.6 4.3 34.0 34.3

43.3 34.8 37.6 21.6 39.8 34.0 28.3 36.9 14.5 7.4 37.8 37.7

En-Si En-Hi En-Et En-Lt En-Fi En-Lv

Data Source

647K 1.56M 1.94M 2.11M 2.66M 4.50M


7.2 1.2 10.9 14.2 22.6 17.9 18.1 12.1 21.8 20.2 15.6 12.9

13.7 3.3 23.5 20.8 27.8 21.4 22.4 15.3 28.5 22.4 19.3 15.9

Table 2: Low/Medium Resource Machine Translation Pre-training consistently improves over a randomly initialized baseline, with particularly large gains on low resource language pairs (e.g. Vi-En).
Languages Cs Es Zh De Ru Fr
Size 11M 15M 25M 28M 29M 41M
Random 16.5 33.2 35.0 30.9 31.5 41.4
mBART25 18.0 34.0 33.3 30.5 31.3 41.0

Table 3: High Resource Machine Translation where all the datasets are from their latest WMT competitions. We only evaluate our models on En-X translation.

2.3 Pre-trained Models

To better measure the effects of different levels of multilinguality during pre-training, we built a range of models as follows:

  • [leftmargin=*]

  • mBART25  We pre-train a model on all 25 languages, using the setting described in section 2.2.

  • mBART06  To explore the effect of pre-training on related languages, we pretrain a model on a subset of six European languages: Ro, It, Cs, Fr, Es and En. For a fair comparison, we use of the mBART25 batch size, which allows our model to have the same number of updates per language during pre-training.

  • mBART02   We pre-train bilingual models, using English and one other language for four language pairs: En-De, En-Ro, En-It. We use a batch size of of that in the mBART25.

  • BART-En/Ro  To help establish baseline performance levels, we also train monolingual BART models on the same En and Ro corpus only.

  • Random  As additional baselines, we will also include a comparison with a model randomly initialized without pre-training for each translation task. Since the sizes of different downstream datasets vary, we always grid-search the hyper-parameters (architecture, dropout, etc.) to find the best non-pretrained configuration.

All models use the same vocabulary (section 2.1). Not all tokens will frequently occur in all pre-training corpora, but later experiments show that this large vocabulary can improve generalization in multilingual settings even for unseen languages.

3 Sentence-level Machine Translation

This section shows that mBART pre-training provides consistent performance gains in low to medium resource sentence-level MT settings, including bi-text only and with back translation, and outperforms other existing pre-training schemes (section 3.2). We also present a detailed analysis to understand better which factors contribute the most to these gains (section 3.3), and show that pre-training can even improve performance for languages not present in the pre-training data at all (section 3.4).

3.1 Experimental Settings


We gather pairs of publicly available parallel corpora that cover all the languages in CC25 (Table 1). Most pairs are from previous WMT (Gu, Kk, Tr, Ro, Et, Lt, Fi, Lv, Cs, Es, Zh, De, Ru, Fr En) and IWSLT (Vi, Ja, Ko, Nl, Ar, It En) competitions. We also use FLoRes pairs (guzman-etal-2019-flores, En-Ne and En-Si), En-Hi from IITB DBLP:journals/corr/abs-1710-02855, and En-My from WAT19 ding2018nova; ding2019towards. We divide the datasets into three categories – low resource (1M sentence pairs), medium resource (1M and 10M), and high resource (10M).

Fine-tuning & Decoding

We fine-tune our multilingual pre-trained models on a single pair of bi-text data, feeding the source language into the encoder and decoding the target language. As shown in Figure 1, we load the pre-trained weights and train the MT model on bi-texts with teacher forcing. For all directions, we train with dropout, label smoothing, warm-up steps, maximum learning rate. We use a maximum of K training updates for all low and medium resource pairs and K for high resource pairs. The final models are selected based on validation likelihood. For decoding, we use beam-search with beam size for all directions. The final results are reported in BLEU papineni2002bleu with language-specific settings, see appendix A.

Figure 2: Pre-training + Back Translation on FLoRes with two iterations of BT.
Pre-training Fine-tuning
Model Data EnRo RoEn +BT
Random None 34.3 34.0 36.8
XLM lample2019cross En Ro - 35.6 38.5
MASS song2019mass En Ro - - 39.1
BART lewis2019bart En - - 38.0
XLM-R conneau2019unsupervised CC100 35.6 35.8 -
BART-En En 36.0 35.8 37.4
BART-Ro Ro 37.6 36.8 38.1
mBART02 En Ro 38.5 38.5 39.9
mBART25 CC25 37.7 37.8 38.8
Table 4: Comparison with Other Pre-training Approaches on WMT16 Ro-En.

3.2 Main Results

As shown in Table 2, initializing with the pre-trained mBART25 weights shows gains on all the low and medium resource pairs when compared with randomly initialized baselines. We observe gains of BLEU on low resource pairs such as En-Vi, En-Tr, and noisily aligned pairs like En-Hi. Fine-tuning fails in extremely low-resource setting such as En-Gu, which only have roughly  10k examples for tuning. In these settings, unsupervised translation is more appropriate, see section 5.2.

For high resource cases (Table 3), we do not observe consistent gains, and pre-training slightly hurts performance when 25M parallel sentence are available. When a significant amount of bi-text data is given, we suspect that supervised training washes out the pre-trained weights completely.

+ Back Translation

Back-translation (BT, sennrich-etal-2016-improving) is a standard approach to augment bi-text with target side monolingual data. We combine our pre-training with BT and test it on low resource language pairs – En-Si and En-Ne – using the FLoRes dataset (guzman-etal-2019-flores). For a fair comparison, we use the same monolingual data as guzman-etal-2019-flores to generate BT data. Figure 2 shows that initializing the model with our mBART25 pre-trained parameters improves BLEU scores at each iteration of back translation, resulting in new state-of-the-art results in all four translation directions.

v.s. Other Pre-training Approaches

We also compare our pre-trained models with recent self-supervised pre-training methods, as shown in Table 4. We consider En-Ro translation, the only pair with established results. Our mBART model outperforms all the other pre-trained models, both with and without BT augmentation. We also show comparisons with the conventional BART model trained on the same En and Ro data only. Both have improvements over baselines, while worse than mBART results, indicating pre-training in a multilingual setting is essential. Moreover, combining BT leads to additional gains, resulting in a new state-of-the-art for Ro-En translation.

3.3 Analysis

We also present additional analysis, to better quantify when our pre-training helps.

How many languages should you pre-train on?

We investigate when it is helpful for pre-training to include languages other than the targeted language pair that will be used during fine tuning. Table 5 shows performance on four X-En pairs. Pre-training on more languages helps most when the target language monolingual data is limited (e.g. En-My, the size of My is around of En).

In contrast, when monolingual data is plentiful (De, Ro), pre-training on multiple languages slightly hurts the final results (< BLEU). In these cases, additional languages may reduce the capacity available for each test language. Additionally, the fact that mBART06 performs similar to mBART02 on Ro-En suggests that pre-training with similar languages is particularly helpful.

How many pre-training steps are needed?

We plot Ro-En BLEU score v.s. Pre-training steps in Figure 3, where we take the saved checkpoints (every 25K steps) and apply the same fine-tuning process described in section 3.1. Without any pre-training, our model overfits and performs much worse than the baseline. However, after just 25K steps (5% of training), both models outperform the best baseline. The models keep improving by over BLEU for the rest of steps and have not fully converged after 500K steps. mBART25 is consistently slightly worse than mBART02.

Languages De Ro It My En
Size/GB 66.6 61.4 30.2 1.6 300.8
mBART02 31.3 38.5 39.7 36.5
mBART06 - 38.5 39.3 -
mBART25 30.5 37.7 39.8 36.9
Table 5: Pretraining Languages on En-X translation. The size refers to the size of monolingual data for X. The size of En is shown as reference. All the pretrained models were controlled to see the same number of English instances during training.
Models En-My Training Cost
GPU hours
Random chen2019facebook 23.3 34.9 5
    + BT 32.0 37.7 5 + 300 + 350
mBART02 29.1 37.8 3003000 + 40
    + BT 34.9 39.2 -
Table 6:

Comparison with Back-Translation on My-En translation using same mono-lingual data. We also estimate the computational costs for both pre-training and back-translation based on Nvidia V100 GPUs.

Monolingual Nl-En En-Nl Ar-En En-Ar Nl-De De-Nl
Random None 34.6 (-8.7) 29.3 (-5.5) 27.5 (-10.1) 16.9 (-4.7) 21.3 (-6.4) 20.9 (-5.2)
mBART02 En Ro 41.4 (-2.9) 34.5 (-0.3) 34.9 (-2.7) 21.2 (-0.4) 26.1 (-1.6) 25.4 (-0.7)
mBART06 En Ro Cs It Fr Es 43.1 (-0.2) 34.6 (-0.2) 37.3 (-0.3) 21.1 (-0.5) 26.4 (-1.3) 25.3 (-0.8)
mBART25 All 43.3 34.8 37.6 21.6 27.7 26.1

Table 7: Generalization to Unseen Languages Language transfer results, fine-tuning on language-pairs without pre-training on them. mBART25 uses all languages during pre-training, while other settings contain at least one unseen language pair. For each model, we also show the gap to mBART25 results.
Figure 3: Fine-tuning curves for Ro-En along with Pre-training steps. Both mBART25 and mBART02 outperform the best baseline system after 25K steps.
Figure 4: Fine-tuning curves for En-De along with size of bitext. The x-axis is on a log scale.

How does the size of bitexts inference the gain from pre-training?

Tables 2 and 3 show that pre-training consistently improves for low and medium resource language pairs. To verify this trend, we plot performance for differing sized subsets of the En-De dataset. More precisely, we take the full En-De corpus (M pairs) and randomly sample 10K, 50K, 100K, 500K, 1M, 5M, 10M datasets. We compare performance without pre-training to the mBART02 results, as shown in Figure 4. The pre-trained model is able to achieve over 20 BLEU with only 10K training examples, while the baseline system scores 0. Unsurprisingly, increasing the size of bi-text corpus improves both models. Our pre-trained model consistently outperforms the baseline models, but the gap reduces with increasing amounts of bi-text, especially after 10M sentence pairs. This result confirms our observation in section 3.2 that our pre-training does not help translation in high-resource pairs.

Is pre-training complementary to BT?  

Figure 2 presents that our pre-trained models can be combined with iterative back-translation (BT) on additional data, however, it is still not a fair comparison. Table 6 shows the results when using same monolingual data where we use M En and M My sentences following  chen2019facebook.

With the same amount of monolingual corpus, mBART pre-training achieves the same performance on EnMy as BT, while still BLEU worse on MyEn. We suspect BT benefits from bigger monolingual data (En). Moreover, combining mBART02 model with BT, we see further gains even with same monolingual data. Besides, we also provide estimated training costs where BT has a longer pipeline involving training a baseline system (5h), translating monolingual data (300h) and formal training (350h). Instead, most of training costs of mBART lies in the pre-training part and can be easily adjusted to be more efficient.

3.4 Generalization to Languages NOT in Pre-training

In this section, we show that mBART can improve performance even with fine tuning for languages that did not appear in the pre-training corpora, suggesting that the pre-training has language universal aspects, especially within the parameters learned at the Transformer layers.

Experimental Settings

We analyze the results of three pairs: Nl-En, Ar-En and De-Nl using the pre-trained mBART25, mBART06 and mBART02 (EnRo) models. During pre-training, mBART06 and EnRo Bilingual do not contain Arabic (Ar), German (De) or Dutch (Nl) data, but all languages are in mBART25. Both De and Nl are European languages and are related to En, Ro and other the languages in mBART06 pre-training data.


mBART25 uses all languages during pre-training, but other settings contain at least one unseen language. We find large gains from pre-training on English-Romanian, even when translating a distantly related unseen language (Arabic) and two unseen languages (German and Dutch). The best results are achieved when pre-training includes both test languages, however pre-training on other languages is surprisingly competitive.

Unseen Vocabularies

Arabic is distantly related to the languages in mBART02 and mBART06, and its use of a disjoint character set means that it word embeddings will be largely untrained. However, we obtain similar improvements on Ar-En pairs to those on Nl-En. This result suggests that the pre-trained Transformer layers learn universal properties of language that generalize well even with minimal lexical overlap.

Unseen Source or Target Languages

Table 7 shows different performance when the unseen languages are on the source side, target side, or both sides. If both sides are unseen, the performance (in terms of difference from mBART25) is worse than where at least one language is seen during pre-training. Furthermore, although the En-X pairs perform similarly, mBART06 outperforms mBART02 by a margin on X-En pairs. Fine-tuning unseen languages on source side is more difficult, deserving more extensive future study.

4 Document-level Machine Translation

We evaluate mBART on document-level machine translation tasks, where the goal is to translate segments of text that contain more than one sentence (up to an entire document). During pre-training, we use document fragments of up to 512 tokens, allowing the models to learn dependencies between sentences. We show that this pre-training significantly improves document-level translation.

4.1 Experimental Settings

Datasets # Docs # Insts # Sents
WMT19 En-De 77K 171K 3.7M
TED15 Zh-En 1.7K 6.5K 0.2M
Table 8: Statistics for the Document-level Corpus of WMT19 En-De and TED15 Zh-En. # of instances is the # of training examples in document model.
Model Random mBART25
Sent-MT 34.5 35.9 36.4 38.0
Doc-MT 7.7 37.1 38.5
(a) Sentence- and Document-level BLEU scores on En-De
Model Random mBART25 HAN miculicich-etal-2018-document
Sent-MT 22.0 28.4 -
Doc-MT 3.2 29.6 24.0
(b) Document-level BLEU scores on Zh-En
Table 9: Document-Level Machine Translation on En-De and Zh-En. () The randomly initialized Doc-MT model cannot produce translations aligned to the original sentences, so only document evaluation is possible.


We evaluate performance on two common document-level MT datasets: WMT19 En-De and TED15 Zh-En (statistics in Table 8). For En-De, we use the document data from WMT19 to train our model, without any additional sentence-level data; Zh-En dataset is from the IWSLT 2014 and 2015 evaluation campaigns cettolo2012wit3; cettolo2015iwslt. Following miculicich-etal-2018-document, we use 2010-2013 TED as the test set.


We use the same pre-processing as that in pre-training. For each block, sentences are separated by end of sentence symbols (</S>) and the entire instance is ended with the specific language id (<LID>). The numbers of segmented instances are also shown in Table 8 where on average, every document is split into 2-4 instances.

Fine-tuning & Decoding

We use the same fine-tuning scheme as for sentence-level translation (section 3.1), without using any task-specific techniques developed by previous work miculicich-etal-2018-document; li2019pretrained, such as constrained contexts or restricted attention. For decoding, we simply pack the source sentences into blocks, and translate each instance block autoregressively. The model does not know how many sentences to generate in advance and decoding stops when <LID> is predicted. We use beam size 5 by default.

Baselines & Evaluation

We train 4 models: a document-level (Doc-) MT model (section 4.1) and a corresponded sentence-level (Sent-) MT model (section 3.1) as the baseline, both with and without pre-training. We use mBART25 as the common pre-trained model for En-De and Zh-En. For En-De, even though our mBART25 Doc-MT model decodes multiple sentences together, the translated sentences can be aligned to the source sentences, which allows us to evaluate BLEU scores both on sentence-level (s-BLEU) and document-level (d-BLEU) 222

Standard BLEU scores match n-grams at sentence-level. We also consider document-level where we match n-grams over the whole document resulting in a slightly higher score.

. For Zh-En, however, we cannot produce the same number of translated sentences as the reference due to alignment errors in the test data. We only provide the d-BLEU scores on this direction.

We also compare our models with Hierarchical Attention Networks (HAN, miculicich-etal-2018-document) on Zh-En, which is the state-of-the-art non-pretraining approach for document-level translation for this pair. They combine two layers of attention – first within and then across sentences.

4.2 Main Results

We show the main results for both En-De and Zh-En are presented in Table 9.

Random v.s. Pre-trained

The MT models initialized with pre-trained weights outperform randomly initialized models by large margins, for both sentence-level and document-level training. Our mBART25 models (both Sent-MT and Doc-MT) also outperform HAN miculicich-etal-2018-document333d-BLEU is recomputed from the provided system output., despite the fact that they are not customized for document-level MT in any way.

Sent-MT v.s. Doc-MT

For cases (En-De, En-Zh), the mBART25 Doc-MT models outperform themselves fine-tuned at sentence-level by a margin, which is completely opposite for models without pre-training. For both datasets, randomly initialized Doc-MT fail to work, resulting in much worse results than the sentence-level models. Such large performance gaps indicate that pre-training is critical for document level performance. It is in general difficult to collect high quality document-level data in large quantities, suggesting that pre-training may be a strong strategy for future work. We also include a sampled example in appendix B.

Figure 5: Illustrated frameworks for unsupervised machine translation via (a) back-translation (b) language transfer where Ne-En is used as an example. For both cases, we initialize from multilingual pre-training (e.g. mBART25).

5 Unsupervised Machine Translation

In addition to supervised machine translation, we also evaluate our model on tasks where no bi-text is available for the target language pair. We define three types of unsupervised translation:

  1. [leftmargin=*]

  2. No bi-text of any kind is given. A common solution is to learn from back-translation (BT)  (artetxe2017unsupervised; lample2018phrase). We show that mBART provides a simple and effective initialize scheme for these methods.

  3. No bi-text for the target pair is available, but the target languages both appear in bi-text corpora for other language pairs. Previous work has shown that zero-shot transfer is possible via massively multi-lingual MT  (johnson2017google; gu2019improved) or distillation through pivoting (chen2017teacher). We limit our focus to building MT models for single language pairs, and leave multi-lingual pre-training for multi-lingual MT to future work.

  4. No bi-text for the target pair is available, but there is bi-text for translating from some other language into the target language. This is a new evaluation regime, where we will show that mBART supports effective transfer, even if the source language has no bi-text of any form.

In this section, we demonstrate the effectiveness of multilingual pre-training in unsupervised machine translation via (1) back-translation ( section 5.1) and (3) language transfer (section 5.2). An illustration of both approaches are presented in Figure 5.

5.1 Unsupervised Machine Translation via Back-Translation


We evaluate our pre-trained models on both similar (En-De, En-Ro) and dissimilar pairs (En-Ne, En-Si), which are determined by measuring the subword units that are shared between the source and target languages. We use the same test sets as the supervised benchmarks section 3.1, and directly use the pre-training data (CC25) for back-translation to avoid introducing new information.


Following the same procedure described in lample2018phrase,lample2019cross, we first initialize the translation model with the pre-trained weights, and then learn to predict the monolingual sentences conditioned on source sentences generated by on-the-fly back-translation (BT). lample2019cross only pre-train an encoder, so perform additional de-noising training to learn a seq2seq model – a step which is unnecessary for mBART’s pre-trained seq2seq model. However, we do constrain mBART to only generating tokens in target language 444

We mask out the output probability of predicting tokens which appear less than

in the target monolingual corpus. for the first steps of on-the-fly BT, to avoid it simply copying the source text.

Model Similar Pairs Dissimilar Pairs
En-De En-Ro En-Ne En-Si
Random 21.0 17.2 19.4 21.2 0.0 0.0 0.0 0.0
XLM lample2019cross 34.3 26.4 31.8 33.3 0.5 0.1 0.1 0.1
MASS song2019mass 35.2 28.3 33.1 35.2 - - - -
mBART 34.0 29.8 30.5 35.0 10.0 4.4 8.2 3.9

Table 10: Unsupervised MT via Back-Translation. En-De, En-Ro are initialized by mBART02, while En-Ne, En-Si are initialized by mBART25. Our models are trained on monolingual data used in pre-training.
Fine-tuning Languages
Zh Ja Ko Cs Ro Nl It Ar Hi Ne Si Gu
Domain News TED TED News News TED TED TED News Wiki Wiki Wiki
Testing Languages Zh 23.7 8.8 9.2 2.8 7.8 7.0 6.8 6.2 7.2 4.2 5.9 0.0
Ja 9.9 19.1 12.2 0.9 4.8 6.4 5.1 5.6 4.7 4.2 6.5 0.0
Ko 5.8 16.9 24.6 5.7 8.5 9.5 9.1 8.7 9.6 8.8 11.1 0.0
Cs 9.3 15.1 17.2 21.6 19.5 17.0 16.7 16.9 13.2 15.1 16.4 0.0
Ro 16.2 18.7 17.9 23.0 37.8 22.3 21.6 22.6 16.4 18.5 22.1 0.0
Nl 14.4 30.4 32.3 21.2 27.0 43.3 34.1 31.0 24.6 23.3 27.3 0.0
It 16.9 25.8 27.8 17.1 23.4 30.2 39.8 30.6 20.1 18.5 23.2 0.0
Ar 5.8 15.5 12.8 12.7 12.0 14.7 14.7 37.6 11.6 13.0 16.7 0.0
Hi 3.2 10.1 9.9 5.8 6.7 6.1 5.0 7.6 23.5 14.5 13.0 0.0
Ne 2.1 6.7 6.5 5.0 4.3 3.0 2.2 5.2 17.9 14.5 10.8 0.0
Si 5.0 5.7 3.8 3.8 1.3 0.9 0.5 3.5 8.1 8.9 13.7 0.0
Gu 8.2 8.5 4.7 5.4 3.5 2.1 0.0 6.2 13.8 13.5 12.8 0.3
Table 11: Unsupervised MT via Language Transfer on X-En translations. The model fine-tuned on one language pair is directly tested on another. We use gray color to show the direct fine-tuning results, and lightgray color to show language transfer within similar language groups. We bold the highest transferring score for each pair.


Table 10 shows the unsupervised translation results compared with non-pretrained models, as well as models with existing pre-training methods. Our models achieve large gains over non-pretrained models for all directions, and outperform XLM significantly for dissimilar pairs (En-Ne, En-Si) where the existing approaches completely fail. For similar pairs, our model also performs well against XLM and MASS, with the best numbers for En-X pairs.

Pairs BT Transfer Combined

30.5 CsEn 23.0 33.9
NeEn 10.0 HiEn 18.9 22.1
ZhEn 11.3 KoEn 9.2 15.0
NlEn 28.5 ItEn 34.1 35.4
Table 12: Back-Translation v.s. Language Transfer for Unsupervised MT. We present the best transferring scores together with the pairs transferred from.

5.2 Unsupervised Machine Translation via Language Transfer

The second case of unsupervised machine translation assumes the target language appears in a bi-text corpus with some other source language.


We only consider XEn translation, and choose the bitexts of 12 language pairs from section 3.1, covering Indic languages (Ne, Hi, Si, Gu), European languages (Ro, It, Cs, Nl), East Asian languages (Zh, Ja, Ko) and Arabic languages (Ar).


As illustrated in Figure 5 (b), we take the pre-trained mBART25 model and finetune on each language pair, and then directly apply them to the rest of pairs, as seen in Table 11. We also present the direct fine-tuning performance (section 3) on the diagonal, for reference. We can always obtain reasonable transferring scores at all pairs over different fine-tuned models except from Gu-En where the supervised model completely fails ( BLEU). In some cases, we can achieve similar (Cs-En) or even much better (Ne-En, Gu-En) results compared to the supervised results.

As a comparison, we also apply the same procedure on randomly initialized models without pre-training, which always ends up with BLEU. This indicates that multilingual pre-training is essential and produces universal representations across languages, so that once the model learns to translate one language to En, it learns to translate all languages with similar representations. We also present three examples of language transferring between Zh, Ja and Ko in appendix B.

When is language transfer useful?

Table 11

also shows mixed results at each pair. First, for most pairs, language transfer works better when fine-tuning is also conducted in the same language family, especially between Indic languages (Hi, Ne, Gu). However, significant vocabulary sharing is not required for effective transfer. For instance, Zh-En and It-En achieve the best transfer learning results on Ko-En and Ar-En, respectively. However, the vocabulary overlapping (even character overlapping) between Zh and Ko, It and Ar is low.

w/ Back-Translation

We also present the comparison on 4 pairs of unsupervised MT with back-translation (BT) v.s. language transfer in Table 12. The results are also mixed. If there exists high quality (similar languages) bi-text data, or translating between dissimilar pairs, language transfer is able to beat the conventional methods with BT. Furthermore, we also show promising results for combining these two techniques. In such cases, we start from the best transferred model and apply (iterative) BT on the same monolingual corpus used in pre-training. Table 12 presents the results with 1 iteration of BT. For all pairs, we see improvements by combining both techniques.

6 Related Work

Pre-training for Text Generation

This work inherits from the recent success brought by self-supervised pre-training for NLP applications peters2018deep; radford2018gpt; devlin2018bert; yang2019xlnet; liu2019roberta

, especially for text generation tasks 

radford2019language; song2019mass; dong2019unified; raffel2019exploring; lewis2019bart where different self-supervised objectives are designed for training big neural models on enormous unlabeled text corpora The pre-trained models are usually used as the initialization for fine-tuning variant downstream tasks such as controllable language modeling shirish2019ctrl, machine translation song2019mass, summarization liu2019text and dialogue generation zhang2019dialogpt. In contrast to most prior work, we focus on a deep exploration of applying denoising pre-training for various translation applications.

Multilinguality in NLP tasks

This work is also related to the continual trend of multilingual language learning, including aligning multilingual word embeddings DBLP:journals/corr/MikolovLS13; chen-cardie-2018-unsupervised; lample2018word into universal space, and learning cross-lingual models DBLP:wada; lample2019cross; conneau2019unsupervised to exploit shared representations across languages.

For machine translation, the most relevant field is multilingual translation firat2016multi; viegas2016google; aharoni-etal-2019-massively; DBLP:journals/corr/abs-1907-05019 where the ultimate goal is to jointly train one translation model that translates multiple language directions at the same time, and shares representations to improve the translation performance on low-resource languages gu-etal-2018-universal. In this paper, we mainly focus on multilingualism in the pre-training stage and fine-tune the learned model in the standard bi-lingual scenario. Compared to multilingual translation, we do not require parallel data across multiple languages but the targeted direction, which potentially improves the scalability to low-resource languages and specific domains. Moreover, multilingual pre-training is unlikely to suffer the interference problems between dissimilar languages, which is typical for regular multilingual translation models.

Document Translation

As one of the key applications, this work also links to previous efforts for incorporating document-level contexts into neural machine translation wang-etal-2017-exploiting-cross; DBLP:journals/corr/JeanLFC17; tiedemann-scherrer-2017-neural; miculicich-etal-2018-document; doi:10.1162/tacla00029. li2019pretrained is the most relevant work which also utilized pre-trained encoder (BERT) for handling longer context. However, none of these works had shown positive results on pure Seq2Seq models at document-level, which involved task-specific techniques, and usually only worked on sentence-level translation with a constrained range of context. To the extent of our knowledge, our multilingual pre-trained model is the first-of-its-kind work that shows improved results on document-level translation with standard Seq2Seq learning.

Unsupervised Translation

This work also summarizes the previous efforts of learning to translate between languages without a direct parallel corpus, and re-defines them as unsupervised machine translation with three categories where in this work, we only focus on applications to the first and the third kinds (section 5). When no parallel corpus of any kind is available, artetxe2017unsupervised,lample2018unsupervised,lample2018phrase proposed to jointly learn denoising auto-encoder and back-translation from both directions, which, however, required good initialization and only worked well on similar language pairs; wu2019extract replaced back-translation with retrieved similar sentences from target monolingual data; wu2019machine solves the problem by mining sentences from Wikipedia and use them as weakly supervised translation pairs. Similar to lample2019cross,song2019mass, we follow the first approach and treat our pre-trained model as the initialization step. Besides, we investigate unsupervised translation using language transfer, which is similar to Pourdamghani2019TranslatingTA where the authors generate translationese of the source language and train a system on high-resource languages to correct these intermediate utterances. It is also closely related to conneau2018xnli,artetxe2019crosslingual for cross-lingual representation learning.

7 Conclusion

We demonstrate that multilingual de-noising pre-training is able to significantly improve both supervised and unsupervised machine translation at both the sentence level and document level. We analyze when and how pre-training is most effective and can be combined with other approaches such as back-translation. Our results also show the transfer learning ability of the learned representations from multilingual pre-training.

In future work, we will scale-up the current pre-training to more languages, e.g., an mBART100 model. The size of our model makes it expensive to deploy in production – future work will explore pre-training more efficient models.

8 Acknowledgements

We thank Marc’Aurelio Ranzato, Guillaume Lample, Alexis Conneau, and Michael Auli for sharing their expertise on low-resource and unsupervised machine translation, Peng-Jen Chen, Jiajun Shen for details about FloRes and WAT datasets. We also thank our colleagues at FAIR and FAIAR for valuable feedback.


Appendix A Evaluation Details

For all our tasks, we use BLEU scores papineni2002bleu as the automatic metric to evaluate the translation performance. Normally, we compute the BLEU scores over tokenized text for both system outputs and the references, and we apply language-wise tokenization after over the translation. Note that, since we directly work on raw texts, we automatically get de-tokenized output after recovering sentence-piece subwords. Following the literature, the instructions of language-wise tokenization are as follows:

For other languages that are not listed above, we compute BLEU scores with sacreBLEU with DEFAULT tokenization.

Appendix B Translation Examples

Figure 6: An Example of Document-level translation from mBART25 Sent-MT and Doc-MT, held out from the test set of TED15 Zh-En. The Doc-MT system produces much fluent and coherent translation which is closer to the reference translation. For instance, Doc-MT model produces several “And” to connect sentences to make it reads better, while the Sent-MT model does not contain global knowledge and produce sentences independently. Besides, both systems produce much better translations than models without pre-training where the non-pretrained Doc-MT model completely fails to produce readable translation output.
Figure 7: Examples of Unsupervised MT via Language Transfer between Ja, Ko, Zh En. We mark the supervised settings in red. All three languages have quite different character sets (Ja and Zh shares part of the Chinese characters) and syntactic structures. However, they are still culturally and historically correlated, which we assume can be captured through pre-training. For all cases, if we fine-tune the mBART25 model on any pair, the resulted model directly translates well in the other two pairs without seeing any corresponded parallel sentences. We also see failure cases. For instance (the 3rd example), only the supervised model translates “자석” into “magents” correctly, while the Ja-En and Zh-En guess with irreverent words “cushions” and “jellyfish”, respectively. Also, in the 2nd example, the Ko-En model fails to translate “developed” and copies the source tokens. We suspect it is because the pre-training stage biases the output distribution.