Multilingual Translation with Extensible Multilingual Pretraining and Finetuning

08/02/2020 ∙ by Yuqing Tang, et al. ∙ Facebook 0

Recent work demonstrates the potential of multilingual pretraining of creating one model that can be used for various tasks in different languages. Previous work in multilingual pretraining has demonstrated that machine translation systems can be created by finetuning on bitext. In this work, we show that multilingual translation models can be created through multilingual finetuning. Instead of finetuning on one direction, a pretrained model is finetuned on many directions at the same time. Compared to multilingual models trained from scratch, starting from pretrained models incorporates the benefits of large quantities of unlabeled monolingual data, which is particularly important for low resource languages where bitext is not available. We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance. We double the number of languages in mBART to support multilingual machine translation models of 50 languages. Finally, we create the ML50 benchmark, covering low, mid, and high resource languages, to facilitate reproducible research by standardizing training and evaluation data. On ML50, we demonstrate that multilingual finetuning improves on average 1 BLEU over the strongest baselines (being either multilingual from scratch or bilingual finetuning) while improving 9.3 BLEU on average over bilingual baselines from scratch.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A multitude of datasets and models have been developed in natural language processing for a wide variety of tasks and applications. However, a large proportion of these have focused on English. Many works have contributed resources for other languages, developing specialized models for each language of interest is not scalable, not to mention difficult for low resource languages where labeled data is exceptionally scarce.

Recent work in multilingual NLP shows promise for incorporating many languages into one architecture. For example, the mBART Liu et al. (2020) model trains on twenty five different languages and can be finetuned for various different tasks. For translation, mBART was finetuned on bitext (bilingual finetuning). However, while mBART was trained on a variety of languages, the multilingual nature of the pretraining is not used during finetuning. Finetuning on bitext to translate from one language to another does not leverage the full capacity of the multilingual pretraining. Instead, we propose multilingual finetuning of pretrained models, and we demonstrate large improvements compared to bilingual finetuning.

Previous work  Aharoni et al. (2019); Arivazhagan et al. (2019b); Zhang et al. (2020) has explored multilingual translation by training multiple directions within the same model from scratch, but this approach faces challenges for mid to low resource languages. In lower resource scenarios, bitext data is usually unavailable in large quantities, making it challenging to train from scratch. In contrast, monolingual data exists even for low resource languages, particularly in resources such as Wikipedia or Commoncrawl, a version of the web. Thus, leveraging this monolingual data through pretraining can provide a much stronger starting point for low resource machine translation tasks.

However, unlike training a multilingual model from scratch, pretrained models are limited to the choices made during pretraining. For example, mBART was only trained on 25 languages, so finetuning to translate on a model not part of these 25 languages is not possible. Thus, people are restricted to the languages selected to train the initial model, as it is incredibly computationally intensive to retrain from scratch. In this work, we show that existing pretrained models, such as mBART  Liu et al. (2020) can be extended to additional languages. We demonstrate by doubling the number of languages supported by mBART — to 50 — without loss of performance on the original 25 languages and without starting from scratch. This allows languages to be added flexibly, while preserving the broader utility of the pretrained model, as it can be used for tasks beyond translation.

Further, working in a multilingual setting remains challenging, as various different datasets, evaluation settings, and preprocessing such as tokenization are used. Benchmarks for sentence embeddings Hu et al. (2020), natural language inference  Conneau et al. (2018), and question answering  Lewis et al. (2019b) exist, but there is not yet a setting for machine translation. To this end, we contribute the ML50

benchmark, a dataset of 50 languages with publicly available training and evaluation sets, including high, mid, and extremely low resource directions. We will open source this benchmark for the community.

We make three main contributions:

  • An effective and novel approach for multilingual translation models with multilingual pretraining (with monolingual data) followed by multilingual finetuning (with parallel data). In the Many-to-English setting, multilingual finetuning achieves a 3.6 BLEU improvement over bilingual finetuning, and 2.6 BLEU improvement compared to multilingual models trained from scratch. On average, combining Many-to-English and English-to-Many, multilingual finetuning improves BLEU points over the strongest baseline.

  • We show that existing pretrained models, such as mBART, can be extended to incorporate additional languages without training from scratch and without performance loss on the original languages. We release mBART50 for the community to use, which has double the number of languages of the original mBART.

  • To facilitate reproducible research on multilingual translation with representative challenges of the real world, we create the ML50 benchmark covering high, mid, and low resource languages and consisting of 230M bitext.

2 Related work

2.1 Multilingual Denoising Pretraining

This work is related to recent progress of pretraining techniques for NLP applications Peters et al. (2018); Radford et al. (2018); Devlin et al. (2019); Liu et al. (2019); Song et al. (2019); Lewis et al. (2019a). In particular, recent works explored pre-training on multilingual unlabeled corpus Lample and Conneau (2019); Conneau et al. (2019); Liu et al. (2020); Tran et al. (2020), and significantly improved the performance of fine-tuning on machine translation between two languages. We extend liu2020multilingual by allowing fine-tuning in multilingual settings.

2.2 Multilingual Neural Machine Translation

Training a universal translation system between multiple languages Firat et al. (2016); Johnson et al. (2017) has shown enormous improvement for translating low-resource languages Gu et al. (2018), and even enabling zero-shot translation Gu et al. (2019); Arivazhagan et al. (2019a). arivazhagan2019massively indicates that it is essential to train gigantic models with enough capacity to fully leverage massive multilingual corpora.

A closely related concurrent work, siddhant2020leveraging shows it is possible to train a multilingual system jointly with monolingual datasets based on song2019mass. It naturally enables translation for languages without parallel data. In contrast, this work focuses on fine-tuning multilingual translation systems given a pre-trained model.

3 Multilingual Translation from Denoising Pretraining

We briefly describe the pretrained multilingual BART model and present multilingual finetuning, a technique to convert pretrained models into multilingual machine translation systems.

mBART

multilingual BART (mBART) Liu et al. (2020) is a sequence-to-sequence generative pretraining scheme. The model incorporates languages by concatenating data: where each is a collection of monolingual documents in language

. mBART is trained as a denoising autoencoder, training to predict the original text

given where is a noising function that corrupts text. We maximize :

(1)

where is an instance in language and the distribution is defined by the seq-to-seq model. This model is pretrained using two types of noise in — random span masking and order permutation — as described in Liu et al. (2020).

3.1 Multilingual Finetuning

To leverage multilingual pretraining to create translation systems, previous work Liu et al. (2020) used mBART as a starting point and then performed bilingual finetuning. Concretely, the seq-to-seq model was finetuned on language to language translation. However, bilingual finetuning does not leverage the full capacity of multilingual pretraining. Recent work on multilingual translation Aharoni et al. (2019); Arivazhagan et al. (2019b) displays that strong translation models can be created by doing multilingual training rather than using bilingual tranining. Instead of training a model from language to language , a model is trained to translate N languages to N other languages.

Thus, we propose to do multilingual finetuning (ML-FT) to adapt pretrained models to become multilingual models. This procedure creates one model capable of translating many languages to many other languages, which has efficiency and storage maintenance benefits. Further, multilingual finetuning retains several benefits of multilingual translation models in general, for example allowing languages of similar family to benefit each other.

To perform multilingual finetuning, we collect bitexts of different language pairs into a collection for each direction . Following mBART Liu et al. (2020), we augment each bitext pair by adding a source language token and a target language token at the beginning of and respectively to form a target language token augmented pair . We then initialize a transformer based seq-to-seq model by the pretained mBART, and provide the multilingual bitexts to finetune the pretrained model.

Multilingual Translation Model Variants

We explore configurations to create different versions of multilingual translation models: Many-to-one (), one-to-Many (), and Many-to-Many () via a pivot language. Concretely, the Many-to-one model encodes languages and decodes to English, while the one-to-Many model encodes English and decodes into languages. Finally, the Many-to-Many model encodes and decodes languages. We follow Arivazhagan et al. (2019b) and use pivot data through English to create Many-to-Many models.

Temperature Sampling

When training multilingual models with many languages, the training dataset sizes are imbalanced as different languages have different quantities of bitext. Thus, we train with temperature upsampling, which upsamples lower resource pairs so that the high resource languages do not dominate the training data. We follow Arivazhagan et al. (2019b) and use the following temperature based sampling function with temperature to sample data for each direction:

4 Results from Multilingual Finetuning on Languages

We first examine the impact of multilingual finetuning directly on existing pretrained models. We present results on the 25 languages included in mBART, using the existing mBART model. First, we describe three strong baselines: bilingual finetuning, bilingual translation models from scratch, and multilingual translation models from scratch. Then, we describe our experimental setting. Finally, we present results on 25 languages, showing that on average, multilingual finetuning improves BLEU over the strongest baseline — 1.0 BLEU point improvement over the strongest to-English baseline while difference to the strongest from-English baseline.

Data Translation to English Translation from English
BL-FT ML-Scratch ML-FT BL-FT ML-Scratch ML-FT
en N1 NN N1 NN en 1N NN 1N NN
10M 2.60 3.99 2.51 4.37 3.19 1.67 0.64 -0.6 2.20 -0.90
1M-10M 3.70 5.70 5.06 6.40 4.62 3.40 2.40 1.7 1.76 1.40
100k-1M 5.49 7.28 7.04 8.13 6.47 4.17 4.31 4.97 2.9 2.00
7k-30k 10.80 14.63 13.77 18.03 14.57 7.27 8.07 7.90 7.6 0.90
All 4.94 6.91 6.15 7.91 6.14 3.67 3.31 2.66 3.0 1.81
Table 1: Multilingual Finetuning on languages comparing to bilingual models. Numbers are the improvement in BLEU compared to bilingual training from scratch.
Data Translation to English Translation from English
ML-FT vs BL-FT ML-FT vs ML-SC ML-FT vs BL-FT ML-FT vs ML-SC
N1 NN N1 NN 1N NN 1N NN
10M 1.77 0.59 0.39 -0.80 0.53 -0.64 1.56 1.61
1M-10M 2.70 0.92 0.70 -1.08 -1.64 -2.68 -0.64 -1.00
100k-1M 2.64 0.98 0.86 -0.81 -1.29 -2.64 -1.43 -2.44
7k-30k 7.23 3.77 3.40 -0.07 0.33 -0.93 -0.47 -1.57
All 2.98 1.20 1.00 -0.77 -0.63 -1.85 -0.28 -0.85
Table 2: Multilingual Finetuning on languages comparing to bilingual finetuning and multilingual training from scratch. Numbers are the improvement in BLEU compared to bilingual finetuning and multilingual training from scratch. We compare to bilingual finetuning (BL-FT) and multilingual translation from scratch (ML-SC). We perform multilingual finetuning on the existing mBART model. On average, multilingual finetuning (ML-FT) improves BLEU in Many-to-one (N1), BLEU in one-to-Many (1N), and and BLEU for to-English and from-English respectively in Many-to-Many (NN) settings compared to the strongest baselines ML-SC many-to-one, BL-FT, and ML-SC many-to-one and BL-FT finetuning (combined baselines for ML-FT many-to-many) respectively.

4.1 Baselines

We compare our proposed multilingual finetuning to three strong baselines: bilingual training from scratch, bilingual finetuning, and multilingual models trained from scratch.

Bilingual Trained from Scratch (BL-Scratch)

We train bilingual translation models with standard Transformer Vaswani et al. (2017) models 1115 layers with 512 embedding dimension, 2048 FFN embedding dimension, and 8 heads for both encoder and decoder for translation into and from English to languages. For directions with more than 1 million bitext training data (de, cs, fr, ja, es, ru, pl, zh, fi, lv, lt, and hi ), we train Transformer Big models 2226 layers with 1024 embedding dimension, 4096 FFN embedding dimension, and 16 heads for both encoder and decoder as there is more data to benefit from additional model capacity. For directions with more than 10 million bitext training data (de, cs, fr, ja, es, ru, pl, and zh), we train Transformer Large models 33312 layers with 1024 embedding dimension, 4096 FFN embedding dimension, and 16 heads for both encoder and decoder as there is even more data to benefit from additional model capacity. The best performing bilingual model is selected as the Bilingual Train from Scratch baseline.

Bilingual Finetuning (BL-FT)

Bilingual finetuning adapts the mBART model into bilingual machine translation models by training for longer on translation bitext. For each language direction, we follow Liu et al. (2020) and finetune for K updates to obtain the Bilingual Finetuning baseline.

Multilingual Trained from Scratch (ML-SC)

We train different multlilingual models from scratch: Many-to-one (N1), one-to-Many (1N), and Many-to-Many (NN) with English as pivot. We train for K updates and sweep through different batch sizes, learning rates, and upsampling temperature for best performing multilingual model on validation, using GPUs for each training instance. Following Arivazhagan et al. (2019b), we train with temperature upsampling.

4.2 Evaluation and Generation

We evaluate performance with tokenized BLEU, following the tokenization in mBART Liu et al. (2020). To generate, we decode using beam search with beam size with length penalty on the validation set. We do not perform checkpoint averaging. To select the best performing model in a sweep, we compare BLEU on the validation set.

4.3 Performance on Languages

We first evaluate our proposed multilingual finetuning technique on languages using the existing mBART model. We compare bilingual finetuning from mBART (BL-FT), multilingual training from scratch (ML-SC), and multilingual finetuning (ML-FT) by quantifying the BLEU improvement over the bilingual training from scratch baseline. Results are displayed in Table 1, separated into three settings: Many-to-one (N1), one-to-Many (1N), and Many-to-Many (NN).

Performance of Multilingual Finetuning

Compared to the BL-FT and ML-SC baselines, multilingual finetuning has consistently stronger results in the Many-to-one setting, translating from 25 different languages into English. The improvement is 7.9 BLEU points stronger than the bilingual from scratch baseline, and 1.0 BLEU points stronger than the the strongest baseline, ML-SC.

However, in the one-to-Many setting, improvement of all multilingual methods against bilingual baselines is lower across the board. We hypothesize this is due to the challenge of needing to decode into many different languages (additional analysis is presented in Section 6.1). Multilingual finetuning method is BLEU points stronger than the bilingual from scratch baseline; it is also comparable to the strongest baseline — bilingual finetuning with BLEU difference on average.

Finally, in the Many-to-Many setting, improvement of all many-to-many multilingual methods against bilingual baselines is lower across the board. Again we hypothesize this is due to the challenge of decoding into many different languages including English (additional analysis is presented in Section 6.1). Multilingual finetuning method is BLEU points stronger than the bilingual from scratch baseline for translation from and into English combined. Overall, it is lower than the strongest from-English and into-English baselines combined with BLEU difference on average.

Performance by Resource Level

Comparing the languages by resource level, we see that the improvement from multilingual training is more significant as the quantity of training bitext decreases. For example, in the multilingual finetuning (ML-FT) Many-to-one setting, improvement over bilingual from scratch is 4.4 BLEU points for languages with more than 10M bitext, but is 18.0 BLEU points for languages with 7K-30K available bitext. The trend is less consistent in the one-to-Many setting, but low resource languages still see improvements. For example, with multilingual finetuning (ML-FT), improvement over bilingual from scratch is 2.2 BLEU for languages with more than 10M bitext, but 7.6 BLEU for languages with 7K-30K available bitext.

Data size Languages
10M+ German, Czech, French, Japanese, Spanish, Russian, Polish, Chinese
1M - 10M Finnish, Latvian, Lithuanian, Hindi, Estonian
100k to 1M Tamil, Romanian, Pashto, Sinhala, Malayalam, Dutch, Nepali, Italian, Arabic, Korean, Hebrew, Turkish, Khmer, Farsi, Vietnamese, Croatian, Ukrainian
10K to 100K Thai, Indonesian, Swedish, Portuguese, Xhosa, Afrikaans, Kazakh, Urdu, Macedonian, Telugu, Slovenian, Burmese, Georgia
10K- Marathi, Gujarati, Mongolian, Azerbaijani, Bengali
Table 3: Languages in ML50 Benchmark. We display the languages included in the ML50 Benchmark and the quantity of training data in bitext pairs. Full breakdown is provided in Appendix Table 6.

5 Results from Multilingual Finetuning on Languages

Multilingual finetuning showed strong improvements on languages in the Many-to-one setting and we subsequently extend to incorporate a greater number of languages — 50 instead of 25. However, the number of languages possible is limited by the initial selection of languages in mBART. To remedy this, we show that the number of languages in mBART can be easily extended with additional pretraining. Second, we build the ML50 benchmark, to standardize training data, evaluation data, and evaluation procedure across 50 different languages. Finally, we display results of multilingual finetuning from mBART on 50 languages and show strong improvements over the baselines.

5.1 Doubling the Languages in mBART

We describe how we extend existing pretrained models to incorporate a greater number of languages. This technique allows existing models to be used on new languages, rather than needing to restart a computationally intensive pretraining method from scratch.

Creating mBART50

While multilingual pretrained models have shown strong performance in a variety of tasks  Liu et al. (2020); Conneau et al. (2019), they remain limited as they are trained on a fixed number of languages. For example, mBART was trained on 25 languages, all fairly high resource. Pretraining fully from scratch is computationally intensive — mBART trained for 2.5 weeks on 256 Nvidia V100 GPUs Liu et al. (2020). However, there are hundreds of different languages in the world, so restarting pretraining from scratch to add any of them to mBART would be difficult. Instead, we take the existing mBART model, trained on languages, and show that it can be extend to more than languages. We take the public available pretrained mBART model444https://github.com/pytorch/fairseq/tree/master/examples/mbart which was pretrained on

languages and extend its embedding layers with randomly initialized vectors for an extra set of

language tokens. We then combine the monolingual data of original languages and the new languages together to continue pretraining this extended MBART model. We will release the mBART50 model as a general purpose multilingual pretrained model, which will be useful for a variety of generation tasks beyond machine translation.

Data and Training Details

We use the mBART.cc25 checkpoint Liu et al. (2020) available in the fairseq library Ott et al. (2019) to continue the pretraining process. We use the monolingual data from XLMR Conneau et al. (2019) to extend the pretraining to a set of languages in addition to the languages mBART model. To be consistent mBART, we reuse its K sentencepiece Kudo and Richardson (2018) model which was trained using monolingual data for languages from XLMR, and thus already supports languages beyond the original 25 mBART was trained on. For pre-training, we train mBART50 for an additional K updates with a batch size of tokens. The sizes of the monolingual data for the additional 50 languages is provided in the appendix.

Data Translation to English Translation from English
BL-FT ML-SC ML-FT BL-FT ML-SC ML-FT
en N1 NN N1 NN en 1N NN 1N NN
10M 2.7 2.8 1.9 3.8 1.4 1.9 -0.6 -1.7 -0.3 -1.7
1M-10M 3.9 4.8 4.1 6.2 4.4 3.3 1.5 1.0 1.7 0.6
100k-1M 5.7 6.9 7.0 8.2 7.4 4.4 4.0 3.4 4.0 3.2
10K-100K 16.8 17.9 18.3 22.3 20.6 13.4 13.6 13.9 13.5 13.6
4k-10k 11.6 13.1 14.1 18.9 15.0 8.7 10.6 10.9 9.9 9.7
All 8.7 9.7 9.8 12.3 10.6 6.8 6.4 6.0 6.3 5.7
Table 4: Multilingual Finetuning on languages comparing to bilingual models. Improvement in BLEU compared to bilingual training from scratch is shown.
Data Translation to English Translation from English
ML-FT vs BL-FT ML-FT vs ML-SC ML-FT vs BL-FT ML-FT vs ML-SC
N1 NN N1 NN 1N NN 1N NN
10M 1.05 -1.34 0.95 -0.50 -2.15 -3.53 0.31 -0.01
1M-10M 2.34 0.54 1.36 0.30 -1.60 -2.74 0.18 -0.44
100k-1M 2.43 1.68 1.28 0.36 -0.36 -1.21 0.01 -0.25
10K-100K 5.49 3.82 4.37 2.30 0.06 0.21 -0.13 -0.25
4k-10k 7.33 3.42 5.83 0.87 1.27 1.00 -0.65 -1.20
All 3.61 1.85 2.61 -0.15 -0.47 -1.10 -0.04 -0.35
Table 5: Multilingual Finetuning on languages comparing to bilingual finetuning and multilingual training from scratch We compare to bilingual finetuning (BL-FT) and multilingual translation from scratch (ML-SC). On average, multilingual finetuning (ML-FT) improves BLEU in Many-to-one (N1), BLEU in one-to-Many (1N), and and BLEU for to-English and from-English respectively in Many-to-Many (NN) settings compared to the strongest baselines ML-SC many-to-one, BL-FT, and ML-SC many-to-one and BL-FT finetuning (combined baselines for ML-FT many-to-many) respectively.

5.2 Ml50 Benchmark

To demonstrate the impact of multilingual finetuning on additional languages, we create the ML50 Benchmark. ML50 standardizes the training and evaluation schemes across 50 different languages, from extremely low resource languages like Xhosa and Gujarati to high resource languages like French and German. The full list of languages is shown in Table 3. We group the languages into five categories based on the amount of available training data: more than 10M pairs (8 languages), 1M to 10M pairs (5 languages), 100k to 1M pairs (17 languages), 10K to 100K pairs (13 languages), and finally, less than 10K pairs of training data (5 languages). ML50 includes languages in N language families, from Germanic and Romance languages to Indic and African ones. Many additional languages we contribute are lower resource, compared to the languages in the original mBART.

Training Data

We gather parallel data between English and 49 other languages to form ML50, to enable the training of machine translation models. We select these 49 languages based on the amount of parallel and monolingual data to cover languages with different amount of resources and under different language families. The quantity of available monolingual data is relevant for pretraining, so we want to ensure there is a sufficient amount. All of the data is publicly available, such as WMT, IWSLT, WAT, TED, and other published research works. For training data, each language pair can include multiple sources. We simply concatenate them together and remove duplicated source-target sentence pairs for each language pair. We use fasttext Joulin et al. (2017) to perform language identification on both source and target sentences, and we remove sentences pairs if either source or target sentence is not predicted as expected language. We further filter out training data that match to any source or target side sentences in evaluation datasets. Compared to other datasets such as opus100, the ML50 benchmark contains around 4 times more training data. The full list of languages, data sources, and amount of resulting data can be found in Table 6 in the Appendix.

Evaluation Data

To ensure high quality evaluation of languages covered in ML50, we include publicly available, widely used evaluation sets. We source these evaluation datasets from translation workshops such as WMT, IWSLT, WAT, and other published research works. We follow the evaluation protocol, including tokenization, used for each of these evaluation sets, to ensure our results are comparable with existing work. We release these scripts to make it easier for others. Compared to other datasets such as opus100, we choose to use high quality existing evaluation datasets rather than use part of the training data as evaluation. This is because training data, particularly for low resource languages, is often very noisy and unreliable.

Figure 1: Multilingual Finetuning Improvement over Bilingual Finetuning for Languages Translation: average BLEU improvement for translation into English; BLEU average difference for Translation from English

5.3 Performance on 50 Languages

We evaluate the performance of mBART50 on the ML50 Benchmark. We compare to the same baselines — bilingual finetuning, bilingual training from scratch, and multilingual training from scratch. Results are displayed in Table 4.

In the Many-to-One setting averaged across all languages, multilingual finetuning improves over the strongest baseline, multilingual many-to-many from scratch, by 2.5 BLEU points. For lower resource language pairs, the improvement is much more significant. For example, the improvement for languages with 4K-10K training data is 4.8 BLEU points over the strongest baseline, and the improvement for languages with 10K-100K training data is 4+ BLEU over the strongest baseline.

For One-to-Many, the performance of all methods — bilingual finetuning, multilingual from scratch, and multilingual finetuning — is similar. On average, all models have around 5.7 to 7 BLEU points improvement over bilingual baselines.

Finally, in Many-to-Many, multilingual finetuning achieves 0.8 improvement in the to-English direction over the strongest baseline. In the from-English direction, the performance of Many-to-Many from multilingual finetuning is similar to multilingual from scratch, both around 5.5 to 6 BLEU improvement over bilingual baselines.

5.4 Comparison to Bilingual Finetuning

We examine the performance of our proposed multilingual finetuning method compared to bilingual finetuning. Current work shows that strong translation models can be created by finetuning pretrained models to bilingual translation models. However, this means that a separate model would need to be created for each translation direction of interest, which creates a large quantity of models that need to be finetuned. In contrast, multilingual finetuning allows a multitude of directions to be captured within one model.

However, multilingual finetuning would mean that the same model capacity must model many directions rather than just one, which could decrease performance. In Figure 1, we analyze the improvement of multilingual finetuning over the bilingual finetuning. On the left, we compare the Many-to-one setting translating into English, and on the right we compare the one-to-Many setting translating out of English to many different languages.

In the Many-to-one setting, every language pair except one is improved by multilingual finetuning. Some low resource languages see substantial improvement of 10+ BLEU points, with the largest improvement being over 15 BLEU improvement. On average, multilingual finetuning improves BLEU across all directions into English. In the one-to-Many setting, performance is about the same between multilingual finetuning and bilingual finetuning, with the average improvement at BLEU across all directions out of English comparing to bilingual baselines.

6 Discussion

6.1 Challenges of one-to-Many

In the Many-to-one setting, where models must encode various different languages and decode into English, large improvements are seen when doing multilingual modeling. Previous work has similarly observed this improvement Arivazhagan et al. (2019b) in multilingual training from scratch, as multilingual modeling increases the quantity of target-side English data seen by the model. For example, compared to bilingual finetuning, our multilingual finetuning model is exposed to English target side data from 50 different language pairs.

However, in the one-to-Many setting and the Many-to-Many setting, models must decode into 50 different languages. This is a difficult decoding challenge, as a strong conditional language model must be learned for each language. While pretraining exposes the model to monolingual data, the quantity of monolingual data varies for each language. For lower resource languages, such as Gujarati or Xhosa, the quantity of monolingual data available even through online resources such as Commoncrawl, remains limited. Other work Arivazhagan et al. (2019b) observes similar trends in performance of one-to-Many.

Overall, we find that multilingual finetuning performs better than any of our assessed baselines — bilingual training from scratch, bilingual finetuning, and multilingual training from scratch — when averaged across the Many-to-one and one-to-Many directions. It is important to note that this effect mainly comes from the strong improvement of the Many-to-one setting, and all approaches have similar performance in the one-to-Many setting.

6.2 Comparison of mBART50 on 25 Languages

We show that the mBART model can be extended from 25 languages to 50 languages without starting from scratch. In this section, we evaluate if adding additional languages is harmful for performance on the original 25 languages. As the model remains the same size but has more to model, it could have reduced capacity for the original 25 languages, but we do not see any reduction in performance. Results are shown in Figure 2. For each language, we plot the performance when doing bilingual finetuning with mBART25 and mBART50. We show that performance is almost exactly the same with both models, indicating that the number of languages can be doubled without loss of performance.

Figure 2: Continuing Pretraining with Additional Languages – No Performance Degeneration in Original Languages

7 Conclusion

We demonstrate that multilingual neural machine translation models can be created from pretrained models such as mBART. Previous work using pretrained models focused only on bilingual finetuning, and work in multilingual translation trained only from scratch. While using pretrained models could limit the number of languages possible, we show that mBART can be extended to double the number of original languages, without loss of performance on the original languages. We release mBART50 for the community as a strong generative denoising pretrained model in 50 different languages. Further, to train and evaluate on 50 languages, we develop and release the

ML50 benchmark. In conclusion, we show that by performing multilingual finetuning, strong improvements of over 2 BLEU points can be achieved in the Many-to-one setting. Overall, averaging across the Many-to-one and one-to-Many directions, our proposed multilingual finetuning strategy outperforms all baselines.

References

  • R. Aharoni, M. Johnson, and O. Firat (2019) Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 3874–3884. External Links: Link, Document Cited by: §1, §3.1.
  • N. Arivazhagan, A. Bapna, O. Firat, R. Aharoni, M. Johnson, and W. Macherey (2019a) The missing ingredient in zero-shot neural machine translation. arXiv preprint arXiv:1903.07091. Cited by: §2.2.
  • N. Arivazhagan, A. Bapna, O. Firat, D. Lepikhin, M. Johnson, M. Krikun, M. X. Chen, Y. Cao, G. Foster, C. Cherry, et al. (2019b) Massively multilingual neural machine translation in the wild: findings and challenges. arXiv preprint arXiv:1907.05019. Cited by: §1, §3.1, §3.1, §3.1, §4.1, §6.1, §6.1.
  • A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov (2019) Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116. Cited by: §2.1, §5.1, §5.1.
  • A. Conneau, G. Lample, R. Rinott, A. Williams, S. R. Bowman, H. Schwenk, and V. Stoyanov (2018) XNLI: evaluating cross-lingual sentence representations. arXiv preprint arXiv:1809.05053. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL), Cited by: §2.1.
  • O. Firat, K. Cho, and Y. Bengio (2016) Multi-way, multilingual neural machine translation with a shared attention mechanism. In NAACL, Cited by: §2.2.
  • J. Gu, H. Hassan, J. Devlin, and V. O.K. Li (2018) Universal neural machine translation for extremely low resource languages. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 344–354. External Links: Link, Document Cited by: §2.2.
  • J. Gu, Y. Wang, K. Cho, and V. O. Li (2019) Improved zero-shot neural machine translation via ignoring spurious correlations. arXiv preprint arXiv:1906.01181. Cited by: §2.2.
  • J. Hu, S. Ruder, A. Siddhant, G. Neubig, O. Firat, and M. Johnson (2020) Xtreme: a massively multilingual multi-task benchmark for evaluating cross-lingual generalization. arXiv preprint arXiv:2003.11080. Cited by: §1.
  • M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, et al. (2017) Google’s multilingual neural machine translation system: enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5, pp. 339–351. Cited by: §2.2.
  • A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2017) Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431. Cited by: §5.2.
  • T. Kudo and J. Richardson (2018) SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, pp. 66–71. External Links: Link, Document Cited by: §5.1.
  • G. Lample and A. Conneau (2019) Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291. Cited by: §2.1.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2019a)

    BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension

    .
    arXiv preprint arXiv:1910.13461. Cited by: §2.1.
  • P. Lewis, B. Oğuz, R. Rinott, S. Riedel, and H. Schwenk (2019b) MLQA: evaluating cross-lingual extractive question answering. arXiv preprint arXiv:1910.07475. Cited by: §1.
  • Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer (2020) Multilingual denoising pre-training for neural machine translation. arXiv preprint arXiv:2001.08210. Cited by: §1, §1, §2.1, §3, §3.1, §3.1, §4.1, §4.2, §5.1, §5.1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §2.1.
  • M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) fairseq: a fast, extensible toolkit for sequence modeling. In North American Association for Computational Linguistics (NAACL): System Demonstrations, Cited by: §5.1.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In North American Association for Computational Linguistics (NAACL), Cited by: §2.1.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018)

    Improving language understanding with unsupervised learning

    .
    Technical report OpenAI. Cited by: §2.1.
  • K. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2019) MASS: masked sequence to sequence pre-training for language generation. In

    International Conference on Machine Learning (ICML)

    ,
    Cited by: §2.1.
  • C. Tran, Y. Tang, X. Li, and J. Gu (2020) Cross-lingual retrieval for iterative self-supervised training. arXiv preprint arXiv:2006.09526. Cited by: §2.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, Cited by: §4.1.
  • B. Zhang, P. Williams, I. Titov, and R. Sennrich (2020) Improving massively multilingual neural machine translation and zero-shot translation. arXiv preprint arXiv:2004.11867. Cited by: §1.

Appendix A Appendices

ML50 Train ML50 Eval
Language # Sentences Source Source
# Sentences
Valid
# Sentences
Test
af 45967 Opus LauraMartinus 1500 2686
ar 226073 IWSLT17 IWSLT17 1158 1460
az 5680 TED58 TED58 671 903
bn 4487 TED58 TED58 896 216
cs 42587802 WMT20 WMT19 2983 1997
de 45828203 WMT20 WMT19 2998 2000
es * 14524187 WMT13 WMT13 3003 3000
et 1052003 WMT18 WMT18 2000 2000
fa 144895 TED58 TED58 3930 4490
fi * 2353313 WMT17 WMT17 3000 3002
fr 36797950 WMT14 WMT14 3000 3003
gl 9504 TED58 TED58 682 1007
gu 7471 WMT19 WMT19 1998 1016
he 204380 TED58 TED58 4515 5508
hi 1327206 ITB ITB 520 2507
hr 116792 TED58 TED58 3333 4881
id 83944 TED58 TED58 2677 3179
it 226457 IWSLT17.mltlng IWSLT17.mltlng 1566 1147
ja * 16167141 WMT20 WMT20 dev-split 999 999
ka 12364 TED58 TED58 654 943
kk 29186 WMT19 WMT19 2066 1000
km 191967 WMT’20 Flores devtest 2378 2309
ko 224612 IWSLT17 IWSLT17 1143 1429
lt * 1395010 WMT19 WMT19 2000 1000
lv * 1808291 WMT17 WMT17 2003 2001
mk 24037 TED58 TED58 640 438
ml 358916 lotus lotus 500 1000
mn 7168 TED58 TED58 372 414
mr 9397 TED58 TED58 767 1090
my 18073 WAT19 WAT19 1000 1018
ne 227387 Flores Flores 2559 2924
nl 232572 IWSLT17.mltlng IWSLT17.mltlng 1777 1181
pl 10332683 WMT20 WMT20 dev-split 1000 1000
ps 579346 WMT’20 Flores devtest 3162 2698
pt 49446 TED58 TED58 1193 1803
ro 592594 WMT16 WMT17 1999 1999
ru * 13922899 WMT20 WMT19 3000 2000
si 565661 Flores Flores 2898 2905
sl 18751 TED58 TED59 1068 1251
sv 53596 TED58 TED58 1729 2283
ta 609767 WMT’20 WMT20 dev-split 995 994
te 22042 lotus lotus 500 1000
th 93723 TED58 TED58 2989 3713
tr 204200 WMT17 WMT17 3000 3007
uk 104193 TED58 TED58 3060 3751
ur 26302 lotus lotus 500 1000
vi 127069 IWSLT 15 IWSLT15 1268 1080
xh 48981 Opus LauraMartinus 1500 2717
zh * 10082367 WMT20 WMT19 3981 2000
Table 6: ML50 Benchmark dataset stats. For each language, we list the size of training data after the filtering steps, the source of training/evaluation data, and the size of evaluation data. We notice that part of the available dataset are missing due to human error for a few language pairs. We mark these languages with asterisk and we will release next version of the ML50 benchmark data to include the missing data.
Lang de cs fr ja es ru pl zh fi lv lt hi
BL-Scratch to en 39.7 29.0 35.2 18.4 27 37.7 28.4 25.1 24.1 17.9 27.8 20.1
BL-FT to en 41.0 32.0 37.4 19.5 30.2 38.5 31.0 25.4 28.8 20.8 30.7 23.8
BL-Scratch from en 40 24.8 39 22.2 29 28.5 24.3 33.6 19.7 16.6 13.3 17.5
BL-FT from en 41.9 26.5 40.8 24.5 30.3 30.5 26.7 35.1 23.7 19.0 16.1 20.4
Lang et ta ro ps si ml nl ne it ar ko he
BL-Scratch to en 23.2 14.2 32.6 8.9 6.1 12.5 32.5 2.8 36.9 33.5 16.4 38.6
BL-FT to en 28.3 18.2 37.1 15.0 12.6 18.2 36.5 13.3 42.1 37.5 19.9 42.7
BL-Scratch from en 17.5 28.7 32.9 7.3 1.5 17.5 29.3 1.3 33.7 19.7 16.1 27.0
BL-FT from en 22.0 34.0 37.4 9.3 4.7 25.5 33.3 6.9 38.1 22.0 20.0 29.7
Lang tr km fa vi hr uk th id sv pt xh af
BL-Scratch to en 16.5 4.0 27.6 26.0 33.6 24.5 20.9 28.0 30.8 30.7 0.4 1.0
BL-FT to en 22.5 8.3 33.2 31.9 42.0 33.5 28.2 36.9 44.9 46.0 12.1 26.5
BL-Scratch from en 16.3 4.3 15.1 28.5 26.0 17.8 30.7 27.2 27.0 27.1 0.2 1.0
BL-FT from en 22.7 5.9 18.4 32.9 32.2 24.3 36.5 35.6 38.5 41.6 11.2 18.3
Lang kk ur mk te sl my ka gl mr gu mn az
BL-Scratch to en 1.4 7.8 14.1 10.9 7.9 3.9 6.1 6.6 2.8 0.0 3.5 2.8
BL-FT to en 11.0 28.0 35.8 35.8 28.5 25.1 23.8 34.3 11.6 0.5 11.2 15.5
BL-Scratch from en 0.6 8.3 8.2 15.0 4.9 19.8 3.7 4.2 5.2 0.0 3.3 1.9
BL-FT from en 5.9 23.7 27.2 38.8 21.9 35.8 13.0 26.7 11.5 0.6 8.5 7.4
Table 7: Bilingual and Finetuning Bilingual Baselines over languages
Lang de cs fr ja es ru pl zh fi lv lt hi
ML-Scratch N1 39.6 32.3 38.0 19.2 31.6 38.6 30.6 25.9 29.3 22.1 30.5 26.3
ML-Scratch NN 38.3 31.2 37.0 17.5 31.6 38.0 29.9 24.8 28.4 21.1 30.5 25.3
ML-Scratch 1N 39.1 23.9 38.5 20.9 29.3 28.6 24.6 31.7 21.2 17.6 14.5 19.8
ML-Scratch NN 37.2 23.1 37.8 20.0 29.1 27.4 23.1 30.5 20.3 16.5 14.6 19.7
Lang et ta ro ps si ml nl ne it ar ko he
ML-Scratch N1 29.1 20.5 36.3 16.0 15.4 19.5 34.5 17.7 40.1 51.0 29.2 39.7
ML-Scratch NN 28.3 19.9 36.6 15.7 16.2 19.2 37.6 20.3 41.9 44.5 24.1 40.5
ML-Scratch 1N 19.2 33.3 36.1 8.4 4.2 25.0 32.6 9.4 36.5 21.7 19.3 29.6
ML-Scratch NN 18.6 32.1 35.2 8.3 3.9 23.8 31.9 9.1 36.6 20.9 18.1 28.1
Lang tr km fa vi hr uk th id sv pt xh af
ML-Scratch N1 23.1 8.9 31.9 28.0 40.6 31.7 26.4 36.3 41.5 43.9 14.5 35.7
ML-Scratch NN 23.6 10.5 32.6 30.6 40.6 32.4 27.3 35.7 42.2 44.5 13.5 35.1
ML-Scratch 1N 22.1 5.0 18.5 32.5 32.5 24.4 36.5 34.7 38.2 41.9 4.9 20.3
ML-Scratch NN 21.7 5.0 18.3 31.9 31.6 24.5 36.7 35.4 38.4 42.0 8.9 17.6
Lang kk ur mk te sl my ka gl mr gu mn az
ML-Scratch N1 12.5 28.6 36.7 37.8 32.4 27.9 23.0 35.8 14.9 3.1 10.8 14.1
ML-Scratch NN 13.6 30.2 37.6 40.1 30.8 27.6 24.2 36.0 14.9 3.5 12.5 16.0
ML-Scratch 1N 7.9 24.6 28.3 41.2 23.4 35.5 13.5 28.9 13.9 3.0 9.2 8.5
ML-Scratch NN 7.9 24.3 29.5 41.2 22.6 36.3 13.2 28.8 13.8 3.9 9.1 7.9
Table 8: Multilingual Baselines over languages
Lang de cs fr ja es ru pl zh fi lv lt hi
ML-FT N1 41.5 34.2 39.8 20.5 28.6 39.1 32.9 26.8 31.3 23.1 31.6 27.2
ML-FT NN 37.9 31.7 37.3 17.4 27.3 37.9 30.0 24.8 29.0 21.8 30.4 25.5
ML-FT 1N 38.6 24.5 38.9 21.8 29.5 28.7 24.7 32.4 21.0 17.9 14.7 20.0
ML-FT NN 36.8 23.3 37.4 20.5 28.6 27.3 23.1 31.1 19.7 16.2 14.4 18.7
Lang et ta ro ps si ml nl ne it ar ko he
ML-FT N1 30.9 20.9 38.6 16.2 17.5 19.9 38.1 21.1 43.9 39.1 21.7 43.5
ML-FT NN 28.4 19.8 37.0 15.2 16.1 18.7 37.7 19.4 43.3 41.9 23.3 42.0
ML-FT 1N 19.6 33.4 36.4 8.4 4.1 24.8 32.6 9.0 37.5 21.2 19.4 29.0
ML-FT NN 18.5 32.5 35.5 8.2 3.3 23.6 31.1 8.5 35.9 20.0 18.5 27.4
Lang tr km fa vi hr uk th id sv pt xh af
ML-FT N1 24.8 11.2 35.7 33.1 44.3 36.2 30.3 39.1 46.9 49.3 14.2 42.4
ML-FT NN 24.3 10.7 34.0 32.7 42.7 34.2 29.1 37.9 45.1 47.1 16.6 42.2
ML-FT 1N 22.1 6.2 18.3 32.5 31.9 24.4 36.0 34.8 37.8 41.0 8.9 20.7
ML-FT NN 21.4 5.7 18.2 32.0 30.8 24.1 35.7 35.1 38.0 40.8 11.6 19.6
Lang kk ur mk te sl my ka gl mr gu mn az
ML-FT N1 19.3 31.4 42.5 44.0 33.9 32.1 28.6 40.6 17.4 15.8 13.6 19.9
ML-FT NN 15.6 31.7 39.4 41.8 31.6 29.7 24.5 36.9 15.4 5.4 12.8 17.4
ML-FT 1N 6.5 24.6 27.0 41.0 22.8 35.4 12.3 28.0 13.4 1.9 8.5 8.1
ML-FT NN 6.9 22.2 29.0 39.6 23.1 36.8 12.3 28.0 13.1 1.9 7.7 8.0
Table 9: Multilingual Finetuning over languages