We participate in the WMT2021 shared task on news translation and submit a multilingual translation system. In recent years, multilingual translation has gained significant interest as an alternative to developing separate, specialized systems for different language directions firat2016multi; tan2019multilingual; aharoni2019massively; zhang2020improving; tang2020multilingual; arivazhagan2019massively
. Multilingual systems have great potential for simplicity and consolidation, making them attractive options for the development and maintenance of commercial translation technologies. From a research standpoint, studies of transfer learning between related languages and developing methods that incorporate low-resource languages are strong motivators for grouping languages together in one systemdabre2019exploiting; fan2021beyond.
Despite such motivations, existing multilingual translation systems have been unable to show that the translation quality of multilingual systems surpasses that of bilingual. Several works compare to bilingual baselines, but these baselines do not incorporate standard techniques used across the field — such as backtranslation or dense model scaling. Further, multilingual translation systems are often developed on non-standard training datasets and use different evaluation datasets. These factors make it difficult to assess the performance of multilingual translation, particularly when compared to the most competitive bilingual models.
In this work, our aim is to demonstrate against the winning WMT2020 models and our bilingual WMT2021 systems that multilingual translation models have stronger performance than bilingual ones. We focus on 14 language directions: English to and from Czech, German, Hausa, Icelandic, Japanese, Russian, and Chinese. We create an unconstrained system that utilizes both WMT distributed and publicly available training data, apply large-scale backtranslation, and explore dense and mixture-of-expert architectures. We compare the impact of various techniques on bilingual and multilingual systems, demonstrating where multilingual systems have an advantage. Our final multilingual submission improves the translation quality on average +2.0 compared to the WMT2020 winning models, and ranks first in 7 directions based on automatic evaluation on the WMT2021 leaderboard.
We participate in translation of English to and from Czech (cs), German (de), Hausa (ha), Icelandic (is), Japanese (ja), Russian (ru), and Chinese (zh). We describe our bitext and monolingual data sources, including additional mined data created for Hausa, and our preprocessing pipeline.
2.1 Bitext Data
For all directions, we use all available bitext data from the shared task . For language directions such as English to German or English to Russian, this provides millions of high-quality bitext. However, for low to mid resource languages, such as Hausa and Icelandic, we incorporate additional sources of data from freely available online sources such as ccMatrix schwenk2019ccmatrix, ccAligned elkishky2020ccaligned, and OPUS tiedemann2012opus. We utilize all available data sources to develop the best quality translation model possible.
For English-Hausa (and Hausa-English), we also mined extra parallel data from the provided monolingual data. We use LaBSE feng2020language to embed Hausa and English sentences into the same embedding space. We then use the margin function formulation artetxe2019margin based on
-nearest neighbors (KNN) to score and rank pairs of sentences from the two languages. Using the mining strategy fromtran2020cross, we mined an additional one million pairs of parallel sentences for English-Hausa.
The majority of available bitext represents noisy alignment rather than the output of human translations. We apply several steps of preprocessing to filter noisy data. First, we apply language identification using fasttext joulin2017bag and retain sentences predicted as the desired language111Note: for Hausa, the language identification system was unreliable, so we did not utilize it.. We then normalize punctuation with moses. Subsequently, we removed sentences longer than 250 words and with a source/target length ratio exceeding 3.
2.2 Monolingual Data
Previous work (ng-etal-2019-facebook) shows that using in-domain monolingual data provides the most quality improvement when used for large-scale backtranslation. For high resource languages such as English and German, there are sufficiently large quantities of in-domain data available in Newscrawl, and we do not utilize additional monolingual data. For the remaining languages, the data available in Newscrawl is not sufficient and we follow the strategy in moore2010intelligent; ng-etal-2019-facebook to examine large quantities of general-domain monolingual data from Commoncrawl222http://data.statmt.org/cc-100/
and identify a subset that is most similar to the available in-domain news data. For each language, we train an n-gram language model(kenneth2011kenlm) on all available news-domain data (Newscrawl) and a n-gram language model on a similarly sized sample from general-domain data (Commoncrawl). For each sentence in Commoncrawl, we compute word-normalized cross entropy scores and using in-domain language model and general-domain language model respectively. We retain sentences that meet the threshold . This selects around 5% of total number of sentences in the original Commoncrawl.
To create our multilingual vocabulary, we first learn a multilingual subword tokenizer on our combined training data across all languages. We use SentencePiece kudo2018sentencepiece, which learns subword units from untokenized text. We train our SPM model with temperature upsampling (with T=5) similar to conneau2020unsupervised, so that low-resource languages are represented. Subsequently, we convert the learned SPM units into our final vocabulary.
3 System Overview
We describe step-by-step how we created our final multilingual submission for WMT2021. We detail our bilingual and multilingual model architectures, as well as how we incorporate strategies such as backtranslation, news-domain finetuning, ensembling, and noisy channel reranking.
3.1 Baseline Bilingual Models
A pre-requisite to creating state-of-the-art multilingual translation systems is establishing strong, competitive bilingual baselines. Our goal is to apply the same set of techniques in data augmentation and modeling scaling to both bilingual and multilingual models, and demonstrate multilingual models have stronger translation quality.
To create baseline bilingual systems, we train a separate Transformer model vaswani2017attention for each language direction. For every language pair except Hausa, we use the Transformer 12/12 configurations in Table 2. For Hausa-English (and English-Hausa), since the amount of bitext data is smaller, we use the Transformer-Base architecture similar to vaswani2017attention. We train all our models using fairseq ott2019fairseq on 32 Volta 32GB GPUs. We use learning rate of 0.001 with the Adam optimizer, batch size of 768,000 tokens3336000 tokens per GPU * 32 GPUs * 4 update frequency, and tune the dropout rate for each language direction independently. For large models
Backtranslation sennrich2015improving is a widely used technique to improve the quality of machine translation systems using data augmentation. To perform backtranslation for a forward language direction (e.g. English to German), we use a system in the backward direction (e.g. German to English), to translate the target German monolingual data into the English source. We then use these backtranslated synthetic English to German sentence pairs in conjunction with the original parallel data to train an improved forward translation model.
We use all available filtered monolingual data we have for each language (up to 500 million sentences per language) for backtranslation. Using our baseline bilingual models (described in Section 3.1), we first finetune on in-domain news data (described in Section 3.5), and use an ensemble of 3 models with different seeds to generate backtranslation data using beam search. For Hausa-English and English-Hausa, we applied a round of iterative backtranslation hoang2018iterative; chen2019facebook as the quality improvement is significant.
3.3 Data Sharding and Sampling
displays the amount of data for all languages after postprocessing. We divide the data into multiple shards, with each training epoch using one shard. We downsample data from both high resource directions and synthetic backtranslated data by dividing them into a greater number of shards than the real bitext data from low resource directions. We find that downsampling high resource languages works better than upsampling low resource languages, as upsampling contributes more strongly to overfitting.
3.4 Model Architectures
We describe several model architectures that we compared using the final dataset with both bitext and backtranslated data.
Scaling Bilingual Models.
Based on the baseline architectures described in Section 3.1, we further improve our bilingual models. The two main improvements are: adding backtranslated data, and adding deeper and wider Transformer configurations to take advantage of the increase in data.
Dense Multilingual Models.
For the multilingual systems, we train two separate models: Many to English, or one system encompassing every language translated into English, and English to Many, or one for English into every language. The challenge of multilingual models is often one of capacity — given a fixed number of parameters, a model needs to learn representations of numerous languages rather than just one. To understand the needed capacity and optimal architectural configuration, we experiment with different Transformer architectures, ranging from 480M parameters to 4.7B parameters (see Table 2).
Sparsely Gated MoE Multilingual Models.
In multilingual models, languages necessarily compete for capacity and must balance sharing parameters with specialization for different languages. A straightforward way to add capacity to neural architectures is to simply scale the model size in a dense manner: increasing the number of layers, the width of the layers, or the size of the hidden dimension. However, this has a significant computational cost, as each forward pass activates all parameters — at the limit, models become incredibly slow to train and produce translations fan2021beyond.
In this work, we instead focus on sparse model scaling, motivated by wanting to increase capacity without a proportional increase in computational cost. We train Sparsely Gated Mixture-of-Expert (MoE) models lepikhin2020gshard for English to Many and Many to English. These models aim to strike a balance between allowing high-resource directions to benefit from increased expert model capacity, while also allowing transfer to low-resource directions via shared model capacity. In each Sparsely Gated MoE layer, each token is routed to the top-k expert FFN blocks based on a learned gating function. Thus, only a subset of all the model’s parameters is used per input sequence.
We use a Transformer architecture with the Feed Forward block in every alternate Transformer layer replaced with a Sparsely Gated Mixture-of-Experts layer with top-2 gating in the encoder and decoder. As in lepikhin2020gshard, we also add a gate loss term to balance expert assignment across tokens with a gate loss weight of 0.01. We use an expert capacity factor of 2.0. We use a learning rate of 0.001 with the Adam optimizer with 4000 warmup updates and a batch size of 1 Million tokens (MoE model with 64 experts) or 1.5 Million tokens (MoE model with 128 experts).
3.5 In-Domain Finetuning
Finetuning with domain-specific data is an effective method of improving translation quality for the desired domain, and thus we curated news-domain data for finetuning. For directions such as German and Russian, we finetune on evaluation datasets from previous years of WMT. For Hausa and Icelandic, as no previous data exists, we use mined data and filter to the subset identified as most likely news domain. Subsequently, we finetune our models on the in-domain data for a maximum of ten epochs, selecting the best model with validation loss on the newstest2020 dev set. For our submission, we use the settings tuned on newstest2020 and include newstest2021 dev set in the final finetuning.
3.6 Checkpoint Averaging
To combat bias toward recent training data, it is common to average parameters across multiple checkpoints of a model vaswani2017attention. We apply this technique to all models and average the last five checkpoints. To address rapid overfitting during finetuning, we also average the finetuned model with the model after the initial training is complete and select this averaged set of parameters if it performs better on the validation data.
3.7 Noisy Channel Re-ranking
We apply noisy channel re-ranking to select the best candidate translations from n-best hypotheses generated with beam search. We follow yee2019simple; bhosale2020language and utilize scores from the direct model , channel model , and language model . To combine these scores for reranking, for every one of our n-best hypotheses, we calculate:
The weights and are determined by tuning them with a random search over 1000 trials on a validation set and selecting the weights that give the best performance. In addition, we also tune a length penalty. The search bounds we use for the weights and the length penalty are [0,2].
We trained Transformer-based language models for all languages on the same monolingual data as used for backtranslation. The exception is English, where we trained on the CC100 English data and RoBERTa training dataconneau2020unsupervised; wenzek2019ccnet; liu2019roberta. For the high resource languages, the language models have 12 decoder layers and embedding dimension 4096. For Hausa and Icelandic, we trained smaller language models with 6 decoder layers to prevent overfitting.
As a final step, we apply post-processing to the translation outputs for Czech, German, Icelandic, Japanese, and Chinese. For Czech, German, and Icelandic, we convert quotation marks to German double-quote style444https://en.wikipedia.org/wiki/Quotation_mark#German. For Chinese and Japanese, we convert punctuation marks to the language-specific punctuation characters.
|Bilingual Dense 12/12||28.3||38.0||28.3||34.5||21.1||38.0||30.8||31.3|
|Dense 24/24 Wide||29||37.9||24.5||36.8||21.2||36.9||30.4||31.0|
|Bilingual Dense 12/12, BL-FT||30.4||42.8||30.3||35.5||24.6||39.5||36.2||34.2|
|Dense 12/12, ML-FT||30.3||42.4||32.7||37.5||23.9||39.5||34.2||34.4|
|MoE-64 12/12, ML-FT||31.6||43.5||33.4||38.8||25.7||39.8||36.0||35.5|
|Dense 24/24, ML-FT||31.8||43.4||36.0||38.8||25.6||40.3||36.3||36.0|
|MoE-128 24/24, ML-FT||31.9||43.6||34.9||39.7||26.5||40.4||37.2||36.3|
|Dense 24/24 Wide, ML-FT||32.1||43.8||36.1||39.4||26.7||40.6||36.9||36.5|
|Bilingual Dense 12/12||33.1||39.6||23.1||29.4||26.1||25.7||42.4||31.3|
|Dense 24/24 Wide||33.4||39.7||23.4||32.0||28.0||26.6||42.2||32.2|
|Bilingual Dense 12/12, BL-FT||35.7||39.5||23.3||29.4||27.7||26.0||43.0||32.1|
|Dense 12/12, ML-FT||35.0||39.1||22.9||30.5||26.9||25.6||41.5||31.6|
|MoE-64 12/12, ML-FT||35.9||40.4||24.1||29.6||28.8||26.4||43.0||32.6|
|Dense 24/24, ML-FT||35.8||40.1||24.1||31.6||28.7||26.8||42.5||32.8|
|MoE-128 24/24, ML-FT||36.4||40.8||24.6||31.2||29.7||26.8||43.6||33.3|
|Dense 24/24 Wide, ML-FT||36.7||40.6||24.6||32||29.3||26.7||43||33.3|
|Bilingual 24/24 Wide||40.3|
|Bilingual 12/12 + FT||40.4|
|Bilingual 24/24 + FT||40.5|
|Bilingual 24/24 Wide + FT||40.4|
4 Experiments and Results
We conduct experiments to quantify the impact of each of the component in our system. All experiments are evaluated on newstest20 barrault-etal-2020-findings using SacreBLEU DBLP:journals/corr/abs-1804-08771.
4.1 Creating State-of-the-Art Multilingual Translation Models
We investigate the effectiveness of multilinguality in translation. Compared to bilingual models, which can dedicate their capacity to specializing in specific source and target languages, multilingual systems must learn to effectively share available capacity across all languages while balancing languages of different resource levels. Despite rising research interest, previous WMT submissions have not demonstrated quality improvement of multilingual models over bilingual models. We discuss various choices and comparisons that build our state-of-the-art multilingual translation system. Overall, the best multilingual systems outperform the best bilingual ones in 11 out of 14 directions, with an average improvement of +0.8 BLEU.
4.1.1 Building a Multilingual Vocabulary.
Similar to how multilingual systems must share model capacity, multilingual translation models must also share vocabulary capacity. Instead of training specialized subword units for a specific language (often 32k), multilingual models group all languages together to learn a much smaller vocabulary set than 32k * number of languages. We first examine the impact of this multilingual vocabulary, by taking a bilingual system and training it with the multilingual vocabulary. This would indicate a performance difference coming not from architecture, but from the vocabulary itself. Table 3 indicates that across all directions, using a specialized bilingual vocabulary is usually superior, meaning multilingual systems must bridge the performance gap of a potentially subpar vocabulary. However, for some directions such as en-is and en-ja, no difference is observed.
4.1.2 Comparing Model Architectures.
Dense Transformer Models.
Overall, we find that dense multilingual models are fairly competitive with dense bilingual models (see Table 4). Importantly, we find multilingual models benefit greatly from additional model capacity. In Table 5, we show comparable dense scaling applied to a bilingual model translating from English to German. While the multilingual model improves up to 1 BLEU point, the bilingual model only improves +0.3 BLEU, indicating diminishing return and possible overfitting in bilingual models. Scaling multilingual translation models has stronger potential for performance improvement.
Sparsely Gated Mixture of Expert Models.
If multilingual models benefit from greater capacity, what is the best way to add that capacity? In Table 4, we compare the performance of Dense and MoE multilingual models while keeping the FLOPs per update approximately the same for fair comparison. Due to the conditional compute capacity of MoE layers, MoE models have a greater number of total parameters, but a comparable computational cost with the corresponding dense model.
For Many to English and English to Many, the MoE model with 64 experts per MoE layer gives an average boost of +0.7 BLEU on the dev set. To compare to scaling dense models, increasing dense model size from 12/12 to 24/24 does not correspond to significant improvement for Many to English. However, there is around +1 BLEU improvement in dense scaling on English to Many. We also see a slightly decline or no improvement in the performance of MoE models (MoE-64 12/12 vs MoE-128 24/24) when increasing model dimensionality and increasing the number of experts from 64 to 128. One possible hypothesis is that having 128 experts is largely unnecessary for only 7 languages. Compared to 64 experts, training convergence per expert is slower as each expert is exposed to fewer tokens during training on an average.
After finetuning on in-domain data, we observe a significant improvement in performance across the board. There is a larger improvement from finetuning in MoE models compared to the associated dense baselines. Furthermore, the MoE model with 128 experts, which previously lagged behind the MoE model with 64 experts, now gives the best results for all but two directions. A possible hypothesis is that expert capacity in MoE models can retain specialized direction-specific finetuning better than dense models, where all language directions must share all model capacity while finetuning.
4.1.3 Effects of In-Domain Finetuning
Finetuning Improves Multilingual More than Bilingual.
Table 6 compares the impact of finetuning across a variety of models. Multilingual systems benefit more from in-domain finetuning. As a result, the best multilingual system always outperforms the best bilingual system.
Multilingual Finetuning is better than Bilingual Finetuning.
For multilingual models, there are two possible finetuning schemes tang2020multilingual. The multilingual model could be finetuned to specialize to the news domain in a multilingual fashion, concatenating the news data for all languages, or could be finetuned for each direction separately by training on bilingual news domain data. We compare multilingual in-domain finetuning with bilingual in-domain finetuning in Table 6. We find that multilingual finetuning is almost always better than bilingual finetuning, indicating that it is not necessary to take a multilingual system and specialize it to be bilingual via bilingual finetuning — a completely multilingual system still has the strongest performance.
4.1.4 Human Evaluation.
While a number of studies have been conducted on bilingual models to understand how BLEU correlates with human-perceived quality, few studies have investigated multilingual ones. Given a bilingual system and a multilingual system with the same BLEU scores, we want to understand if there is anything intrinsically different in the multilingual system output that would impact human evaluation.
We study two directions: English to German and English to Russian. We ask human annotators who are fluent in source and native in target language to evaluate the translation quality between a bilingual system output and a multilingual system output. Both systems have similar BLEU scores, within decimal point difference. The translations are generated on the same English source sentence. We find no statistically significant difference between human evaluations of both systems, indicating that human evaluators have no innate preference for bilingual or multilingual systems.
4.2 Impact of Large-scale Backtranslation
Large-scale backtranslation has contributed to improvements in performance in machine translation models edunov2018understanding, even when measured in human evaluation studies edunov2019evaluation; bogoychev2019domain — it is a component integrated into most modern translation systems. However, backtranslation also has downsides. Research has indicated that systems trained with large scale backtranslation data tend to overfit to the synthetically generated source sentences, producing lower quality translations when translating original source sentences marie2020tagged. Further, backtranslation is fundamentally a form of data augmentation, which could have increasingly marginal effect when large-scale mined bitext is directly incorporated into training datasets. Beyond mining, multilingual translation can also be seen as an inherent form of data augmentation, as language directions can benefit from the training data of other directions. Thus, we analyze further in this section the continued importance of backtranslation, even in multilingual systems.
Backtranslation in Bilingual Systems.
First, we investigate if backtranslated data is still helpful, even after we augment the training dataset with mined and publicly available training data, beyond what is distributed in the WMT Shared Task. Our results in Table 7 show that backtranslation is helpful for 10 out of 14 directions, especially for low resource directions such as ha-en and is-en. However, for high resource directions such as de-en, ru-en, zh-en, bilingual systems trained with backtranslation had slightly lower validation BLEU compared to those trained without backtranslation.
Finetuning Corrects Overfitting to Translationese
We further investigate the anomaly that high-resource directions can suffer from adding backtranslated data. Figure 1 shows that the minor BLEU degradation from adding backtranslation mostly disappears after applying in-domain finetuning. For zh-en and cs-en after in-domain finetuning, the system trained with backtranslation has stronger performance (+0.4 BLEU) compared to the system trained without backtranslation. Previous studies of this effect have indicated that backtranslation produces translationese, which has distinct qualities compared to original training data marie2020tagged; zhang2019effect; graham2020translationese. We hypothesize that in-domain finetuning, which trains the model on non-backtranslated data, can have a corrective effect that counteracts overfitting on translationese.
Backtranslation in Multilingual Systems.
Table 8 summarizes the performance improvement from adding backtranslation to multilingual models in an ablation study. Overall, despite creating a fully unconstrained system with substantially greater training data and leveraging the data sharing potential of multilingual translation, we find that backtranslation still improves the performance. We believe this is influenced by the fact that backtranslation fully utilizes available monolingual data. While data mining techniques can identify potentially parallel sentences, it is naturally limited to identifying only a subset of the full monolingual data the algorithms utilize to mine.
4.3 Ablation on Components of Final Submission
Finally, we end by analyzing each aspect in our final submission and the cumulative effect. The effect of each component is shown in Table 9.
We find that our bilingual baselines have high BLEU scores, particularly for ru-en where our bilingual baseline is already stronger than the WMT20 winner. Overall, we observe that only en-ha and ha-en are significantly lower than 20 BLEU, indicating that curating a large amount of high quality bitext data is likely the most important basis of a strong system.
Subsequently, we add backtranslated data. We observe that ha, is, and ja in particular observe large improvements in BLEU after adding backtranslated data, while other directions can actually slightly decrease in quality as a possible effect of translationese.
We next evaluate the impact of in-domain finetuning and find an almost 3 BLEU improvement across directions for translation into English and 0.7 BLEU improvement for translation out of English. Across all language directions, finetuning is almost universally helpful.
Compared to bilingual models, multilingual models have stronger performance in every direction. Multilingual models benefit much more from scaling model size, as our largest architecture (MoE-128 24/24) has the best performance.
The effect of ensembling on average is fairly minor, but specific directions can see large improvements (such as +1 BLEU on zh-en).
We then apply noisy channel reranking to the outputs of our final system. It is helpful across almost all directions, but does not have a huge effect on BLEU. On average, performance improves around 0.3 to 0.5 BLEU.
Finally, we observe that postprocessing translated output to use standardized punctuation in each language is very important for BLEU scores when translating out of English. For example, Chinese in particular has a number of specific periods and double width punctuation characters, and properly using these produces almost +5 BLEU. However, we note that these techniques likely only improve BLEU score, and the effect on human evaluation is not well understood.
In this paper, we describe Facebook’s multilingual model submission to the WMT2021 shared task on news translation. We employed techniques such as large scale backtranslation, bitext mining, large scale dense and sparse multilingual models, in-domain finetuning, ensembling, and noisy channel reranking. We provide extensive experiment results to quantify the impact of each technique, as well as how well they cumulatively stack to produce the final system. Our results demonstrate that multilingual translation can achieve state-of-the-art performance on both low resource and high resource languages, beating our strong bilingual baselines and previous years’ winning submissions.
We’d like to thank Michael Auli and Halil Akin for helping getting this project started, help and advice along the way. We’d like to thank Holger Schwenk and Vishrav Chaudhary for their help in getting training data.