Bilingual neural machine translation (NMT) systems have achieved decent performance with the help of Transformervaswani2017attention. One of the most exciting recent trends in NMT is training a single system on multiple languages at once 10.1162/tacl_a_00065; aharoni-etal-2019-massively; zhang-etal-2020-improving; DBLP:journals/corr/abs-2010-11125. This is a powerful paradigm for two reasons: simplifying system development and deployment and improving the translation quality on low-resource language pairs by transferring similar knowledge from high-resource languages.
This paper describes our experiments on the task of large-scale multilingual machine translation in WMT-21. We primarily focus on the small tasks, especially on Small Task 2, which has a small amount of training data. Small Task 1 contains five Central/East European languages and English, having 30 translation directions. Similarly, Small Task 2 contains five South East Asian languages and English, also having 30 translation directions.
In this work, we mainly concentrate on different back-translation methods sennrich2016aimproving; edunov2018understanding; gracca2019generalizing for multilingual machine translation, including beam search and other sampling methods. Along with it, we also explore the effect of different sizes of vocabularies and the effect of various amounts of synthetic data. On this large-scale multilingual machine translation task, we achieved the second place for both small tasks, obtaining 34.96 and 33.34 average spBLEU scores goyal2021flores on the hidden test set for Small Task 1 and 2, respectively.
2 Related Work
Multilingual Neural Machine Translation has received increasing attention recently. Since dong-etal-2015-multi extended the traditional bilingual NMT to one-to-many translation, there has been a massive increase in work on MT systems that involve more than two languages DBLP:journals/corr/DabreCK17; CHOI18.139; DBLP:journals/corr/abs-1906-07978. The recent research on multilingual NMT can be split into two directions: developing language-specific components kim-etal-2019-effective; DBLP:journals/corr/abs-2006-01594 and training a single model with extensive training data, including parallel and monolingual data DBLP:journals/corr/abs-2010-11125. Here, we continue to explore the second research direction, trying to build a single multilingual NMT model for simple industrial deployment.
Back-translation sennrich2016aimproving has been proven as a powerful technique to leverage monolingual data for improving low-resource language pairs. edunov2018understanding and gracca2019generalizing
explore different sampling methods for bilingual back-translation, including beam search, constrained and unconstrained sampling. Constrained sampling randomly predicts the next word within some candidates that have a higher prediction probability. And unconstrained sampling randomly predicts the next words from the whole vocabulary without caring for the output distribution. In this paper, we extend their exploration to the realm of multilingualism, where similar languages affect the results.
3 Experimental Setup
The organizer offers parallel and monolingual data for Small Task 1 and 2. Table 1 shows the size of the data in terms of the number of sentences for each language. There are five extra sets for evaluation, i.e., dev, devtest, hidden dev, hidden devtest, and test sets. The dev set with 997 parallel sentences among all language pairs and the devtest set with 1,012 parallel sentences are public. In contrast, the hidden dev and hidden devtest sets are invisible to the participants and used for the first submission period. The hidden test set is also invisible and used for the final ranking.
Pre-processing is done by a regular Moses toolkit koehn2007moses pipeline that involves tokenization, byte pair encoding and removing long sentences. We borrow the 256K vocabularies from the organizer’s pretrained model and the 128K vocabularies from M2M_100 fan2021beyond, one shared vocabulary among all languages. Our submissions only use the 256K vocabularies, while the 128K vocabularies is used for ablation experiments.
We also perform back-translation on the monolingual data and only accept the synthetic sentence pair whose length is less than 250 words and whose length ratio between the source and target sentence length is less than 1.8. In order to balance the volume across different languages, we apply temperature sampling with over the dataset, where is the number of sentences in the language.
|Small Task 1||Small Task 2|
|Word representation size||512||1,024||1,024|
|Feed-forward layer dimension||2,048||4,096||8,192|
|#prenormed encoder/ decoder layer||6||12||24|
|Layer dropout rate||0.05||0.05||0.05|
All our models are built using the fairseq implementation ott2019fairseq of the Transformer architecture vaswani2017attention. Multilingual models are built using the same technique as johnson2017google and aharoni2019massively, namely adding a language label to the target sentence.
We apply three types of architectures, i.e., Trans_small, Trans_base, and Trans_big. The detailed settings of these architectures are shown in Table 2. The parameters of all architectures are in the half-precision floating-point format.
All our submissions on the shared task leaderboard are Trans_base due to the memory and time limit of the evaluation system. Trans_small is mainly used for the ablation experiments. And the pretrained Trans_big from M2M_100 fan2021beyond is finetuned on the parallel corpus to generate high-quality synthetic sentences.
3.3 Optimization and Evaluation
The following hyper-parameter configuration is used: Adam optimizer with , , a weight-decay of 0.0001, the label smoothed cross-entropy criterion with a label smoothing of 0.1, an initial learning rate of 0.0003 with the inverse square root lr-scheduler and warmup updates of 2,500 steps. The batch size (the number of tokens) is for Trans_small, and for Trans_base and Trans_big.
For ablation experiments, we continue to train the pretrained Trans_small
offered by the organizer on the given parallel dataset for one epoch. We further train the model finetuned on the parallel data for another epoch when combining both parallel and synthetic data. For the final submissions, we train a pretrainedTrans_base for two epochs instead of one epoch. Pretrained Trans_big from M2M_100 is only further trained on parallel data for two epochs to generate high-quality synthetic data. Even though we only train these models for a few epochs, they seem to converge quite well according to the spBLEU curve during validation.
The model is validated every 3,000 steps on the dev set and saved. We use the beam search with a beam size of five, and stop translation when , where and
are the source and target sentence length, respectively. The evaluation metric is BLEU based on sentence piece tokenization (spBLEU)goyal2021flores. We submit the average checkpoint of the last 15 checkpoints to the evaluation system. While for the ablation experiment, we use the best-performed model on the dev set.
4.1 The Role of Vocabularies
There are two pretrained vocabularies, the one with the size of 256K from the organizer and the one with the size of 128K from M2M_100 fan2021beyond. To evaluate which vocabulary is better, we train two Trans_smalls with these two vocabularies from scratch on the parallel data of Small Task 2 for five epochs. To make the parameter sizes of these two models comparable, we set the following hyper-parameter for the model with the 128K vocabularies: 5 pre-normed encoder and decoder layers with a word representation size of 768 and a feed-forward layer dimension of 3072, resulting in 181M parameters. The other settings stay the same with Trans_small (with 256K vocabularies).
Table 3 shows the performance with different vocabularies. It is obvious that the 128K vocabulary outperforms the 256K vocabulary, 23.14 vs. 21.65 spBLEU. However, if we finetune the pretrained Trans_small with the 256K vocabulary, 0.58 score improvement is achieved compared to the 128K Trans_small. In a word, 128K vocabulary is a better choice for training from scratch, while pretrained model offers us more gain.
|128K Trans_small (scratch)||23.14|
|256K Trans_small (scratch)||21.65|
|256K Trans_small (pretrained)||23.72|
|1st finetuned on parallel data||28.27|
|2nd finetuned on synthetic data||32.16|
|3rd finetuned on synthetic data||33.01|
|1st finetuned on parallel data||32.46|
|2nd finetuned on synthetic data||34.73|
|Small Task||devtest||hidden dev||hidden devtest||hidden test|
4.2 Different Back-translation Methods
Similar to edunov2018understanding, we explore three types of back-translation methods, i.e., beam search with the beam size of five sennrich2016aimproving, unconstrained sampling edunov2018understanding and sampling constrained to the most 10 likely words graves2013generating; ott2018analyzing; fan2018hierarchical. Unconstrained sampling predicts the next word from the whole vocabulary without caring for the model distribution. In contrast, constrained sampling predicts the next words within some candidates that have the highest prediction probabilities. Both constrained and unconstrained sampling can be considered as adding uncertainty to the greedy search.
Figure 1 shows the back-translation results on the devtest set of Small Task 2. We combine three different amounts of synthetic data and parallel data to further train our Trans_smalls after finetuned on parallel data. 80M synthetic sentences cover only 6M monolingual English data and all other monolingual data. In addition to the 80M synthetic sentences, we further increase the amount of monolingual English data to verify the model performance concerning the amount of synthetic English data on the target side. The reason for this implementation is there are too many monolingual English sentences compared to other languages. We try to check whether it is necessary to use all monolingual English sentences.
As seen in Figure 1, little improvement is obtained by increasing monolingual English sentences after 6M. Besides, in contrast to the results in edunov2018understanding where the unconstrained sampling offers the best performance among these three methods, the constrained sampling method gives us the best score.
Beam search is the worst among these three methods. We hypothesize this is because beam search focuses only on the high probability words, while both constrained sampling and unconstrained sampling methods offer rich translations on the source side. With the diverse synthetic data generated from the sampling methods, the model can be trained with a higher level of generalization.
In contrast to the bilingual translation (English-German) in edunov2018understanding where unconstrained sampling outperforms constrained sampling, multilingual translation of Small Task 2 contains similar languages. We argue that unconstrained sampling might generate synthetic sentences with a mix of similar languages, which damages the quality of synthetic data, while constrained sampling gives us some restriction, to some extent avoiding the mix of different languages.
The reason for the slight effect of the synthetic English (on the source side) data after 6M might be that English is dissimilar to the other five South East Asian languages. Less similar knowledge could be transferred from this synthetic English (on the source side) data to other languages.
4.3 Final Submissions
Section 4.1 suggests us to employ a pretrained model with the 128K vocabulary. M2M_100 fan2021beyond offers multiple pretrained models with the 128K vocabularies 111https://github.com/pytorch/fairseq/tree/master/examples/m2m_100. Their sizes are 418M, 1.2B, and 12B, respectively. Considering our limited GPU budget, we finetune the 1.2B model, i.e. Trans_big, on parallel data of Small Task 2, obtaining 28.78 spBLEU on the devtest set. In comparison, training a Trans_base on the same data only provides 28.23 spBLEU. Even though Trans_big outperforms Trans_base, we only train it for generating high-quality synthetic data since it is too large for the evaluation system.
Section 4.2 advises us to use the constrained sampling method on partial monolingual English data. With the constrained sampling method, we generate synthetic sentences with Trans_big that is first finetuned on the parallel data. Instead of using all monolingual English data, we synthesize en-id, en-jv, en-ms, en-ta and en-tl with all, 15M, 60M, 10M, and 60M monolingual English sentences, respectively, a ratio of about between the number of parallel sentences and synthetic sentences if there are enough monolingual data.
Table 4 shows the results for iterative finetuning. Except for finetuning Trans_base on the combination of the parallel data and the synthetic data generated by Trans_big, we use the finetuned Trans_base to generate the synthetic data secondly and finetune it again. Finally, it offers us 33.01 spBLEU on the devtest set for Small Task 2.
Due to time and resource limits, we only conduct one trial on Small Task 1. We first finetune the pretrained Trans_base on parallel data. Then we use this Trans_base to generate synthetic data with only 20M monolingual English sentences and all other monolingual sentences. Table 5 shows the corresponding results. Different from Small Task 2, a large amount of monolingual English data might be helpful for Small Task 1 since Central/East European languages are more similar to English than Asian languages. Finally, We leave this exploration to future work.
We demonstrate that a pretrained model with a smaller size of the vocabulary is a better choice. Because of the memory and time limit of the evaluation system, we can only apply a 1.2B model with the smaller vocabularies to generate high-quality synthetic data. Besides, we have a different observation than previous research for bilingual back-translation: the constrained sampling method performs the best among all three back-translation methods, including the beam search and the unconstrained sampling. Finally, we also show that extensive monolingual English data offers a modest improvement. Combining these three findings, we iteratively train our models on partial high-quality synthetic data, achieving the second place for both small tasks.