This paper describes the machine translation system submitted by the joint team of Samsung Research Philippines and Konvergen AI for the WMT’21 Large Scale Multilingual Translation Task. Our team participated in Small Track #2, where the task is to produce a multilingual machine translation system for five Southeast-Asian languages: Javanese, Indonesian, Malay, Tagalog, and Tamil111Tamil is considered an official language in Singapore, a Southeast Asian country, plus English, in all 30 directions.
We will first describe the filtering heuristics that we used to preprocess the data, and then outline the steps we took to train and evaluate our models. Specific hyperparameters, preprocessing decisions, and other training parameters will be listed in their corresponding sections. Finally, we report our results on the FLORES-101goyal2021flores devtest set, as well as on the competition’s hidden test set.
2 Parallel Text Preprocessing Heuristics
The contest dataset comprises of various bitext sources, including: bible-uedin christodouloupoulos2015massively, CCAligned elkishky_ccaligned_2020, ELRC 2922222https://elrc-share.eu/, MultiCCAligned elkishky_ccaligned_2020, ParaCrawl 333https://www.paracrawl.eu/, TED2020 reimers-2020-multilingual-sentence-bert, WikiMatrix schwenk2019wikimatrix, tico-19, Ubuntu, OpenSubtitles, QED, Tanzil, Tatoeba, GlobalVoices, GNOME, KDE4, and WikiMedia TIEDEMANN12.463.
We preprocess the datasets before training in order to minimize spurious relations that originate from incorrect text pairs. Our preprocessing removes samples based on a few heuristics that we developed based on our observation on the datasets. Each bitext file is applied a different set of preprocessing based on observation. For example we filter by number content for datasets such as CCAAligned while TED2020 is not applied that same filter.
In this section, we will cover the decisions made during preprocessing. We observe a score increase of 1.91 BLEU on our submission model when the preprocessing is applied. We report the total number of lines filtered from the bitext for all language pairs on Table 1.
|ISO||Language Pair||Before Preprocessing||After Preprocessing||Reduction|
|en-id||English - Indonesian||54,075,891||27,186,074||49.73%|
|en-ms||English - Malaysian||13,437,727||7,674,956||42.89%|
|en-tl||English - Tagalog||13,612,403||5,302,768||61.04%|
|en-jv||English - Javanese||3,044,920||388,766||87.23%|
|en-ta||English - Tamil||2,115,925||1,420,827||32.85%|
|id-ms||Indonesian - Malaysian||4,857,321||3,371,777||30.58%|
|id-tl||Indonesian - Tagalog||2,743,305||1,823,140||33.54%|
|id-jv||Indonesian - Javanese||780,119||432,734||44.53%|
|id-ta||Indonesian - Tamil||500,898||393,336||21.47%|
|ms-tl||Malaysian - Tagalog||1,358,486||985,493||27.46%|
|ms-jv||Malaysian - Javanese||434,710||250,070||42.47%|
|ms-ta||Malaysian - Tamil||372,623||351,416||5.69%|
|tl-jv||Tagalog - Javanese||817,146||544,233||33.40%|
|tl-ta||Tagalog - Tamil||563,337||482,618||14.33%|
|jv-ta||Javanese - Tamil||65,997||48,806||26.05%|
2.1 Filter by Duplicate
Duplication is present throughout the dataset. Table 2 outlines samples of duplication based on three distinct types:
Duplicates within the same language Within a subset file of a designated language, multiple lines have the same string while the its counterpart may feature different translations.
Partial duplication The whole string of text in one language is present in its counterpart translation.
Duplication among parallel text Both source and target text line feature exactly the same string. While this may be correct for named entities, most of these duplication are short and can be non-informative.
|Duplicates within the same file|
|Error reading from file: %s||Error sa pagbasa ng talaksang ’%s’: %s|
|Error seeking in file: %s||Error sa pagbasa ng talaksang ’%s’: %s|
|Error closing file: %s||Error sa pagbasa ng talaksang ’%s’: %s|
|CJ E&M Corporation.||Drama iki diprodhuksi déning CJ E&M Corporation.|
|New Orleans, Louisiana.||Lair ing New Orleans, Louisiana.|
|Edward Thomas Hardy.||Jeneng dawané ya iku Edward Thomas Hardy.|
|Duplication among parallel text|
|Those who are invited will find the way.||Those who are invited will find the way.|
|Gazelle, whose face the full moon forms:||Gazelle, whose face the full moon forms:|
|Time has warned us never to approach her.||Time has warned us never to approach her.|
2.2 Filtering by Language and Letters
In algorithmically-aligned datasets such as CCAligned, some training examples are not in the list of contest languages. We find full text lines that are in Azerbaijani, Turkish, Arabic, and Japanese. To identify these languages, we use langdetect444https://pypi.org/project/langdetect/. This filter works for sentences that are fully foreign. It is also the case that foreign letters that may refer to named-entity can be found in the dataset. We consider this to be allowable so long as the the foreign character string is present in both source and target text line. To filter this, we use AlphabetDetector555https://pypi.org/project/alphabet-detector/ and check if detected foreign letters are present in both text line.
2.3 Filter by Specific Keywords and Symbols
There are a number of cases where the translations are generally correct but also feature extra keywords that have no relation to the parallel text. These keywords are generally in English and are consistently present in a number of bitext datasets such as KDE4, GNOME, and Ubuntu.
|Task Scheduler||Penjadwal TugasComment|
|Configure and schedule tasks||Atur dan jadwal tugasName|
Bitexts such as OpenSubtitles feature secondary information that relates to a particular scene (for example "(loud music playing)"). These secondary information may be in parentheses to denote an action being done or to signify a song being played. These secondary information are not always available for each language. We opt to remove all lines that have these specific symbols.
2.4 Filtering Number Content
We apply a filter to remove incorrect text lines in the bitext by checking if both source and target text lines feature the same numeric values such as date and quantities. Table 4 shows that filtering by number can remove text lines that do not relate to one another as numeric values tend to translate the same. Due to the limited time allotted for the shared task, we opt to remove entirely parallel sentences that do not have matching numbers. We filter this by using regular expressions.
2.5 Filtering by Length
Text lines with very long lengths are generally not informative, we find most of these text lines consists of a list of names that would normally be found in a bibliography. We set an arbitrary max length of 500 characters for both source and target sentences.
|Removed||Di. 13:00 - 17:30||Mo. 13:00 - 18:00|
|Di 24 nov. 10h – 18h||Sa 23 nov. 10h – 18h|
|Kept||(Terakhir diperbarui saat: 24/03/2020)||(Huling nai-update Sa: 24/03/2020)|
|Harga / $: 1,2835||presyo / $: 1.2835|
3.1 Model Architecture
For our submission, we wish to measure how much performance can be boosted by heuristics-based data preprocessing alone. Given that we anticipate most, if not all, submissions to the shared task will be transformer-based models, we opt to use the standard “vanilla” Sequence-to-Sequence Transformer vaswani2017attention model with little-to-no changes. This lets us more clearly compare the performance boost of our filtering heuristics against the boost provided by a number of architecture augmentations and training tricks that other submissions might have.
In addition to using a standard Transformer model, we only train the model directly on our filtered bitext and do not make use of Backtranslation sennrich2015improving for data augmentation. We also start from-scratch with models initialized using Glorot Uniform glorot2010understanding, opting not to use massively-pretrained translation models such as M2M-100 fan2021beyond as our starting checkpoint.
Following vaswani2017attention, we produce two models: a base model and a large model. For the sake of simplicity, for the rest of the paper, we will refer to our models trained with our filtered data as and .
The hyperparameters used for our models are presented in Table 5.
3.2 Data Preformatting and Tokenization
Our models employ one single shared vocabulary for all languages and directions. We train our tokenizer using the SentencePiece666https://github.com/google/sentencepiece library, limiting our vocabulary to 37,000 BPE sennrich2015neural tokens, and training with a character coverage of .
Before training the tokenizer, we first preformat the dataset into the format to be used for training later on. We append the source and target language’s ISO-639-1 code enclosed in square brackets at the beginning of each sentence. For example:
|[en] [tl] Today is a sunny day.|
is the preformatted version of "Today is a sunny day." when translating from English to Tagalog.
This preformatting is only done for the source sentences in the training dataset, while the target sentences are untouched.
For the purpose of training the tokenizer, the six language tokens ([en], [id], [jv], [ms], [ta], and [tl]) are treated as special tokens to ensure that they will not be segmented later on.
3.3 Training Setup
We then compile our filtered, preformatted bitext and train our base and large models. During training, we limit all source and target sentences to a maximum sequence length of 150 subword tokens. All sentences that are much longer are truncated.
Our models are trained using the Adam kingma2014adam optimizer. Following vaswani2017attention, we also use the “Noam“ learning rate scheduler, linearly increasing the learning rate from for the first 8000 steps, then decaying afterward. We also set Adam’s and use a label smoothing factor of .
For batching, we accumulate tokens until we reach a maximum size of approximately 32,000 tokens per batch, an increase over the 25,000 tokens used in vaswani2017attention. We then train the base model and the large model for 100,000 steps and 300,000 steps, respectively. All our models are trained on 8 NVIDIA Tesla P100 GPUs in parallel using the OpenNMT-py klein-etal-2017-opennmt toolkit.
To generate translations using the model, we use Beam Search with beam size and apply an average length penalty of . During generation, we limit all outputs to a maximum sequence length of 100, preemptively terminating generation if it begins to exceed this maximum length. We do not use sampling during translation, nor increase the temperature parameter as this induces randomness lopez2020simplifying.
We test our experimental models on the FLORES-101 devtest set. We report our BLEU scores using the SPM-BLEU variant of SacreBLEU777BLEU+case.mixed+numrefs.1+smooth.exp+tok.spm
After training our models and producing sample translations from the FLORES-101 devtest set, we compare the results of our two models with a number of baselines:
Transformers with No Heuristics – These models are essentially identical with our Transformer models in terms of architecture, hyperparameters, and training setups, except the bitext they are training on are the raw training corpus given in the competition (i.e. the filtering heuristics were not applied on them). We train these models as an ablation experiment to be able to identify how much of the final performance is attributable to the filtering heuristics.
M2M-100 615M – This is the baseline given for the WMT’21 Large-scale Multilingual Translation Task Small Track 2 competition. This M2M-100 fan2021beyond model was trained on CCMatrix and CCaligned with no further finetuning on the contest dataset.
DeltaLM+ZCode – This is the best performing model for the Small Track 2. The model is a finetuned version of the DeltaLM ma2021deltalm encoder-decoder pretrained model.
All analyses and results within this section are based on the public devtest set and not the contest’s hidden test set, unless specified. A summary of the BLEU scores for all models and baselines are available on Table 6.
4.1 Transformer + Heuristics vs. Baselines
We report the results of our and models against the M2M-100 615M model baseline as well as the best performing model for the shared task.
scored an average BLEU of 20.78 on all 30 directions. On the other hand, scored 22.92 average BLEU on all 30 directions, which is 2.14 BLEU points higher than the base model. Both models outperformed the M2M-100 615M baseline, with the base model giving a 5.32 BLEU improvement, and the large model giving a 7.46 BLEU improvement.
It is worth noting that, while the outperforms the baseline on average, it fails to outperform it on four specific translation directions: enid and enms. Note that it is these two language pairs that have the most number of training sentences in the training corpus.
The language pairs that benefit significantly from training on the contest dataset are language pairs that are of less volume than enid and enms. This is likely due to these pairs being less-sampled in M2M100’s training dataset, and thus were not as learned by the model compared to pairs with a higher volume of training data.
The same observations can be found when comparing the performance of against the baseline model. only marginally outperformed the baseline in one direction (iden, +0.07 BLEU), and marginally underperformed against the baseline in one direction (msen, -0.47 BLEU). This higher performance for M2M-100 is likely due to the training method used in the model in addition to the size of the training corpora used. While M2M-100 is advantageous in these translation directions, the difference is only marginal, most likely owing to ’s size which gives it higher capacity.
Both our transformer models and the baseline model are significantly outperformed by the DeltaLM+ZCode model, which is the best performing model in the competition. The best model outperforms our best model () by a significant 11.02 average BLEU, and the baseline model by 18.48 average BLEU.
While DeltaLM+ZCode outperforms our model in terms of average performance, it is worth noting that our model – a standard Transformer without any augmentations and training tricks – managed to outperform DeltaLM+ZCode in one translation direction: idjv.
scored 23.91 BLEU while DeltaLM+ZCode scored 23.35 BLEU. While the difference is marginal (+0.56 BLEU), our model still outperforms the best model in this direction, which we attribute to the quality of our data preprocessing and filtering heuristics.
4.2 Heuristics vs. No Heuristics
To quantify how much our filtering heuristics contributed to the final performance of our models, we trained two additional models: both identical to our base and large transformer variants, except the training corpus used was not processed using our filtering heuristics. For these ablation experiments, we use the same BPE tokenizer that is used for our main transformer models (trained on the filtered data). This is to ensure full model equivalency. To prevent confusion, we will refer to these ablation models simply as Base and Large to differentiate them from our contest models and .
On average, both sizes of models performed worse when trained without the filtering heuristics. Base scored 19.28 average BLEU on the devtest set, 1.5 points lower than . On the other hand, Large scored average 21.01 BLEU, which is 1.91 points lower than .
It is interesting, however, that Base outperformed in two translation directions: enms and msid. This may indicate that the filtering heuristics work better for a certain subset of languages. We look towards exploring how filtering methods such as ours affect multilingual translation datasets in terms of balance and informativeness in the future.
On the other hand, Large performed worse than in all 30 directions. This may be due to the increase in total trainable parameters, as larger models need more data with higher quality to be effectively trained.
4.3 The Case of Tamil
We observe that our models, including the other models on the shared task leaderboard, struggled with Tamil. Xta translation is on average much worse in terms of BLEU score compared to the other translation directions that do not involve it.
We hypothesize that this is due to two things.
First, Tamil is the most underrepresented language in the shared task dataset, with Xta having the least amount of parallel text for every language X in the training set. This causes the model, to a certain extent, to underfit on directions that translate to or from Tamil.
Second, Tamil is the only language in the shared task dataset that does not use the latin alphabet. Combined with the fact that it is the most underrepresented language in the dataset, there is a possibility that the model may have treated Tamil as noise during training. The observation that Xta performs worse on average compared to its inverse direction taX lends more credence to this hypothesis. The model is not trained well to represent sentences in Tamil, and thus, struggles when generating Tamil translations.
Part of our planned future work includes identifying methods to improve translation in multilingual datasets where the alphabets used may be more than one. This is to improve translation to non-latin alphabet languages in future methods.
4.4 Hidden Test Set Performance
We also report the performance of our models on the shared task’s hidden test set. We once more compare our results against the baseline M2M-100 model as well as the best performing DeltaLM+ZCode model.
Our final submission for the shared task was our model, which performed with an average BLEU of 22.97 on the shared task’s hidden test set. This is a marginal difference from it’s devtest set score (+0.05 average BLEU).
, unsurprisingly, still outperformed (20.73 average BLEU, +2.24 improvement) and the baseline M2M-100 model (14.02 average BLEU, +8.95 improvement) in the hidden test set. The shared task’s best performing model, DeltaLM+ZCode, still outperforms all other models in the hidden test set, scoring 33.89 average BLEU, a 10.92 improvement over our best model.
On the hidden test set, still ranked first in the idjv translation direction, scoring 24.05 BLEU. This outperforms DeltaLM+ZCode’s 23.79 BLEU (+0.26) and M2M-100’s 15.33 BLEU (+8.72).
A summary of our model’s performance on the hidden test set, as well as the baseline and best performing model, can be found on Table 7
In this paper, we described the translation systems submitted by the joint Samsung Research Philippines-Konvergen AI team for the WMT’21 Large Scale Multilingual Translation Small Track 2 shared task. We outline the filtering heuristics that we took to preprocess our data. We then train two models with a bitext preprocessed using our filtering heuristics, with our best model reaching an average BLEU score of 22.92 on the devtest set, and outperforming the baseline model by 7.46 BLEU points. In addition, we rank sixth in the contest leaderboard overall, scoring 22.97 BLEU on the hidden test set.
We also reached first place for the idjv translation direction, beating all other more complex models, despite only using a standard transformer without any special augmentations and training tricks. This provides empirical evidence that data quality and preprocessing decisions weigh just as much, if not even more, than cutting edge model architectures and training techniques do.