Integrating Unsupervised Data Generation into Self-Supervised Neural Machine Translation for Low-Resource Languages

07/19/2021 ∙ by Dana Ruiter, et al. ∙ DFKI GmbH Universität Saarland 0

For most language combinations, parallel data is either scarce or simply unavailable. To address this, unsupervised machine translation (UMT) exploits large amounts of monolingual data by using synthetic data generation techniques such as back-translation and noising, while self-supervised NMT (SSNMT) identifies parallel sentences in smaller comparable data and trains on them. To date, the inclusion of UMT data generation techniques in SSNMT has not been investigated. We show that including UMT techniques into SSNMT significantly outperforms SSNMT and UMT on all tested language pairs, with improvements of up to +4.3 BLEU, +50.8 BLEU, +51.5 over SSNMT, statistical UMT and hybrid UMT, respectively, on Afrikaans to English. We further show that the combination of multilingual denoising autoencoding, SSNMT with backtranslation and bilingual finetuning enables us to learn machine translation even for distant language pairs for which only small amounts of monolingual data are available, e.g. yielding BLEU scores of 11.6 (English to Swahili).



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural machine translation (NMT) achieves high quality translations when large amounts of parallel data are available (WMT:2020). Unfortunately, for most language combinations, parallel data is non-existent, scarce or low-quality. To overcome this, unsupervised MT (UMT) (lampleEtAl:EMNLP:2018; ren2019unsupervised; artetxe2019effective) focuses on exploiting large amounts of monolingual data, which are used to generate synthetic bitext training data via various techniques such as back-translation or denoising. Self-supervised NMT (SSNMT) (ruiter-etal-2019-self) learns from smaller amounts of comparable data –i.e. topic-aligned data such as Wikipedia articles– by learning to discover and exploit similar sentence pairs. However, both UMT and SSNMT approaches often do not scale to low-resource languages, for which neither monolingual nor comparable data are available in sufficient quantity (guzman-etal-2019-flores; espanaEtAl:WMT:2019; marchisio2020does). To date, UMT data augmentation techniques have not been explored in SSNMT. However, both approaches can benefit from each other, as SSNMT has strong internal quality checks on the data it admits for training, which can be of use to filter low-quality synthetic data, and UMT data augmentation makes monolingual data available for SSNMT.

In this paper we explore and test the effect of combining UMT data augmentation with SSNMT on different data sizes, ranging from very low-resource ( non-parallel sentences) to high-resource ( sentences). We do this using a common high-resource language pair (), which we downsample while keeping all other parameters identical. We then proceed to evaluate the augmentation techniques on different truly low-resource similar and distant language pairs, i.e. English ()–{Afrikaans (), Kannada (), Burmese (), Nepali (), Swahili (), Yorùbá ()}, chosen based on their differences in typology (analytic, fusional, agglutinative), word order (SVO, SOV) and writing system (Latin, Brahmic). We also explore the effect of different initialization techniques for SSNMT in combination with finetuning.

2 Related Work

Substantial effort has been devoted to muster training data for low-resource NMT, e.g. by identifying parallel sentences in monolingual or noisy corpora in a pre-processing step (artetxe2018margin; chaudhary-EtAl:2019:WMT; schwenk2019wikimatrix) and also by leveraging monolingual data into supervised NMT e.g. by including autoencoding (currey-etal-2017-copied) or language modeling tasks (gulcehre2015using; ramachandran-etal-2017-unsupervised)

. Low-resource NMT models can benefit from high-resource languages through transfer learning

(zoph-etal-2016-transfer), e.g. in a zero-shot setting (johnson2016google)

, by using pre-trained language models

(lample2019cross; kuwanto2021low), or finding an optimal path for pivoting through related languages (leng2019unsupervised).

Back-translation often works well in high-resource settings (bojar-tamchyna-2011-improving; sennrich-etal-2016-improving; Karakanta2018). NMT training and back-translation have been used in an incremental fashion in both unidirectional (hoang-etal-2018-iterative) and bidirectional systems (zhang2018joint; niu2018bidirectional).

Unsupervised NMT (lampleEtAl:ICLR:2018; artetxeEtAl:ICLR:2018; yangEtAl:2018) applies bi-directional back-translation in combination with denoising and multilingual shared encoders to learn MT on very large monolingual data. This can be done multilingually across several languages by using language-specific decoders (sen-etal-2019-multilingual), or by using additional parallel data for a related pivot language pair (li-etal-2020-reference). Further combining unsupervised neural MT with phrase tables from statistical MT leads to top results (lampleEtAl:EMNLP:2018; ren2019unsupervised; artetxe2019effective). However, unsupervised systems fail to learn when trained on small amounts of monolingual data (guzman-etal-2019-flores), when there is a domain mismatch between the two datasets (kim-etal-2020-unsupervised) or when the languages in a pair are distant (koneru-etal-2021-unsupervised). Unfortunately, all of this is the case for most truly low-resource language pairs.

Self-supervised NMT (ruiter-etal-2019-self) jointly learns to extract data and translate from comparable data and works best on 100s of thousands of documents per language, well beyond what is available in true low-resource settings.

3 UMT-Enhanced SSNMT

Figure 1: UMT-Enhanced SSNMT architecture (Section 3).

SSNMT jointly learns MT and extracting similar sentences for training from comparable corpora in a loop on-line. Sentence pairs from documents in languages and are fed as input to a bidirectional NMT system , which filters out non-similar sentences after scoring them with a similarity measure calculated from the internal embeddings.

Sentence Pair Extraction (SPE):

Input sentences , , are represented by the sum of their word embeddings and by the sum of the encoder outputs, and scored using the margin-based measure introduced by artetxe2018margin. If a pair () is top scoring for both language directions and for both sentence representations, it is accepted for training, otherwise it is filtered out. This is a strong quality check and equivalent to system P  in ruiter-etal-2019-self. A SSNMT model with SPE is our baseline (B) model.

Since most possible sentence pairs from comparable corpora are non-similar, they are simply discarded. In a low-resource setting, this potentially constitutes a major loss of usable monolingual information. To exploit sentences that have been rejected by the SSNMT filtering process, we integrate the following UMT synthetic data creation techniques on-line (Figure 1):

Back-translation (BT):

Given a rejected sentence , we use the current state of the SSNMT system to back-translate it into . The synthetic pair in the opposite direction is added to the batch for further training. We perform the same filtering process as for SPE so that only good quality back-translations are added. We apply the same to source sentences in .

Word-translation (WT):

For synthetic sentence pairs rejected by BT filtering, we perform word-by-word translation. Given a rejected sentence with tokens , we replace each token with its nearest neighbor in the bilingual word embedding layer of the model to obtain . We then train on the synthetic pair in the opposite direction . As with BT, this is applied to both language directions. To ensure sufficient volume of synthetic data (Figure 2, right), WT data is trained on without filtering.

Noise (N):

To increase robustness and variance in the training data, we add noise, i.e. token deletion, substitution and permutation, to copies of source sentences

(edunov-etal-2018-understanding) in parallel pairs identified via SPE, back-translations and word-translated sentences and, as with WT, we use these without additional filtering.


When languages are related and large amounts of training data is available, the initialization of SSNMT is not important. However, similarly to UMT, initialization becomes crucial in the low-resource setting (edman-etal-2020-low). We explore four different initialization techniques: no initialization (none), i.e. random initialization for all model parameters, initialization of tied source and target side word embedding layers only via pre-trained cross-lingual word-embeddings (WE) while randomly initializing all other layers and initialization of all layers via denoising autoencoding (DAE) in a bilingual and multilingual (MDAE) setting.

Finetuning (F):

When using MDAE initialization only, the following SSNMT is multilingual, otherwise it is bilingual. Due to the multilingual nature of the SSNMT with MDAE initialization, the performance of the individual languages can be limited by the curse of multilinguality (conneau-etal-2020-unsupervised), where multilingual training leads to improvements on low-resource languages up to a certain point after which it decays. To alleviate this, we finetune converged multilingual SSNMT models bilingually on a given language pair .

Comparable Monolingual
# Art () VO (%) # Sent () # Tok () # Sent () # Tok ()
73 7.1 4,589 780 189,990 27,640 1,034 34,759 31,858
18 1.4 1,739 764 95,481 30,003 1,058 47,136 35,534
19 2.1 1,505 477 82,537 15,313 997 43,752 24,094
20 0.6 1,526 207 83,524 7,518 296 13,149 9,229
34 6.5 2,375 244 122,593 8,774 329 13,957 9,937
19 5.7 1,314 34 82,674 1,536 547 17,953 19,370
Table 1: Number of sentences (Sent) and tokens (Tok) in the comparable and monolingual datasets. For comparable datasets, we report the number of articles (Art) and percentage of vocabulary overlap (VO) between the two languages in a pair. # Sent of monolingual data (/) is the same for  and its corresponding due to downsampling of  to match .

4 Experimental Setting

4.1 Data

MT Training

For training, we use Wikipedia (WP) as a comparable corpus and download the dumps111Dumps were downloaded on February 2021 from and extract comparable articles per language pair (Comparable in Table 1) using For validation and testing, we use the test and development data from mckellar2020dataset (), (), WAT2020 () (yi2018myanmar), FLoRes () (guzman-etal-2019-flores), surafel2020low (), and MENYO-20k () (adelani2021menyo20k). For  we use newstest2012 for development and newstest2014 for testing. As the  data does not have a development split, we additionally sample 1  sentences from CCAligned (elkishky_ccaligned_2020) to use as  development data. The  test set is divided into several sub-domains, and we only evaluate on the TED talks domain, since the other domains are noisy, e.g. localization or religious corpora.

MT Initialization

We use the monolingual Wikipedias to initialize SSNMT. As the monolingual Wikipedia for Yorùbá is especially small (65  sentences), we use the Yorùbá side of JW300 (agic-vulic-2019-jw300) as additional monolingual initialization data. For each monolingual data pair –{,…,}, the large English monolingual corpus is downsampled to its low(er)-resource counterpart before using the data (Monolingual in Table 1).

For the word-embedding-based initialization, we learn CBOW word embeddings using word2vec (mikolov2013distributed), which are then projected into a common multilingual space via vecmap (artetxe2017acl) to attain bilingual embeddings between –{,…,}. For the weak-supervision of the bilingual mapping process, we use a list of numbers ( only) which is augmented with 200 Swadesh list444 entries for the low-resource experiments.

For DAE initialization, we do not use external, highly-multilingual pre-trained language models, since in practical terms these may not cover the language combination of interest555This is the case here: MBart-50 (tang2020multilingual) does not cover Kannada, Swahili and Yorùbá.. We therefore use the monolingual data to train a bilingual (+{,…}) DAE using BART-style noise (liu2020mbart). We set aside 5  sentences for testing and development each. We use BART-style noise (, ) for word sequence masking. We add one random mask insertion per sequence and perform a sequence permutation. For the multilingual DAE (MDAE) setting, we train a single denoising autoencoder on the monolingual data of all languages, where  is downsampled to match the largest non-English monolingual dataset ().

In all cases SSNMT training is bidirectional between two languages –{,…,}, except for MDAE, where SSNMT is trained multilingually between all language combinations in {,,…,}.

4.2 Preprocessing

On the Wikipedia corpora, we perform sentence tokenization using NLTK (bird-2006-nltk). For languages using Latin scripts (,,,) we perform punctuation normalization and truecasing using standard Moses (koehn-etal-2007-moses) scripts on all datasets. For Yorùbá only, we follow adelani2021 and perform automatic diacritic restoration. Lastly, we perform language identification on all Wikipedia corpora using polyglot.666 After exploring different byte-pair encoding (BPE) (sennrich-etal-2016-neural) vocabulary sizes of 2 , 4 , 8 , 16  and 32 , we choose 2  (), 4  (–{,,,}) and 16  () merge operations using sentence-piece777 (kudo-richardson-2018-sentencepiece). We prepend a source and a target language token to each sentence. For the  experiments only, we use the data processing by ruiter2020selfinduced in order to minimize experimental differences for later comparison.

4.3 Model Specifications and Evaluation

Systems are either not initialized, initialized via bilingual word embeddings, or via pre-training using (M)DAE. Our implementation of SSNMT is a transformer base with default parameters. We use a batch size of 50 sentences and a maximum sequence length of 100 tokens. For evaluation, we use BLEU (papineni2002BLEU) calculated using SacreBLEU888 (post-2018-call)

and all confidence intervals (

) are calculated using bootstrap resampling (koehn-2004-statistical) as implemented in multeval101010 (clark-etal-2011-better).

5 Exploration of Corpus Sizes ()

To explore which technique works best with varying data sizes, and to compare with the high-resource SSNMT setting in ruiter2020selfinduced, we train SSNMT on , with different combinations of techniques (+BT, +WT, +N) over decreasingly small corpus sizes. The base (B) model is a simple SSNMT model with SPE.

Figure 2: Left: BLEU scores (2) of different techniques (+BT,+WT,+N) added to the base (B) SSNMT model when trained on increasingly large numbers  WP articles (# Articles).
Right: Number of extracted (SPE) or generated (BT,WT) sentence pairs () per technique of the B+BT+WT model trained on 4  comparable WP articles. Number of extracted sentence pairs by the base model is shown for comparison as a dotted line.

Figure 2 (left) shows that translation quality as measured by BLEU is very low in the low-resource setting. For experiments with only 4  comparable articles (similar to the corpus size available for ), BLEU is close to zero with base (B) and B+BT models. Only when WT is applied to rejected back-translated pairs does training become possible, and is further improved by adding noise, yielding BLEUs of 3.38111111Note that such low BLEU scores should be taken with a grain of salt: While there is an automatically measurable improvement in translation quality, a human judge would not see a meaningful improvement between different systems with low BLEU scores. (2) and 3.58 (2). The maximum gain in performance obtained by WT is at 31  comparable articles, where it adds BLEU over the B+BT performance. While the additional supervisory signal provided by WT is useful in the low and medium resource settings, up until articles, its benefits are overcome by the noise it introduces in the high-resource scenario, leading to a drop in translation quality. Similarly, the utility of adding noise varies with corpus size. Only BT constantly adds a slight gain in performance of 1–2 over all base models, where training is possible. In the high resource case, the difference between B and B+BT is not significant, with BLEU (2) and (2) for B+BT, which also leads to a small, yet statistically insignificant gain over the  SSNMT model in ruiter2020selfinduced, i.e. (2) and (2) BLEU.

At the beginning of training, the number of extracted sentence pairs (SPE) of the B+BT+WT+N model trained on the most extreme low-resource setting (4  articles), is low (Figure 2, right), with 4 

sentence pairs extracted in the first epoch. This number drops further to 2 

extracted pairs in the second epoch, but then continually rises up to 13  extracted pairs in the final epoch. This is not the case for the base (B) model, which starts with a similar amount of extracted parallel data but then continually extracts less as training progresses. The difference between the two models is due to the added BT and WT techniques. At the beginning of training B+BT+WT is not able to generate backtranslations of decent quality, with only few (196) backtranslations accepted for training. Rejected backtranslations are passed into WT, which leads to large numbers of WT sentence pairs up to the second epoch (56 ). These make all the difference: through WT, the system is able to gain noisy supervisory signals from the data, which leads to the internal representations to become more informative for SPE, thus leading to more and better extractions. Then, BT and SPE enhance each other, as SPE ensures original (clean) parallel sentences to be extracted, which improves translation accuracy, and hence more and better backtranslations (e.g. up to 20  around epoch 15) are accepted.

6 Exploration of Language Distance

English Afrikaans Nepali Kannada Yorùbá Swahili Burmese
Typology fusional fusional fusional agglutinative analytic agglutinative analytic
Script Latin Latin Brahmic Brahmic Latin Latin Brahmic
sim) 1.000 0.822 0.605 0.602 0.599 0.456 0.419
Table 2: Classification (typology, word order, script) of the languages

together with their cosine similarity (sim) to English based on lexical and syntactic URIEL features.

BT, WT and N data augmentation techniques are especially useful for the low- and mid-resource settings of related language pairs such as English and French (both Indo-European). To apply the approach to truly low-resource language pairs, and to verify which language-specific characteristics impact the effectiveness of the different augmentation techniques, we train and test our model on a selected number of languages (Table 2) based on their typological and graphemic distance from English (fusionalanalytic121212English and Afrikaans are traditionally categorized as fusional languages. However, due to their small morpheme-word ratio, both English and Afrikaans are nowadays often categorized as analytic languages., SVO, Latin script). Focusing on similarities on the lexical and syntactic level,131313This corresponds to lang2vec features syntax_average and inventory_average. we retrieve the URIEL (littell-etal-2017-uriel) representations of the languages using lang2vec141414 and calculate their cosine similarity to English. Afrikaans is the most similar language to English, with a similarity of , and pre-BPE vocabulary (token) overlap of 7.1% (Table 1), which is due to its similar typology (fusionalanalytic) and comparatively large vocabulary overlap (both languages belong to the West-Germanic language branch). The most distant language is Burmese (sim , vocabulary overlap 2.1%), which belongs to the Sino-Tibetan language family and uses its own (Brahmic) script.

We train SSNMT with combinations of BT, WT, N on the language combinations –{,,,,,} using the four different types of model initialization (none, WE, DAE, MDAE).

Intrinsic Parameter Analysis

We focus on the intrinsic initialization and data augmentation technique parameters. The difference between no (none) and word-embedding (WE) initialization is barely significant across all language pairs and techniques (Figure 3). For all language pairs, except , MDAE initialization tends to be the best choice, with major gains of BLEU (2, B+BT) and BLEU (2, B+BT) over their WE-initialized counterparts. This is natural, since pre-training on (M)DAE allows the SSNMT model to learn how to generate fluent sentences. By performing (M)DAE, the model also learns to denoise noisy inputs, resulting in a big improvement in translation performance (e.g. BLEU, 2 DAE) on the  and  B+BT+WT models in comparison to their WE-initialized counterparts. Without (M)DAE pre-training, the noisy word-translations lead to very low BLEU scores. Adding an additional denoising task, either via (M)DAE initialization or via adding the +N data augmentation technique, lets the model also learn from noisy word-translations with improved results. For  only, the WE initialization generally performs best, with BLEU scores of (2) and (2). For language pairs using different scripts, i.e. Latin–Brahmic (–{,,}), the gain by performing bilingual DAE pre-training is negligible, as results are generally low. These languages also have a different word order (SOV) than English (SVO), which may further increase the difficulty of the translation task (banerjee2019ordering; kim-etal-2020-unsupervised). However, once the pre-training and MT learning is multilingual (MDAE), the different language directions benefit from another and an internal mapping of the languages into a shared space is achieved. This leads to BLEU scores of (2), (2) and (2) using the B+BT technique. The method is also beneficial when translating into the low-resource languages, with 2 reaching BLEU (B).

Figure 3: BLEU scores of SSNMT Base (B) with added techniques (+BT,+WT,+N) on low-resource language combinations 2 and 2, with .

B+BT+WT seems to be the best data augmentation technique when the amount of data is very small, as is the case for , with gains of BLEU on 2 over the baseline B. This underlines the findings in Section 5, that WT serves as a crutch to start the extraction and training of SSNMT. Further adding noise (+N) tends to adversely impact on results on this language pair. On languages with more data available (–{,,,,}), +BT tends to be the best choice, with top BLEUs on  of (2, MDAE) and (2, MDAE). This is due to these models being able to sufficiently learn on B (+BT) only (Figure 4), thus not needing +WT as a crutch to start the extraction and MT learning process. Adding +WT to the system only adds additional noise and thus makes results worse.

Extrinsic Parameter Analysis

We focus on the extrinsic parameters linguistic distance and data size. Our model is able to learn MT also on distant language pairs such as  (sim ), with top BLEUs of (2, B+BT+W+N) and (2, B+BT). Despite being typologically closer, training SSNMT on  (sim ) only yields BLEUs above 1 in the multilingual setting (BLEU 2). This is the case for all languages using a different script than English (,,), underlining the fact that achieving a cross-lingual representation, i.e. via multilingual (pre-)training or a decent overlap in the (BPE) vocabulary (as in –{,,}) of the two languages, is vital for identifying good similar sentence pairs at the beginning of training and thus makes training possible. For   the MDAE approach was only beneficial in the 2 direction, but had no effect on 2, which may be due to the fact that  is the most distant language from  (sim 0.419) and, contrary to the other low-resource languages we explore, does not have any related language151515Both Nepali and Kannada share influences from Sanskrit. Swahili and Yorùbá are both Niger-Congo languages, while English and Afrikaans are both Indo-European. in our experimental setup, which makes it difficult to leverage supervisory signals from a related language.

When the amount of data is small (), the model does not achieve BLEUs above without the WT technique or without (M)DAE initialization, since the extraction recall of a simple SSNMT system is low at the beginning of training (ruiter2020selfinduced) and thus SPE fails to identify sufficient parallel sentences to improve the internal representations, which would then improve SPE recall. This is analogous to the observations on the - base model B trained on WP articles (Figure 2). Interestingly, the differences between no/WE and DAE initialization are minimized when using WT as a data augmentation technique, showing that it is an effective method that makes pre-training unnecessary when only small amounts of data are available. For larger data sizes (–{,}), the opposite is the case: the models sufficiently learn SPE and MT without WT, and thus WT just adds additional noise.

Extraction and Generation

Figure 4: Number of extracted (SPE) or generated (BT,WT) sentence pairs () per technique of the best performing SSNMT model (2) per language . Number of extracted sentence pairs by the base model (B) are shown for comparison as a dotted line.

The SPE extraction and BT/WT generation curves (Figure 4) for  (B+BT, WE) are similar to those on  (Figure 2, right). At the beginning of training, not many pairs (32 ) are extracted, but as training progresses, the model internal representations are improved and it starts extracting more and more parallel data, up to 252  in the last epoch. Simultaneously, translation quality improves and the number of backtranslations generated increases drastically from 2  up to 156  per epoch. However, as the amount of data for  is large, the base model B has a similar extraction curve. Nevertheless, translation quality is improved by the additional backtranslations ( BLEU). For  (B+BT+WT+N, WE), the curves are similar to those of , where the added word-translations serve as a crutch to make SPE and BT possible, thus showing a gap between the number of extracted sentences (SPE) () of the best model and those of the baseline (B) (1–2 ). For  (B+BT+WT, WE), the amount of extracted data is very small () for both the baseline and the best model. Here, WT fails to serve as a crutch as the number of extractions does not increase, but instead is overwhelmed by the number of word translations. For –{,} (MDAE), the extraction and BT curves also rise over time. For , where all training setups show similar translation performance in the 2 direction, we show the extraction and BT curves for B+BT with WE initialization. We observe that, as opposed to all other models, both lines are flat, underlining the fact that due to the lack of sufficiently cross-lingual model-internal representations, the model does not enter the self-supervisory cycle common to SSNMT.

Bilingual Finetuning

Best* 51.2 52.2 0.3 0.9 0.1 0.7 0.3 0.5 7.7 6.8 2.9 3.1
MDAE 42.5 42.5 3.1 5.3 0.1 1.7 1.0 3.3 7.4 7.9 1.5 4.7
MDAE+F 46.3 50.2 5.0 9.0 0.2 2.8 2.3 5.7 11.6 11.2 2.9 5.8
Table 3: BLEU scores on the 2 () and 2 () directions of top performing SSNMT model without finetuning and without MDAE (Best*) and SSNMT using MDAE initialization and B+BT technique with (MDAE+F) and without finetuning (MDAE).

The overall trend shows that MDAE pre-training with multilingual SSNMT training in combination with back-translation (B+BT) leads to top results for low-resource similar and distant language combinations. For only, which has more comparable data available for training and is a very similar language pair, the multilingual setup is less beneficial. The model attains enough supervisory signals when training bilingually on , thus the additional languages in the multilingual setup are simply noise for the system. While the MDAE setup with multilingual MT training makes it possible to map distant languages into a shared space and learn MT, we suspect that the final MT performance on the individual language directions is ultimately being held back due to the multilingual noise of other language combinations. To verify this, we use the converged MDAE B+BT model and fine-tune it using the B+BT approach on the different –{,…,} combinations individually (Table 3).

In all cases, the bilingual finetuning improves the multilingual model, with a major increase of BLEU for  resulting in a BLEU score of 11.6. The finetuned models almost always produce the best performing model, showing that the process of multilingual pre-training (MDAE) to achieve a cross-lingual representation, SSNMT online data extraction (SPE) with online back-translation (B+BT) to obtain increasing quantities of supervisory signals from the data, followed by focused bilingual fine-tuning to remove multilingual noise is key to learning low-resource MT also on distant languages without the need of any parallel data.

7 Comparison to other NMT Architectures

Pair Init. Config. Best Base UMT UMT+NMT Laser   TSS #P ()
2 WE B+BT 51.2.9 48.1.9 27.9.8 44.2.9 52.11.0 35.3 37
2 WE B+BT 52.2.9 47.9.9 1.4.1 0.7.1 52.9.9
2 MDAE B+BT+F 5.0.2 0.0.0 0.0.0 0.0.0 21.3 397
2 MDAE B+BT+F 9.0.2 0.0.0 0.0.0 0.0.0 40.3 397
2 MDAE B+BT+F 0.2.0 0.0.0 0.1.0 0.0.0 0.0.0 39.3 223
2 MDAE B+BT+F 2.8.1 0.0.0 0.0.0 0.0.0 0.1.0 38.6 223
2 MDAE B+BT+F 2.3.1 0.0.0 0.1.0 0.0.0 0.5.1 8.8
2 MDAE B+BT+F 5.7.2 0.0.0 0.0.0 0.0.0 0.2.0 21.5
2 MDAE B+BT+F 11.6.3 4.2.2 3.6.2 0.2.0 10.0.3 14.8 995
2 MDAE B+BT+F 11.2.3 3.6.2 0.3.0 0.0.0 8.4.3 19.7 995
2 MDAE B+BT+F 2.9.1 0.3.1 1.0.1 0.3.1 12.3 501
2 MDAE B+BT+F 5.8.1 0.5.1 0.6.0 0.0.0 22.4 501
Table 4: BLEU scores of the best SSNMT configuration (columns 2-4) compared with SSNMT base, USMT(+UNMT) and a supervised NMT system trained on Laser extractions (columns 5-8). Top scoring systems (TSS) per test set and the amount of parallel training sentences (#P) available for reference (columns 9-10).

We compare the best SSNMT model configuration per language pair with the SSNMT baseline system, and with Monoses (artetxe2019effective), an unsupervised machine translation model in its statistical (USMT) and hybrid (USMT+UNMT) version (Table 4). Over all languages, SSNMT with data augmentation outperforms both the SSNMT baseline and UMT models.

We also compare our results with a supervised NMT system trained on WP parallel sentences extracted by Laser161616 (artetxe2018massively) (–{,}) in a preprocessing data extraction step with the recommended extraction threshold of . We use the pre-extracted and similarity-ranked WikiMatrix (schwenk2019wikimatrix) corpus, which uses Laser to extract parallel sentences, for –{,}. Laser is not trained on  and , thus these languages are not included in the analysis. For , our model and the supervised model trained on Laser extractions perform equally well. In all other cases, our model statistically significantly outperforms the supervised LASER model, which is surprising, given the fact that the underlying LASER model was trained on parallel data in a highly multilingual setup (93 languages), while our MDAE setup does not use any parallel data and was trained on the monolingual data of much fewer language directions (7 languages) only. This again underlines the effectiveness of joining SSNMT with BT, multilingual pre-training and bilingual finetuning.

For reference, we also report the top-scoring system (TSS) per language direction based on top results reported on the relevant test sets together with the amount of parallel training data available to TSS systems. In case of language pairs whose test set is part of ongoing shared tasks (–{,}), we report the most recent results reported on the shared task web-pages (Section 4). The amount of parallel data available for these TSS varies greatly across languages, from 37  () to 995  (often noisy) sentences. In general, TSS systems perform much better than any of the SSNMT configurations or unsupervised models. This is natural, as TSS systems are mostly supervised (martinus2019focus; adelani2021menyo20k), semi-supervised (surafel2020low) or multilingual models with parallel pivot language pairs (guzman-etal-2019-flores), none of which is used in the UMT and SSNMT models. For 2 only, our best configuration and the supervised NMT model trained on Laser extractions outperform the current TSS, with a gain in BLEU of (B+BT), which may be due to the small amount of parallel data the TSS was trained on (37  parallel sentences).

8 Discussion and Conclusion

Across all tested low-resource language pairs, joining SSNMT-style online sentence pair extraction with UMT-style online back-translation significantly outperforms the SSNMT baseline and unsupervised MT models, indicating that the small amount of available supervisory signals in the data is exploited more efficiently. Our models also outperform supervised NMT systems trained on Laser extractions, which is remarkable given that our systems are trained on non-parallel data only, while Laser has been trained on massive amounts of parallel data.

While SSNMT with data augmentation and MDAE pre-training is able to learn MT even on a low-resource distant language pair such as , it can fail when a language does not have any relation to other languages included in the multilingual pre-training, which was the case for  in our setup. This can be overcome by being conscientious of the importance of language distance and including related languages during MDAE pre-training and SSNMT training. We make our code and data publicly available.171717


We thank David Adelani and Jesujoba Alabi for their insights on Yorùbá. Part of this research was made possible through a research award from Facebook AI. Partially funded by the German Federal Ministry of Education and Research under the funding code 01IW20010 (Cora4NLP). The authors are responsible for the content of this publication.