Log In Sign Up

MMTAfrica: Multilingual Machine Translation for African Languages

by   Chris C. Emezue, et al.

In this paper, we focus on the task of multilingual machine translation for African languages and describe our contribution in the 2021 WMT Shared Task: Large-Scale Multilingual Machine Translation. We introduce MMTAfrica, the first many-to-many multilingual translation system for six African languages: Fon (fon), Igbo (ibo), Kinyarwanda (kin), Swahili/Kiswahili (swa), Xhosa (xho), and Yoruba (yor) and two non-African languages: English (eng) and French (fra). For multilingual translation concerning African languages, we introduce a novel backtranslation and reconstruction objective, BT&REC, inspired by the random online back translation and T5 modeling framework respectively, to effectively leverage monolingual data. Additionally, we report improvements from MMTAfrica over the FLORES 101 benchmarks (spBLEU gains ranging from +0.58 in Swahili to French to +19.46 in French to Xhosa). We release our dataset and code source at


page 1

page 2

page 3

page 4


Back-translation for Large-Scale Multilingual Machine Translation

This paper illustrates our approach to the shared task on large-scale mu...

A Universal Semantic Space

Multilingual embeddings build on the success of monolingual embeddings a...

Is ChatGPT A Good Translator? A Preliminary Study

This report provides a preliminary evaluation of ChatGPT for machine tra...

TaTa: A Multilingual Table-to-Text Dataset for African Languages

Existing data-to-text generation datasets are mostly limited to English....

Can Multilinguality benefit Non-autoregressive Machine Translation?

Non-autoregressive (NAR) machine translation has recently achieved signi...

MRA - Proof of Concept of a Multilingual Report Annotator Web Application

MRA (Multilingual Report Annotator) is a web application that translates...

Towards Continual Learning for Multilingual Machine Translation via Vocabulary Substitution

We propose a straightforward vocabulary adaptation scheme to extend the ...

1 Introduction

Despite the progress of multilingual machine translation (MMT) and the many efforts towards improving its performance for low-resource languages, African languages suffer from under-representation. For example, of the known African languages Eberhard et al. (2020) only

of them are available in the FLORES 101 Large-scale Multilingual Translation Task as at the time of this research. Furthermore, most research that look into transfer learning of multlilingual models from high-resource to low-resource languages rarely work with ALs in the low-resource scenario. While the consensus is that the outcome of the research made using the low-resource non-African languages should be scalable to African languages, this cross-lingual generalization is not guaranteed

Orife et al. (2020) and the extent to which it actually works remains largely under-studied. Transfer learning from African languages to African languages sharing the same language sub-class has been shown to give better translation quality than from high-resource Anglo-centric languages Nyoni and Bassett (2021) calling for the need to investigate ALAL multilingual translation.

We take a step towards addressing the under-representation of African languages in MMT and improving experiments by participating in the 2021 WMT Shared Task: Large-Scale Multilingual Machine Translation and focusing solely on ALsALs. We focused on African languages and non-African languages (English and French). Table 1 gives an overview of our focus African languages in terms of their language family, number of speakers and the regions in Africa where they are spoken Adelani et al. (2021b). We chose these languages in an effort to create some language diversity: the 6 African languages span the most widely and least spoken languages in Africa.

Our main contributions are summarized below:

  1. MMTAfrica – a many-to-many ALAL multilingual model for African languages.

  2. Our novel reconstruction objective (described in section 4) and the BT&REC finetuning setting, together with our proposals in section 6.1 offer a comprehensive strategy for effectively exploiting monolingual data of African languages in ALAL multilingual machine translation,

  3. Evaluation of MMTAfrica on the FLORES Test Set reports significant gains in spBLEU over the M2M MMT Fan et al. (2020) benchmark model provided by Goyal et al. (2021),

  4. We further created a unique highly representative test set – MMTAfrica Test Set – and reported benchmark results and insights using MMTAfrica.

Language Lang ID
(ISO 639-3)
Family Speakers Region
Igbo ibo Niger-Congo-Volta-Niger 27M West
Fon (Fongbe) fon Niger-Congo-Volta-Congo-Gbe 1.7M West
Kinyarwanda kin Niger-Congo-Bantu 12M East
Swahili swa Niger-Congo-Bantu 98M Southern, Central & East
Xhosa xho Niger-Congo-Nguni Bantu 19.2M Southern
Yorùbá yor Niger-Congo-Volta-Niger 42M West
Table 1: Language, family, number of speakers Eberhard et al. (2020), and regions in Africa. Adapted from Adelani et al. (2021b)

2 Related Work

2.1 Multilingual Machine Translation (MMT)

The current state of multilingual NMT, where a single NMT model is optimized for the translation of multiple language pairs Firat et al. (2016); Johnson et al. (2017); Lu et al. (2018); Aharoni et al. (2019); Arivazhagan et al. (2019b), has become very appealing for a number of reasons. It is scalable and easy to deploy or maintan (the ability of a single model to effectively handle all translation directions from languages, if properly trained and designed, surpasses the scalability of individually trained models using the traditional bilingual framework). Multilingual NMT can encourage knowledge transfer among related language pairs Lakew et al. (2018); Tan et al. (2019) as well as positive transfer from higher-resource languages Zoph et al. (2016); Neubig and Hu (2018); Arivazhagan et al. (2019a); Aharoni et al. (2019); Johnson et al. (2017) due to its shared representation, improve low-resource translation Ha et al. (2016); Johnson et al. (2017); Arivazhagan et al. (2019b); Xue et al. (2021) and enable zero-shot translation (i.e. direct translation between a language pair never seen during training) Firat et al. (2016); Johnson et al. (2017).

Despite the many advantages of multilingual NMT it suffers from certain disadvantages. Firstly, the output vocabulary size is typically fixed regardless of the number of languages in the corpus and increasing the vocabulary size is costly in terms of computational resources because the training and inference time scales linearly with the size of the decoder’s output layer.

Another pitfall of massively multilingual NMT is its poor zero-shot performance Firat et al. (2016); Arivazhagan et al. (2019a); Johnson et al. (2017); Aharoni et al. (2019), particularly compared to pivot-based models (two bilingual models that translate from source to target language through an intermediate language), the spurious correlation issue Gu et al. (2019) and off-target translation Johnson et al. (2017) where the model ignores the given target information and translates into a wrong language.

Our work is inspired by some research to improve the performance (including zero-shot translation) of multilingual models via back-translation and leveraging monolingual data. Zhang et al. (2020) proposed random online backtranslation to enhance multilingual translation of unseen training language pairs. Siddhant et al. (2020) leveraged monolingual data in a semi-supervised fashion and reported three major results:

  1. Using monolingual data significantly boosts the translation quality of low resource languages in multilingual models.

  2. Self-supervision improves zero-shot translation quality in multilingual models.

  3. Leveraging monolingual data with self-supervision provides a viable path towards adding new languages to multilingual models.

3 Data Methodology

Table 2 presents the size of the gathered and cleaned parallel sentences for each language direction.

Target Language
ibo fon kin xho yor swa eng fra
ibo -






- -

- -
Table 2: Number of parallel samples for each language direction. We highlight the largest and smallest parallel samples. We see for example that much more research on machine translation and data collation has been carried out on swaeng than fonfra, attesting to the under-representation of some African languages.

We devised preprocessing guidelines for each of our focus languages taking their linguistic properties into consideration. We used a maximum sequence length of (due to computational resources) and a minimum of . In the following sections we will describe the data sources for the the parallel and monolingual corpora.

Parallel Corpora:

As NMT models are very reliant on parallel data, we sought to gather more parallel sentences for each language direction in an effort to increase the size and domain of each language direction. To this end, our first source was JW300 Agić and Vulić (2019), a parallel corpus of over 300 languages with around 100 thousand biblical domain parallel sentences per language pair on average. Using OpusTools Aulamo et al. (2020) we were able to get only very trustworthy translations by setting ( is a threshold which indicates the confidence of the translations). We collected more parallel sentences from Tatoeba111, kde4222 Tiedemann (23-25), and some English-based bilingual samples from MultiParaCrawl333

Finally, following pointers from the native speakers of these focus languages in the Masakhane community  et al. (2020)

to existing research on machine translation for African languages which open-sourced their parallel data, we assembled more parallel sentences mostly in the

AL direction.

From all this we created MMTAfrica Test Set (explained in more details in section 3.1), got total training samples for all languages directions (a breakdown of data size for each language direction is provided in Table 2) and for dev.

Monolingual Corpora:

Despite our efforts to gather several parallel data from various domains, we were faced with some problems: 1) there was a huge imbalance in parallel samples across the language directions. In Table 2 we see that the fon direction has the least amount of parallel sentences while swa or yor is made up of relatively larger parallel sentences. 2) the parallel sentences particularly for ALAL span a very small domain (mostly biblical, internet )

We therefore set out to gather monolingual data from diverse sources. As our focus is on African languages, we collated monolingual data in only these languages.

The monolingual sources and volume are summarized in Table 3.

Language(ID) Monolingual source Size
Xhosa (xho) The CC100-Xhosa Dataset created by Conneau et al. (2019), and OpenSLR van Niekerk et al. (2017)
Yoruba (yor) Yoruba Embeddings Corpus Alabi et al. (2020) and MENYO20k Adelani et al. (2021a)
Fon/Fongbe (fon) FFR Dataset Dossou and Emezue (2020), and Fon French Daily Dialogues Parallel Data Dossou and Emezue (2021)
Swahili/Kiswahili (swa) Shikali and Refuoe (2019)
Kinyarwanda (kin) KINNEWS-and-KIRNEWS Niyongabo et al. (2020)
Igbo (ibo) Ezeani et al. (2020)
Table 3: Monolingual data sources and sizes (number of samples).

3.1 Data Set Types in our Work

Here we elaborate on the different categories of data set that we (generated and) used in our work for training and evaluation.

  • FLORES Test Set: This refers to the dev test set of parallel sentences in all language directions provided by  Goyal et al. (2021)444 We performed evaluation on this test set for all language directions except fon and kin.

  • MMTAfrica Test Set: This is a test set we created by taking out a small but equal number of sentences from each parallel source domain. As a result, we have a set from a wide range of domains, while encompassing samples from many existing test sets from previous research. Although this set is small to be fully considered as a test set, we open-source it because it contains sentences from many domains (making it useful for evaluation) and we hope that it can be built upon, by perhaps merging it with other benchmark test sets Abate et al. (2018); Abbott and Martinus (2019); Reid et al. (2021).

  • Baseline Train/Test Set: We first conducted baseline experiments with Fon, Igbo, English and French as explained in section 5.0.1. For this we created a special data set by carefully selecting a small subset of the FFR Dataset (which already contained parallel sentences in French and Fon), first automatically translating the sentences to English and Igbo, using the Google Translate API555, and finally re-translating with the help of Igbo (7) and English (7) native speakers (we recognized that it was easier for native speakers to edit/tweak an existing translation rather than writing the whole translation from scratch). In so doing, we created a data set of translations in all language directions. We split the data set into for training Baseline Train Set, for dev and for test Baseline Test Set.

4 Model and Training Setup

For each language direction we have its set of parallel sentences where is the th source sentence of language and is its translation in the target language .

Following the approach of Johnson et al. (2017) and Xue et al. (2021), we model translation in a text-to-text format. More specifically, we create the input for the model by prepending the target language tag to the source sentence. Therefore for each source sentence the input to the model is <> and the target is . Taking a real example, let’s say we wish to translate the Igbo sentence Daalụ maka ikwu eziokwu nke Chineke to English. The input to the model becomes ¡eng¿ Daalụ maka ikwu eziokwu nke Chineke.

4.1 Model Setup

For all our experiments, we used the mT5 model Xue et al. (2021), a multilingual variant of the encoder-decoder, transformer-based Vaswani et al. (2017) “Text-to-Text Transfer Transformer” (T5) model Raffel et al. (2019). In T5 pre-training, the NLP tasks (including machine translation) were cast into a “text-to-text” format – that is, a task where the model is fed some text prefix for context or conditioning and is then asked to produce some output text. This framework makes it straightforward to design a number of NLP tasks like machine translation, summarization, text classification, etc. Also, it provides a consistent training objective both for pre-training and finetuning. The mT5 model was pre-trained with a maximum likelihood objective using “teacher forcing” Williams and Zipser (1989). The mT5 model was also pretrained with a modification of the masked language modelling objective Devlin et al. (2018).

We finetuned the mt5-base model on our many-to-many machine translation task. While Xue et al. (2021) suggest that higher versions of the mT5 model (Large, XL or XXL) give better performance on downstream multilingual translation tasks, we were constrained by computational resources to mt5-base, which has parameters.

4.2 Training Setup

We have a set of language tags for the languages we are working with in our multilingual many-to-many translation. In our baseline setup (section 5.0.1) and in our final experiment (section 5.0.2) . We carried out many-to-many translation using all the possible directions from except . We skipped for this fundamental reason:

  • our main focus is on AfricanAfrican or African. Due to the high-resource nature of English and French, adding the training set for would overshadow the learning of the other language directions and greatly impede our analyses. Our intuition draws from the observation of Xue et al. (2021) as the reason for off-target translation in the mT5 model: as English-based finetuning proceeds, the model’s assigned likelihood of non-English tokens presumably decreases. Therefore since the mt5-base training set contained predominantly English (and after other European languages) tokens and our research is about ALAL translation, removing the direction was our way of ensuring the model designated more likelihood to AL tokens.

4.2.1 Our Contributions

In addition to the parallel data between the African languages, we leveraged monolingual data to improve translation quality in two ways:

  1. our backtranslation (BT): We designed a modified form of the random online backtranslation Zhang et al. (2020) where instead of randomly selecting a subset of languages to backtranslate, we selected for each language sentences at random from the monolingual data set. This means that the model gets to backtranslate different (monolingual) sentences every backtranslation time and in so doing, we believe, improve the model’s domain adaptation because it gets to learn from various samples from the whole monolingual data set. We initially tested different values of to find a compromise between backtranslation computation time and translation quality. Following research works which have shown the effectiveness of random beam-search over greedy decoding while generating backtranslations Lample et al. (2017); Edunov et al. (2018); Hoang et al. (2018); Zhang et al. (2020), we generated

    prediction sentences from the model and randomly selected (with equal probability) one for our backtranslated sentence. Naturally the value of

    further affects the computation time (because the model has to produce different output sentences for each input sentence) and so we finally settled with .

  2. our reconstruction: Given a monolingual sentence from language , we applied random swapping ( times) and deletion (with a probability of ) to get a noisy version . Taking inspiration from Raffel et al. (2019) we integrated the reconstruction objective into our model finetuning by prepending the language tag ¡¿ to

    and setting its target output to


5 Experiments

In all our experiments we initialized the pretrained mT5-base model using Hugging Face’s AutoModelForSeq2SeqLM and tracked the training process with Weights&Biases Biewald (2020). We used the AdamW optimizer Loshchilov and Hutter (2017) with a learning rate (lr) of and transformer’s scheduler (where the learning rate decreases linearly from the initial lr set in the optimizer to , after a warmup period and then increases linearly from to the initial lr set in the optimizer.)

5.0.1 Baseline

The goal of our baseline was to understand the effect of jointly finetuning with backtranslation and reconstruction on the AfricanAfrican language translation quality in two scenarios: when the AL was initially pretrained on the multilingual model and contrariwise. Using Fon (which was not initially included in the pretraining) and Igbo (which was initially included in the pretraining) as the African languages for our baseline training, we finetuned our model on a many-to-many translation in all directions of amounting to directions. We used the Baseline Train Set for training and the Baseline Test Set

for evaluation. We trained the model for only 3 epochs in three settings:

  1. BASE: in this setup we finetune the model on only the many-to-many translation task: no backtranslation nor reconstruction.

  2. BT: refers to finetuning with our backtranslation objective described in section 4. For our baseline, where we backtranslate using monolingual data in , we set . For our final experiments, we first tried with but finally reduced to due to the great deal of computation required. For our baseline experiment, we ran one epoch normally and the remaining two with backtranslation. For our final experiments, we first finetuned the model on epochs before continuing with backtranslation.

  3. BT&REC: refers to joint backtranslation and reconstruction (explained in section 4) while finetuning. Two important questions were addressed – 1) the ratio, backtranslation : reconstruction, of monolingual sentences to use and 2) whether to use the same or different sentences for backtranslation and reconstruction. Bearing computation time in mind, we resolved to go with for our baseline and for our final experiments. We leave ablation studies on the effect of the ratio on translation quality to future work. For the second question we decided to randomly sample (with replacement) different sentences each for our backtranslation and reconstruction.

For our baseline, we used a learning rate of , a batch size of 32 sentences, with gradient accumulation up to a batch of 256 sentences and an early stopping patience of 100 evaluation steps. To further analyse the performance of our baseline setups we ran comparemt666 Neubig et al. (2019) on the model’s predictions.

5.0.2 MMTAfrica

MMTAfrica refers to our final experimental setup where we finetuned our model on all language directions involving all eight languages except engfra. Taking inspiration from our baseline results we ran our experiment with our proposed BT&REC setting and made some adjustments along the way.

The long computation time for backtranslating (with just 100 sentences per language the model was required to generate around translations every backtranslation time) was a drawback. To mitigate the issue we parallelized the process using the multiprocessing package in Python777 We further slowly reduced the number of sentences for backtranslation (to , and finally ).

Gradient descent in large multilingual models has been shown to be more stable when updates are performed over large batch sizes are used Xue et al. (2021). To cope with our computational resources, we used gradient accumulation to increase updates from an initial batch size of sentences, up to a batch gradient computation size of

sentences. We further utilized PyTorch’s DataParallel package

888 to parallelize the training across the GPUs. We used a learning rate (lr) of

6 Results and Insights

All evaluations were made using spBLEU (sentencepiece Kudo and Richardson (2018) + sacreBLEU Post (2018)) as described in Goyal et al. (2021). We further evaluated on the chrF Popović (2015) and TER metrics.

6.1 Baseline Results and Insights

Figure 1 compares the spBLEU scores for the three setups used in our baseline experiments. As a reminder, we make use of the symbol to refer to any language in the set .

BT gives strong improvement over BASE (except in engibo where it’s relatively the same, and fraibo where it performs worse).

Figure 1: spBLEU scores of the 3 setups explained in section 5.0.1

When the target language is fon, we observe a considerable boost in the spBLEU of the BT setting, which also significantly outperformed BASE and BT&REC. BT&REC contributed very little when compared with BT and sometimes even performed poorly (in engfon). We attribute this poor performance from the reconstruction objective to the fact that the mt5-base model was not originally pretrained on Fon. Therefore, with only 3 epochs of finetuning (and 1 epoch before introducing the reconstruction and backtranslation objectives) the model was not able to meaningfully utilize both objectives.

Conversely, when the target language is ibo BT&REC gives best results – even in scenarios where BT underperforms BASE(as is the case of fraibo and engibo). We believe that the decoder of the model, being originally pretrained on corpora containing Igbo, was able to better use our reconstruction to improve translation quaity in ibo direction.

Drawing insights from fonibo we offer the following propositions concerning ALAL multilingual translation:

  • our backtranslation (section 4) from monolingual data improves the cross-lingual mapping of the model for low-resource African languages. While it is computationally expensive, our parallelization and decay of number of backtranslated sentences are some potential solutions towards effectively adopting backtranslation using monolingual data.

  • Denoising objectives typically have been known to improve machine translation quality Zhang and Zong (2016); Cheng et al. (2016); Gu et al. (2019); Zhang et al. (2020); Xue et al. (2021) because they imbue the model with more generalizable knowledge (about that language) which is used by the decoder to predict better token likelihoods for that language during translation. This is a reasonable explanation for the improved quality with the BT&REC over BT in the ibo. As we learned from fon, using reconstruction could perform unsatisfactorily if not handled well. Some methods we propose are:

    1. For African languages that were included in the original model pretraining (as was the case of Igbo, Swahili, Xhosa, and Yorùbá in the mT5 model), using the BT&REC setting for finetuning produces best results. While we did not perform ablation studies on the data size ratio for backtranslation and reconstruction, we believe that our ratio of (in our final experiments) gives the best compromise on both computation time and translation quality.

    2. For African languages that were not originally included in the original model pretraining (as was the case of Kinyarwanda and Fon in the mT5 model), reconstruction together with backtranslation (especially at an early stage) only introduces more noise which could harm the cross-lingual learning. For these languages we propose:

      1. first finetuning the model on only our reconstruction (described in section 4) for fairly long training steps before using BT&REC. This way, the initial reconstruction will help the model learn that language representation space and increase its the likelihood of tokens.

6.2 MMTAfrica Results and Insights

In Table 4, we compared MMTAfrica with the M2M MMTFan et al. (2020) benchmark results of Goyal et al. (2021) using the same test set they used – FLORES Test Set. On all language pairs except swaeng (which has a comparable spBLEU difference), we report an improvement from MMTAfrica (spBLEU gains ranging from in swafra to in fraxho). The lower score of swaeng presents an intriguing anomaly, especially given the large availability of parallel corpora in our training set for this pair. We plan to investigate this in further work.

Source Target spBLEU (FLORES) spBLEU (Ours*) spCHRF (Ours*)
ibo swa 4.38 21.84 — 11.63 37.38 — 35.66
ibo xho 2.44 13.97 — 7.65 31.95 — 29.47
ibo yor 1.54 10.72 — 7.72 26.55 — 16.84
ibo eng 7.37 13.62 — 15.44 38.90 — 37.99
ibo fra 6.02 16.46 — 12.89 35.10 — 31.71
swa ibo 1.97 19.80 — 16.73 33.95 — 28.07
swa xho 2.71 21.71 — 11.74 39.86 — 35.67
swa yor 1.29 11.68 — 8.28 27.44 — 17.18
swa eng 30.43 27.67 — 28.41 56.12 — 53.65
swa fra 26.69 27.27 — 19.85 46.20 — 41.41
xho ibo 3.80 17.02 — 15.28 31.30 — 26.67
xho swa 6.14 29.47 — 15.73 44.68 — 40.78
xho yor 1.92 10.42 — 7.82 26.77 — 17.10
xho eng 10.86 20.77 — 21.75 48.69 — 46.34
xho fra 8.28 21.48 — 15.97 40.65 — 36.28
yor ibo 1.85 11.45 — 11.44 25.26 — 21.70
yor swa 1.93 14.99 — 6.61 30.49 — 28.21
yor xho 1.94 9.31 — 4.99 26.34 — 24.27
yor eng 4.18 8.15 — 9.02 30.65 — 28.85
yor fra 3.57 10.59 — 7.91 27.60 — 23.93
eng ibo 3.53 21.49 — 19.52 37.24 — 32.46
eng swa 26.95 40.11 — 27.06 53.13 — 51.90
eng xho 4.47 27.15 — 14.85 44.93 — 39.88
eng yor 2.17 12.09 — 9.43 28.34 — 18.39
fra ibo 1.69 19.48 — 17.25 34.47 — 29.49
fra swa 17.17 34.21 — 19.49 48.95 — 45.44
fra xho 2.27 21.73 — 11.37 40.06 — 35.41
fra yor 1.16 11.42 — 8.54 27.67 — 17.53
Table 4: Evaluation Scores of the Flores M2M MMT model and MMTAfrica on FLORES Test Set. We use | to denote spBLEU with — without BT&REC.

In Table 5 we introduce benchmark results of MMTAfrica on MMTAfrica Test Set with and without BT&REC. We also put the test size of each language pair. The spBLEU scores demonstrate the efficiency of our new objective, as it led to improvements in majority of the tasks.

Interesting analysis about Fon (fon) and Yorùbá (yor):

For each language, the lowest spBLEU scores in both tables come from the yor direction, except fonyor (from Table 5) which interestingly has the highest spBLEU score compared to the other fon directions. We do not know the reason for the very low performance in the yor direction, but we offer below a plausible explanation about fonyor.

The oral linguistic history of Fon ties it to the ancient Yorùbá kingdom Barnes (1997). Furthermore, in present day Benin, where Fon is largely spoken as a native language, Yoruba is one of the indigenuous languages commonly spoken. 999 (Last Accessed : 30.08.2021). Therefore Fon and Yorùbá share some linguistic characteristics and we believe this is one logic behind the fonyor surpassing other fon directions.

This explanation could inspire transfer learning from Yorùbá, which has received comparably more research and has more resources for machine translation, to Fon. We leave this for future work.

7 Conclusion and Future Work

In this paper, we introduced MMTAfrica, a multilingual machine translation model on 6 African Languages. Our results and analyses, including a new reconstruction objective, give insights on MMT for African languages for future research.

Source Target Test size spBLEU spCHRF spTER
ibo swa 60 34.89 (12.27) 47.38 (36.65) 68.28 (124.01)
ibo xho 30 36.69 (21.92) 50.66 (41.40) 59.65 (76.36)
ibo yor 30 11.77 (10.19) 29.54 (22.10) 129.84 (130.39)
ibo kin 30 33.92 (16.07) 46.53 (36.95) 67.73 (96.5)
ibo fon 30 35.96 (11.47) 43.14 (21.75) 63.21 (91.91)
ibo eng 90 37.28 (11.70) 60.42 (38.11) 62.05 (110.67)
ibo fra 60 30.86 (6.02) 44.09 (28.13) 69.53 (121.43)
swa ibo 60 33.71 (23.12) 43.02 (33.91) 60.01 (85.18)
swa xho 30 37.28 (20.55) 52.53 (40.84) 55.86 (72.71)
swa yor 30 14.09 (15.49) 27.50 (23.50) 113.63 (106.22)
swa kin 30 23.86 (13.53) 42.59 (36.88) 94.67 (118.0)
swa fon 30 23.29 (8.94) 33.52 (16.97) 65.11 (84.12)
swa eng 60 35.55 (43.11) 60.47 (66.52) 47.32 (40.0)
swa fra 60 30.11 (21.99) 48.33 (43.84) 63.38 (71.17)
xho ibo 30 33.25 (24.33) 45.36 (36.42) 62.83 (70.63)
xho swa 30 39.26 (23.42) 53.75 (46.22) 53.72 (67.13)
xho yor 30 22.00 (16.86) 38.06 (27.20) 70.45 (74.36)
xho kin 30 30.66 (14.09) 46.19 (37.26) 74.70 (112.4)
xho fon 30 25.80 (10.73) 34.87 (18.70) 65.96 (85.51)
xho eng 90 30.25 (21.36) 55.12 (48.96) 62.11 (69.61)
xho fra 30 29.45 (16.25) 45.72 (35.99) 61.03 (70.91)
yor ibo 30 25.11 (15.00) 34.19 (26.75) 74.80 (97.44)
yor swa 30 17.62 (4.81) 34.71 (28.23) 85.18 (130.22)
yor xho 30 29.31 (14.43) 43.13 (32.34) 66.82 (87.2)
yor kin 30 25.16 (14.65) 38.02 (32.22) 72.67 (86.99)
yor fon 30 31.81 (10.28) 37.45 (17.52) 63.39 (88.57)
yor eng 90 17.81 (2.11) 41.73 (22.90) 93.00 (93.49)
yor fra 30 15.44 (5.62) 30.97 (22.81) 90.57 (136.96)
kin ibo 30 31.25 (27.30) 42.36 (37.44) 66.73 (76.1)
kin swa 30 33.65 (13.00) 46.34 (38.62) 72.70 (100.84)
kin xho 30 20.40 (9.96) 39.71 (33.27) 89.97 (108.05)
kin yor 30 18.34 (17.64) 33.53 (27.48) 70.43 (72.4)
kin fon 30 22.43 (10.84) 32.49 (18.40) 67.26 (82.65)
kin eng 60 15.82 (9.28) 43.10 (35.88) 96.55 (102.82)
kin fra 30 16.23 (12.24) 33.51 (29.41) 91.82 (100.75)
fon ibo 30 32.36 (16.24) 46.44 (31.18) 61.82 (83.54)
fon swa 30 29.84 (17.08) 42.96 (35.26) 72.28 (88.37)
fon xho 30 28.82 (13.59) 43.74 (31.80) 66.98 (93.27)
fon yor 30 30.45 (22.17) 42.63 (30.52) 60.72 (70.91)
fon kin 30 23.88 (10.08) 39.59 (28.96) 78.06 (91.81)
fon eng 30 16.63 (13.67) 41.63 (30.36) 69.03 (83.57)
fon fra 60 24.79 (17.31) 43.39 (33.39) 82.15 (82.97)
eng ibo 90 44.24 (25.18) 54.89 (35.84) 63.92 (87.47)
eng swa 60 49.94 (33.53) 61.45 (55.58) 47.83 (65.22)
eng xho 120 31.97 (22.57) 49.74 (46.01) 72.89 (68.47)
eng yor 90 23.93 (11.01) 36.19 (19.25) 84.05 (90.29)
eng kin 90 40.98 (11.47) 56.00 (30.05) 76.37 (101.42)
eng fon 30 27.19 (6.40) 36.86 (14.91) 62.54 (91.08)
fra ibo 60 36.47 (18.26) 46.93 (28.72) 59.91 (86.05)
fra swa 60 36.53 (20.72) 51.42 (46.35) 55.94 (66.86)
fra xho 30 34.35 (21.49) 49.39 (43.36) 60.30 (72.39)
fra yor 30 7.26 (7.88) 25.54 (19.59) 124.53 (121.17)
fra kin 30 31.07 (17.24) 42.26 (37.63) 81.06 (95.45)
fra fon 60 31.07 (10.82) 38.72 (21.10) 75.74 (93.33)
Table 5: Benchmark Evaluation Scores on MMTAfrica Test Set with (without) BT&REC


  • , W. Nekoto, V. Marivate, T. Matsila, T. Fasubaa, T. Kolawole, T. Fagbohungbe, S. O. Akinola, S. H. Muhammad, S. Kabongo, S. Osei, et al. (2020) Participatory research for low-resourced machine translation: a case study in african languages. Findings of EMNLP. Cited by: §3.
  • S. T. Abate, M. Melese, M. Y. Tachbelie, M. Meshesha, S. Atinafu, W. Mulugeta, Y. Assabie, H. Abera, B. Ephrem, T. Abebe, W. Tsegaye, A. Lemma, T. Andargie, and S. Shifaw (2018) Parallel corpora for bi-lingual English-Ethiopian languages statistical machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 3102–3111. External Links: Link Cited by: 2nd item.
  • J. Abbott and L. Martinus (2019) Benchmarking neural machine translation for Southern African languages. In Proceedings of the 2019 Workshop on Widening NLP, Florence, Italy, pp. 98–101. External Links: Link Cited by: 2nd item.
  • D. I. Adelani, D. Ruiter, J. O. Alabi, D. Adebonojo, A. Ayeni, M. Adeyemi, A. Awokoya, and C. España-Bonet (2021a) The effect of domain and diacritics in yorùbá-english neural machine translation. External Links: arXiv:2103.08647 Cited by: Table 3.
  • D. I. Adelani, J. Z. Abbott, G. Neubig, D. D’souza, J. Kreutzer, C. Lignos, C. Palen-Michel, H. Buzaaba, S. Rijhwani, S. Ruder, S. Mayhew, I. A. Azime, S. H. Muhammad, C. C. Emezue, J. Nakatumba-Nabende, P. Ogayo, A. Anuoluwapo, C. Gitau, D. Mbaye, J. O. Alabi, S. M. Yimam, T. Gwadabe, I. Ezeani, R. A. Niyongabo, J. Mukiibi, V. Otiende, I. Orife, D. David, S. Ngom, T. P. Adewumi, P. Rayson, M. Adeyemi, G. Muriuki, E. Anebi, C. Chukwuneke, N. Odu, E. P. Wairagala, S. Oyerinde, C. Siro, T. S. Bateesa, T. Oloyede, Y. Wambui, V. Akinode, D. Nabagereka, M. Katusiime, A. Awokoya, M. Mboup, D. Gebreyohannes, H. Tilaye, K. Nwaike, D. Wolde, A. Faye, B. Sibanda, O. Ahia, B. F. P. Dossou, K. Ogueji, T. I. Diop, A. Diallo, A. Akinfaderin, T. Marengereke, and S. Osei (2021b)

    MasakhaNER: named entity recognition for african languages

    CoRR abs/2103.11811. External Links: Link, 2103.11811 Cited by: Table 1, §1.
  • Ž. Agić and I. Vulić (2019) JW300: a wide-coverage parallel corpus for low-resource languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3204–3210. External Links: Link, Document Cited by: §3.
  • R. Aharoni, M. Johnson, and O. Firat (2019) Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 3874–3884. External Links: Link, Document Cited by: §2.1, §2.1.
  • J. Alabi, K. Amponsah-Kaakyire, D. Adelani, and C. España-Bonet (2020) Massive vs. curated embeddings for low-resourced languages: the case of Yorùbá and Twi. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, pp. 2754–2762 (English). External Links: Link, ISBN 979-10-95546-34-4 Cited by: Table 3.
  • N. Arivazhagan, A. Bapna, O. Firat, R. Aharoni, M. Johnson, and W. Macherey (2019a) The missing ingredient in zero-shot neural machine translation. CoRR abs/1903.07091. External Links: Link, 1903.07091 Cited by: §2.1, §2.1.
  • N. Arivazhagan, A. Bapna, O. Firat, D. Lepikhin, M. Johnson, M. Krikun, M. X. Chen, Y. Cao, G. F. Foster, C. Cherry, W. Macherey, Z. Chen, and Y. Wu (2019b) Massively multilingual neural machine translation in the wild: findings and challenges. CoRR abs/1907.05019. External Links: Link, 1907.05019 Cited by: §2.1.
  • M. Aulamo, U. Sulubacak, S. Virpioja, and J. Tiedemann (2020) OpusTools and parallel corpus diagnostics. In Proceedings of The 12th Language Resources and Evaluation Conference, pp. 3782–3789. External Links: Link, ISBN 979-10-95546-34-4 Cited by: §3.
  • S.T. Barnes (1997) Africa’s ogun: old world and new. African systems of thought, Indiana University Press. External Links: ISBN 9780253210838, LCCN lc96043166, Link Cited by: §6.2.
  • L. Biewald (2020) Experiment tracking with weights and biases. Note: Software available from External Links: Link Cited by: §5.
  • Y. Cheng, W. Xu, Z. He, W. He, H. Wu, M. Sun, and Y. Liu (2016) Semi-supervised learning for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1965–1974. External Links: Link, Document Cited by: 2nd item.
  • A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov (2019) Unsupervised cross-lingual representation learning at scale. External Links: arXiv:1911.02116 Cited by: Table 3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: §4.1.
  • B. F. P. Dossou and C. C. Emezue (2020) FFR V1.0: fon-french neural machine translation. CoRR abs/2003.12111. External Links: Link, 2003.12111 Cited by: Table 3.
  • B. F. P. Dossou and C. C. Emezue (2021) Crowdsourced phrase-based tokenization for low-resourced neural machine translation: the case of fon language. CoRR abs/2103.08052. External Links: Link, 2103.08052 Cited by: Table 3.
  • D. M. Eberhard, G. F. Simons, and C. D. F. (eds.) (2020) Ethnologue: languages of the world. twenty-third edition.. SIL International. External Links: Link Cited by: Table 1, §1.
  • S. Edunov, M. Ott, M. Auli, and D. Grangier (2018) Understanding back-translation at scale. CoRR abs/1808.09381. External Links: Link, 1808.09381 Cited by: item 1.
  • I. Ezeani, P. Rayson, I. E. Onyenwe, C. Uchechukwu, and M. Hepple (2020) Igbo-english machine translation: an evaluation benchmark. CoRR abs/2004.00648. External Links: Link, 2004.00648 Cited by: Table 3.
  • A. Fan, S. Bhosale, H. Schwenk, Z. Ma, A. El-Kishky, S. Goyal, M. Baines, O. Celebi, G. Wenzek, V. Chaudhary, N. Goyal, T. Birch, V. Liptchinsky, S. Edunov, E. Grave, M. Auli, and A. Joulin (2020) Beyond english-centric multilingual machine translation. CoRR abs/2010.11125. External Links: Link, 2010.11125 Cited by: item 3, §6.2.
  • O. Firat, K. Cho, and Y. Bengio (2016) Multi-way, multilingual neural machine translation with a shared attention mechanism. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 866–875. External Links: Link, Document Cited by: §2.1.
  • O. Firat, B. Sankaran, Y. Al-onaizan, F. T. Yarman Vural, and K. Cho (2016) Zero-resource translation with multi-lingual neural machine translation. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

    Austin, Texas, pp. 268–277. External Links: Link, Document Cited by: §2.1, §2.1.
  • N. Goyal, C. Gao, V. Chaudhary, P. Chen, G. Wenzek, D. Ju, S. Krishnan, M. Ranzato, F. Guzman, and A. Fan (2021) The flores-101 evaluation benchmark for low-resource and multilingual machine translation. External Links: 2106.03193 Cited by: item 3, 1st item, §6.2, §6.
  • J. Gu, Y. Wang, K. Cho, and V. O.K. Li (2019) Improved zero-shot neural machine translation via ignoring spurious correlations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1258–1268. External Links: Link, Document Cited by: §2.1, 2nd item.
  • T. Ha, J. Niehues, and A. H. Waibel (2016) Toward multilingual neural machine translation with universal encoder and decoder. CoRR abs/1611.04798. External Links: Link, 1611.04798 Cited by: §2.1.
  • V. C. D. Hoang, P. Koehn, G. Haffari, and T. Cohn (2018) Iterative back-translation for neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, Melbourne, Australia, pp. 18–24. External Links: Link, Document Cited by: item 1.
  • M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean (2017) Google’s multilingual neural machine translation system: enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5, pp. 339–351. External Links: Link, Document Cited by: §2.1, §2.1, §4.
  • T. Kudo and J. Richardson (2018) SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, pp. 66–71. External Links: Link, Document Cited by: §6.
  • S. M. Lakew, M. Cettolo, and M. Federico (2018)

    A comparison of transformer and recurrent neural networks on multilingual neural machine translation

    In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 641–652. External Links: Link Cited by: §2.1.
  • G. Lample, L. Denoyer, and M. Ranzato (2017) Unsupervised machine translation using monolingual corpora only. CoRR abs/1711.00043. External Links: Link, 1711.00043 Cited by: item 1.
  • I. Loshchilov and F. Hutter (2017) Fixing weight decay regularization in adam. CoRR abs/1711.05101. External Links: Link, 1711.05101 Cited by: §5.
  • Y. Lu, P. Keung, F. Ladhak, V. Bhardwaj, S. Zhang, and J. Sun (2018) A neural interlingua for multilingual machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, Brussels, Belgium, pp. 84–92. External Links: Link, Document Cited by: §2.1.
  • G. Neubig, Z. Dou, J. Hu, P. Michel, D. Pruthi, X. Wang, and J. Wieting (2019) Compare-mt: A tool for holistic comparison of language generation systems. CoRR abs/1903.07926. External Links: Link Cited by: §5.0.1.
  • G. Neubig and J. Hu (2018) Rapid adaptation of neural machine translation to new languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 875–880. External Links: Link, Document Cited by: §2.1.
  • R. A. Niyongabo, Q. Hong, J. Kreutzer, and L. Huang (2020) KINNEWS and KIRNEWS: benchmarking cross-lingual text classification for Kinyarwanda and Kirundi. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), pp. 5507–5521. External Links: Link, Document Cited by: Table 3.
  • E. Nyoni and B. A. Bassett (2021) Low-resource neural machine translation for southern african languages. CoRR abs/2104.00366. External Links: Link, 2104.00366 Cited by: §1.
  • I. Orife, J. Kreutzer, B. Sibanda, D. Whitenack, K. Siminyu, L. Martinus, J. T. Ali, J. Z. Abbott, V. Marivate, S. Kabongo, M. Meressa, E. Murhabazi, O. Ahia, E. V. Biljon, A. Ramkilowan, A. Akinfaderin, A. Öktem, W. Akin, G. Kioko, K. Degila, H. Kamper, B. Dossou, C. Emezue, K. Ogueji, and A. Bashir (2020) Masakhane - machine translation for africa. CoRR abs/2003.11529. External Links: Link, 2003.11529 Cited by: §1.
  • M. Popović (2015)

    ChrF: character n-gram F-score for automatic MT evaluation

    In Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal, pp. 392–395. External Links: Link, Document Cited by: §6.
  • M. Post (2018) A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Brussels, Belgium, pp. 186–191. External Links: Link, Document Cited by: §6.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR abs/1910.10683. External Links: Link, 1910.10683 Cited by: item 2, §4.1.
  • M. Reid, J. Hu, G. Neubig, and Y. Matsuo (2021) AfroMT: pretraining strategies and reproducible benchmarks for translation of 8 african languages. External Links: 2109.04715 Cited by: 2nd item.
  • S. C. Shikali and M. Refuoe (2019) Language modeling data for swahili. Zenodo. External Links: Document, Link Cited by: Table 3.
  • A. Siddhant, A. Bapna, Y. Cao, O. Firat, M. X. Chen, S. R. Kudugunta, N. Arivazhagan, and Y. Wu (2020) Leveraging monolingual data with self-supervision for multilingual neural machine translation. CoRR abs/2005.04816. External Links: Link, 2005.04816 Cited by: §2.1.
  • X. Tan, J. Chen, D. He, Y. Xia, T. Qin, and T. Liu (2019) Multilingual neural machine translation with language clustering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 963–973. External Links: Link, Document Cited by: §2.1.
  • J. Tiedemann (23-25) Parallel data, tools and interfaces in opus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), N. C. (. Chair), K. Choukri, T. Declerck, M. U. Dogan, B. Maegaard, J. Mariani, J. Odijk, and S. Piperidis (Eds.), Istanbul, Turkey (english). External Links: ISBN 978-2-9517408-7-7 Cited by: §3.
  • D. van Niekerk, C. van Heerden, M. Davel, N. Kleynhans, O. Kjartansson, M. Jansche, and L. Ha (2017) Rapid development of TTS corpora for four South African languages. In Proc. Interspeech 2017, Stockholm, Sweden, pp. 2178–2182. External Links: Link Cited by: Table 3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. CoRR abs/1706.03762. External Links: Link, 1706.03762 Cited by: §4.1.
  • R. J. Williams and D. Zipser (1989)

    A learning algorithm for continually running fully recurrent neural networks

    Neural Computation 1 (2), pp. 270–280. External Links: Document Cited by: §4.1.
  • L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel (2021) MT5: a massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 483–498. External Links: Link, Document Cited by: §2.1, 1st item, §4.1, §4.1, §4, §5.0.2, 2nd item.
  • B. Zhang, P. Williams, I. Titov, and R. Sennrich (2020) Improving massively multilingual neural machine translation and zero-shot translation. CoRR abs/2004.11867. External Links: Link, 2004.11867 Cited by: §2.1, item 1, 2nd item.
  • J. Zhang and C. Zong (2016) Exploiting source-side monolingual data in neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1535–1545. External Links: Link, Document Cited by: 2nd item.
  • B. Zoph, D. Yuret, J. May, and K. Knight (2016) Transfer learning for low-resource neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1568–1575. External Links: Link, Document Cited by: §2.1.