Low Resource Neural Machine Translation: A Benchmark for Five African Languages

03/31/2020 ∙ by Surafel M. Lakew, et al. ∙ Fondazione Bruno Kessler Università di Trento 0

Recent advents in Neural Machine Translation (NMT) have shown improvements in low-resource language (LRL) translation tasks. In this work, we benchmark NMT between English and five African LRL pairs (Swahili, Amharic, Tigrigna, Oromo, Somali [SATOS]). We collected the available resources on the SATOS languages to evaluate the current state of NMT for LRLs. Our evaluation, comparing a baseline single language pair NMT model against semi-supervised learning, transfer learning, and multilingual modeling, shows significant performance improvements both in the En-LRL and LRL-En directions. In terms of averaged BLEU score, the multilingual approach shows the largest gains, up to +5 points, in six out of ten translation directions. To demonstrate the generalization capability of each model, we also report results on multi-domain test sets. We release the standardized experimental data and the test sets for future works addressing the challenges of NMT in under-resourced settings, in particular for the SATOS languages.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Since the introduction ofrecurrent neural networks (Sutskever et al., 2014; Cho et al., 2014; Bahdanau et al., 2014) and recently with the wide use of the Transformer model (Vaswani et al., 2017), NMT has shown increasingly better translation quality (Bentivogli et al., 2018). Despite this progress, the NMT paradigm is data demanding and still shows limitations when models are trained on small parallel corpora (Koehn and Knowles, 2017). Unfortunately, most of the 7,000+ languages and language varieties currently spoken around the world fall under this low-resource condition. For these languages, the absence of usable NMT models, creates a barrier in the increasingly digitized social, economic and political dimensions. To overcome this barrier, building effective NMT models from small-sized training sets has become a primary challenge in MT research.

Recent progress, however, has shown the possibility of learning better models even in this challenging scenario. Promising approaches rely on semi-supervised learning (Sennrich et al., 2016), transfer-learning (Zoph et al., 2016) and multilingual NMT solutions (Johnson et al., 2017; Ha et al., 2016). However, due to resource availability issues and to the priority given to well-established evaluation benchmarks, the language and geographical coverage of NMT is yet to reach new heights. This calls for further investigation of the current state of NMT using new languages, especially at a time where an effective and equitable exchange of information across borders and cultures is the most important necessity for many.

Therefore, the objectives of this work are: i) to benchmark and expand the current boundary of NMT into five prominent languages used in the East African region (i.e. Swahili, Amharic, Tigrigna, Oromo, and Somali), which have not been extensively studied yet within the NMT paradigm, ii) to investigate strengths and weakness of NMT applied to LRLs and define open problems, iii) to release a standardized training dataset for the five LRL, as well as multi-domain test sets for up to directions, to encourage future research in zero-shot and unsupervised translation between LRLs.

2 Experimental Settings

2.1 Dataset and Preprocessing

For the five languages aligned to English, we collect all available parallel data from the Opus corpus (Tiedemann, 2012), including JW300 (Agić and Vulić, 2019), Bible (Christodouloupoulos and Steedman, 2015), Tanzil, and Ted talks (Cettolo et al., 2012). For pre-training a massive multilingual model for our transfer learning (TL) experiments, we utilize the Ted talks corpus by Qi et al. (2018), which contains English-aligned parallel data for languages but does not include the SATOS ones. Monolingual data for each SATOS language are extracted from Wikipedia dumps 222Wikipedia: https://dumps.wikimedia.org/ and the Habit corpus (Rychlỳ and Suchomel, 2016).333Habit: http://habit-project.eu/wiki/SetOfEthiopianWebCorpora. Note that, given the data scarcity conditions characterizing SATOS languages, our goal is to collect all the corpora available for these languages, so to ultimately provide a standardized dataset comprising multi-domain test benchmarks facilitating future research. Tables 3 and 4 show the amount of data collected in parallel and monolingual settings.444Considering geographical location there are well over languages in the Horn of the African continent which we plan to investigate in a future work.

At preprocessing time, data is split into train, dev and test sets. To avoid bias towards specific domains, balanced dev and test sets are built by randomly selecting up to segments. The remaining material is left as training data, after filtering out segments similar to those contained in the dev and test sets in order to avoid potential overlap. Then, the standardized data is segmented into subword units using SentencePiece (Kudo and Richardson, 2018).555SentencePiece: https://github.com/google/sentencepiece The segmentation rules are set to for all models, except for the multilingual models that utilize subwords. When required, particularly for evaluation, the Moses toolkit Koehn et al. (2007) is used to tokenize/detokenize segments. Unless otherwise specified, we use the same pre-processing stages to train all the models.

2.2 Model Types And Evaluation

To evaluate different NMT approaches on LRL s, we train the following model types:

  1. S-NMT: a total of single language pair models trained for each SATOS En pairs.

  2. SS-NMT: semi-supervised models trained with the original parallel data of an S-NMT model and synthetic data generated with back-translation for each language pair.

  3. TL: adapted child model for each language pair parallel data from the massively multilingual parent model (M-NMT116).

  4. M-NMT: a single multilingual model trained on the aggregation of all the SATOS En data.

These NMT models are evaluated on multi-domain test sets when available; otherwise, only the in-domain test set is used. BLEU (Papineni et al., 2002) is used to measure systems’ performance.666Moses Toolkit: http://www.statmt.org/moses When En is the target language, BLEU scores are computed on detokenized (hypothesis, reference) pairs. When the target is a LRL, we report tokenized BLEU. Further details about the NMT model types considered in our evaluation are given in Appendix A.3.

2.3 Model Settings

All the models are trained using the OpenNMT implementation777OpenNMT: http://opennmt.net/ (Klein et al., 2017) of Transformer (Vaswani et al., 2017) The model parameters are set to a hidden units and embedding dimension, layers of self-attentional encoder-decoder with heads. At training time, we use token level batch size with a maximum sentence length of . For inference, we keep a example level batch size, with a beam search width of . LazyAdam (Kingma and Ba, 2014) is applied throughout all the strategies with a constant initial learning rate value of . Given the sparsity of the data, dropout (Srivastava et al., 2014) is set to . The multilingual models (M-NMT and M-NMT116) are run for M steps, while the S-NMT, SS-NMT, and the adaptations steps of the TL approach vary based on the amount of data used. In all runs, models’ convergence is checked based on the validation loss.

3 Results and Discussion

Table 1 shows the performance of the different LRL modeling criteria with multi-domain test sets. Looking at the single pair NMT models (S-NMT), we observe that in all the test domains they underperform with respect to the SS-NMT, TL, or M-NMT models in terms of averaged (AVG) BLEU scores. Specifically to each domain, the S-NMT models perform reasonably well on the in-domain test sets while, for the out-of-domain Ted test set we often observe rather large degradations. For instance, on the Sw/Am/So-En Ted test sets, there is a consistent performance drop in both in the LRL En translation directions. The performance drop with test sets featuring a domain shift with respect to the training data shows the susceptibility of NMT in a low-resource training condition. We expect that the S-NMT model performance can be improved with the more robust models described in Sec. A.3.

Sw-En Am-En Ti-En Om-En So-En
Model Domain En Sw En Am En Ti En Om En So
S-NMT Jw300 48.71 47.58 32.86 25.72 29.89 25.54 26.92 23.38
Bible 30.35 23.36 29.87 24.64
Tanzil 18.83 31.67 11.71 5.71 8.91 2.46
Ted 16.63 11.92 4.26 1.32 1.35 0.39
AVG 28.06 30.39 19.80 14.03 29.89 25.54 26.92 23.38 13.38 9.16
SS-NMT Jw300 48.90 47.45 32.76 26.54 29.84 25.99 26.45 23.47
Bible 30.53 24.21 27.68 22.89
Tanzil 19.44 32.17 12.55 7.29 6.75 2.25
Ted 18.62 14.72 6.92 1.41 1.21 0.52
AVG 28.99 31.45 20.69 14.86 29.84 25.99 26.45 23.47 11.88 8.55
TL Jw300 48.74 47.39 32.95 26.49 29.81 26.47 27.77 24.54
Bible 30.36 24.26 32.07 27.67
Tanzil 19.9 31.78 12.28 7.34 10.14 3.34
Ted 19.74 14.81 7.42 1.31 1.97 0.56
AVG 29.46 31.33 20.75 14.85 29.81 26.47 27.77 24.54 14.73 10.52
M-NMT Jw300 46.62 44.47 33.21 24.39 32.21 26.4 32.24 24.96
Bible 29.78 20.01 34.99 28.76
Tanzil 18.75 24.22 13.68 10.95 12.68 3.73
Ted 17.54 14.65 6.78 1.32 3.09 1.01
AVG 27.64 27.78 20.86 14.17 32.21 26.40 32.24 24.96 16.92 11.17
Table 1: BLEU scores for the SATOS En directions, domain-specific best performing results are highlighted for each direction, whereas bold shows the overall best in terms of the AVG score.

Indeed, the AVG BLEU scores of the SS-NMT, TL, and M-NMT models show better performance in most of the cases compared to the S-NMT model. Specifically, M-NMT achieves the highest results in six out of ten directions. Interestingly, except for EnOm/So, all the other improvements of the M-NMT occur when translating into En, these are: Am/Ti/Om/SoEn directions. These improvements are highly related to the fact that all the LRL s are paired with En, maximizing the distribution of the En data both on the encoder and decoder side. M-NMT also shows the largest drops when compared to all the other models. These drops occur particularly for the Sw-En pair, with a - and - decrease respectively in comparison with the best performing approaches (TL, SS-NMT). Similarly, a slight degradation is observed in the EnAm/Ti directions. Our observation is that Sw-En can exploit the largest amount of parallel data, followed by Am-En and Ti-En. This indicates that the least resourced pairs (Om-En and So-En) benefit most from M-NMT modeling. Moreover, it is easy to notice that most of the performance degradation occurs when translating into the LRL.

For the the SS-NMT and TL approaches, our experiments show comparable performance in most of the translation directions. Both the approaches outperform the M-NMT in a total of four directions: SS-NMT in EnSw/Am, while TL in SwEn and EnTi. Contrasting SS-NMT and TL, the latter shows either comparable or better performance. Particularly, for the less-resourced Om/So-En pairs, the TL approach improves over SS-NMT. Indeed, these comparisons highly depend on several factors: the type of training data (monolingual for SS-NMT, and parallel corpora for TL), size and data distribution, and domain mismatch between the monolingual and parallel data. For instance, for the So-En pair, the SS-NMT shows a drop even if there is more training data from the back-translation stage. Perhaps, in addition to the poor quality of the back-translation, the drop can be attributed to the dissimilarity of the target monolingual data from the original parallel corpora.

Moreover, domain-level performance of the SS-NMT, TL and M-NMT models shows a similar pattern as in the S-NMT. An interesting aspect is that the performance on the out-of-domain test set (Ted) shows a larger improvement margin than the in-domain test sets, where the best performance comes from the TL and M-NMT models. For instance, TL improves the SwEn to 19.74 BLEU from the baseline S-NMT at and the EnSw to 14.81 from . Note that these improvements can be attributed to the domain similarity between the M-NMT116 model that is trained with Ted talks data and used for TL stages. However, using all the SATOS pairs, the M-NMT model improves all the out-of-domain test cases, with large gains in the extremely low-resourced (Om/So-En) pairs. Overall, utilizing all the data at our disposal we can show improvements over the baseline S-NMT models. A summary of open problems for LRL NMT based on the findings of this work is presented in Section A.4.

4 Conclusions

In this work, we analyzed the state of NMT approaches using five low-resource languages. Our investigation shows that the baseline single-pair model can be significantly improved by the more robust semi-supervised, transfer-learning, and multilingual modeling approaches. However, a test on out-of-domain data shows the poor performance of all the approaches. This work will hopefully set the stage for further research on low-resource NMT modeling. Data, models, and scripts are available at https://github.com/surafelml/Afro-NMT. For open problems observed in this work see Section A.4.


  • Ž. Agić and I. Vulić (2019) JW300: a wide-coverage parallel corpus for low-resource languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3204–3210. External Links: Link Cited by: §2.1.
  • N. Arivazhagan, A. Bapna, O. Firat, D. Lepikhin, M. Johnson, M. Krikun, M. X. Chen, Y. Cao, G. Foster, C. Cherry, et al. (2019) Massively multilingual neural machine translation in the wild: findings and challenges. arXiv preprint arXiv:1907.05019. Cited by: §A.4.
  • M. Artetxe, G. Labaka, and E. Agirre (2018) Unsupervised statistical machine translation. arXiv preprint arXiv:1809.01272. Cited by: §A.4.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §A.3.1, §1.
  • L. Bentivogli, A. Bisazza, M. Cettolo, and M. Federico (2018) Neural versus phrase-based mt quality: an in-depth analysis on english-german and english-french. Computer Speech & Language 49, pp. 52–70. Cited by: §1.
  • N. Bertoldi and M. Federico (2009) Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the fourth workshop on statistical machine translation, pp. 182–189. Cited by: §A.3.2.
  • I. Caswell, C. Chelba, and D. Grangier (2019) Tagged back-translation. arXiv preprint arXiv:1906.06442. Cited by: §A.3.2.
  • M. Cettolo, C. Girardi, and M. Federico (2012) Wit3: web inventory of transcribed and translated talks. In Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), Vol. 261, pp. 268. Cited by: §2.1.
  • K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §A.3.1, §1.
  • C. Christodouloupoulos and M. Steedman (2015) A massively parallel corpus: the bible in 100 languages. Language resources and evaluation 49 (2), pp. 375–395. Cited by: §2.1.
  • D. Dong, H. Wu, W. He, D. Yu, and H. Wang (2015) Multi-task learning for multiple language translation.. In ACL (1), pp. 1723–1732. Cited by: §A.3.4.
  • S. Edunov, M. Ott, M. Auli, and D. Grangier (2018) Understanding back-translation at scale. arXiv preprint arXiv:1808.09381. Cited by: §A.3.2.
  • O. Firat, K. Cho, and Y. Bengio (2016) Multi-way, multilingual neural machine translation with a shared attention mechanism. arXiv preprint arXiv:1601.01073. Cited by: §A.3.4.
  • J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin (2017) Convolutional sequence to sequence learning. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 1243–1252. Cited by: §A.3.1.
  • J. Gu, Y. Wang, K. Cho, and V. O. Li (2019) Improved zero-shot neural machine translation via ignoring spurious correlations. arXiv preprint arXiv:1906.01181. Cited by: §A.4.
  • F. Guzmán, P. Chen, M. Ott, J. Pino, G. Lample, P. Koehn, V. Chaudhary, and M. Ranzato (2019) Two new evaluation datasets for low-resource machine translation: nepali-english and sinhala-english. arXiv preprint arXiv:1902.01382. Cited by: §A.4.
  • T. Ha, J. Niehues, and A. Waibel (2016) Toward multilingual neural machine translation with universal encoder and decoder. arXiv preprint arXiv:1611.04798. Cited by: §A.3.4, §A.4, §1.
  • M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean (2017) Google’s multilingual neural machine translation system: enabling zero-shot translation. Transactions of the Association for Computational Linguistic 5, pp. 339–351. External Links: Link Cited by: §A.3.4, §A.4, §1.
  • N. Kalchbrenner and P. Blunsom (2013) Recurrent continuous translation models. In

    Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

    pp. 1700–1709. Cited by: §A.3.1.
  • D. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.3.
  • G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush (2017)

    OpenNMT: open-source toolkit for neural machine translation

    arXiv preprint arXiv:1701.02810. Cited by: §2.3.
  • T. Kocmi and O. Bojar (2018) Trivial transfer learning for low-resource neural machine translation. arXiv preprint arXiv:1809.00357. Cited by: §A.3.3.
  • P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, et al. (2007) Moses: open source toolkit for statistical machine translation. In Proc. of ACL, Cited by: §2.1.
  • P. Koehn and R. Knowles (2017) Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872. Cited by: §1.
  • T. Kudo and J. Richardson (2018) Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226. Cited by: §A.4, §2.1.
  • S. M. Lakew, Q. F. Lotito, N. Matteo, T. Marco, and F. Marcello (2017a) Improving zero-shot translation of low-resource languages. In 14th International Workshop on Spoken Language Translation, Cited by: §A.4.
  • S. M. Lakew, M. A. Di Gangi, and M. Federico (2017b) “Multilingual Neural Machine Translation for Low Resource Languages”. In Proceedings of the 4th Italian Conference on Computational Linguistics (CLiC-IT), Rome, Italy. Cited by: §A.3.3.
  • S. M. Lakew, A. Erofeeva, M. Negri, M. Federico, and M. Turchi (2018) “Transfer Learning in Multilingual Neural Machine Translation with Dynamic Vocabulary”. In 15th International Workshop on Spoken Language Translation (IWSLT), Bruges, Belgium. Cited by: §A.3.3.
  • G. Lample, L. Denoyer, and M. Ranzato (2018) Unsupervised machine translation using monolingual corpora only. In Proceedings of the 6th International Conference on Learning Representations, Cited by: §A.4.
  • G. Neubig and J. Hu (2018) Rapid adaptation of neural machine translation to new languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 875–880. External Links: Link Cited by: §A.3.3, §A.4.
  • T. Q. Nguyen and D. Chiang (2017) Transfer learning across low-resource, related languages for neural machine translation. arXiv preprint arXiv:1708.09803. Cited by: §A.3.3.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §2.2.
  • Y. Qi, D. S. Sachan, M. Felix, S. J. Padmanabhan, and G. Neubig (2018) When and why are pre-trained word embeddings useful for neural machine translation?. In Proceedings of NAACL-HLT 2018, pp. 529–535. External Links: Link Cited by: §2.1.
  • P. Rychlỳ and V. Suchomel (2016) Annotated amharic corpora. In International Conference on Text, Speech, and Dialogue, pp. 295–302. Cited by: Table 4, §2.1.
  • R. Sennrich, B. Haddow, and A. Birch (2015) Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709. Cited by: §A.3.2, §A.3.2.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 1715–1725 (English). External Links: Document, ISBN 978-1-945626-03-6 Cited by: §1.
  • N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting.. Journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §2.3.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §A.3.1, §1.
  • M. Y. Tachbelie, S. T. Abate, and W. Menzel (2009) Morpheme-based language modeling for amharic speech recognition. In Human Language Technology. Challenges for Computer Science and Linguistics., Cited by: §A.4.
  • J. Tiedemann (2012) Parallel data, tools and interfaces in opus.. In Proceedings of Language Resources and Evaluation (LREC), Cited by: §2.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 6000–6010. Cited by: §A.3.1, §1, §2.3.
  • B. Zoph, D. Yuret, J. May, and K. Knight (2016) Transfer learning for low-resource neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1568–1575. External Links: Link Cited by: §A.3.3, §1.

Appendix A Appendix

a.1 Low-Resource (SATOS) Languages

Language Family Sub-Family Script Users/Speakers
Swahili (Sw) Niger-Congo Bantu Latin  150M
Amharic (Am) Afroasiatic South-Semitic Ge’ez/Ethiopic  25M
Tigrigna (Ti) Afroasiatic South-Semitic Ge’ez/Ethiopic  7M
Oromo (Om) Afroasiatic Chustic Latin  35M
Somali (So) Afroasiatic Chustic Latin  16M
Table 2:

Background on the SATOS languages that are considered in this work. Number of speakers is following the estimates provided by 

https://www.ethnologue.com (2015).

a.2 Data and Statistics

Language Pair Split Jw300 Bible Tanzil Ted Total
Sw-En train 907842 87645 995487
dev 5179 3505 681 9365
test 5315 3509 1364 10188
Am-En train 538677 43172 17461 599310
dev 4514 4685 4905 14104
test 4551 4685 4911 567 14714
Ti-En train 344540 344540
dev 4845 4845
test 4945 4945
Om-En train 907842 907842
dev 5179 5179
test 5315 5315
So-En train 44276 24592 68868
dev 4713 4393 565 9671
test 4735 4450 1132 10317
Table 3: Data statistics in number of examples for each pair of the SATOS languages paired with English, across four domains.
Sw Am Ti Om So
Wiki 351805 114251 2560 12162 69386
Habit 1208947 139357 250432 2643337
Total 351805 1323198 141917 262594 2712723
Table 4: Monolingual data size of the SATOS languages collected from Wiki dump and the Habit (Rychlỳ and Suchomel, 2016) project. Note, the En side monolingual is only from Wiki and selected proportional with each LRL monolingual data.

a.3 NMT Approaches for Low-Resource Languages

a.3.1 Neural Machine Translation

MT is a task of mapping a source language sequence into a target language , where and can differ. Several types of architectures have been proposed for modeling NMT: Recurrent (Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Cho et al., 2014; Bahdanau et al., 2014), Convolutional (Gehring et al., 2017), and recently Transformer (TNN) by Vaswani et al. (2017) that have shown better performance and efficient processing of input tokens in a simultaneous manner. Though there are different formalizations of NMT for sequence representation, the common underlying principle is to learn a model in an end-to-end fashion. In general, an encoder network reads the input sequence () and creates a latent representation of it, whereas a decoder network learns how to generate the output sequence (). In this work, we utilize the TNN for modeling the NMT systems.

TNN is built using a mechanism called self-attention

, that computes relations between the different positions of a given sequence to generate hidden representations. Both the encoder and decoder of TNN constitute a stack of self-attention layers followed by a fully-connected feed-forward (FNN) layers. The encoder is composed of

number of similar layers. Each of the

layers comprises two sub-layers. The first sub-layer is a multi-headed self-attention, while the second is an FNN. The decoder side is similar to the encoder, except a third multi-head self-attention layer is added, to specifically attend on the encoder representation. For each target token prediction stage, a conditional probability is computed using the previously decoded token and the source sequence (



The network is trained end-to-end to find the parameters that maximizes the log-likelihood of the training set :

Figure 1: NMT modeling, from left to right: single pair, semi-supervised, multilingual, and transfer-learning strategies.

The standard NMT (S-NMT) training requires the availability of a source to target language aligned parallel corpus. Hence, the objective function is simply to learn the mapping from the source and target training examples. Moreover, several training objectives have been suggested for NMT training, as illustrated in Figure 1, which we will discuss in the following sections.

a.3.2 Semi-Supervised NMT

In semi-supervised NMT (SS-NMT) monolingual data is utilized to improve over the S-NMT model. The primary way of achieving SS-NMT is known as back-translation (Bertoldi and Federico, 2009; Sennrich et al., 2015). Hence, to improve a Source Target model with target language monolingual data, an SS-NMT can be formalized in three stages: i) train Target Source model by reversing the parallel data, ii) translate the target monolingual data with the reverse model, and iii) train the Source Target model by merging the original and the newly generated synthetic parallel data.

The expectation is that, with the augmented data, the Source Target translation performance can be further improved. There are other variants of back-translation based SS-NMT (Edunov et al., 2018; Caswell et al., 2019), however, for this work we focus on the above three stages following Sennrich et al. (2015).

a.3.3 Transfer-Learning Based NMT

Zoph et al. (2016) proposed a TL paradigm, where a model trained with a high-resource pair (parent) is used to initialize a model training for LRL pair (child). Later the TL approach is improved by incorporating related languages in the parent-child transfer setup (Nguyen and Chiang, 2017; Kocmi and Bojar, 2018). The parent can also be trained with a large scale multilingual data (Neubig and Hu, 2018) and adapted to the LRL pair. Moreover, by tailoring the parent vocabulary and associated model parameters to the child/new LRL pair (Lakew et al., 2017b) have shown a better positive TL, also known as dynamic TL.

Given the diversity of languages and writing scripts, in this work, we utilize the dynamic TL mechanism following the experimental setup in Lakew et al. (2018). In other words, assuming a parent model pre-trained with large scale multilingual data, but not the SATOS languages, the TL stage must involve the customization to the LRL pair. Our goal is to investigate how far the pre-trained model helps to improve a new LRL pair than comparing the different TL approaches.

a.3.4 Multilingual NMT

M-NMT can be considered under the umbrella of TL approach, however, within a single (parent) model that aggregates all the parallel data of language pairs. Hence, the TL can occur implicitly, based on the assumption that the combination of all the available pairs data brings more diversity to the model training corpus. Though there are several M-NMT modeling mechanisms (Dong et al., 2015; Firat et al., 2016), we follow the single encoder-decoder based approach (Johnson et al., 2017; Ha et al., 2016), that works by appending target language-flag at the beginning of each source language example. Our goal is to comparatively evaluate the significance of the M-NMT model that leverages the aggregation of all the SATOS languages data.

a.4 Open Problems

The reported results in Table 1, and the discussion confirms what has been reported in the literature on back-translation, transfer-learning, and multilingual modeling to improve LRL translation tasks. However, there are still open problems that require further investigation with respect to the SATOS languages and other languages with small training data:

Language and Data: As shown in Table 2, we explored five languages that are low-resource, as well as highly diverse. Where the varied characteristics of these languages can pose new challenges, more so in the low-resource NMT setting. Meaning, each language can exhibit its characteristics that might require a specialized modeling criterion. For instance, Am and Ti is a highly morphological language (Tachbelie et al., 2009), that might be improved with alternative input modeling methods than the segmentation approach (Kudo and Richardson, 2018) we utilized in this work. More importantly the availability of model training resources both in parallel and monolingual format is limited. Hence, data generation approaches that can diversify the existing examples can be a key ingredient to further improve the current model performance. In this direction, Arivazhagan et al. (2019) indicated the importance of formulating sample efficient learning algorithms and approaches that can leverage from other forms of data, such as speech and images.

Domain Shift: can be characterized by scenarios such as domain imbalance within a training data or the domain mismatch between a parallel and monolingual data. The poor performance of each modeling type on the Ted talks are a good indication to easily identify and assess the weakness of NMT, more so in the low-resource setting. Moreover, the poor performance of the SS-NMT is another example where back-translation can also harm the initial model (S-NMT) performance if the monolingual data is too distant from the in-domain data. Thus, with the absence of enough training material, learning a better translation model by exploiting all available domains is an important criterion. This direction requires a model that can generalize well across domains while minimizing the negative effect as observed in the SS-NMT case.

Zero-Resource Language: As we have noted in Sec 1, the majority of the world languages do not have parallel training material. Hence, for languages pairs with only monolingual data (i.e., zero-resource languages), alternative modeling strategies are needed. We highlight this aspect, considering a real low-resource NMT modeling should aim at enabling and improving translation between the LRL pairs. Indeed, recent progress in zero-shot Johnson et al. (2017); Ha et al. (2016) and unsupervised Artetxe et al. (2018); Lample et al. (2018) approaches remain as the primary options to explore. However, in light of recent studies (Neubig and Hu, 2018; Guzmán et al., 2019), that shows the weakness of zero-resource approaches, further investigation are required for languages such as SATOS. In other words, certain LRL share few similarities (e.g. Am Vs. Sw), and with the absence of comparable and large amount of monolingual data, zero-resource NMT settings become highly challenging. In such type of resource scarce setting, perhaps incrementally learning and improving zero-resource directions from monolingual data by leveraging multilingual model (Lakew et al., 2017a; Gu et al., 2019) could be a promising alternative to investigate.