Since the introduction ofrecurrent neural networks (Sutskever et al., 2014; Cho et al., 2014; Bahdanau et al., 2014) and recently with the wide use of the Transformer model (Vaswani et al., 2017), NMT has shown increasingly better translation quality (Bentivogli et al., 2018). Despite this progress, the NMT paradigm is data demanding and still shows limitations when models are trained on small parallel corpora (Koehn and Knowles, 2017). Unfortunately, most of the 7,000+ languages and language varieties currently spoken around the world fall under this low-resource condition. For these languages, the absence of usable NMT models, creates a barrier in the increasingly digitized social, economic and political dimensions. To overcome this barrier, building effective NMT models from small-sized training sets has become a primary challenge in MT research.
Recent progress, however, has shown the possibility of learning better models even in this challenging scenario. Promising approaches rely on semi-supervised learning (Sennrich et al., 2016), transfer-learning (Zoph et al., 2016) and multilingual NMT solutions (Johnson et al., 2017; Ha et al., 2016). However, due to resource availability issues and to the priority given to well-established evaluation benchmarks, the language and geographical coverage of NMT is yet to reach new heights. This calls for further investigation of the current state of NMT using new languages, especially at a time where an effective and equitable exchange of information across borders and cultures is the most important necessity for many.
Therefore, the objectives of this work are: i) to benchmark and expand the current boundary of NMT into five prominent languages used in the East African region (i.e. Swahili, Amharic, Tigrigna, Oromo, and Somali), which have not been extensively studied yet within the NMT paradigm, ii) to investigate strengths and weakness of NMT applied to LRLs and define open problems, iii) to release a standardized training dataset for the five LRL, as well as multi-domain test sets for up to directions, to encourage future research in zero-shot and unsupervised translation between LRLs.
2 Experimental Settings
2.1 Dataset and Preprocessing
For the five languages aligned to English, we collect all available parallel data from the Opus corpus (Tiedemann, 2012), including JW300 (Agić and Vulić, 2019), Bible (Christodouloupoulos and Steedman, 2015), Tanzil, and Ted talks (Cettolo et al., 2012). For pre-training a massive multilingual model for our transfer learning (TL) experiments, we utilize the Ted talks corpus by Qi et al. (2018), which contains English-aligned parallel data for languages but does not include the SATOS ones. Monolingual data for each SATOS language are extracted from Wikipedia dumps 222Wikipedia: https://dumps.wikimedia.org/ and the Habit corpus (Rychlỳ and Suchomel, 2016).333Habit: http://habit-project.eu/wiki/SetOfEthiopianWebCorpora. Note that, given the data scarcity conditions characterizing SATOS languages, our goal is to collect all the corpora available for these languages, so to ultimately provide a standardized dataset comprising multi-domain test benchmarks facilitating future research. Tables 3 and 4 show the amount of data collected in parallel and monolingual settings.444Considering geographical location there are well over languages in the Horn of the African continent which we plan to investigate in a future work.
At preprocessing time, data is split into train, dev and test sets. To avoid bias towards specific domains, balanced dev and test sets are built by randomly selecting up to segments. The remaining material is left as training data, after filtering out segments similar to those contained in the dev and test sets in order to avoid potential overlap. Then, the standardized data is segmented into subword units using SentencePiece (Kudo and Richardson, 2018).555SentencePiece: https://github.com/google/sentencepiece The segmentation rules are set to for all models, except for the multilingual models that utilize subwords. When required, particularly for evaluation, the Moses toolkit Koehn et al. (2007) is used to tokenize/detokenize segments. Unless otherwise specified, we use the same pre-processing stages to train all the models.
2.2 Model Types And Evaluation
To evaluate different NMT approaches on LRL s, we train the following model types:
S-NMT: a total of single language pair models trained for each SATOS En pairs.
SS-NMT: semi-supervised models trained with the original parallel data of an S-NMT model and synthetic data generated with back-translation for each language pair.
TL: adapted child model for each language pair parallel data from the massively multilingual parent model (M-NMT116).
M-NMT: a single multilingual model trained on the aggregation of all the SATOS En data.
These NMT models are evaluated on multi-domain test sets when available; otherwise, only the in-domain test set is used. BLEU (Papineni et al., 2002) is used to measure systems’ performance.666Moses Toolkit: http://www.statmt.org/moses When En is the target language, BLEU scores are computed on detokenized (hypothesis, reference) pairs. When the target is a LRL, we report tokenized BLEU. Further details about the NMT model types considered in our evaluation are given in Appendix A.3.
2.3 Model Settings
All the models are trained using the OpenNMT implementation777OpenNMT: http://opennmt.net/ (Klein et al., 2017) of Transformer (Vaswani et al., 2017) The model parameters are set to a hidden units and embedding dimension, layers of self-attentional encoder-decoder with heads. At training time, we use token level batch size with a maximum sentence length of . For inference, we keep a example level batch size, with a beam search width of . LazyAdam (Kingma and Ba, 2014) is applied throughout all the strategies with a constant initial learning rate value of . Given the sparsity of the data, dropout (Srivastava et al., 2014) is set to . The multilingual models (M-NMT and M-NMT116) are run for M steps, while the S-NMT, SS-NMT, and the adaptations steps of the TL approach vary based on the amount of data used. In all runs, models’ convergence is checked based on the validation loss.
3 Results and Discussion
Table 1 shows the performance of the different LRL modeling criteria with multi-domain test sets. Looking at the single pair NMT models (S-NMT), we observe that in all the test domains they underperform with respect to the SS-NMT, TL, or M-NMT models in terms of averaged (AVG) BLEU scores. Specifically to each domain, the S-NMT models perform reasonably well on the in-domain test sets while, for the out-of-domain Ted test set we often observe rather large degradations. For instance, on the Sw/Am/So-En Ted test sets, there is a consistent performance drop in both in the LRL En translation directions. The performance drop with test sets featuring a domain shift with respect to the training data shows the susceptibility of NMT in a low-resource training condition. We expect that the S-NMT model performance can be improved with the more robust models described in Sec. A.3.
Indeed, the AVG BLEU scores of the SS-NMT, TL, and M-NMT models show better performance in most of the cases compared to the S-NMT model. Specifically, M-NMT achieves the highest results in six out of ten directions. Interestingly, except for EnOm/So, all the other improvements of the M-NMT occur when translating into En, these are: Am/Ti/Om/SoEn directions. These improvements are highly related to the fact that all the LRL s are paired with En, maximizing the distribution of the En data both on the encoder and decoder side. M-NMT also shows the largest drops when compared to all the other models. These drops occur particularly for the Sw-En pair, with a - and - decrease respectively in comparison with the best performing approaches (TL, SS-NMT). Similarly, a slight degradation is observed in the EnAm/Ti directions. Our observation is that Sw-En can exploit the largest amount of parallel data, followed by Am-En and Ti-En. This indicates that the least resourced pairs (Om-En and So-En) benefit most from M-NMT modeling. Moreover, it is easy to notice that most of the performance degradation occurs when translating into the LRL.
For the the SS-NMT and TL approaches, our experiments show comparable performance in most of the translation directions. Both the approaches outperform the M-NMT in a total of four directions: SS-NMT in EnSw/Am, while TL in SwEn and EnTi. Contrasting SS-NMT and TL, the latter shows either comparable or better performance. Particularly, for the less-resourced Om/So-En pairs, the TL approach improves over SS-NMT. Indeed, these comparisons highly depend on several factors: the type of training data (monolingual for SS-NMT, and parallel corpora for TL), size and data distribution, and domain mismatch between the monolingual and parallel data. For instance, for the So-En pair, the SS-NMT shows a drop even if there is more training data from the back-translation stage. Perhaps, in addition to the poor quality of the back-translation, the drop can be attributed to the dissimilarity of the target monolingual data from the original parallel corpora.
Moreover, domain-level performance of the SS-NMT, TL and M-NMT models shows a similar pattern as in the S-NMT. An interesting aspect is that the performance on the out-of-domain test set (Ted) shows a larger improvement margin than the in-domain test sets, where the best performance comes from the TL and M-NMT models. For instance, TL improves the SwEn to 19.74 BLEU from the baseline S-NMT at and the EnSw to 14.81 from . Note that these improvements can be attributed to the domain similarity between the M-NMT116 model that is trained with Ted talks data and used for TL stages. However, using all the SATOS pairs, the M-NMT model improves all the out-of-domain test cases, with large gains in the extremely low-resourced (Om/So-En) pairs. Overall, utilizing all the data at our disposal we can show improvements over the baseline S-NMT models. A summary of open problems for LRL NMT based on the findings of this work is presented in Section A.4.
In this work, we analyzed the state of NMT approaches using five low-resource languages. Our investigation shows that the baseline single-pair model can be significantly improved by the more robust semi-supervised, transfer-learning, and multilingual modeling approaches. However, a test on out-of-domain data shows the poor performance of all the approaches. This work will hopefully set the stage for further research on low-resource NMT modeling. Data, models, and scripts are available at https://github.com/surafelml/Afro-NMT. For open problems observed in this work see Section A.4.
- JW300: a wide-coverage parallel corpus for low-resource languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3204–3210. External Links: Cited by: §2.1.
- Massively multilingual neural machine translation in the wild: findings and challenges. arXiv preprint arXiv:1907.05019. Cited by: §A.4.
- Unsupervised statistical machine translation. arXiv preprint arXiv:1809.01272. Cited by: §A.4.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §A.3.1, §1.
- Neural versus phrase-based mt quality: an in-depth analysis on english-german and english-french. Computer Speech & Language 49, pp. 52–70. Cited by: §1.
- Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the fourth workshop on statistical machine translation, pp. 182–189. Cited by: §A.3.2.
- Tagged back-translation. arXiv preprint arXiv:1906.06442. Cited by: §A.3.2.
- Wit3: web inventory of transcribed and translated talks. In Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), Vol. 261, pp. 268. Cited by: §2.1.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §A.3.1, §1.
- A massively parallel corpus: the bible in 100 languages. Language resources and evaluation 49 (2), pp. 375–395. Cited by: §2.1.
- Multi-task learning for multiple language translation.. In ACL (1), pp. 1723–1732. Cited by: §A.3.4.
- Understanding back-translation at scale. arXiv preprint arXiv:1808.09381. Cited by: §A.3.2.
- Multi-way, multilingual neural machine translation with a shared attention mechanism. arXiv preprint arXiv:1601.01073. Cited by: §A.3.4.
Convolutional sequence to sequence learning.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1243–1252. Cited by: §A.3.1.
- Improved zero-shot neural machine translation via ignoring spurious correlations. arXiv preprint arXiv:1906.01181. Cited by: §A.4.
- Two new evaluation datasets for low-resource machine translation: nepali-english and sinhala-english. arXiv preprint arXiv:1902.01382. Cited by: §A.4.
- Toward multilingual neural machine translation with universal encoder and decoder. arXiv preprint arXiv:1611.04798. Cited by: §A.3.4, §A.4, §1.
- Google’s multilingual neural machine translation system: enabling zero-shot translation. Transactions of the Association for Computational Linguistic 5, pp. 339–351. External Links: Cited by: §A.3.4, §A.4, §1.
Recurrent continuous translation models.
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1700–1709. Cited by: §A.3.1.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.3.
OpenNMT: open-source toolkit for neural machine translation. arXiv preprint arXiv:1701.02810. Cited by: §2.3.
- Trivial transfer learning for low-resource neural machine translation. arXiv preprint arXiv:1809.00357. Cited by: §A.3.3.
- Moses: open source toolkit for statistical machine translation. In Proc. of ACL, Cited by: §2.1.
- Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872. Cited by: §1.
- Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226. Cited by: §A.4, §2.1.
- Improving zero-shot translation of low-resource languages. In 14th International Workshop on Spoken Language Translation, Cited by: §A.4.
- “Multilingual Neural Machine Translation for Low Resource Languages”. In Proceedings of the 4th Italian Conference on Computational Linguistics (CLiC-IT), Rome, Italy. Cited by: §A.3.3.
- “Transfer Learning in Multilingual Neural Machine Translation with Dynamic Vocabulary”. In 15th International Workshop on Spoken Language Translation (IWSLT), Bruges, Belgium. Cited by: §A.3.3.
- Unsupervised machine translation using monolingual corpora only. In Proceedings of the 6th International Conference on Learning Representations, Cited by: §A.4.
- Rapid adaptation of neural machine translation to new languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 875–880. External Links: Cited by: §A.3.3, §A.4.
- Transfer learning across low-resource, related languages for neural machine translation. arXiv preprint arXiv:1708.09803. Cited by: §A.3.3.
- BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §2.2.
- When and why are pre-trained word embeddings useful for neural machine translation?. In Proceedings of NAACL-HLT 2018, pp. 529–535. External Links: Cited by: §2.1.
- Annotated amharic corpora. In International Conference on Text, Speech, and Dialogue, pp. 295–302. Cited by: Table 4, §2.1.
- Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709. Cited by: §A.3.2, §A.3.2.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 1715–1725 (English). External Links: Cited by: §1.
- Dropout: a simple way to prevent neural networks from overfitting.. Journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §2.3.
- Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §A.3.1, §1.
- Morpheme-based language modeling for amharic speech recognition. In Human Language Technology. Challenges for Computer Science and Linguistics., Cited by: §A.4.
- Parallel data, tools and interfaces in opus.. In Proceedings of Language Resources and Evaluation (LREC), Cited by: §2.1.
- Attention is all you need. In Advances in Neural Information Processing Systems, pp. 6000–6010. Cited by: §A.3.1, §1, §2.3.
- Transfer learning for low-resource neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1568–1575. External Links: Cited by: §A.3.3, §1.
Appendix A Appendix
a.1 Low-Resource (SATOS) Languages
Background on the SATOS languages that are considered in this work. Number of speakers is following the estimates provided byhttps://www.ethnologue.com (2015).
a.2 Data and Statistics
a.3 NMT Approaches for Low-Resource Languages
a.3.1 Neural Machine Translation
MT is a task of mapping a source language sequence into a target language , where and can differ. Several types of architectures have been proposed for modeling NMT: Recurrent (Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Cho et al., 2014; Bahdanau et al., 2014), Convolutional (Gehring et al., 2017), and recently Transformer (TNN) by Vaswani et al. (2017) that have shown better performance and efficient processing of input tokens in a simultaneous manner. Though there are different formalizations of NMT for sequence representation, the common underlying principle is to learn a model in an end-to-end fashion. In general, an encoder network reads the input sequence () and creates a latent representation of it, whereas a decoder network learns how to generate the output sequence (). In this work, we utilize the TNN for modeling the NMT systems.
TNN is built using a mechanism called self-attention
, that computes relations between the different positions of a given sequence to generate hidden representations. Both the encoder and decoder of TNN constitute a stack of self-attention layers followed by a fully-connected feed-forward (FNN) layers. The encoder is composed ofnumber of similar layers. Each of the
layers comprises two sub-layers. The first sub-layer is a multi-headed self-attention, while the second is an FNN. The decoder side is similar to the encoder, except a third multi-head self-attention layer is added, to specifically attend on the encoder representation. For each target token prediction stage, a conditional probability is computed using the previously decoded token and the source sequence ():
The network is trained end-to-end to find the parameters that maximizes the log-likelihood of the training set :
The standard NMT (S-NMT) training requires the availability of a source to target language aligned parallel corpus. Hence, the objective function is simply to learn the mapping from the source and target training examples. Moreover, several training objectives have been suggested for NMT training, as illustrated in Figure 1, which we will discuss in the following sections.
a.3.2 Semi-Supervised NMT
In semi-supervised NMT (SS-NMT) monolingual data is utilized to improve over the S-NMT model. The primary way of achieving SS-NMT is known as back-translation (Bertoldi and Federico, 2009; Sennrich et al., 2015). Hence, to improve a Source Target model with target language monolingual data, an SS-NMT can be formalized in three stages: i) train Target Source model by reversing the parallel data, ii) translate the target monolingual data with the reverse model, and iii) train the Source Target model by merging the original and the newly generated synthetic parallel data.
The expectation is that, with the augmented data, the Source Target translation performance can be further improved. There are other variants of back-translation based SS-NMT (Edunov et al., 2018; Caswell et al., 2019), however, for this work we focus on the above three stages following Sennrich et al. (2015).
a.3.3 Transfer-Learning Based NMT
Zoph et al. (2016) proposed a TL paradigm, where a model trained with a high-resource pair (parent) is used to initialize a model training for LRL pair (child). Later the TL approach is improved by incorporating related languages in the parent-child transfer setup (Nguyen and Chiang, 2017; Kocmi and Bojar, 2018). The parent can also be trained with a large scale multilingual data (Neubig and Hu, 2018) and adapted to the LRL pair. Moreover, by tailoring the parent vocabulary and associated model parameters to the child/new LRL pair (Lakew et al., 2017b) have shown a better positive TL, also known as dynamic TL.
Given the diversity of languages and writing scripts, in this work, we utilize the dynamic TL mechanism following the experimental setup in Lakew et al. (2018). In other words, assuming a parent model pre-trained with large scale multilingual data, but not the SATOS languages, the TL stage must involve the customization to the LRL pair. Our goal is to investigate how far the pre-trained model helps to improve a new LRL pair than comparing the different TL approaches.
a.3.4 Multilingual NMT
M-NMT can be considered under the umbrella of TL approach, however, within a single (parent) model that aggregates all the parallel data of language pairs. Hence, the TL can occur implicitly, based on the assumption that the combination of all the available pairs data brings more diversity to the model training corpus. Though there are several M-NMT modeling mechanisms (Dong et al., 2015; Firat et al., 2016), we follow the single encoder-decoder based approach (Johnson et al., 2017; Ha et al., 2016), that works by appending target language-flag at the beginning of each source language example. Our goal is to comparatively evaluate the significance of the M-NMT model that leverages the aggregation of all the SATOS languages data.
a.4 Open Problems
The reported results in Table 1, and the discussion confirms what has been reported in the literature on back-translation, transfer-learning, and multilingual modeling to improve LRL translation tasks. However, there are still open problems that require further investigation with respect to the SATOS languages and other languages with small training data:
Language and Data: As shown in Table 2, we explored five languages that are low-resource, as well as highly diverse. Where the varied characteristics of these languages can pose new challenges, more so in the low-resource NMT setting. Meaning, each language can exhibit its characteristics that might require a specialized modeling criterion. For instance, Am and Ti is a highly morphological language (Tachbelie et al., 2009), that might be improved with alternative input modeling methods than the segmentation approach (Kudo and Richardson, 2018) we utilized in this work. More importantly the availability of model training resources both in parallel and monolingual format is limited. Hence, data generation approaches that can diversify the existing examples can be a key ingredient to further improve the current model performance. In this direction, Arivazhagan et al. (2019) indicated the importance of formulating sample efficient learning algorithms and approaches that can leverage from other forms of data, such as speech and images.
Domain Shift: can be characterized by scenarios such as domain imbalance within a training data or the domain mismatch between a parallel and monolingual data. The poor performance of each modeling type on the Ted talks are a good indication to easily identify and assess the weakness of NMT, more so in the low-resource setting. Moreover, the poor performance of the SS-NMT is another example where back-translation can also harm the initial model (S-NMT) performance if the monolingual data is too distant from the in-domain data. Thus, with the absence of enough training material, learning a better translation model by exploiting all available domains is an important criterion. This direction requires a model that can generalize well across domains while minimizing the negative effect as observed in the SS-NMT case.
Zero-Resource Language: As we have noted in Sec 1, the majority of the world languages do not have parallel training material. Hence, for languages pairs with only monolingual data (i.e., zero-resource languages), alternative modeling strategies are needed. We highlight this aspect, considering a real low-resource NMT modeling should aim at enabling and improving translation between the LRL pairs. Indeed, recent progress in zero-shot Johnson et al. (2017); Ha et al. (2016) and unsupervised Artetxe et al. (2018); Lample et al. (2018) approaches remain as the primary options to explore. However, in light of recent studies (Neubig and Hu, 2018; Guzmán et al., 2019), that shows the weakness of zero-resource approaches, further investigation are required for languages such as SATOS. In other words, certain LRL share few similarities (e.g. Am Vs. Sw), and with the absence of comparable and large amount of monolingual data, zero-resource NMT settings become highly challenging. In such type of resource scarce setting, perhaps incrementally learning and improving zero-resource directions from monolingual data by leveraging multilingual model (Lakew et al., 2017a; Gu et al., 2019) could be a promising alternative to investigate.