Machine translation (MT) is the automation of text (or speech) translations from one language to another, by a software. The rise of MT commenced sometime in the mid-twentieth century Weaver (1955) and has since the early 1900s experienced rapid growth due to the increase in the availability of parallel corpora and computational power Marino et al. (2006). Subject to the inception of neural machine translation (NMT) (Sutskever et al., 2014; Bahdanau et al., 2014), MT has seen substantial advancements in translation quality as modern MT systems draw closer and closer to human translation quality. Notwithstanding the progress achieved in the domain of MT, the idea of NMT system development being data-hungry remains a significant challenge in expanding this work to low resource languages.
Unfortunately, most African languages fall under the low-resourced group and as a result, MT of these languages has seen little progress. Irrespective the great strategies (or techniques) that have been developed to help alleviate the low resource problem, African languages still have not seen any substantial research or application of these strategies Orife et al. (2020). With a considerable number of African languages being endangered Eberhard et al. (2019), this shows that these languages are in dire need of MT translation tools to help save them from disappearing. In other words, this poses a challenge to the African community of NLP practitioners.
This paper examines the application of low-resource learning techniques on African languages of the Bantu family. Of the three Bantu languages under consideration, isiZulu, isiXhosa and Shona, the first two fall under the Nguni language sub-class indicating a close relationship between the two, as shown in Figure 1. Shona is not closely related to the Nguni language sub-class Holden and Mace (2003). Comparing MT on these three gives us the opportunity to explore the effect of correlations and similarities between languages. We give a comparative analysis of three learning protocols, namely, transfer learning, zero-shot learning and multilingual modeling.
Our experiments indicate the tremendous opportunity of leveraging the inter-relations in the Bantu language sub-classes, to build translation systems. We show that with the availability of data, multilingual modeling provides significant improvements on baseline models. In the case of transfer learning, we show that language sub-class inter-relations play a major role in translation quality. Furthermore, we demonstrate that zero-shot learning only manages to surpass transfer learning when we employ a parent model that is distantly related to the task of interest.
The following sections of this paper are presented as follows: in Section 2 we briefly review NMT and the main architecture we use in this work. Section 3 then gives a brief outline of the training protocols employed in this paper. Thereafter, we discuss the work that is related to this paper in Section 4. Section 6 discusses our experiments on the translation of English-to-Shona (E-S), English-to-isiXhosa (E-X), English-to-isiZulu (E-Z) and isiXhosa-to-isiZulu (X-Z). Finally, the results and conclusion of this work are outlined in Sections 7 and 8 respectively.
Modern NMT models have the encoder-decoder mechanism Sutskever et al. (2014) as the vanilla architecture. Given an input sequence and target sequence
. A NMT model decomposes the conditional probability distribution over the output sentences as follows:
Jointly trained, the encoder-decoder mechanism learns to maximize the conditional log-likelihood
where denotes the set of training pairs and the set of parameters to be learned.
Several encoder-decoder architectures have been developed and they each model the probability distributiondifferently. For example, appertaining to the general encoder-decoder architecture by Sutskever et al. (2014), the introduction of an encoder-decoder mechanism with attention Bahdanau et al. (2014) has produced significant improvements on translation quality. The Bahdanau et al. (2014)
architecture modifies the factorized conditional probability as follows:
denotes the decoder which generates the target output in auto-regressive fashion, and
denotes decoder recurrent neural network (RNN) hidden state. The widely used RNN variants are the gated recurrent units (GRU) and the long term short term memory (LSTM) networks as these help mitigate the vanishing gradient problemPouget-Abadie et al. (2014)
. The encoder compresses the input sequence into a fixed length vector, which is widely known as the context vector. This vector is then relayed to the decoder, where it is taken in as input.
Attention-based models (Bahdanau et al., 2014; Chorowski et al., 2015; Luong et al., 2015b) have become the conventional architectures of most sequence to sequence transduction tasks. In the domain of NMT, the transformer model introduced by Vaswani et al. (2017) has become the vanilla architecture of NMT and has produced remarkable results on several translation tasks. In this work, we employ the transformer model in all our training protocols. The transformer architecture’s fundamental unit is the multi-head attention mechanism that is solely based on learning different sequence representations with its multiple attention heads.
The key aspect of the transformer architecture is its ability to jointly pay attention to or learn different sequence representations that belong to different sub-spaces of distinct positions. In other words, the transformer model focuses on learning pair relation representations which in turn are employed to learn the relations amongst each other. The multi-head attention mechanism consists of attention layers such that for each head, the query and key-value pairs are first projected to some subspace with some dense layers of sizes , , and respectively. Suppose the query, key and value have dimensions , and respectively, each attention head then maps the query and key-value pairs to some output as follows
where is a dot-product attention function, , and . Thereafter, the outputs of length each, are then concatenated and fed to a dense layer of hidden units, to produce a multi-head attention output
where . In all our training protocols (or experiments), we employ the transformer architecture with 6 encoder-decoder blocks, 8 attention heads, a 256-dimensional word representations, a dropout rate of 0.1 and positional feed-forward layers of 1024 inner dimension. Both learning rate schedules and optimization parameters are adopted from the work of Vaswani et al. (2017).
3 Training Protocols
All the training protocols employed in this paper are discussed in this section. For all our baseline models, we employ the conventional NMT training protocol as introduced by Sutskever et al. (2014).
3.1 Transfer learning
To formally define transfer learning we begin by introducing the nomenclature used in the definition. We define our pair as the “domain”, where the input feature space is denoted by and the marginal probability distribution over is denoted . Considering the domain , we define our learning “task” as the pair where denotes the target feature space with the denoting the target predictive function, for example the conditional distribution .
Accorded with a source and target domain and respectively, along with their respective tasks and . Provided that either or , transfer learning intends to better the predictive function by leveraging the learned representations in domain and task .
The transfer learning definition above is conditioned on one of two scenarios: the first being that the source domain and target domain are not equal, , which implies that either the source task input feature space and target task input feature space are dissimilar ,, or the source and target domain’s marginal probability distributions are dissimilar . The second condition states that the source and target tasks are not similar, , which suggests that either the two target feature spaces are dissimilar or it could be that the two target predictive functions are dissimilar . In simple terms, transfer learning is an ingenious performance improvement technique of training models by transferring knowledge gained on one task to a related task, especially those with low-resources. Torrey and Shavlik (2010).
3.2 Multilingual learning
Multilingual NMT is a quintessential method of mapping to and from multiple languages and was first suggested by Dong et al. (2015) with a one-to-many model. Thereafter, this technique was extended to performing many-to-many translations contingent of a task-specific encoder-decoder pair (Luong et al., 2015a; Firat et al., 2016b). Afterwards, a single encoder-decoder architecture for performing many-to-many translations was developed (Johnson et al., 2017; Ha et al., 2017) by adding a target language specific token at the beginning of each input sequence. In this paper we employ this many-to-many translation technique.
Given a training tuple where , the multilingual model’s task is to translate source language to target language . It then follows that the model’s objective is to maximize the log-likelihood over all the training sets appertaining to all the accessible language pairs :
where denotes the target language ID.
The main advantage of a multilingual model is that it leverages the learned representations from individual translation pairs. In other words, the model learns a universal representation space for all the accessible language pairs, which promotes translation of low-resource pairs (Firat et al., 2016a; Gu et al., 2018).
3.3 Zero-shot learning
In addition to facilitating low-resource translations, the universal representation space in multilingual models also facilitates zero-shot learning, an extreme variant of transfer learning. First proposed by Johnson et al. (2017) and Ha et al. (2017), who demonstrated that multilingual models can translate between untrained language pairs by leverage the universal representations learned during the training process. Adopting the definitions of a task and domain described in Section 3.1, we give a formal definition of zero-shot learning as follows:
Accorded with source domain and the respective task where and denote the input and label feature spaces respectively. The objective of zero-shot learning is to estimate the predictive function or conditional probability distribution
denote the input and label feature spaces respectively. The objective of zero-shot learning is to estimate the predictive function or conditional probability distribution, where is the target task input feature space on which the source model has not been trained.
4 Related work
Notwithstanding the tremendous achievements in the domain of MT, the few publications on MT of African languages bears testament to little growth of this area. The earliest notable work on MT of African languages was done by Wilken et al. (2012), where they demonstrated that phrase-based SMT was a promising technique of translating African languages using a English-Setswana language pair. Wolff and Kotzé (2014) extended this technique to a E-Z pair, however they employed isiZulu syllables as their source tokens and this modification proved to be efficient as it improved the results by . van Niekerk (2014) employed unsupervised word segmentation coupled with phrase-based SMT to translate English to Afrikaans, Northern-Sotho, Tsonga and isiZulu. In their final analysis, the authors found their experiment to be efficient only for Afrikaans and isiZulu datasets, with a SOTA BLEU of for the isiZulu translations of none biblical corpora.
Abbott and Martinus (2018) adopted NMT to the translation of English to Setswana where they demonstrated that NMT outperformed the previous SMT Wilken et al. (2012) model by BLEU. The findings of Abbott and Martinus (2018) prompted opportunities of extending NMT to other African languages. Martinus and Abbott (2019) proposed a benchmark of NMT for translating English to four of South Africa’s official languages, namely isiZulu, Northern Sotho, Setswana and Afrikaans.
Transfer learning has been widely used on non African languages. Zoph et al. (2016) investigate the effects of transfer learning in a scenario where the source task (or parent model) is based on a high resource language. This source model is then used to train the target task (or child model). Closely related to the work of Zoph et al. (2016) is Nguyen and Chiang (2017) who perform transfer learning on closely related languages. Furthermore, in the work of Nguyen and Chiang (2017), the authors use a low resource languages as a source model which makes their work closely related to this paper. The main difference being that our work is on African languages.
|Sentence count||77 500||30 253||128 342||125 098|
|Source token count||7 588||4 649||10 657||27 144|
|Target token count||16 408||9 424||33 947||25 465|
|Train||54 250||21 177||88 192||87 570|
|Valid||11 625||4 538||20 075||18 765|
|Test||11 625||4 538||20 075||18 763|
Similarly, zero-shot learning and multilingual modeling have largely been applied on non African languages. For example, both Johnson et al. (2017) and Ha et al. (2017) applied zero shot learning on languages that are not from Africa. In their work (Johnson et al. (2017) and Ha et al. (2017)), they show that multilingual systems can perform translation between language pairs that have not been encountered during training. In this work, we seek to leverage these techniques (multilingual and zero-shot learning) on South-Eastern Bantu languages. The applications of zero-shot learning and transfer multilingual learning in this work are closely related to the work of Lakew et al. (2018) and Johnson et al. (2017), with the major difference being that this work applies these techniques to low-resource south-eastern Bantu languages.
5 Our data
. As in the case of most machine learning problems, this work included data cleaning, a crucial part of developing machine learning algorithms. The cleaning involved decomposition of contractions, shortening the sequences, manually verifying some of the randomly selected alignments and dropping all duplicate pairs to curb data leakage. We also decomposed contractions and performed subword level tokenization. The datasets comprise of four language pairs, namely E-S, E-X, E-Z and X-Z. Table1 gives a summary of the data-set sizes and token count per language pair. We split our data-sets into train, validation and test sets at a ratio of respectively, as shown in Table 2.
For all our training pairs, we train ten transformer models with unique initialization parameters each and take the mean BLEU along with its standard deviation as the final result. We examine the effects of language similarity when performing transfer learning by using E-X and E-S source models to perform on transfer learning on a E-Z task. In this case the similar languages are isiXhosa and isiZulu while on the other hand Shona is the distant language, though still a Bantu language like the other two. We start of by training the base model or source model on a large data-set, for example the E-X pairs. Thereafter we initialize our target model with the source model, whereby the target model is to be trained on a low resourced pair. In other words, instead of starting the target model’s training procedure from scratch we leverage the source model’s weights to initialize the target model without freezing any of the architecture’s layers.
The rationale behind this knowledge transfer technique being that in a case where one is faced with few training examples, they can make the most of a prior probability distribution over the model space. Taking the base model (trained on the larger data-set) as the anchor point, the apex of the prior probability distribution belonging to the model space. In setting the base model as the anchor point, the model adapts all the parameters that are useful in both tasks. For example, in the E-S and E-Z tasks, all the source task input language embeddings are adopted to the target task while target language embeddings are modified at training.
|Model type||E-Z||E-X||X-Z||E-S||E-Z Gain|
To compare the performance of transfer learning, multilingual leaning and zero-shot learning we train a many-to-many multilingual model that translates E-X, X-Z and E-Z language pairs. The zero-short learning model is trained on E-X and X-Z language pairs.
Our experimental findings are summarized in Table 3. We obtain the highest baseline BLEU of on the X-Z model, followed by the E-X model with BLEU. On the E-S and E-Z baseline models, we obtained and BLEU, respectively. As expected, the X-Z language pair had the highest score, mainly owing to the vocabulary overlap between the source language and target language, evidenced by both languages falling under the Nguni sub-class. All three Nguni languages have a lot of vocabulary in common.
From the multilingual results, we note that though with some loss in performance, the multilingual model closely resembles the E-X and X-Z baseline model results. The average BLEU loss for the E-X and X-Z pairs is approximately and , respectively. This loss is perhaps due to the complexity of learning shared representations across all language pairs. On the other hand, the E-Z pair obtained a BLEU gain of , which is probably a consequence of isiZulu and isiXhosa target languages having significant overlap in vocabulary. As a result, the model’s decoder utilizes the learned representations from the target language with more training examples (isiXhosa in this case) to improve the translations with lower training examples (isiZulu in this case).
We train the two transfer learning models on a E-Z task, one with the E-X model as the source model and the other with the E-S source model. The transfer learning results show that both source models do improve the BLEU for the E-Z target task, with the greatest improvement coming from the source task that is more closely related to the target task. In our experiments, the E-X task is more closely related to the target task E-Z. The E-S and E-X parent (or source) models had 6.6 and 0.9 BLEU improvements, respectively. These results indicate that the E-X parent model surpassed the E-S model by a significant margin. To be precise, the E-X gain was BLEU higher than the E-S model.
Compared to the multilingual model, the E-X parent model had less BLEU. The E-S parent model on the other hand had less BLEU than multilingual model. These results suggest that with the necessary data available, many to many multilingual modeling is the favourable translation technique. Especially in the case of Southern African languages that have sub-classes with a lot of vocabulary in common. For example, the Nguni, Sotho and Tsonga language sub-classes as shown in Figure 1.
To perform zero-shot learning, we train our multilingual model on the E-X and X-Z language pairs. The E-X zero-shot preliminary results were BLEU, which indicates a loss in BLEU. Similarly, this model produced a BLEU of on the X-Z language pair, indicating a loss in BLEU. On the zero-shot learning task, we recorded a score of , which is approximately BLEU greater than the baseline model. Overall, multilingual learning performs better than transfer learning and zero-shot learning. Second to multilingual learning, transfer learning proved to be an efficient technique only when we employ a parent model with a target language of the same sub-class as the target language in the target task. This is evidenced by the zero-shot learning model obtaining BLEU more than the transfer learning model with E-S model as the parent model.
This work examines the opportunities in leveraging low resource translation techniques in developing translation models for Southern African languages. We focus on English-to-isiZulu (E-Z) as it is the smallest of our corpora with just 30,000 sentence pairs. Using multilingual English-isiXhosa-isiZulu learning we achieve a BLEU score for English-to-isiZulu of 18.6 , more than doubling the previous state-of-the-art and yielding significant gains (9.9 in BLEU score) over the baseline English-to-isiZulu transformer model. Multilingual learning for this dataset outperforms both transfer learning and zero-shot learning, though both of these techniques are better than the baseline mode, with BLEU score gains of 6.1 and 2.0 respectively.
We further found that transfer learning is a highly effective technique for training low resource translation models for closely related South-Eastern Bantu languages. Using the English-to-isiXhosa baseline model, transfer learning to isiZulu had a BLEU score gain of 6.1 while using the English-to-Shona baseline model for transfer learning yielded no statistical significant gain at all (). Since isiXhosa is similar to isiZulu while Shona is quite different, this illustrates the performance gains that can be achieved by exploiting language inter-relationships with transfer learning, a conclusion further emphasised by the fact that zero-shot learning, in which no English-to-isiZulu training data was available, outperformed transfer learning using the English-to-Shona baseline model.
Our greatest appreciation goes to Jade Abbott and Herman Kamper for discussions and the African Institute for Mathematical Sciences (AIMS) and the National Research Foundation of South Africa, for supporting this work with a research grant. We further express our sincere gratitude to the South African Centre for High-Performance Computing (CHPC) for CPU and GPU cluster access.
- Towards neural machine translation for african languages. arXiv preprint arXiv:1811.05467. Cited by: §4.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1, §2, §2.
- Attention-based models for speech recognition. In Advances in neural information processing systems, pp. 577–585. Cited by: §2.
- English-isizulu/isizulu-english dictionary. NYU Press. Cited by: Figure 1.
Multi-task learning for multiple language translation.
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1723–1732. Cited by: §3.2.
- Ethnologue: languages of the world. twenty-second edition. Note: https://www.ethnologue.com/guides/how-many-languages-endangered[Accessed: 19-September-2019] Cited by: §1.
- Multi-way, multilingual neural machine translation with a shared attention mechanism. arXiv preprint arXiv:1601.01073. Cited by: §3.2.
- Zero-resource translation with multi-lingual neural machine translation. arXiv preprint arXiv:1606.04164. Cited by: §3.2.
- Universal neural machine translation for extremely low resource languages. arXiv preprint arXiv:1802.05368. Cited by: §3.2.
- Effective strategies in zero-shot neural machine translation. arXiv preprint arXiv:1711.07893. Cited by: §3.2, §3.3, §4.
- Spread of cattle led to the loss of matrilineal descent in africa: a coevolutionary analysis. Proceedings of the Royal Society of London. Series B: Biological Sciences 270 (1532), pp. 2425–2433. Cited by: §1.
- Google’s multilingual neural machine translation system: enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5, pp. 339–351. Cited by: §3.2, §3.3, §4.
- Multilingual neural machine translation for low-resource languages. IJCoL. Italian Journal of Computational Linguistics 4 (4-1), pp. 11–25. Cited by: §4.
- Linguanaut Foreign Language Learning, Copyright © 2013 Linguanaut. Note: http://www.linguanaut.com/index.htm[Accessed: 15-September-2019] Cited by: §5.
- Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114. Cited by: §3.2.
- Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. Cited by: §2.
- N-gram-based machine translation. Computational linguistics 32 (4), pp. 527–549. Cited by: §1.
- Benchmarking neural machine translation for southern african languages. arXiv preprint arXiv:1906.10511. Cited by: §4.
- Transfer learning across low-resource, related languages for neural machine translation. arXiv preprint arXiv:1708.09803. Cited by: §4.
- Omniglot the online encyclopedia of writing systems and languages. Note: https://omniglot.com/language/phrases/zulu.php[Accessed: 10-May-2020] Cited by: §5.
- Masakhane – machine translation for africa. External Links: Cited by: §1.
- ORPUS corpus the open parallel corpus. Note: http://opus.nlpl.eu/[Accessed: 10-September-2019] Cited by: §5.
- Overcoming the curse of sentence length for neural machine translation using automatic segmentation. arXiv preprint arXiv:1409.1257. Cited by: §2.
Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1, §2, §2, §3.
- Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pp. 242–264. Cited by: §3.1.
- Exploring unsupervised word segmentation for machine translation in the south african context. In Proceedings of the 2014 PRASA, RobMech and AfLaT International Joint Symposium, pp. 202–206. Cited by: §4.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2, §2.
- Translation. Machine translation of languages 14 (15-23), pp. 10. Cited by: §1.
- Wild Coast Xhosa phrase book. Note: https://www.wildcoast.co.za/xhosa-phrasebook[Accessed: 10-October-2019] Cited by: §5.
Developing and improving a statistical machine translation system for english to setswana: a linguistically-motivated approach.
Twenty-Third Annual Symposium of the Pattern Recognition Association of South Africa, pp. 114. Cited by: §4.
- Experiments with syllable-based zulu-english machine translation. In Proceedings of the 2014 PRASA, RobMech and AfLaT International Joint Symposium, pp. 217–222. Cited by: §4.
- Transfer learning for low-resource neural machine translation. arXiv preprint arXiv:1604.02201. Cited by: §4.