Towards Supervised and Unsupervised Neural Machine Translation Baselines for Nigerian Pidgin

03/27/2020 ∙ by Orevaoghene Ahia, et al. ∙ InstaDeep 0

Nigerian Pidgin is arguably the most widely spoken language in Nigeria. Variants of this language are also spoken across West and Central Africa, making it a very important language. This work aims to establish supervised and unsupervised neural machine translation (NMT) baselines between English and Nigerian Pidgin. We implement and compare NMT models with different tokenization methods, creating a solid foundation for future works.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over 500 languages are spoken in Nigeria, but Nigerian Pidgin is the uniting language in the country. Between three and five million people are estimated to use this language as a first language in performing their daily activities. Nigerian Pidgin is also considered a second language to up to 75 million people in Nigeria, accounting for about half of the country’s population according to

BBC (2017).

The language is considered an informal lingua franca and offers several benefits to the country. In 2020, 65% of Nigeria’s population is estimated to have access to the internet according to Statista (2019). However, over 58.4% of the internet’s content is in English language, while Nigerian languages, such as Igbo, Yoruba and Hausa, account for less than 0.1% of internet content according to W3Techs (2020). For Nigerians to truly harness the advantages the internet offers, it is imperative that English content is able to be translated to Nigerian languages, and vice versa.

This work is a first attempt towards using contemporary neural machine translation (NMT) techniques to perform machine translation for Nigerian Pidgin, establishing solid baselines that will ease and spur future work. We evaluate the performance of supervised and unsupervised neural machine translation models using word-level and the subword-level tokenization of Sennrich et al. (2015).

2 Related Work

Some work has been done on developing neural machine translation baselines for African languages. Abbott and Martinus (2018) implemented a transformer model which significantly outperformed existing statistical machine translation architectures from English to South-African Setswana. Also, Martinus and Abbott (2019) went further, to train neural machine translation models from English to five South African languages using two different architectures - convolutional sequence-to-sequence and transformer. Their results showed that neural machine translation models are very promising for African languages.

The only known natural language processing work done on any variant of Pidgin English is by

Ogueji and Ahia (2019). The authors provided the largest known Nigerian Pidgin English corpus and trained the first ever translation models between both languages via unsupervised neural machine translation due to the absence of parallel training data at the time.

3 Methodology

All baseline models were trained using the Transformer architecture of Vaswani et al. (2017). We experiment with both word-level and Byte Pair Encoding (BPE) subword-level tokenization methods for the supervised models. We learned 4000 byte pair encoding tokens, following the findings of Martinus and Abbott (2019). For the unuspervised model, we experiment with only word-level tokenization.

3.1 Dataset

The dataset used for the supervised was obtained from the JW300 large-scale, parallel corpus for Machine Translation (MT) by Agić and Vulić (2019). The train set contained 20214 sentence pairs, while the validation contained 1000 sentence pairs. Both the supervised and unsupervised models were evaluated on a test set of 2101 sentences preprocessed by the Masakhane111https://github.com/masakhane-io/masakhane/tree/master/jw300_utils/test/ group. The model with the highest test BLEU score is selected as the best.

3.2 Models

Unsupervised model training followed Ogueji and Ahia (2019) which used a Transformer of 4 encoder and 4 decoder layers with 10 attention heads. Embedding dimension was set to 300.

Supervised model training was performed with the open-source machine translation toolkit JoeyNMT by

Kreutzer et al. (2019)

. For the byte pair encoding, embedding dimension was set to 256, while the embedding dimension was set to 300 for the word-level tokenization. The Transformer used for the byte pair encoding model had 6 encoder and 6 decoder layers, with 4 attention heads. For word-level, the encoder and decoder each had 4 layers with 10 attention heads for fair comparison to the unsupervised model. The models were each trained for 200 epochs on an Amazon EC2 p3.2xlarge instance.

4 Results

4.1 Quantitative

English to Pidgin:

Test BLEU score Model
5.18 Unsupervised (Word-Level)
17.73 Supervised (Word-Level)
24.29 Supervised (Byte Pair Encoding)
Table 1: BLEU Scores (English to Pidgin)

Pidgin to English:

Test BLEU score Model
7.93 Unsupervised (Word-Level)
24.67 Supervised (Word-Level)
13.00 Supervised (Byte Pair Encoding)
Table 2: BLEU Scores (Pidgin to English)

For the word-level tokenization English to Pidgin models, the supervised model outperforms the unsupervised model, achieving a BLEU score of 17.73 in comparison to the BLEU score of 5.18 achieved by the unsupervised model. The supervised model trained with byte pair encoding tokenization outperforms both word-level tokenization models, achieving a BLEU score of 24.29.

Taking a look at the results from the word-level tokenization Pidgin to English models, the supervised model outperforms the unsupervised model, achieving a BLEU score of 24.67 in comparison to the BLEU score of 7.93 achieved by the unsupervised model. The supervised model trained with byte pair encoding tokenization achieved a BLEU score of 13.00. One thing that is worthy of note is that word-level tokenization methods seem to perform better on Pidgin to English translation models, in comparison to English to Pidgin translation models.

4.2 Qualitative

When analyzed by L1 speakers, the translation qualities were rated very well. In particular, the unsupervised model makes many translations that did not exactly match the reference translation, but conveyed the same meaning. More analysis and translation examples are in the Appendix.

5 Conclusion

There is an increasing need to use neural machine translation techniques for African languages. Due to the low-resourced nature of these languages, these techniques can help build useful translation models that could hopefully help with the preservation and discoverability of these languages.

Future works include establishing qualitative metrics and the use of pre-trained models to bolster these translation models.

Code, data, trained models and result translations are available here - https://github.com/orevaoghene/pidgin-baseline

Acknowledgments

Special thanks to the Masakhane group for catalysing this work.

References

  • J. Z. Abbott and L. Martinus (2018) Towards neural machine translation for african languages.

    NeurIPS Workshop on Machine Learning for the Developing World

    .
    Cited by: §2.
  • Ž. Agić and I. Vulić (2019) JW300: a wide-coverage parallel corpus for low-resource languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. External Links: Link, Document Cited by: §3.1.
  • BBC (2017) BBC starts pidgin digital service for west africa audiences. Note: https://www.bbc.com/news/world-africa-40975399 Cited by: §1.
  • J. Kreutzer, J. Bastings, and S. Riezler (2019) Joey NMT: a minimalist NMT toolkit for novices. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, Hong Kong, China. Cited by: §3.2.
  • L. Martinus and J. Z. Abbott (2019) A focus on neural machine translation for african languages. ArviX Preprint. Cited by: §2, §3.
  • K. Ogueji and O. Ahia (2019) PidginUNMT: unsupervised neural machine translation from west african pidgin to english. External Links: 1912.03444 Cited by: §2, §3.2.
  • R. Sennrich, B. Haddow, and A. Birch (2015) Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. Cited by: §1.
  • Statista (2019) Internet user penetration in nigeria from 2017 to 2023. Note: https://www.statista.com/statistics/484918/internet-user-reach-nigeria/ Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §3.
  • W3Techs (2020) Usage of content languages for websites. Note: https://w3techs.com/technologies/overview/content_language/all Cited by: §1.

Appendix A Appendix

a.1 English to Pidgin translations

Unsupervised (Word-Level):

Source How has holy spirit helped the Governing Body ?
Reference How holy spirit don take help Governing Body ?
Model Translation ibrahim c how dey do word wey dey guide the governing body .
Source What can we learn from Jesus ’ counsel ?
Reference Wetin we fit learn from this advice ?
Model Translation wetin we don learn from jesus ’ counsel
Source One student began coming to the kingdom hall .
Reference One of my student come start to come kingdom hall .
Model Translation one student wey begin dey come di kingdom hall .
Table 3: Unsupervised (Word-Level) Results from English to Nigerian Pidgin

Supervised (Word-Level):

Source How has holy spirit helped the Governing Body ?
Reference How holy spirit don take help Governing Body ?
Model Translation How holy spirit take help Governing Body ?
Source What can we learn from Jesus ’ counsel ?
Reference Wetin we fit learn from this advice ?
Model Translation Wetin we fit learn from Jesus example ?
Source One student began coming to the kingdom hall .
Reference One of my student come start to come kingdom hall .
Model Translation One day , e start to Kingdom Hall .
Table 4: Supervised (Word-Level) Results from English to Nigerian Pidgin

Supervised (Byte Pair Encoding):

Source How has holy spirit helped the Governing Body ?
Reference How holy spirit don take help Governing Body ?
Model Translation How holy spirit take help Governing Body ?
Source What can we learn from Jesus ’ counsel ?
Reference Wetin we fit learn from this advice ?
Model Translation Wetin we fit learn from Jesus example ?
Source One student began coming to the kingdom hall .
Reference One of my student come start to come kingdom hall .
Model Translation One woman come start to dey go meeting .
Table 5: Supervised (Byte Pair Encoding) Results from English to Nigerian Pidgin
Discussions:

The following insights can be drawn from the example translations shown in the tables above:

  1. The unsupervised model performed poorly at some simple translation examples, such as the first translation example.

  2. For all translation models, the model makes hypothesis that are grammatically and qualitatively correct, but do not exactly match the reference translation, such as the second translation example.

  3. Surprisingly, the unsupervised model performs better at some relatively simple translation examples than both supervised models. The third example is a typical such case.

  4. The supervised translation models seem to perform better at longer example translations than the unsupervised example.

a.2 Pidgin to English translations

Unsupervised (Word-Level):

Source How holy spirit don take help Governing Body ?
Reference How has holy spirit helped the Governing Body ?
Model Translation how holy spirit is to help governing body .
Source Wetin we fit learn from this advice ?
Reference What can we learn from Jesus ’ counsel ?
Model Translation what should we learn from this advice ?
Table 6: Unsupervised (Word-Level) Results from English to Nigerian Pidgin

Supervised (Word-Level):

Source How holy spirit don take help Governing Body ?
Reference How has holy spirit helped the Governing Body ?
Model Translation how has holy spirit the governing body ?
Source Wetin we fit learn from this advice ?
Reference What can we learn from Jesus ’ counsel ?
Model Translation What can we learn from this ?
Table 7: Supervised (Word-Level) Results from English to Nigerian Pidgin

Supervised (Byte Pair Encoding):

Source How holy spirit don take help Governing Body ?
Reference How has holy spirit helped the Governing Body ?
Model Translation 5 , 6 . ( a ) how did holy spirit help the governing body ?
Source Wetin we fit learn from this advice ?
Reference What can we learn from Jesus ’ counsel ?
Model Translation Wtin can we learn from this advice ?
Table 8: Supervised (Byte Pair Encoding) Results from English to Nigerian Pidgin