HausaMT v1.0: Towards English-Hausa Neural Machine Translation

06/09/2020 ∙ by Adewale Akinfaderin, et al. ∙ 0

Neural Machine Translation (NMT) for low-resource languages suffers from low performance because of the lack of large amounts of parallel data and language diversity. To contribute to ameliorating this problem, we built a baseline model for English-Hausa machine translation, which is considered a task for low-resource language. The Hausa language is the second largest Afro-Asiatic language in the world after Arabic and it is the third largest language for trading across a larger swath of West Africa countries, after English and French. In this paper, we curated different datasets containing Hausa-English parallel corpus for our translation. We trained baseline models and evaluated the performance of our models using the Recurrent and Transformer encoder-decoder architecture with two tokenization approaches: standard word-level tokenization and Byte Pair Encoding (BPE) subword tokenization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Hausa is a language spoken in the western part of Africa. It belongs to the Afro–Asiatic phylum and it is the second most spoken native language on the continent, after Swahili. The language is spoken by more than 40 million people as a first language and about 15 million people use it as a second and third language. Most of the speakers are concentrated in Nigeria, Niger and Chad – all resulting to both anglophone and francophone influences [Sabiu et al.2018, Eberhard et al.2019]. Our work on curating datasets and creating evaluation benchmark for English–Hausa Neural Machine Translation (NMT) is inspired by the socio–linguistic facts of the Hausa language. Hausa has been referred to as the largest internal political unit in Africa. There has been extensive linguistic academic research on Hausa and the language benefits from the existence of trans–border communication in the West African Sahel belt and the availability of international radio stations like the BBC Hausa and Voice of America Hausa [Odoje2013].

The exponential growth of social media platforms have eased communication among users. However, the advances in technological adoption have also informed the need to translate human languages. In low–resource countries, the language inequality can be ameliorated by using machine translation to bridge gaps in technological, political and socio–economic advancements [Odoje2016]. The recent successes in NMT over Phrased–Based Statistical Machine Translation (PBSMT) for high–resource data conditions can be leveraged to explore best practices, data curation and evaluation benchmark for low–resource NMT tasks [Bentivogli et al.2016, Isabelle et al.2017]. Using the JW300, Tanzil, Tatoeba and Wikimedia public datasets, we trained and evaluated baseline NMT models for Hausa language.

4th Widening NLP Workshop, Annual Meeting of the Association for Computational Linguistics, ACL 2020

2 Related Works

Hausa Words Embedding:

Researchers have recently curated datasets and trained word embedding models for the Hausa language. The results from this trained models have been promising, with approximately 300% increase in prediction accuracy over other baseline models  [Abdulmumin & Galadanci2019].

Masakhane:

Due to the linguistic complexity and morphological properties of languages native to continent of Africa, using abstractions from successful resource–rich cross–lingual machine translation tasks often fail for low–resource NMT task. The Masakhane project was created to bridge this gap by focusing on facilitating open–source NMT research efforts for African languages 

[, Orife et al.2020].

3 Dataset Description

For the HausaMT task, we used the JW300, Tanzil, Tatoeba and Wikimedia public datasets. The JW300 dataset is a crawl of the parallel data available on Jehovah Witness’ website. Most of the data are from the magazines, Awake! and Watchtower, and they cover a diverse range of societal topics in a religious context [Agić & Vulić2019]. The Tatoeba database is a collection of parallel sentences in 330 languages [Raine2018]. The dataset is crowdsourced and published under a Creative Commons Attribution 2.0 license. The Tanzil dataset is a multilingual text aimed at producing a highly verified multi-text of the Quran text [Zarrabi-Zadeh et al.2007]. The Wikimedia dataset are parallel sentence pairs extracted and filtered from noisy parallel and comparable wikipedia copora [Wolk & Marasek2014]. For this work, we trained on two datasets which are: 1) the JW300 as our baseline, and 2) All the datasets combined. The number of tokens, number of sentences and statistical properties of the datasets are in Table 1.

Dataset Sentence Length (Mean Std) Tokens Sentences
English Hausa English Hausa
JW300 18.11 10.53 20.14 11.57 4,051,322 4,506,787 223,723
All 19.71 24.31 21.28 24.60 6,919,805 7,471,256 351,024
Table 1: Dataset summary.Combination of JW300, Tanzil, Tatoeba and Wikimedia datasets.

4 Experiments and Results

For our baseline model, we trained a recurrent-based model with the Long Short-Term Memory (LSTM) network as our encoder and decoder type with the Luong attention mechanism 

[Luong et al.2015]. To achieve an improved benchmark, we also trained a Transformer encoder–decoder model. The Transformer is based on attention mechanism and the training time is significantly faster than architectures based on convolutional or recurrent networks [Vaswani et al.2017]

. For the hyperparameters used to train both the recurrent and transformer based architecture, we used an embedding size of 256, hidden units of 256, batch size of 4096 and an encoder and decoder depth of 6 respectively.

Dataset Model BPE Word
dev test dev test
JW300 Recurrent 20.06 19.39 25.36 24.75
Transformer 21.33 20.38 28.71 28.06
All Recurrent 31.89 33.48 40.78 42.29
Transformer 31.91 32.42 44.42 45.98
Table 2: BLEU scores for BPE and word-level tokenization. Best scores of the Transformer model against the Recurrent are highlighted in bold

To preprocess the parallel corpus, we used the standard word–level tokenization and Byte Pair Encoding (BPE) [Gage1994]. The BPE is a subword tokenization which has become a successful choice in translation tasks. The model was trained based on the 4000 BPE tokens used on a recent machine translation study for South African languages [Martinus & Abbott2019]

. To train our model, the Joey NMT minimalist toolkit, which is open source and based on PyTorch was used 

[Kreutzer et al.2019]. The models were trained using a Tesla P100 GPU. The model training for the baseline and repeated tasks (datasets & tokenization type) took between 5-9 hours for each run.

5 Conclusion and Future Work

Evaluating the model on the test set, we observed that the word–level tokenization outperform the BPE by a BLEU score factor of ~1.27–1.42 times (Table 2). The qualities of the English to Hausa translations using both word–level and BPE subword tokenizations were rated positively by first language speakers. Table 3 shows some of the translations example.

Source: This is normal, because they themselves have not been anointed.
Reference: Hakan ba abin mamaki ba ne don ba a shafa su da ruhu mai tsarki ba.
Hypothesis: Wannan ba daidai ba ne, domin ba a shafe su ba.
Source: A white - haired man in a frock coat appears on screen.
Reference: Wani mutum mai furfura ya bayyana da dogon kwat a majigin.
Hypothesis:
Wani mutum mai suna da wani mutum mai suna da ke cikin mota yana
da nisa a cikin kabari
Source: Why is that of vital importance?
Reference: Me ya sa hakan yake da muhimmanci?
Hypothesis: Me ya sa wannan yake da muhimmanci?
Table 3: Example Translations.

A significant portion of both the training and test datasets are from the the JW300 parallel data, which are religious texts. We acknowledge that for us to reach a viable state of real-world translation quality, we need to evaluate our model on "general" Hausa data. However, parallel data for other out-of-domain areas does not exist. High-yielding avenue for future work include evaluating on English texts and crowd-sourcing L1 speakers to manually evaluate the quality of the translations by editing. The post edited translation can then be used as the reference to calculate the evaluation metric. Other future work include carrying out an empirical study to explore the effect of word–level and subword tokenizations. Other methods such as the linguistically motivated vocabulary reduction (LMVR) have shown to perform better for languages in the Afro–Asiatic family 

[Ataman & Federico2018]. The datasets, pre–trained models, and configurations are available on Github.111https://github.com/WalePhenomenon/Hausa-NMT

Acknowledgements

The author would like to thank Gabriel Idakwo for the qualitative analysis of the translations.

References

  • [Eberhard et al.2019] David M. Eberhard, Gary F. Simons, and Charles D. Fennig (eds.). 2019. Ethnologue: Languages of the world. twenty-second edition. URL: http://www.ethnologue.com.
  • [Sabiu et al.2018] Ibrahim T. Sabiu, Fakhrul A. Zainol, and Mohammed S. Abdullahi. 2018. Hausa People of Northern Nigeria and their Development. Asian People Journal (APJ), eISSN : 2600-8971, Volume 1, Issue 1, PP 179-189.
  • [Odoje2013] Clement Odoje. 2013. Language Inequality: Machine Translation as the Bridging Bridge for African languages. 4, 01.
  • [Odoje2016] Clement Odoje. 2016. The Peculiar Challenges of SMT to African Languages . s. ICT, Globalisation and the Study of Languages and Linguistics in Africa, PP 223.
  • [Bentivogli et al.2016] Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo, and Marcello Federico. 2016. Neural versus Phrase-Based Machine Translation Quality: a Case Study.

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, PP 257-267, Austin, Texas.

  • [Isabelle et al.2017] Pierre Isabelle, Colin Cherry,and George Foster. 2017. A Challenge Set Approach to Evaluating Machine Translation. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, PP 2486–2496, Copenhagen, Denmark.
  • [Abdulmumin & Galadanci2019] Idris Abdulmumin and Bashir S. Galadanci. 2019. hauWE: Hausa Words Embedding for Natural Language Processing. 2019 2nd International Conference of the IEEE Nigeria Computer Chapter, PP. 1-6, Zaria, Nigeria.
  • [, Orife et al.2020] , Iroro F. O. Orife, Julia Kreutzer, Blessing Sibanda, Kathleen Siminyu, Laura Martinus, Jamiil Toure Ali, Jade Abbott, Vukosi Marivate, Salomon Kabongo, Musie Meressa, Espoir Murhabazi, Orevaoghene Ahia, Elan van Biljon, Arshath Ramkilowan, Adewale Akinfaderin, Alp Öktem, Wole Akin, Ghollah Kioko, Kevin Degila, Herman Kamper, Bonaventure Dossou, Chris Emezue, Kelechi Ogueji, and Abdallah Bashir. 2020. Masakhane – Machine Translation For Africa. To appear in the Proceedings of the AfricaNLP Workshop, International Conference on Learning Representations (ICLR 2020).
  • [Agić & Vulić2019] Željko Agić and Ivan Vulić. 2019. JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, PP 3204–3210, Florence, Italy.
  • [Raine2018] Paul Raine. 2018. Building Sentences with Web 2.0 and the Tatoeba Database. Accents Asia, 10(2), PP 2-7.
  • [Zarrabi-Zadeh et al.2007] Hamid Zarrabi-Zadeh, Abbas Ahmadi, Morteza Bagheri, Yousef Daneshvar, Mohammad Derakhshani, Mohammad Fakharzadeh, Ehsan Fathi, Yusof Ganji, Mojtaba Haghighi, Nasser Lashgarian, Zahra Mousavian, Mohsen Saboorian, Yaser Shanjani, Mohammad-Reza Nikseresht, and Mahdi Mousavian. 2007. Tanzil Project. URL: http://tanzil.net/docs/home.
  • [Wolk & Marasek2014] Krzysztof Wolk and Krzysztof Marasek. 2014. Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs. Procedia Technology, Volume 18, PP 126-132.
  • [Vaswani et al.2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
  • [Gage1994] Philip Gage. 1994. A New Algorithm for Data Compression. C Users J., 12(2), PP. 23–38.
  • [Martinus & Abbott2019] Laura Martinus and Jade Abbott. 2019. A Focus on Neural Machine Translation for African Languages. CoRR, abs/1906.05685. URL: http://arxiv.org/abs/1906.05685.
  • [Kreutzer et al.2019] Julia Kreutzer, Joost Bastings, and Stefan Riezler. 2019. Joey NMT: A Minimalist NMT Toolkit for Novices. Proceedings of the 2019 EMNLP and the 9th IJCNLP (System Demonstrations), pages 109–114, Hong Kong, China.
  • [Ataman & Federico2018] Duygu Ataman and Marcello Federico. 2018. An Evaluation of Two Vocabulary Reduction Methods for Neural Machine Translation. Proceedings of AMTA 2018, vol. 1: MT Research Track, Pp 97-110, Boston, MA.
  • [Luong et al.2015] Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Pp 1412 - 1421, Lisbon, Portugal.