Masakhane – Machine Translation For Africa

03/13/2020 ∙ by Iroro Orife, et al. ∙ 0

Africa has over 2000 languages. Despite this, African languages account for a small portion of available resources and publications in Natural Language Processing (NLP). This is due to multiple factors, including: a lack of focus from government and funding, discoverability, a lack of community, sheer language complexity, difficulty in reproducing papers and no benchmarks to compare techniques. To begin to address the identified problems, MASAKHANE, an open-source, continent-wide, distributed, online research effort for machine translation for African languages, was founded. In this paper, we discuss our methodology for building the community and spurring research from the African continent, as well as outline the success of the community in terms of addressing the identified problems affecting African NLP.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 The State of African NLP

2144 of all 7111 (30.15%) living languages today are African languages (Eberhard et al., 2019). But only a small portion of linguistic resources for NLP research are built for African languages. As a result, there are only few NLP publications: In all ACL conferences in 2019, only 5 out of 2695 (0.19%) author affiliations were based in Africa (Caines, 2019). This stark contrast of linguistic richness versus poor representation of African languages in NLP is caused by multiple factors.

First of all, African societies do not see hope for African languages being accepted as primary means of communication (Alexander, 2009). As a result, few efforts to fund NLP or translation for African languages exist, despite the potential impact. This lack of focus has had a ripple effect.

The few existing resources are not easily discoverable, published in closed journals, non-indexed local conferences, or remain undigitized, surviving only in private collections (Mesthrie, 1995). This opaqueness impedes researchers’ ability to reproduce and build upon existing results, and to develop, compete on and progress public benchmarks (Martinus and Abbott, 2019).

African researchers are disproportionately affected by socio-economic factors, and are often hindered by visa issues (Johnson, 2019) and costs of flights from and within Africa (Hattem, 2017). They are distributed and disconnected on the continent, and rarely have the opportunity to commune, collaborate and share.

Furthermore, African languages are of high linguistic complexity and variety, with diverse morphologies and phonologies, including lexical and grammatical tonal patterns, and many are practiced within multilingual societies with frequent code switching (Ndubuisi-Obi et al., 2019; Bird, 1999; Gibbon et al., 2006). Because of this complexity, cross-lingual generalization from success in languages like English are not guaranteed.

2 Contribution

Founded at the Deep Learning Indaba 2019, Masakhane constitutes an open-source, continent-wide, distributed, online research effort for machine translation for African languages. Its goals are threefold:

  1. For Africa: To build a community of NLP researchers, connect and grow it, spurring and sharing further research, to enable language preservation and increase its global visibility and relevance.

  2. For NLP researchers: To build data sets and tools to facilitate NLP research on African languages, and to pose new research problems to enrich the NLP research landscape.

  3. For the global researchers community: To discover best practices for distributed research, to be applied by other emerging research communities.

3 Methodology and Results

(a) Origin of Participants & Focus Language.
(b) Highest Level of Education.
(c) Occupation.
(b) Highest Level of Education.
Figure 1: Participants with African origin are represented by blue markers in (a), indigenous areas that are covered by the languages of current benchmarks in green, benchmarks in progress in dark grey, and countries where those languages are spoken in light grey. Education (b) and occupation (c) of a subset of 37 participants as indicated in a voluntary survey in February 2019.

Masakhane’s strategy is to offer barrier-free open access

to first hands-on NLP experiences with African languages, fighting the above-mentioned opaqueness. With an easy-to-use open source platform, it allows individuals to train neural machine translation (NMT) models on a parallel corpus for a language of their choice, and share the results with an online community. The

online community is based on weekly meetings, an active Slack workspace, and a GitHub repository (github.com/masakhane-io), so that members can support each other and connect despite geographical distances. No academic prerequisites are required for participation, since tertiary education enrolments are minimal in sub-saharan Africa (Jowi et al., 2018).

A Jupyter Notebook features documented data preparation, model configuration, training and evaluation. It runs on Google Colab with a single (free) GPU for a small limited number of hours, such that participants do not require expensive hardware. The NMT models are built using Joey NMT (Kreutzer et al., 2019), which comes with a beginner-friendly documentation. Participants submit and publish their data, code and results for training on their language to improve reproducibility and discoverability. To lower the barrier of data collection, the JW300 multilingual dataset (Agić and Vulić, 2019)

with parallel corpora for English to 101 African languages is integrated into the notebook. With the goal of improving translation quality by transfer learning across languages in the future,

global test sets with English sources are extracted from JW300, and excluded from training data for any language pair to avoid potential data leakage for cross-lingual transfer.

As of February 14, 2020, the Masakhane community consists of 144 participants from 17 African countries with diverse educations and occupations (Figure 1), and 2 countries outside Africa (USA and Germany). So far, 30 translation results for 28 African languages have been published by 25 contributors on GitHub.

4 Future Roadmap

Masakhane

aims to continue to grow and facilitate engagement within the community, especially helping inactive users contribute benchmarks and fostering mentoring relations. In the next year, the project will expand to different NLP tasks beyond NMT in order to reach a broader audience. Qualitative analysis on model performance as well as investigations of automatic evaluation metrics will spur healthy competition on results.

Masakhane

will also provide notebooks for transfer and un/self-supervised learning to push translation quality. In terms of data collection, the size and domain of global test sets will be expanded.

References

  • Ž. Agić and I. Vulić (2019) JW300: a wide-coverage parallel corpus for low-resource languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. External Links: Link, Document Cited by: §3.
  • N. Alexander (2009) Evolving african approaches to the management of linguistic diversity: the acalan project. Language Matters 40 (2), pp. 117–132. Cited by: §1.
  • S. Bird (1999) Strategies for representing tone in african writing systems. Written Language & Literacy 2 (1), pp. 1–44. Cited by: §1.
  • A. Caines (2019) The geographic diversity of nlp conferences. External Links: Link Cited by: §1.
  • D. M. Eberhard, G. F. Simons, and C. D. Fenning (2019) Ethnologue: languages of the worlds. twenty-second edition. Dallas, Texas: SIL International. External Links: Link Cited by: §1.
  • D. Gibbon, E. Urua, and M. Ekpenyong (2006) Morphotonology for tts in niger-congo languages. In Speech Prosody, Cited by: §1.
  • J. Hattem (2017) African air travel is awful. why?. External Links: Link Cited by: §1.
  • K. Johnson (2019) Canada is denying travel visas to ai researchers headed to neurips - again. VentureBeat. External Links: Link Cited by: §1.
  • J. Jowi, C. O. Ong’ondo, and M. Nega (2018) Building phd capacity in sub-saharan africa. British Council. External Links: Link Cited by: §3.
  • J. Kreutzer, J. Bastings, and S. Riezler (2019) Joey NMT: a minimalist NMT toolkit for novices. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, Hong Kong, China. Cited by: §3.
  • L. Martinus and J. Z. Abbott (2019) A focus on neural machine translation for african languages. arXiv preprint arXiv:1906.05685. Cited by: §1.
  • R. Mesthrie (1995) Language and social history: studies in south african sociolinguistics. New Africa Books. Cited by: §1.
  • I. Ndubuisi-Obi, S. Ghosh, and D. Jurgens (2019) Wetin dey with these comments? modeling sociolinguistic factors affecting code-switching behavior in nigerian online discussions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6204–6214. Cited by: §1.
  • K. Strange (2008) Authorship: why not just toss a coin?. American Journal of Physiology-Cell Physiology 295 (3), pp. C567–C575. Note: PMID: 18776156 External Links: Document, Link Cited by: Masakhane - Machine Translation For Africa.