FFR V1.0: Fon-French Neural Machine Translation

03/26/2020 ∙ by Bonaventure F. P. Dossou, et al. ∙ 0

Africa has the highest linguistic diversity in the world. On account of the importance of language to communication, and the importance of reliable, powerful and accurate machine translation models in modern inter-cultural communication, there have been (and still are) efforts to create state-of-the-art translation models for the many African languages. However, the low-resources, diacritical and tonal complexities of African languages are major issues facing African NLP today. The FFR is a major step towards creating a robust translation model from Fon, a very low-resource and tonal language, to French, for research and public use. In this paper, we describe our pilot project: the creation of a large growing corpora for Fon-to-French translations and our FFR v1.0 model, trained on this dataset. The dataset and model are made publicly available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Africa has the highest linguistic diversity in the world(Bendor-Samuel, 2017). On account of the importance of language to communication, and the importance of reliable, powerful and accurate machine translation models in modern inter-cultural communication, there have been (and still are) efforts to create state-of-the-art translation models for the many African languages. However, the low-resources, diacritical and tonal complexities of African languages are major issues facing African NLP today. The FFR is a major step towards creating a robust translation model from Fon, a very low-resource and tonal language, to French, for research and public use. In this paper, we describe our pilot project: the creation of a large growing corpora for Fon-to-French translations and our FFR v1.0 model, trained on this dataset. The dataset and model are made publicly available.

2 Motivation

The selection of Fon language for this pilot project is guided by the fact that not only is Fon one of the major languages spoken by the natives of Benin Republic, it belongs to, and shares tonal and analytical similarities with the Niger-Congo languages, which is the largest group of African languages comprising widely spoken languages like Igbo, Hausa, Yoruba and Swahili (Greenberg, 1948)

. French was chosen because it is the European language spoken officially by the natives of Fon and both languages contain diacritics. Therefore, a machine translation model that succeeds for the Fon-French can be trained (with transfer learning) on the other large group of Niger-Congo African languages and European languages.

3 Related works

It is important to note that the move to promote African NLP has been going on for quite a while, with the advent of organizations and online communities like Deep Learning Indaba, BlackinAI, and most inspirational to this paper, Masakhane, an online community focused on connecting and fostering machine translation (MT) researchers on African languages who have all made meaningful contributions to MT on some African languages. Kevin Degila, a member of the community worked on an English-Fon translation model.

4 Project FFR v1.0

4.1 FFR Dataset

We created the FFR Dataset as a project to compile a large, growing corpora of cleaned Fon - French sentences for machine translation, and other NLP research-related, projects. As training data is crucial to the high performance of a machine learning model, we hope to facilitate future research being done on the Fon language, by releasing our data for research purposes. The major sources for the creation of FFR Dataset were:


  1. JW300 - http://opus.nlpl.eu/JW300.php (24.60% of FFR Dataset)

  2. BeninLanguages - https://beninlangues.com/ (75.40% of FFR Dataset)

JW300 is a parallel corpus of over 300 languages with around 100 thousand parallel sentences per language pair on average. BeninLanguages contains (in French and Fon) vocabulary words, short expressions, small sentences, complex sentences, proverbs and bible verses: Genesis 1 - Psalm 79.

The tabular analysis shown in Table 1 below serves to give an idea of the range of word lengths for the sentences in the FFR dataset. The maximum number of words for the fon sentences, , is , while that of the french sentences, , is . This shows that the FFR Dataset is mostly made up of very short sentences containing 1-5 words, but at the same time, contains some medium to long sentences, thus achieving the intended variety of the dataset.

# Fon French
Very Short sentences (1-5 words) 64301 64255
Short sentences (6-10 words) 13848 17183
Medium sentences (11-30 words) 29113 29857
Long sentences (31+ words) 9767 5734
Table 1: Analysis of sentences in the FFR dataset

The FFR Dataset and the official documentation, with more information on the dataset, which could not be detailed here due to the page-limit, can be found at https://github.com/bonaventuredossou/ffr-v1/tree/master/FFR-Dataset. The FFR Dataset currently contains 117,029 parallel Fon-French words and sentences, which we used to train the FFR v1.0 model.

4.2 FFR v1.0 Model

4.2.1 Data Preprocessing

Our earlier effort at pre-analysis of Fon sentences, which revealed that different accents on same words affected their meanings as shown in sentences #2 and #3 of Table 4, made it necessary to also design our own strategy for encoding the diacritics of the Fon languages thus: we encoded the words with their different accents, instead of the default, which removed all accents.

Training 105,326
Validation 5,691
Testing 6,012
Table 2: Number of samples (sentences) for training FFR

4.2.2 Model structure

In summary, the FFR model is based on the encoder-decoder configuration(Brownlee, 2017; Core, 2020)

. Our encoders and decoders are made up of 128-dimensional gated rectified units (GRUs) recurrent layers, with a word embedding layer of dimension 512. The encoder transforms a source sentence to a fixed-length context vector and the decoder learns to interpret the output from the encoded vector by maximizing the probability of a correct translation given a source sentence. We also applied a 30-dimensional attention model

(Sutskever et al., 2014; Bahdanau and Bengio, 2015; Lamba, 2020) in order to help the model learn which words to place attention on and make contextual, correct translations. The code for the model has been open-sourced on GitHub.

4.2.3 Initial Results and Findings

Model Structure BLEU GLUE
without diacritical encoding 24.53 13.0
with diacritical encoding 30.55 18.18
Table 3: Overall BLEU and GLUE scores on FFR v1.0 test samples

We trained the model with and without diacritical encoding. As seen in Table 3, our diacritical encoding greatly improved the performance of the FFR model on both the BLEU and GLUE (a modification of the BLEU metric introduced my Google) metrics. This really shows how important diacritics are in our african language structures and meaning and therefore the need to build models that can interprete them very well.

ID 0 1 2 3 4 5
Source yí bo wa yi bo wa h()́on hon sá amasín dŏ wŭ gb()́e
Target prends et viens va et viens porte fuire oindre avec un médicament pousser de nouvelles feuilles
FFR v1.0 Model prends et viens va viens scorpion porte se masser le remede esprit de la vie
BLEU/CMS score 1.0 1.0 0.0 0.0 0.0/0.65 0.25 / 0.9
Table 4: Sentences predictions and scores

Table 4 shows translations of interest from the FFR model, illustrating the difficulty of predicting Fon words which bear different meanings with different accents. While our model predicted well for #0 and #1, it misplaced the meanings for #2 and #3.

CMS(context-meaning-similarity): We discovered that the FFR v1.0 model was able to provide predictions that were, although different from the target, similar in context to the target, as seen in sentence #4. This led us to develop the CMS metric: we sent the source and target sentences to five Fon-French natives, and requested a score from 0 to 1 on how similar in contextual meaning the predictions were with the source sentences and target sentences. Then we took the average of their reviews as the CMS score for each of the model's predictions as given in sentence #4 in Table 4.

We also discovered that there were cases, for example sentence #5, where the target was wrong, but the model was able to predict a correct translation.

Although the CMS metric is crude at the moment, we believe there is potential in exploring it further: it could shed more light on measuring the performance of a translation model for these tonal African languages, as well as out-of-vocabulary translation performance.

The FFR model is a pilot project and there is headroom to be explored with the tuning of different architectures, learning schemes, and transfer learning for FFR Model v2.0.

References

  • Bahdanau and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. Cited by: §4.2.2.
  • J. Bendor-Samuel (2017) Niger-congo languages. Britannica. External Links: Link Cited by: §1.
  • J. Brownlee (2017) Deep learning for natural language processing. Machine Learning Mastery. Cited by: §4.2.2.
  • T. Core (2020) Neural machine translation with attention. External Links: Link Cited by: §4.2.2.
  • J. Greenberg (1948) The classification of african languages. American Anthropologist 50, pp. 24. Cited by: §2.
  • H. Lamba (2020) Intuitive understanding of attention mechanism in deep learning. Medium. Cited by: §4.2.2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014)

    Sequence to sequence learning with neural networks

    .
    In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 3104–3112. External Links: Link Cited by: §4.2.2.