Lesan – Machine Translation for Low Resource Languages

12/15/2021
by   Asmelash Teka Hadgu, et al.
0

Millions of people around the world can not access content on the Web because most of the content is not readily available in their language. Machine translation (MT) systems have the potential to change this for many languages. Current MT systems provide very accurate results for high resource language pairs, e.g., German and English. However, for many low resource languages, MT is still under active research. The key challenge is lack of datasets to build these systems. We present Lesan, an MT system for low resource languages. Our pipeline solves the key bottleneck to low resource MT by leveraging online and offline sources, a custom OCR system for Ethiopic and an automatic alignment module. The final step in the pipeline is a sequence to sequence model that takes parallel corpus as input and gives us a translation model. Lesan's translation model is based on the Transformer architecture. After constructing a base model, back translation, is used to leverage monolingual corpora. Currently Lesan supports translation to and from Tigrinya, Amharic and English. We perform extensive human evaluation and show that Lesan outperforms state-of-the-art systems such as Google Translate and Microsoft Translator across all six pairs. Lesan is freely available and has served more than 10 million translations so far. At the moment, there are only 217 Tigrinya and 15,009 Amharic Wikipedia articles. We believe that Lesan will contribute towards democratizing access to the Web through MT for millions of people.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/31/2020

Evaluating Amharic Machine Translation

Machine translation (MT) systems are now able to provide very accurate r...
02/04/2019

Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English

The vast majority of language pairs in the world are low-resource becaus...
05/01/2020

Facilitating Access to Multilingual COVID-19 Information via Neural Machine Translation

Every day, more people are becoming infected and dying from exposure to ...
03/31/2021

Domain-specific MT for Low-resource Languages: The case of Bambara-French

Translating to and from low-resource languages is a challenge for machin...
04/12/2020

When Does Unsupervised Machine Translation Work?

Despite the reported success of unsupervised machine translation (MT), t...
10/30/2017

Machine Translation of Low-Resource Spoken Dialects: Strategies for Normalizing Swiss German

The goal of this work is to design a machine translation system for a lo...
09/13/2021

Evaluating Multiway Multilingual NMT in the Turkic Languages

Despite the increasing number of large and comprehensive machine transla...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Machine Translation Pipeline

Our machine translation pipeline shown in Figure 1

has the following three key components: OCR, automatic alignment and MT modeling that leverage current advances in deep learning.

Data sources

Machine translation (MT) systems today provide very accurate results for high resource language pairs Barrault et al. (2019). However, these systems rely on large-scale parallel corpus to work. For pairs such as German and English, these parallel sentences can be easily obtained from the Web. However, for the majority of languages there is not enough data online to build these datasets. We leverage offline and online sources to mitigate this bottleneck.

Figure 1: Lesan machine translation system pipeline

Offline sources are gathered through partnerships with media companies, libraries and other archives that have books, magazines and other linguistic materials. These are typically translated books or magazines that have daily news of source and target language in different file formats and hard copies. Online sources include news sites, blogs, religious and legal sites that feature stories in both the source and target language.

Data preprocessing

After gathering large-scale unstructured data from online and offline sources, the next step is a preprocessing step. The goal of this module is to generate clean text (corpus) in machine readable format. Scanned books and magazines pass through an OCR system to convert them into machine readable format. We built a custom OCR system for Ethiopic, script used for Tigrinya, Amharic and several other languages. The output of our OCR system is passed through post-editing to correct errors in characters and words. Data crawled from the Web is parsed to extract text by removing HTML-tags and gathering metadata such as author, date that could be helpful in downstream steps. Text coming from different sources such as text parsed from web pages and text coming from the post-editing module passed through several NLP modules such as deduplication, language detection and sentence splitting.

Automatic alignment

All source and target sentences that have been cleaned are then passed through an automatic alignment system to construct parallel corpus. A candidate generation step, produces candidate pairs, then each pair of source and target sentence is fed to our automatic matching system that determines whether an input pair is a translation or not. This system is similar to Junczys-Dowmunt (2018) where we compute cross-entropy scores according to two inverse translation models, and take the weighted average aggregated over a set of candidate pairs.

Sequence-to-sequence modeling with back translation

The final step in the pipeline is a sequence to sequence model that takes the parallel corpus as input and gives us a translation model. Lesan’s translation model is based on the Transformer architecture Vaswani et al. (2017). After constructing a base model, back translation Edunov et al. (2018) is used to leverage monolingual corpora.

Our machine translation pipeline solves the key bottleneck to low resource MT by leveraging online and offline sources, a custom OCR system for Ethiopic and an automatic alignment module.

2 Evaluation

In this section we will describe human evaluation of Lesan compared to three state-of-the-art commercial systems: Google Translate 222https://translate.google.com/, Microsoft Translator 333https://www.bing.com/translator and Yandex Translate 444https://translate.yandex.com/. Google Translate uses neural MT (Wu et al., 2016). Yandex Translate uses a hybrid system of neural MT and statistical MT 555https://tech.yandex.com/translate/doc/dg/concepts/how-works-machine-translation-docpage/. All these services provide APIs to access their system. Lesan’s MT models are implemented using the OpenNMT Klein et al. (2017) toolkit.

Human Evaluation

Experts were selected to evaluate the translation outputs of the systems. We assign the task of selecting what news sources to translate to the expert evaluators in order to avoid bias. The main requirements were: identify 20 news stories (1 - 2 paragraphs); select stories across different genres: politics, social, economy, entertainment, health, sport, technology etc. The output of translation systems next to the sources are given to expert evaluators to assign a score. The outputs are shuffled in such a way that one cannot tell which output is from which system. We chose a Likert scale 666https://en.wikipedia.org/wiki/Likert_scale of 5 corresponding to a range from completely wrong (0) to accurate and fluent (4). These error categories are adapted from the severity levels for error categories  777https://www.taus.net/qt21-project#harmonized-error-typology used in evaluation of translation quality developed by TAUS. The complete description of the scoring scheme is shown in Table 1. We have released the evaluation datasets 888https://zenodo.org/record/5060303 to foster research and progress on evaluating MT systems for low resource languages.


Scale Value Description
Wrong translation 0 This is for a completely wrong translation. The translation output does not make sense given the source text.
Major problem 1 There is a serious problem in the translation with some parts of the source missing or mistranslated and it would be hard to match translation output with source text without major modifications.
Minor problem 2 The translation has minor problems given the source text but requires some minor changes, e.g, changing a word or two to make it fully describe the source text.
Good translation 3 The translation describes the source text; however, there may be some problems with style such as punctuation, word order or appropriate wording.
Accurate and fluent 4 Great job! The output is a correct translation of the source text. It’s both accurate and fluent.
Table 1: Human evaluation guideline to evaluate performance of MT systems.

We report the normalized mean and standard deviation of the scores. The results are given in Table 

3 for Amharic to and from English and in Table 3 for Tigrinya to and from Amharic and English. Across all directions Lesan outperforms these state-of-the-art systems.

Direction System Sentence Story
Am En Yandex 0.23 0.30 0.19 0.25
Microsoft 2.13 0.51 2.06 0.50
Google 2.58 0.54 2.54 0.48
Lesan 2.68 0.41 2.71 0.55
En Am Yandex 0.28 0.34 0.20 0.29
Microsoft 2.57 0.43 2.54 0.44
Google 2.98 0.30 2.88 0.33
Lesan 3.25 0.38 3.17 0.42

Table 3: Human evaluation comparing Lesan and Microsoft for Tigrinya to and from Amharic and English.
Direction System Sentence Story
Am Ti Microsoft 1.92 0.43 1.85 0.54
Lesan 1.94 0.47 1.86 0.53
Ti Am Microsoft 1.60 0.44 1.44 0.57
Lesan 1.94 0.50 1.77 0.50
En Ti Microsoft 2.32 0.6 2.17 0.63
Lesan 2.33 0.63 2.19 0.58
Ti En Microsoft 2.01 0.63 1.89 0.67
Lesan 2.78 0.31 2.63 0.39

Table 2: Human evaluation comparing Lesan against three commercial MT systems for Amharic to and from English.

3 Broader Impact

There are several applications of machine translation systems for broader impact. Let’s take the case of Wikipedia. Wikipedia currently has a total of more than six billion articles and over 17 billion words in its English edition. Unfortunately, millions of people cannot access this because it’s not available in their language. For instance, at the moment there are only 217 articles on the Tigrinya Wikipedia and 15,009 articles on the Amharic Wikipedia 999https://en.wikipedia.org/wiki/List_of_Wikipedias. In future work, we would like to leverage Lesan’s MT system to empower human translators towards our mission of opening up the Web’s content to millions of people in their language.

We would like to thank Sergey Edunov from Facebook AI Research for valuable feedback on our machine translation pipeline.

References

  • L. Barrault, O. Bojar, M. R. Costa jussà, C. Federmann, M. Fishel, Y. Graham, B. Haddow, M. Huck, P. Koehn, S. Malmasi, C. Monz, M. Müller, S. Pal, M. Post, and M. Zampieri (2019) Findings of the 2019 conference on machine translation (wmt19).. In WMT (2), pp. 1–61. Cited by: §1.
  • S. Edunov, M. Ott, M. Auli, and D. Grangier (2018) Understanding back-translation at scale. arXiv preprint arXiv:1808.09381. Cited by: §1.
  • M. Junczys-Dowmunt (2018) Dual conditional cross-entropy filtering of noisy parallel corpora. arXiv preprint arXiv:1809.00197. Cited by: §1.
  • G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush (2017)

    Opennmt: open-source toolkit for neural machine translation

    .
    arXiv preprint arXiv:1701.02810. Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, Ł. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR abs/1609.08144. External Links: Link Cited by: §2.