AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages

09/10/2021
by   Machel Reid, et al.
0

Reproducible benchmarks are crucial in driving progress of machine translation research. However, existing machine translation benchmarks have been mostly limited to high-resource or well-represented languages. Despite an increasing interest in low-resource machine translation, there are no standardized reproducible benchmarks for many African languages, many of which are used by millions of speakers but have less digitized textual data. To tackle these challenges, we propose AfroMT, a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages. We also develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages. Furthermore, we explore the newly considered case of low-resource focused pretraining and develop two novel data augmentation-based strategies, leveraging word-level alignment information and pseudo-monolingual data for pretraining multilingual sequence-to-sequence models. We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines. We also show gains of up to 12 BLEU points over cross-lingual transfer baselines in data-constrained scenarios. All code and pretrained models will be released as further steps towards larger reproducible benchmarks for African languages.

READ FULL TEXT
research
08/02/2020

Multilingual Translation with Extensible Multilingual Pretraining and Finetuning

Recent work demonstrates the potential of multilingual pretraining of cr...
research
03/18/2021

Improving the Lexical Ability of Pretrained Language Models for Unsupervised Neural Machine Translation

Successful methods for unsupervised neural machine translation (UNMT) em...
research
06/10/2019

Generalized Data Augmentation for Low-Resource Translation

Translation to or from low-resource languages LRLs poses challenges for ...
research
05/16/2023

The Interpreter Understands Your Meaning: End-to-end Spoken Language Understanding Aided by Speech Translation

End-to-end spoken language understanding (SLU) remains elusive even with...
research
11/02/2020

Emergent Communication Pretraining for Few-Shot Machine Translation

While state-of-the-art models that rely upon massively multilingual pret...
research
08/04/2021

PARADISE: Exploiting Parallel Data for Multilingual Sequence-to-Sequence Pretraining

Despite the success of multilingual sequence-to-sequence pretraining, mo...
research
05/15/2021

DirectQE: Direct Pretraining for Machine Translation Quality Estimation

Machine Translation Quality Estimation (QE) is a task of predicting the ...

Please sign up or login with your details

Forgot password? Click here to reset