No Language Left Behind: Scaling Human-Centered Machine Translation

07/11/2022
by   NLLB team, et al.
9

Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the 200 language barrier while ensuring safe, high quality results, all while keeping ethical considerations in mind? In No Language Left Behind, we took on this challenge by first contextualizing the need for low-resource language translation support through exploratory interviews with native speakers. Then, we created datasets and models aimed at narrowing the performance gap between low and high-resource languages. More specifically, we developed a conditional compute model based on Sparsely Gated Mixture of Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource languages. We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Critically, we evaluated the performance of over 40,000 different translation directions using a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety. Our model achieves an improvement of 44 the previous state-of-the-art, laying important groundwork towards realizing a universal translation system. Finally, we open source all contributions described in this work, accessible at https://github.com/facebookresearch/fairseq/tree/nllb.

READ FULL TEXT
research
10/20/2022

SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages

In recent years, multilingual machine translation models have achieved p...
research
09/01/2021

Survey of Low-Resource Machine Translation

We present a survey covering the state of the art in low-resource machin...
research
12/15/2022

Fixing MoE Over-Fitting on Low-Resource Languages in Multilingual Machine Translation

Sparsely gated Mixture of Experts (MoE) models have been shown to be a c...
research
05/31/2023

Ethical Considerations for Machine Translation of Indigenous Languages: Giving a Voice to the Speakers

In recent years machine translation has become very successful for high-...
research
09/19/2023

NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages

Democratizing access to natural language processing (NLP) technology is ...
research
04/24/2020

Practical Comparable Data Collection for Low-Resource Languages via Images

We propose a method of curating high-quality comparable training data fo...
research
03/27/2023

Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine Translation

Neural machine translation (NMT) has progressed rapidly over the past se...

Please sign up or login with your details

Forgot password? Click here to reset