Towards the Next 1000 Languages in Multilingual Machine Translation: Exploring the Synergy Between Supervised and Self-Supervised Learning

01/09/2022
by   Aditya Siddhant, et al.
8

Achieving universal translation between all human language pairs is the holy-grail of machine translation (MT) research. While recent progress in massively multilingual MT is one step closer to reaching this goal, it is becoming evident that extending a multilingual MT system simply by training on more parallel data is unscalable, since the availability of labeled data for low-resource and non-English-centric language pairs is forbiddingly limited. To this end, we present a pragmatic approach towards building a multilingual MT model that covers hundreds of languages, using a mixture of supervised and self-supervised objectives, depending on the data availability for different language pairs. We demonstrate that the synergy between these two training paradigms enables the model to produce high-quality translations in the zero-resource setting, even surpassing supervised translation quality for low- and mid-resource languages. We conduct a wide array of experiments to understand the effect of the degree of multilingual supervision, domain mismatches and amounts of parallel and monolingual data on the quality of our self-supervised multilingual models. To demonstrate the scalability of the approach, we train models with over 200 languages and demonstrate high performance on zero-resource translation on several previously under-studied languages. We hope our findings will serve as a stepping stone towards enabling translation for the next thousand languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/16/2020

Improving Multilingual Neural Machine Translation For Low-Resource Languages: French-, English- Vietnamese

Prior works have demonstrated that a low-resource language pair can bene...
research
05/09/2022

Building Machine Translation Systems for the Next Thousand Languages

In this paper we share findings from our effort to build practical machi...
research
02/02/2023

The unreasonable effectiveness of few-shot learning for machine translation

We demonstrate the potential of few-shot translation systems, trained wi...
research
05/30/2023

Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation

Many NLP pipelines split text into sentences as one of the crucial prepr...
research
10/20/2018

Improving Multilingual Semantic Textual Similarity with Shared Sentence Encoder for Low-resource Languages

Measuring the semantic similarity between two sentences (or Semantic Tex...
research
10/07/2020

Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information

We investigate the following question for machine translation (MT): can ...
research
05/29/2023

BigTrans: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languages

Large language models (LLMs) demonstrate promising translation performan...

Please sign up or login with your details

Forgot password? Click here to reset