Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine Translation

03/27/2023
by   Alex Jones, et al.
0

Neural machine translation (NMT) has progressed rapidly over the past several years, and modern models are able to achieve relatively high quality using only monolingual text data, an approach dubbed Unsupervised Machine Translation (UNMT). However, these models still struggle in a variety of ways, including aspects of translation that for a human are the easiest - for instance, correctly translating common nouns. This work explores a cheap and abundant resource to combat this problem: bilingual lexica. We test the efficacy of bilingual lexica in a real-world set-up, on 200-language translation models trained on web-crawled text. We present several findings: (1) using lexical data augmentation, we demonstrate sizable performance gains for unsupervised translation; (2) we compare several families of data augmentation, demonstrating that they yield similar improvements, and can be combined for even greater improvements; (3) we demonstrate the importance of carefully curated lexica over larger, noisier ones, especially with larger models; and (4) we compare the efficacy of multilingual lexicon data versus human-translated parallel data. Finally, we open-source GATITOS (available at https://github.com/google-research/url-nlp/tree/main/gatitos), a new multilingual lexicon for 26 low-resource languages, which had the highest performance among lexica in our experiments.

READ FULL TEXT

page 23

page 27

page 30

research
06/12/2023

Textual Augmentation Techniques Applied to Low Resource Machine Translation: Case of Swahili

In this work we investigate the impact of applying textual data augmenta...
research
05/11/2020

Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation

Over the last few years two promising research directions in low-resourc...
research
06/10/2019

Generalized Data Augmentation for Low-Resource Translation

Translation to or from low-resource languages LRLs poses challenges for ...
research
10/18/2018

Unsupervised Neural Text Simplification

The paper presents a first attempt towards unsupervised neural text simp...
research
04/28/2022

NMTScore: A Multilingual Analysis of Translation-based Text Similarity Measures

Being able to rank the similarity of short text segments is an interesti...
research
07/11/2022

No Language Left Behind: Scaling Human-Centered Machine Translation

Driven by the goal of eradicating language barriers on a global scale, m...
research
08/13/2018

Rapid Adaptation of Neural Machine Translation to New Languages

This paper examines the problem of adapting neural machine translation s...

Please sign up or login with your details

Forgot password? Click here to reset