DeepAI
Log In Sign Up

Practical Comparable Data Collection for Low-Resource Languages via Images

04/24/2020
by   Aman Madaan, et al.
0

We propose a method of curating high-quality comparable training data for low-resource languages without requiring that the annotators are bilingual. Our method involves using a carefully selected set of images as a pivot between the source and target languages by getting captions for such images in both languages independently. Human evaluations on the English-Hindi comparable corpora created with our method show that 81.1% of the pairs are acceptable translations, and only 2.47% of the pairs are not a translation at all. We further establish the potential of dataset collected through our approach by experimenting on two downstream tasks – machine translation and dictionary extraction. All code and data are made available at <https://github.com/madaan/PML4DC-Comparable-Data-Collection>

READ FULL TEXT

page 2

page 4

page 9

page 10

page 11

page 12

05/31/2022

Refining Low-Resource Unsupervised Translation by Language Disentanglement of Multilingual Model

Numerous recent work on unsupervised machine translation (UMT) implies t...
10/20/2022

SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages

In recent years, multilingual machine translation models have achieved p...
05/09/2021

Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation

The data scarcity in low-resource languages has become a bottleneck to b...
02/04/2019

Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English

The vast majority of language pairs in the world are low-resource becaus...
10/05/2020

A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families

The lack or absence of parallel and comparable corpora makes bilingual l...
07/11/2022

No Language Left Behind: Scaling Human-Centered Machine Translation

Driven by the goal of eradicating language barriers on a global scale, m...
03/15/2022

Does Corpus Quality Really Matter for Low-Resource Languages?

The vast majority of non-English corpora are derived from automatically ...