Log In Sign Up

Practical Comparable Data Collection for Low-Resource Languages via Images

by   Aman Madaan, et al.

We propose a method of curating high-quality comparable training data for low-resource languages without requiring that the annotators are bilingual. Our method involves using a carefully selected set of images as a pivot between the source and target languages by getting captions for such images in both languages independently. Human evaluations on the English-Hindi comparable corpora created with our method show that 81.1% of the pairs are acceptable translations, and only 2.47% of the pairs are not a translation at all. We further establish the potential of dataset collected through our approach by experimenting on two downstream tasks – machine translation and dictionary extraction. All code and data are made available at <>


page 2

page 4

page 9

page 10

page 11

page 12


Refining Low-Resource Unsupervised Translation by Language Disentanglement of Multilingual Model

Numerous recent work on unsupervised machine translation (UMT) implies t...

SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages

In recent years, multilingual machine translation models have achieved p...

Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation

The data scarcity in low-resource languages has become a bottleneck to b...

Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English

The vast majority of language pairs in the world are low-resource becaus...

A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families

The lack or absence of parallel and comparable corpora makes bilingual l...

No Language Left Behind: Scaling Human-Centered Machine Translation

Driven by the goal of eradicating language barriers on a global scale, m...

Does Corpus Quality Really Matter for Low-Resource Languages?

The vast majority of non-English corpora are derived from automatically ...