Practical Comparable Data Collection for Low-Resource Languages via Images

04/24/2020
by   Aman Madaan, et al.
0

We propose a method of curating high-quality comparable training data for low-resource languages without requiring that the annotators are bilingual. Our method involves using a carefully selected set of images as a pivot between the source and target languages by getting captions for such images in both languages independently. Human evaluations on the English-Hindi comparable corpora created with our method show that 81.1% of the pairs are acceptable translations, and only 2.47% of the pairs are not a translation at all. We further establish the potential of dataset collected through our approach by experimenting on two downstream tasks – machine translation and dictionary extraction. All code and data are made available at <https://github.com/madaan/PML4DC-Comparable-Data-Collection>

READ FULL TEXT

page 2

page 4

page 9

page 10

page 11

page 12

research
05/31/2022

Refining Low-Resource Unsupervised Translation by Language Disentanglement of Multilingual Model

Numerous recent work on unsupervised machine translation (UMT) implies t...
research
08/04/2023

Sinhala-English Parallel Word Dictionary Dataset

Parallel datasets are vital for performing and evaluating any kind of mu...
research
05/09/2021

Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation

The data scarcity in low-resource languages has become a bottleneck to b...
research
02/04/2019

Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English

The vast majority of language pairs in the world are low-resource becaus...
research
07/11/2022

No Language Left Behind: Scaling Human-Centered Machine Translation

Driven by the goal of eradicating language barriers on a global scale, m...
research
11/14/2020

Iterative Self-Learning for Enhanced Back-Translation in Low Resource Neural Machine Translation

Many language pairs are low resource - the amount and/or quality of para...
research
10/05/2020

A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families

The lack or absence of parallel and comparable corpora makes bilingual l...

Please sign up or login with your details

Forgot password? Click here to reset