Separating Grains from the Chaff: Using Data Filtering to Improve Multilingual Translation for Low-Resourced African Languages

10/19/2022
by   Idris Abdulmumin, et al.
0

We participated in the WMT 2022 Large-Scale Machine Translation Evaluation for the African Languages Shared Task. This work describes our approach, which is based on filtering the given noisy data using a sentence-pair classifier that was built by fine-tuning a pre-trained language model. To train the classifier, we obtain positive samples (i.e. high-quality parallel sentences) from a gold-standard curated dataset and extract negative samples (i.e. low-quality parallel sentences) from automatically aligned parallel data by choosing sentences with low alignment scores. Our final machine translation model was then trained on filtered data, instead of the entire noisy dataset. We empirically validate our approach by evaluating on two common datasets and show that data filtering generally improves overall translation quality, in some cases even significantly.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/24/2018

Filtering and Mining Parallel Data in a Joint Multilingual Space

We learn a joint multilingual sentence embedding and use the distance be...
research
05/27/2023

Parallel Corpus for Indigenous Language Translation: Spanish-Mazatec and Spanish-Mixtec

In this paper, we present a parallel Spanish-Mazatec and Spanish-Mixtec ...
research
05/24/2022

Lack of Fluency is Hurting Your Translation Model

Many machine translation models are trained on bilingual corpus, which c...
research
08/16/2019

Incorporating Word and Subword Units in Unsupervised Machine Translation Using Language Model Rescoring

This paper describes CAiRE's submission to the unsupervised machine tran...
research
06/06/2023

"A Little is Enough": Few-Shot Quality Estimation based Corpus Filtering improves Machine Translation

Quality Estimation (QE) is the task of evaluating the quality of a trans...
research
11/16/2020

Score Combination for Improved Parallel Corpus Filtering for Low Resource Conditions

This paper describes our submission to the WMT20 sentence filtering task...
research
08/27/2021

Translation Error Detection as Rationale Extraction

Recent Quality Estimation (QE) models based on multilingual pre-trained ...

Please sign up or login with your details

Forgot password? Click here to reset