"A Little is Enough": Few-Shot Quality Estimation based Corpus Filtering improves Machine Translation

06/06/2023
by   Akshay Batheja, et al.
0

Quality Estimation (QE) is the task of evaluating the quality of a translation when reference translation is not available. The goal of QE aligns with the task of corpus filtering, where we assign the quality score to the sentence pairs present in the pseudo-parallel corpus. We propose a Quality Estimation based Filtering approach to extract high-quality parallel data from the pseudo-parallel corpus. To the best of our knowledge, this is a novel adaptation of the QE framework to extract quality parallel corpus from the pseudo-parallel corpus. By training with this filtered corpus, we observe an improvement in the Machine Translation (MT) system's performance by up to 1.8 BLEU points, for English-Marathi, Chinese-English, and Hindi-Bengali language pairs, over the baseline model. The baseline model is the one that is trained on the whole pseudo-parallel corpus. Our Few-shot QE model transfer learned from the English-Marathi QE model and fine-tuned on only 500 Hindi-Bengali training instances, shows an improvement of up to 0.6 BLEU points for Hindi-Bengali language pair, compared to the baseline model. This demonstrates the promise of transfer learning in the setting under discussion. QE systems typically require in the order of (7K-25K) of training data. Our Hindi-Bengali QE is trained on only 500 instances of training that is 1/40th of the normal requirement and achieves comparable performance. All the scripts and datasets utilized in this study will be publicly available.

READ FULL TEXT
research
01/19/2023

Improving Machine Translation with Phrase Pair Injection and Corpus Filtering

In this paper, we show that the combination of Phrase Pair Injection and...
research
08/28/2022

CJaFr-v3 : A Freely Available Filtered Japanese-French Aligned Corpus

We present a free Japanese-French parallel corpus. It includes 15M align...
research
09/08/2021

Ensemble Fine-tuned mBERT for Translation Quality Estimation

Quality Estimation (QE) is an important component of the machine transla...
research
10/15/2015

Noisy-parallel and comparable corpora filtering methodology for the extraction of bi-lingual equivalent data at sentence level

Text alignment and text quality are critical to the accuracy of Machine ...
research
03/03/2019

Detecting dementia in Mandarin Chinese using transfer learning from a parallel corpus

Machine learning has shown promise for automatic detection of Alzheimer'...
research
11/01/2021

A New Tool for Efficiently Generating Quality Estimation Datasets

Building of data for quality estimation (QE) training is expensive and r...
research
10/19/2022

Separating Grains from the Chaff: Using Data Filtering to Improve Multilingual Translation for Low-Resourced African Languages

We participated in the WMT 2022 Large-Scale Machine Translation Evaluati...

Please sign up or login with your details

Forgot password? Click here to reset