Improving Machine Translation with Phrase Pair Injection and Corpus Filtering

01/19/2023
by   Akshay Batheja, et al.
0

In this paper, we show that the combination of Phrase Pair Injection and Corpus Filtering boosts the performance of Neural Machine Translation (NMT) systems. We extract parallel phrases and sentences from the pseudo-parallel corpus and augment it with the parallel corpus to train the NMT models. With the proposed approach, we observe an improvement in the Machine Translation (MT) system for 3 low-resource language pairs, Hindi-Marathi, English-Marathi, and English-Pashto, and 6 translation directions by up to 2.7 BLEU points, on the FLORES test data. These BLEU score improvements are over the models trained using the whole pseudo-parallel corpus augmented with the parallel corpus.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/06/2023

"A Little is Enough": Few-Shot Quality Estimation based Corpus Filtering improves Machine Translation

Quality Estimation (QE) is the task of evaluating the quality of a trans...
research
09/19/2018

NICT's Corpus Filtering Systems for the WMT18 Parallel Corpus Filtering Task

This paper presents the NICT's participation in the WMT18 shared paralle...
research
10/17/2017

Paying Attention to Multi-Word Expressions in Neural Machine Translation

Processing of multi-word expressions (MWEs) is a known problem for any n...
research
10/17/2020

A Corpus for English-Japanese Multimodal Neural Machine Translation with Comparable Sentences

Multimodal neural machine translation (NMT) has become an increasingly i...
research
06/09/2020

An Augmented Translation Technique for low Resource language pair: Sanskrit to Hindi translation

Neural Machine Translation (NMT) is an ongoing technique for Machine Tra...
research
04/04/2019

Extract and Edit: An Alternative to Back-Translation for Unsupervised Neural Machine Translation

The overreliance on large parallel corpora significantly limits the appl...
research
03/11/2021

Learning Feature Weights using Reward Modeling for Denoising Parallel Corpora

Large web-crawled corpora represent an excellent resource for improving ...

Please sign up or login with your details

Forgot password? Click here to reset