Selecting Backtranslated Data from Multiple Sources for Improved Neural Machine Translation

05/01/2020
by   Xabier Soto, et al.
0

Machine translation (MT) has benefited from using synthetic training data originating from translating monolingual corpora, a technique known as backtranslation. Combining backtranslated data from different sources has led to better results than when using such data in isolation. In this work we analyse the impact that data translated with rule-based, phrase-based statistical and neural MT systems has on new MT systems. We use a real-world low-resource use-case (Basque-to-Spanish in the clinical domain) as well as a high-resource language pair (German-to-English) to test different scenarios with backtranslation and employ data selection to optimise the synthetic corpora. We exploit different data selection strategies in order to reduce the amount of data used, while at the same time maintaining high-quality MT systems. We further tune the data selection method by taking into account the quality of the MT systems used for backtranslation and lexical diversity of the resulting corpora. Our experiments show that incorporating backtranslated data from different sources can be beneficial, and that availing of data selection can yield improved performance.

READ FULL TEXT
research
04/17/2018

Investigating Backtranslation in Neural Machine Translation

A prerequisite for training corpus-based machine translation (MT) system...
research
10/30/2017

Machine Translation of Low-Resource Spoken Dialects: Strategies for Normalizing Swiss German

The goal of this work is to design a machine translation system for a lo...
research
11/06/2019

Domain, Translationese and Noise in Synthetic Data for Neural Machine Translation

The quality of neural machine translation can be improved by leveraging ...
research
08/12/2017

Statistical Vs Rule Based Machine Translation; A Case Study on Indian Language Perspective

In this paper we present our work on a case study between Statistical Ma...
research
04/09/2020

Self-Training for Unsupervised Neural Machine Translation in Unbalanced Training Data Scenarios

Unsupervised neural machine translation (UNMT) that relies solely on mas...
research
12/02/2019

Language Model Bootstrapping Using Neural Machine Translation For Conversational Speech Recognition

Building conversational speech recognition systems for new languages is ...
research
09/10/2021

Rule-based Morphological Inflection Improves Neural Terminology Translation

Current approaches to incorporating terminology constraints in machine t...

Please sign up or login with your details

Forgot password? Click here to reset