HintedBT: Augmenting Back-Translation with Quality and Transliteration Hints

09/09/2021
by   Sahana Ramnath, et al.
0

Back-translation (BT) of target monolingual corpora is a widely used data augmentation strategy for neural machine translation (NMT), especially for low-resource language pairs. To improve effectiveness of the available BT data, we introduce HintedBT – a family of techniques which provides hints (through tags) to the encoder and decoder. First, we propose a novel method of using both high and low quality BT data by providing hints (as source tags on the encoder) to the model about the quality of each source-target pair. We don't filter out low quality data but instead show that these hints enable the model to learn effectively from noisy data. Second, we address the problem of predicting whether a source token needs to be translated or transliterated to the target language, which is common in cross-script translation tasks (i.e., where source and target do not share the written script). For such cases, we propose training the model with additional hints (as target tags on the decoder) that provide information about the operation required on the source (translation or both translation and transliteration). We conduct experiments and detailed analyses on standard WMT benchmarks for three cross-script low/medium-resource language pairs: Hindi,Gujarati,Tamil-to-English. Our methods compare favorably with five strong and well established baselines. We show that using these hints, both separately and together, significantly improves translation quality and leads to state-of-the-art performance in all three language pairs in corresponding bilingual settings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/29/2018

Bi-Directional Neural Machine Translation with Synthetic Parallel Data

Despite impressive progress in high-resource settings, Neural Machine Tr...
research
06/15/2016

Semi-Supervised Learning for Neural Machine Translation

While end-to-end neural machine translation (NMT) has made remarkable pr...
research
09/27/2022

Improving Multilingual Neural Machine Translation System for Indic Languages

Machine Translation System (MTS) serves as an effective tool for communi...
research
11/01/2018

Language-Independent Representor for Neural Machine Translation

Current Neural Machine Translation (NMT) employs a language-specific enc...
research
05/01/2021

AlloST: Low-resource Speech Translation without Source Transcription

The end-to-end architecture has made promising progress in speech transl...
research
12/22/2017

Source-side Prediction for Neural Headline Generation

The encoder-decoder model is widely used in natural language generation ...
research
07/13/2023

Data Augmentation for Machine Translation via Dependency Subtree Swapping

We present a generic framework for data augmentation via dependency subt...

Please sign up or login with your details

Forgot password? Click here to reset