Multi-Sentence Resampling: A Simple Approach to Alleviate Dataset Length Bias and Beam-Search Degradation

09/13/2021
by   Ivan Provilkov, et al.
0

Neural Machine Translation (NMT) is known to suffer from a beam-search problem: after a certain point, increasing beam size causes an overall drop in translation quality. This effect is especially pronounced for long sentences. While much work was done analyzing this phenomenon, primarily for autoregressive NMT models, there is still no consensus on its underlying cause. In this work, we analyze errors that cause major quality degradation with large beams in NMT and Automatic Speech Recognition (ASR). We show that a factor that strongly contributes to the quality degradation with large beams is dataset length-bias - NMT datasets are strongly biased towards short sentences. To mitigate this issue, we propose a new data augmentation technique – Multi-Sentence Resampling (MSR). This technique extends the training examples by concatenating several sentences from the original dataset to make a long training example. We demonstrate that MSR significantly reduces degradation with growing beam size and improves final translation quality on the IWSTL15 En-Vi, IWSTL17 En-Fr, and WMT14 En-De datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/21/2020

Sentence Boundary Augmentation For Neural Machine Translation Robustness

Neural Machine Translation (NMT) models have demonstrated strong state o...
research
04/17/2021

Sentence Concatenation Approach to Data Augmentation for Neural Machine Translation

Neural machine translation (NMT) has recently gained widespread attentio...
research
08/29/2018

Correcting Length Bias in Neural Machine Translation

We study two problems in neural machine translation (NMT). First, in bea...
research
12/16/2021

Characterizing and addressing the issue of oversmoothing in neural autoregressive sequence modeling

Neural autoregressive sequence models smear the probability among many p...
research
05/07/2020

On Exposure Bias, Hallucination and Domain Shift in Neural Machine Translation

The standard training algorithm in neural machine translation (NMT) suff...
research
08/27/2019

On NMT Search Errors and Model Errors: Cat Got Your Tongue?

We report on search errors and model errors in neural machine translatio...
research
09/20/2020

Softmax Tempering for Training Neural Machine Translation Models

Neural machine translation (NMT) models are typically trained using a so...

Please sign up or login with your details

Forgot password? Click here to reset