Understanding and Improving Lexical Choice in Non-Autoregressive Translation

12/29/2020
by   Liang Ding, et al.
0

Knowledge distillation (KD) is essential for training non-autoregressive translation (NAT) models by reducing the complexity of the raw data with an autoregressive teacher model. In this study, we empirically show that as a side effect of this training, the lexical choice errors on low-frequency words are propagated to the NAT model from the teacher model. To alleviate this problem, we propose to expose the raw data to NAT models to restore the useful information of low-frequency words, which are missed in the distilled data. To this end, we introduce an extra Kullback-Leibler divergence term derived by comparing the lexical choice of NAT model and that embedded in the raw data. Experimental results across language pairs and model architectures demonstrate the effectiveness and universality of the proposed approach. Extensive analyses confirm our claim that our approach improves performance by reducing the lexical choice errors on low-frequency words. Encouragingly, our approach pushes the SOTA NAT performance on the WMT14 English-German and WMT16 Romanian-English datasets up to 27.8 and 33.8 BLEU points, respectively. The source code will be released.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/02/2021

Rejuvenating Low-Frequency Words: Making the Most of Parallel Data in Non-Autoregressive Translation

Knowledge distillation (KD) is commonly used to construct synthetic data...
research
05/27/2021

How Does Distilled Data Complexity Impact the Quality and Confidence of Non-Autoregressive Machine Translation?

While non-autoregressive (NAR) models are showing great promise for mach...
research
04/28/2022

Neighbors Are Not Strangers: Improving Non-Autoregressive Translation under Low-Frequency Lexical Constraints

However, current autoregressive approaches suffer from high latency. In ...
research
03/11/2015

Is language evolution grinding to a halt? The scaling of lexical turbulence in English fiction suggests it is not

Of basic interest is the quantification of the long term growth of a lan...
research
12/22/2021

Self-Distillation Mixup Training for Non-autoregressive Neural Machine Translation

Recently, non-autoregressive (NAT) models predict outputs in parallel, a...
research
06/07/2022

DiMS: Distilling Multiple Steps of Iterative Non-Autoregressive Transformers

The computational benefits of iterative non-autoregressive transformers ...

Please sign up or login with your details

Forgot password? Click here to reset