How Does Distilled Data Complexity Impact the Quality and Confidence of Non-Autoregressive Machine Translation?

05/27/2021
by   Weijia Xu, et al.
0

While non-autoregressive (NAR) models are showing great promise for machine translation, their use is limited by their dependence on knowledge distillation from autoregressive models. To address this issue, we seek to understand why distillation is so effective. Prior work suggests that distilled training data is less complex than manual translations. Based on experiments with the Levenshtein Transformer and the Mask-Predict NAR models on the WMT14 German-English task, this paper shows that different types of complexity have different impacts: while reducing lexical diversity and decreasing reordering complexity both help NAR learn better alignment between source and target, and thus improve translation quality, lexical diversity is the main reason why distillation increases model confidence, which affects the calibration of different NAR models differently.

READ FULL TEXT
research
11/07/2019

Understanding Knowledge Distillation in Non-autoregressive Machine Translation

Non-autoregressive machine translation (NAT) systems predict a sequence ...
research
03/31/2023

Selective Knowledge Distillation for Non-Autoregressive Neural Machine Translation

Benefiting from the sequence-level knowledge distillation, the Non-Autor...
research
10/12/2022

Non-Autoregressive Machine Translation with Translation Memories

Non-autoregressive machine translation (NAT) has recently made great pro...
research
12/29/2020

Understanding and Improving Lexical Choice in Non-Autoregressive Translation

Knowledge distillation (KD) is essential for training non-autoregressive...
research
04/22/2020

A Study of Non-autoregressive Model for Sequence Generation

Non-autoregressive (NAR) models generate all the tokens of a sequence in...
research
12/22/2021

Self-Distillation Mixup Training for Non-autoregressive Neural Machine Translation

Recently, non-autoregressive (NAT) models predict outputs in parallel, a...
research
06/02/2021

Rejuvenating Low-Frequency Words: Making the Most of Parallel Data in Non-Autoregressive Translation

Knowledge distillation (KD) is commonly used to construct synthetic data...

Please sign up or login with your details

Forgot password? Click here to reset