Selective Knowledge Distillation for Non-Autoregressive Neural Machine Translation

03/31/2023
by   Min Liu, et al.
0

Benefiting from the sequence-level knowledge distillation, the Non-Autoregressive Transformer (NAT) achieves great success in neural machine translation tasks. However, existing knowledge distillation has side effects, such as propagating errors from the teacher to NAT students, which may limit further improvements of NAT models and are rarely discussed in existing research. In this paper, we introduce selective knowledge distillation by introducing an NAT evaluator to select NAT-friendly targets that are of high quality and easy to learn. In addition, we introduce a simple yet effective progressive distillation method to boost NAT performance. Experiment results on multiple WMT language directions and several representative NAT models show that our approach can realize a flexible trade-off between the quality and complexity of training data for NAT models, achieving strong performances. Further analysis shows that distilling only 5 an NAT outperform its counterpart trained on raw data by about 2.4 BLEU.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/29/2021

Using Perturbed Length-aware Positional Encoding for Non-autoregressive Neural Machine Translation

Non-autoregressive neural machine translation (NAT) usually employs sequ...
research
11/07/2019

Understanding Knowledge Distillation in Non-autoregressive Machine Translation

Non-autoregressive machine translation (NAT) systems predict a sequence ...
research
12/22/2021

Self-Distillation Mixup Training for Non-autoregressive Neural Machine Translation

Recently, non-autoregressive (NAT) models predict outputs in parallel, a...
research
05/27/2021

How Does Distilled Data Complexity Impact the Quality and Confidence of Non-Autoregressive Machine Translation?

While non-autoregressive (NAR) models are showing great promise for mach...
research
04/19/2023

An Empirical Study of Leveraging Knowledge Distillation for Compressing Multilingual Neural Machine Translation Models

Knowledge distillation (KD) is a well-known method for compressing neura...
research
05/28/2022

One Reference Is Not Enough: Diverse Distillation with Reference Selection for Non-Autoregressive Translation

Non-autoregressive neural machine translation (NAT) suffers from the mul...
research
09/15/2020

Autoregressive Knowledge Distillation through Imitation Learning

The performance of autoregressive models on natural language generation ...

Please sign up or login with your details

Forgot password? Click here to reset