Self-Distillation Mixup Training for Non-autoregressive Neural Machine Translation

12/22/2021
by   Jiaxin Guo, et al.
0

Recently, non-autoregressive (NAT) models predict outputs in parallel, achieving substantial improvements in generation speed compared to autoregressive (AT) models. While performing worse on raw data, most NAT models are trained as student models on distilled data generated by AT teacher models, which is known as sequence-level Knowledge Distillation. An effective training strategy to improve the performance of AT models is Self-Distillation Mixup (SDM) Training, which pre-trains a model on raw data, generates distilled data by the pre-trained model itself and finally re-trains a model on the combination of raw data and distilled data. In this work, we aim to view SDM for NAT models, but find directly adopting SDM to NAT models gains no improvements in terms of translation quality. Through careful analysis, we observe the invalidation is correlated to Modeling Diversity and Confirmation Bias between the AT teacher model and the NAT student models. Based on these findings, we propose an enhanced strategy named SDMRT by adding two stages to classic SDM: one is Pre-Rerank on self-distilled data, the other is Fine-Tune on Filtered teacher-distilled data. Our results outperform baselines by 0.6 to 1.2 BLEU on multiple NAT models. As another bonus, for Iterative Refinement NAT models, our methods can outperform baselines within half iteration number, which means 2X acceleration.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/31/2023

Selective Knowledge Distillation for Non-Autoregressive Neural Machine Translation

Benefiting from the sequence-level knowledge distillation, the Non-Autor...
research
06/07/2022

DiMS: Distilling Multiple Steps of Iterative Non-Autoregressive Transformers

The computational benefits of iterative non-autoregressive transformers ...
research
09/15/2020

Autoregressive Knowledge Distillation through Imitation Learning

The performance of autoregressive models on natural language generation ...
research
11/07/2017

Non-Autoregressive Neural Machine Translation

Existing approaches to neural machine translation condition each output ...
research
12/06/2019

Explaining Sequence-Level Knowledge Distillation as Data-Augmentation for Neural Machine Translation

Sequence-level knowledge distillation (SLKD) is a model compression tech...
research
12/29/2020

Understanding and Improving Lexical Choice in Non-Autoregressive Translation

Knowledge distillation (KD) is essential for training non-autoregressive...
research
05/27/2021

How Does Distilled Data Complexity Impact the Quality and Confidence of Non-Autoregressive Machine Translation?

While non-autoregressive (NAR) models are showing great promise for mach...

Please sign up or login with your details

Forgot password? Click here to reset