DiMS: Distilling Multiple Steps of Iterative Non-Autoregressive Transformers

06/07/2022
by   Sajad Norouzi, et al.
0

The computational benefits of iterative non-autoregressive transformers decrease as the number of decoding steps increases. As a remedy, we introduce Distill Multiple Steps (DiMS), a simple yet effective distillation technique to decrease the number of required steps to reach a certain translation quality. The distilled model enjoys the computational benefits of early iterations while preserving the enhancements from several iterative steps. DiMS relies on two models namely student and teacher. The student is optimized to predict the output of the teacher after multiple decoding steps while the teacher follows the student via a slow-moving average. The moving average keeps the teacher's knowledge updated and enhances the quality of the labels provided by the teacher. During inference, the student is used for translation and no additional computation is added. We verify the effectiveness of DiMS on various models obtaining improvements of up to 7 BLEU points on distilled and 12 BLEU points on raw WMT datasets for single-step translation. We release our code at https://github.com/layer6ai-labs/DiMS.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/22/2021

Self-Distillation Mixup Training for Non-autoregressive Neural Machine Translation

Recently, non-autoregressive (NAT) models predict outputs in parallel, a...
research
02/06/2017

Ensemble Distillation for Neural Machine Translation

Knowledge distillation describes a method for training a student network...
research
02/21/2023

MaskedKD: Efficient Distillation of Vision Transformers with Masked Images

Knowledge distillation is a popular and effective regularization techniq...
research
10/04/2021

Spatial Ensemble: a Novel Model Smoothing Mechanism for Student-Teacher Framework

Model smoothing is of central importance for obtaining a reliable teache...
research
05/28/2022

One Reference Is Not Enough: Diverse Distillation with Reference Selection for Non-Autoregressive Translation

Non-autoregressive neural machine translation (NAT) suffers from the mul...
research
12/29/2020

Understanding and Improving Lexical Choice in Non-Autoregressive Translation

Knowledge distillation (KD) is essential for training non-autoregressive...
research
09/21/2023

Code Soliloquies for Accurate Calculations in Large Language Models

High-quality conversational datasets are integral to the successful deve...

Please sign up or login with your details

Forgot password? Click here to reset