Autoregressive Knowledge Distillation through Imitation Learning

09/15/2020
by   Alexander Lin, et al.
0

The performance of autoregressive models on natural language generation tasks has dramatically improved due to the adoption of deep, self-attentive architectures. However, these gains have come at the cost of hindering inference speed, making state-of-the-art models cumbersome to deploy in real-world, time-sensitive settings. We develop a compression technique for autoregressive models that is driven by an imitation learning perspective on knowledge distillation. The algorithm is designed to address the exposure bias problem. On prototypical language generation tasks such as translation and summarization, our method consistently outperforms other distillation algorithms, such as sequence-level knowledge distillation. Student models trained with our method attain 1.4 to 4.8 BLEU/ROUGE points higher than those trained from scratch, while increasing inference speed by up to 14 times in comparison to the teacher model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/07/2019

Understanding Knowledge Distillation in Non-autoregressive Machine Translation

Non-autoregressive machine translation (NAT) systems predict a sequence ...
research
10/05/2020

Lifelong Language Knowledge Distillation

It is challenging to perform lifelong language learning (LLL) on a strea...
research
12/22/2021

Self-Distillation Mixup Training for Non-autoregressive Neural Machine Translation

Recently, non-autoregressive (NAT) models predict outputs in parallel, a...
research
03/31/2023

Selective Knowledge Distillation for Non-Autoregressive Neural Machine Translation

Benefiting from the sequence-level knowledge distillation, the Non-Autor...
research
03/11/2023

Knowledge Distillation for Efficient Sequences of Training Runs

In many practical scenarios – like hyperparameter search or continual re...
research
03/31/2022

Conditional Autoregressors are Interpretable Classifiers

We explore the use of class-conditional autoregressive (CA) models to pe...
research
07/27/2023

f-Divergence Minimization for Sequence-Level Knowledge Distillation

Knowledge distillation (KD) is the process of transferring knowledge fro...

Please sign up or login with your details

Forgot password? Click here to reset