Autoregressive Knowledge Distillation through Imitation Learning

09/15/2020
by   Alexander Lin, et al.
0

The performance of autoregressive models on natural language generation tasks has dramatically improved due to the adoption of deep, self-attentive architectures. However, these gains have come at the cost of hindering inference speed, making state-of-the-art models cumbersome to deploy in real-world, time-sensitive settings. We develop a compression technique for autoregressive models that is driven by an imitation learning perspective on knowledge distillation. The algorithm is designed to address the exposure bias problem. On prototypical language generation tasks such as translation and summarization, our method consistently outperforms other distillation algorithms, such as sequence-level knowledge distillation. Student models trained with our method attain 1.4 to 4.8 BLEU/ROUGE points higher than those trained from scratch, while increasing inference speed by up to 14 times in comparison to the teacher model.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/07/2019

Understanding Knowledge Distillation in Non-autoregressive Machine Translation

Non-autoregressive machine translation (NAT) systems predict a sequence ...
10/05/2020

Lifelong Language Knowledge Distillation

It is challenging to perform lifelong language learning (LLL) on a strea...
12/22/2021

Self-Distillation Mixup Training for Non-autoregressive Neural Machine Translation

Recently, non-autoregressive (NAT) models predict outputs in parallel, a...
06/25/2016

Sequence-Level Knowledge Distillation

Neural machine translation (NMT) offers a novel alternative formulation ...
12/06/2019

Explaining Sequence-Level Knowledge Distillation as Data-Augmentation for Neural Machine Translation

Sequence-level knowledge distillation (SLKD) is a model compression tech...
03/26/2021

Hands-on Guidance for Distilling Object Detectors

Knowledge distillation can lead to deploy-friendly networks against the ...
09/06/2018

RDPD: Rich Data Helps Poor Data via Imitation

In many situations, we have both rich- and poor- data environments: in a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.