Causal Distillation for Language Models

12/05/2021
by   Zhengxuan Wu, et al.
0

Distillation efforts have led to language models that are more compact and efficient without serious drops in performance. The standard approach to distillation trains a student model against two objectives: a task-specific objective (e.g., language modeling) and an imitation objective that encourages the hidden states of the student model to be similar to those of the larger teacher model. In this paper, we show that it is beneficial to augment distillation with a third objective that encourages the student to imitate the causal computation process of the teacher through interchange intervention training(IIT). IIT pushes the student model to become a causal abstraction of the teacher model - a simpler model with the same causal structure. IIT is fully differentiable, easily implemented, and combines flexibly with other objectives. Compared with standard distillation of BERT, distillation via IIT results in lower perplexity on Wikipedia (masked language modeling) and marked improvements on the GLUE benchmark (natural language understanding), SQuAD (question answering), and CoNLL-2003 (named entity recognition).

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

06/02/2021

One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers

Pre-trained language models (PLMs) achieve great success in NLP. However...
01/25/2020

Generation-Distillation for Efficient Natural Language Understanding in Low-Data Settings

Over the past year, the emergence of transfer learning with large-scale ...
10/16/2021

Pro-KD: Progressive Distillation by Following the Footsteps of the Teacher

With ever growing scale of neural models, knowledge distillation (KD) at...
05/12/2018

Born Again Neural Networks

Knowledge distillation (KD) consists of transferring knowledge from one ...
05/08/2020

Distilling Knowledge from Pre-trained Language Models via Text Smoothing

This paper studies compressing pre-trained language models, like BERT (D...
05/21/2020

Why distillation helps: a statistical perspective

Knowledge distillation is a technique for improving the performance of a...
11/01/2020

MixKD: Towards Efficient Distillation of Large-scale Language Models

Large-scale language models have recently demonstrated impressive empiri...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.