Causal Distillation for Language Models

12/05/2021
by   Zhengxuan Wu, et al.
0

Distillation efforts have led to language models that are more compact and efficient without serious drops in performance. The standard approach to distillation trains a student model against two objectives: a task-specific objective (e.g., language modeling) and an imitation objective that encourages the hidden states of the student model to be similar to those of the larger teacher model. In this paper, we show that it is beneficial to augment distillation with a third objective that encourages the student to imitate the causal computation process of the teacher through interchange intervention training(IIT). IIT pushes the student model to become a causal abstraction of the teacher model - a simpler model with the same causal structure. IIT is fully differentiable, easily implemented, and combines flexibly with other objectives. Compared with standard distillation of BERT, distillation via IIT results in lower perplexity on Wikipedia (masked language modeling) and marked improvements on the GLUE benchmark (natural language understanding), SQuAD (question answering), and CoNLL-2003 (named entity recognition).

READ FULL TEXT
research
05/24/2023

Large Language Model Distillation Doesn't Need a Teacher

Knowledge distillation trains a smaller student model to match the outpu...
research
11/01/2022

Reduce, Reuse, Recycle: Improving Training Efficiency with Distillation

Methods for improving the efficiency of deep network training (i.e. the ...
research
05/24/2023

How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives

Recently, various intermediate layer distillation (ILD) objectives have ...
research
01/25/2020

Generation-Distillation for Efficient Natural Language Understanding in Low-Data Settings

Over the past year, the emergence of transfer learning with large-scale ...
research
10/10/2022

Knowledge Distillation Transfer Sets and their Impact on Downstream NLU Tasks

Teacher-student knowledge distillation is a popular technique for compre...
research
11/01/2020

MixKD: Towards Efficient Distillation of Large-scale Language Models

Large-scale language models have recently demonstrated impressive empiri...
research
09/23/2021

Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing

We aim to identify how different components in the KD pipeline affect th...

Please sign up or login with your details

Forgot password? Click here to reset