Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing

09/23/2021
by   Haoyu He, et al.
0

We aim to identify how different components in the KD pipeline affect the resulting performance and how much the optimal KD pipeline varies across different datasets/tasks, such as the data augmentation policy, the loss function, and the intermediate representation for transferring the knowledge between teacher and student. To tease apart their effects, we propose Distiller, a meta KD framework that systematically combines a broad range of techniques across different stages of the KD pipeline, which enables us to quantify each component's contribution. Within Distiller, we unify commonly used objectives for distillation of intermediate representations under a universal mutual information (MI) objective and propose a class of MI-α objective functions with better bias/variance trade-off for estimating the MI between the teacher and the student. On a diverse set of NLP datasets, the best Distiller configurations are identified via large-scale hyperparameter optimization. Our experiments reveal the following: 1) the approach used to distill the intermediate representations is the most important factor in KD performance, 2) among different objectives for intermediate distillation, MI-α performs the best, and 3) data augmentation provides a large boost for small training datasets or small student networks. Moreover, we find that different datasets/tasks prefer different KD algorithms, and thus propose a simple AutoDistiller algorithm that can recommend a good KD pipeline for a new dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/29/2020

Contrastive Distillation on Intermediate Representations for Language Model Compression

Existing language model compression methods mostly use a simple L2 loss ...
research
02/08/2023

An Empirical Study of Uniform-Architecture Knowledge Distillation in Document Ranking

Although BERT-based ranking models have been commonly used in commercial...
research
10/12/2022

Efficient Knowledge Distillation from Model Checkpoints

Knowledge distillation is an effective approach to learn compact models ...
research
12/11/2022

Learning What You Should Learn

In real teaching scenarios, an excellent teacher always teaches what he ...
research
12/05/2021

Causal Distillation for Language Models

Distillation efforts have led to language models that are more compact a...
research
04/07/2022

Deep Visual Geo-localization Benchmark

In this paper, we propose a new open-source benchmarking framework for V...

Please sign up or login with your details

Forgot password? Click here to reset