Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation

06/10/2021
by   Yuanxin Liu, et al.
0

Recently, knowledge distillation (KD) has shown great success in BERT compression. Instead of only learning from the teacher's soft label as in conventional KD, researchers find that the rich information contained in the hidden layers of BERT is conducive to the student's performance. To better exploit the hidden knowledge, a common practice is to force the student to deeply mimic the teacher's hidden states of all the tokens in a layer-wise manner. In this paper, however, we observe that although distilling the teacher's hidden state knowledge (HSK) is helpful, the performance gain (marginal utility) diminishes quickly as more HSK is distilled. To understand this effect, we conduct a series of analysis. Specifically, we divide the HSK of BERT into three dimensions, namely depth, length and width. We first investigate a variety of strategies to extract crucial knowledge for each single dimension and then jointly compress the three dimensions. In this way, we show that 1) the student's performance can be improved by extracting and distilling the crucial HSK, and 2) using a tiny fraction of HSK can achieve the same performance as extensive HSK distillation. Based on the second finding, we further propose an efficient KD paradigm to compress BERT, which does not require loading the teacher during the training of student. For two kinds of student models and computing devices, the proposed KD paradigm gives rise to training speedup of 2.7x   3.4x.

READ FULL TEXT

page 2

page 4

page 5

page 7

page 8

page 9

page 10

page 13

research
08/25/2019

Patient Knowledge Distillation for BERT Model Compression

Pre-trained language models such as BERT have proven to be highly effect...
research
06/02/2021

One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers

Pre-trained language models (PLMs) achieve great success in NLP. However...
research
01/30/2023

On student-teacher deviations in distillation: does it pay to disobey?

Knowledge distillation has been widely-used to improve the performance o...
research
05/24/2023

How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives

Recently, various intermediate layer distillation (ILD) objectives have ...
research
10/04/2022

Less is More: Task-aware Layer-wise Distillation for Language Model Compression

Layer-wise distillation is a powerful tool to compress large models (i.e...
research
09/26/2021

Partial to Whole Knowledge Distillation: Progressive Distilling Decomposed Knowledge Boosts Student Better

Knowledge distillation field delicately designs various types of knowled...
research
12/30/2021

An Efficient Federated Distillation Learning System for Multi-task Time Series Classification

This paper proposes an efficient federated distillation learning system ...

Please sign up or login with your details

Forgot password? Click here to reset