Less is More: Task-aware Layer-wise Distillation for Language Model Compression

10/04/2022
by   Chen Liang, et al.
0

Layer-wise distillation is a powerful tool to compress large models (i.e. teacher models) into small ones (i.e., student models). The student distills knowledge from the teacher by mimicking the hidden representations of the teacher at every intermediate layer. However, layer-wise distillation is difficult. Since the student has a smaller model capacity than the teacher, it is often under-fitted. Furthermore, the hidden representations of the teacher contain redundant information that the student does not necessarily need for the target task's learning. To address these challenges, we propose a novel Task-aware layEr-wise Distillation (TED). TED designs task-aware filters to align the hidden representations of the student and the teacher at each layer. The filters select the knowledge that is useful for the target task from the hidden representations. As such, TED reduces the knowledge gap between the two models and helps the student to fit better on the target task. We evaluate TED in two scenarios: continual pre-training and fine-tuning. TED demonstrates significant and consistent improvements over existing distillation methods in both scenarios.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/29/2020

Contrastive Distillation on Intermediate Representations for Language Model Compression

Existing language model compression methods mostly use a simple L2 loss ...
research
12/19/2014

FitNets: Hints for Thin Deep Nets

While depth tends to improve network performances, it also makes gradien...
research
04/05/2021

Compressing Visual-linguistic Model via Knowledge Distillation

Despite exciting progress in pre-training for visual-linguistic (VL) rep...
research
01/07/2022

Compressing Models with Few Samples: Mimicking then Replacing

Few-sample compression aims to compress a big redundant model into a sma...
research
01/29/2023

Pipe-BD: Pipelined Parallel Blockwise Distillation

Training large deep neural network models is highly challenging due to t...
research
06/10/2021

Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation

Recently, knowledge distillation (KD) has shown great success in BERT co...
research
10/02/2018

LIT: Block-wise Intermediate Representation Training for Model Compression

Knowledge distillation (KD) is a popular method for reducing the computa...

Please sign up or login with your details

Forgot password? Click here to reset