Contrastive Distillation on Intermediate Representations for Language Model Compression

09/29/2020
by   Siqi Sun, et al.
0

Existing language model compression methods mostly use a simple L2 loss to distill knowledge in the intermediate representations of a large BERT model to a smaller one. Although widely used, this objective by design assumes that all the dimensions of hidden representations are independent, failing to capture important structural knowledge in the intermediate layers of the teacher network. To achieve better distillation efficacy, we propose Contrastive Distillation on Intermediate Representations (CoDIR), a principled knowledge distillation framework where the student is trained to distill knowledge through intermediate layers of the teacher via a contrastive objective. By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student's exploitation of rich information in teacher's hidden layers. CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark, outperforming state-of-the-art compression methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/25/2019

Patient Knowledge Distillation for BERT Model Compression

Pre-trained language models such as BERT have proven to be highly effect...
research
10/04/2022

Less is More: Task-aware Layer-wise Distillation for Language Model Compression

Layer-wise distillation is a powerful tool to compress large models (i.e...
research
03/14/2023

A Contrastive Knowledge Transfer Framework for Model Compression and Transfer Learning

Knowledge Transfer (KT) achieves competitive performance and is widely u...
research
09/23/2021

Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing

We aim to identify how different components in the KD pipeline affect th...
research
06/11/2023

Are Intermediate Layers and Labels Really Necessary? A General Language Model Distillation Method

The large scale of pre-trained language models poses a challenge for the...
research
10/02/2018

LIT: Block-wise Intermediate Representation Training for Model Compression

Knowledge distillation (KD) is a popular method for reducing the computa...
research
04/05/2021

Compressing Visual-linguistic Model via Knowledge Distillation

Despite exciting progress in pre-training for visual-linguistic (VL) rep...

Please sign up or login with your details

Forgot password? Click here to reset