LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding

12/14/2020
by   Hao Fu, et al.
0

The pre-training models such as BERT have achieved great results in various natural language processing problems. However, a large number of parameters need significant amounts of memory and the consumption of inference time, which makes it difficult to deploy them on edge devices. In this work, we propose a knowledge distillation method LRC-BERT based on contrastive learning to fit the output of the intermediate layer from the angular distance aspect, which is not considered by the existing distillation methods. Furthermore, we introduce a gradient perturbation-based training architecture in the training phase to increase the robustness of LRC-BERT, which is the first attempt in knowledge distillation. Additionally, in order to better capture the distribution characteristics of the intermediate layer, we design a two-stage training method for the total distillation loss. Finally, by verifying 8 datasets on the General Language Understanding Evaluation (GLUE) benchmark, the performance of the proposed LRC-BERT exceeds the existing state-of-the-art methods, which proves the effectiveness of our method.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/15/2022

CILDA: Contrastive Data Augmentation using Intermediate Layer Knowledge Distillation

Knowledge distillation (KD) is an efficient framework for compressing la...
research
04/08/2020

LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression

BERT is a cutting-edge language representation model pre-trained by a la...
research
09/21/2021

Knowledge Distillation with Noisy Labels for Natural Language Understanding

Knowledge Distillation (KD) is extensively used to compress and deploy l...
research
09/13/2021

How to Select One Among All? An Extensive Empirical Study Towards the Robustness of Knowledge Distillation in Natural Language Understanding

Knowledge Distillation (KD) is a model compression algorithm that helps ...
research
05/27/2020

Syntactic Structure Distillation Pretraining For Bidirectional Encoders

Textual representation learners trained on large amounts of data have ac...
research
10/15/2021

Kronecker Decomposition for GPT Compression

GPT is an auto-regressive Transformer-based pre-trained language model w...
research
09/01/2020

Automatic Assignment of Radiology Examination Protocols Using Pre-trained Language Models with Knowledge Distillation

Selecting radiology examination protocol is a repetitive, error-prone, a...

Please sign up or login with your details

Forgot password? Click here to reset