RefBERT: Compressing BERT by Referencing to Pre-computed Representations

06/11/2021
by   Xinyi Wang, et al.
10

Recently developed large pre-trained language models, e.g., BERT, have achieved remarkable performance in many downstream natural language processing applications. These pre-trained language models often contain hundreds of millions of parameters and suffer from high computation and latency in real-world applications. It is desirable to reduce the computation overhead of the models for fast training and inference while keeping the model performance in downstream applications. Several lines of work utilize knowledge distillation to compress the teacher model to a smaller student model. However, they usually discard the teacher's knowledge when in inference. Differently, in this paper, we propose RefBERT to leverage the knowledge learned from the teacher, i.e., facilitating the pre-computed BERT representation on the reference sample and compressing BERT into a smaller student model. To guarantee our proposal, we provide theoretical justification on the loss function and the usage of reference samples. Significantly, the theoretical result shows that including the pre-computed teacher's representations on the reference samples indeed increases the mutual information in learning the student model. Finally, we conduct the empirical evaluation and show that our RefBERT can beat the vanilla TinyBERT over 8.1% and achieves more than 94% of the performance of on the GLUE benchmark. Meanwhile, RefBERT is 7.4x smaller and 9.5x faster on inference than BERT_ BASE.

READ FULL TEXT

page 1

page 4

research
02/09/2021

NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application

Pre-trained language models (PLMs) like BERT have made great progress in...
research
04/06/2020

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

Natural Language Processing (NLP) has recently achieved great success by...
research
07/01/2021

Knowledge Distillation for Quality Estimation

Quality Estimation (QE) is the task of automatically predicting Machine ...
research
10/13/2020

BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover's Distance

Pre-trained language models (e.g., BERT) have achieved significant succe...
research
10/14/2022

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

Pre-trained vision-language models (VLMs) have achieved impressive resul...
research
12/22/2022

CAMeMBERT: Cascading Assistant-Mediated Multilingual BERT

Large language models having hundreds of millions, and even billions, of...
research
02/20/2023

Progressive Knowledge Distillation: Building Ensembles for Efficient Inference

We study the problem of progressive distillation: Given a large, pre-tra...

Please sign up or login with your details

Forgot password? Click here to reset