BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover's Distance

10/13/2020
by   Jianquan Li, et al.
0

Pre-trained language models (e.g., BERT) have achieved significant success in various natural language processing (NLP) tasks. However, high storage and computational costs obstruct pre-trained language models to be effectively deployed on resource-constrained devices. In this paper, we propose a novel BERT distillation method based on many-to-many layer mapping, which allows each intermediate student layer to learn from any intermediate teacher layers. In this way, our model can learn from different teacher layers adaptively for various NLP tasks. different levels of linguistic knowledge contained in the intermediate layers of BERT. In addition, we leverage Earth Mover's Distance (EMD) to compute the minimum cumulative cost that must be paid to transform knowledge from teacher network to student network. EMD enables the effective matching for many-to-many layer mapping. effectively measures semantic distance between the teacher network and student network. Furthermore, we propose a cost attention mechanism to learn the layer weights used in EMD automatically, which is supposed to further improve the model's performance and accelerate convergence time. Extensive experiments on GLUE benchmark demonstrate that our model achieves competitive performance compared to strong competitors in terms of both accuracy and model compression.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 9

research
08/25/2019

Patient Knowledge Distillation for BERT Model Compression

Pre-trained language models such as BERT have proven to be highly effect...
research
09/21/2021

RAIL-KD: RAndom Intermediate Layer Mapping for Knowledge Distillation

Intermediate layer knowledge distillation (KD) can improve the standard ...
research
06/11/2021

RefBERT: Compressing BERT by Referencing to Pre-computed Representations

Recently developed large pre-trained language models, e.g., BERT, have a...
research
04/06/2020

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

Natural Language Processing (NLP) has recently achieved great success by...
research
02/09/2021

NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application

Pre-trained language models (PLMs) like BERT have made great progress in...
research
12/22/2022

CAMeMBERT: Cascading Assistant-Mediated Multilingual BERT

Large language models having hundreds of millions, and even billions, of...
research
02/14/2020

TwinBERT: Distilling Knowledge to Twin-Structured BERT Models for Efficient Retrieval

Pre-trained language models like BERT have achieved great success in a w...

Please sign up or login with your details

Forgot password? Click here to reset