EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation

09/15/2021
by   Chenhe Dong, et al.
1

Pre-trained language models have shown remarkable results on various NLP tasks. Nevertheless, due to their bulky size and slow inference speed, it is hard to deploy them on edge devices. In this paper, we have a critical insight that improving the feed-forward network (FFN) in BERT has a higher gain than improving the multi-head attention (MHA) since the computational cost of FFN is 2∼3 times larger than MHA. Hence, to compact BERT, we are devoted to designing efficient FFN as opposed to previous works that pay attention to MHA. Since FFN comprises a multilayer perceptron (MLP) that is essential in BERT optimization, we further design a thorough search space towards an advanced MLP and perform a coarse-to-fine mechanism to search for an efficient BERT architecture. Moreover, to accelerate searching and enhance model transferability, we employ a novel warm-up knowledge distillation strategy at each search stage. Extensive experiments show our searched EfficientBERT is 6.9× smaller and 4.4× faster than BERT_BASE, and has competitive performances on GLUE and SQuAD Benchmarks. Concretely, EfficientBERT attains a 77.7 average score on GLUE test, 0.7 higher than MobileBERT_TINY, and achieves an 85.3/74.5 F1 score on SQuAD v1.1/v2.0 dev, 3.2/2.7 higher than TinyBERT_4 even without data augmentation. The code is released at https://github.com/cheneydon/efficient-bert.

READ FULL TEXT

page 13

page 14

research
04/07/2020

Towards Non-task-specific Distillation of BERT via Sentence Representation Approximation

Recently, BERT has become an essential ingredient of various NLP deep mo...
research
06/07/2021

RoSearch: Search for Robust Student Architectures When Distilling Pre-trained Language Models

Pre-trained language models achieve outstanding performance in NLP tasks...
research
10/16/2021

HRKD: Hierarchical Relational Knowledge Distillation for Cross-domain Language Model Compression

On many natural language processing tasks, large pre-trained language mo...
research
04/06/2020

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

Natural Language Processing (NLP) has recently achieved great success by...
research
07/21/2020

Understanding BERT Rankers Under Distillation

Deep language models such as BERT pre-trained on large corpus have given...
research
09/01/2020

Automatic Assignment of Radiology Examination Protocols Using Pre-trained Language Models with Knowledge Distillation

Selecting radiology examination protocol is a repetitive, error-prone, a...
research
03/21/2021

ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques

Pre-trained language models of the BERT family have defined the state-of...

Please sign up or login with your details

Forgot password? Click here to reset