You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient

06/04/2021
by   Shaokun Zhang, et al.
7

Despite superior performance on various natural language processing tasks, pre-trained models such as BERT are challenged by deploying on resource-constraint devices. Most existing model compression approaches require re-compression or fine-tuning across diverse constraints to accommodate various hardware deployments. This practically limits the further application of model compression. Moreover, the ineffective training and searching process of existing elastic compression paradigms[4,27] prevents the direct migration to BERT compression. Motivated by the necessity of efficient inference across various constraints on BERT, we propose a novel approach, YOCO-BERT, to achieve compress once and deploy everywhere. Specifically, we first construct a huge search space with 10^13 architectures, which covers nearly all configurations in BERT model. Then, we propose a novel stochastic nature gradient optimization method to guide the generation of optimal candidate architecture which could keep a balanced trade-off between explorations and exploitation. When a certain resource constraint is given, a lightweight distribution optimization approach is utilized to obtain the optimal network for target deployment without fine-tuning. Compared with state-of-the-art algorithms, YOCO-BERT provides more compact models, yet achieving 2.1 GLUE benchmark. Besides, YOCO-BERT is also more effective, e.g.,the training complexity is O(1)for N different devices. Code is availablehttps://github.com/MAC-AutoML/YOCO-BERT.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/03/2022

Efficient Fine-Tuning of BERT Models on the Edge

Resource-constrained devices are increasingly the deployment targets of ...
research
02/27/2020

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

Transformer-based models pre-trained on large-scale corpora achieve stat...
research
07/26/2023

DPBERT: Efficient Inference for BERT based on Dynamic Planning

Large-scale pre-trained language models such as BERT have contributed si...
research
08/19/2021

An Information Theory-inspired Strategy for Automatic Network Pruning

Despite superior performance on many computer vision tasks, deep convolu...
research
09/12/2019

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

Transformer based architectures have become de-facto models used for a r...
research
08/01/2019

MSnet: A BERT-based Network for Gendered Pronoun Resolution

The pre-trained BERT model achieves a remarkable state of the art across...
research
02/07/2020

BERT-of-Theseus: Compressing BERT by Progressive Module Replacing

In this paper, we propose a novel model compression approach to effectiv...

Please sign up or login with your details

Forgot password? Click here to reset