An Automatic and Efficient BERT Pruning for Edge AI Systems

06/21/2022
by   Shaoyi Huang, et al.
2

With the yearning for deep learning democratization, there are increasing demands to implement Transformer-based natural language processing (NLP) models on resource-constrained devices for low-latency and high accuracy. Existing BERT pruning methods require domain experts to heuristically handcraft hyperparameters to strike a balance among model size, latency, and accuracy. In this work, we propose AE-BERT, an automatic and efficient BERT pruning framework with efficient evaluation to select a "good" sub-network candidate (with high accuracy) given the overall pruning ratio constraints. Our proposed method requires no human experts experience and achieves a better accuracy performance on many NLP tasks. Our experimental results on General Language Understanding Evaluation (GLUE) benchmark show that AE-BERT outperforms the state-of-the-art (SOTA) hand-crafted pruning methods on BERT_BASE. On QNLI and RTE, we obtain 75% and 42.8% more overall pruning ratio while achieving higher accuracy. On MRPC, we obtain a 4.6 higher score than the SOTA at the same overall pruning ratio of 0.5. On STS-B, we can achieve a 40% higher pruning ratio with a very small loss in Spearman correlation compared to SOTA hand-crafted pruning methods. Experimental results also show that after model compression, the inference time of a single BERT_BASE encoder on Xilinx Alveo U200 FPGA board has a 1.83× speedup compared to Intel(R) Xeon(R) Gold 5218 (2.30GHz) CPU, which shows the reasonableness of deploying the proposed method generated subnets of BERT_BASE model on computation restricted devices.

READ FULL TEXT

page 1

page 3

research
09/17/2020

Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning

Pretrained large-scale language models have increasingly demonstrated hi...
research
09/30/2020

AUBER: Automated BERT Regularization

How can we effectively regularize BERT? Although BERT proves its effecti...
research
10/28/2021

Pruning Attention Heads of Transformer Models Using A* Search: A Novel Approach to Compress Big NLP Architectures

Recent years have seen a growing adoption of Transformer models such as ...
research
08/30/2022

SwiftPruner: Reinforced Evolutionary Pruning for Efficient Ad Relevance

Ad relevance modeling plays a critical role in online advertising system...
research
10/19/2021

Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization

State-of-the-art Transformer-based models, with gigantic parameters, are...
research
12/02/2020

An Once-for-All Budgeted Pruning Framework for ConvNets Considering Input Resolution

We propose an efficient once-for-all budgeted pruning framework (OFARPru...
research
09/22/2021

High-dimensional Bayesian Optimization for CNN Auto Pruning with Clustering and Rollback

Pruning has been widely used to slim convolutional neural network (CNN) ...

Please sign up or login with your details

Forgot password? Click here to reset