Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

09/12/2019
by   Sheng Shen, et al.
0

Transformer based architectures have become de-facto models used for a range of Natural Language Processing tasks. In particular, the BERT based models achieved significant accuracy gain for GLUE tasks, CoNLL-03 and SQuAD. However, BERT based models have a prohibitive memory footprint and latency. As a result, deploying BERT based models in resource constrained environments has become a challenging task. In this work, we perform an extensive analysis of fine-tuned BERT models using second order Hessian information, and we use our results to propose a novel method for quantizing BERT models to ultra low precision. In particular, we propose a new group-wise quantization scheme, and we use a Hessian based mix-precision method to compress the model further. We extensively test our proposed method on BERT downstream tasks of SST-2, MNLI, CoNLL-03, and SQuAD. We can achieve comparable performance to baseline with at most 2.3% performance degradation, even with ultra-low precision quantization down to 2 bits, corresponding up to 13× compression of the model parameters, and up to 4× compression of the embedding table as well as activations. Among all tasks, we observed the highest performance loss for BERT fine-tuned on SQuAD. By probing into the Hessian based analysis as well as visualization, we show that this is related to the fact that current training/fine-tuning strategy of BERT does not converge for SQuAD.

READ FULL TEXT

page 2

page 4

page 5

page 6

page 13

page 14

research
01/05/2021

I-BERT: Integer-only BERT Quantization

Transformer based models, like BERT and RoBERTa, have achieved state-of-...
research
09/14/2023

Revisiting Supertagging for HPSG

We present new supertaggers trained on HPSG-based treebanks. These treeb...
research
05/29/2023

SlimFit: Memory-Efficient Fine-Tuning of Transformer-based Models Using Training Dynamics

Transformer-based models, such as BERT and ViT, have achieved state-of-t...
research
03/16/2023

Block-wise Bit-Compression of Transformer-based Models

With the popularity of the recent Transformer-based models represented b...
research
02/23/2023

Teacher Intervention: Improving Convergence of Quantization Aware Training for Ultra-Low Precision Transformers

Pre-trained Transformer models such as BERT have shown great success in ...
research
06/04/2021

You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient

Despite superior performance on various natural language processing task...
research
08/31/2021

Enjoy the Salience: Towards Better Transformer-based Faithful Explanations with Word Salience

Pretrained transformer-based models such as BERT have demonstrated state...

Please sign up or login with your details

Forgot password? Click here to reset