SQuAT: Sharpness- and Quantization-Aware Training for BERT

10/13/2022
by   Zheng Wang, et al.
0

Quantization is an effective technique to reduce memory footprint, inference latency, and power consumption of deep learning models. However, existing quantization methods suffer from accuracy degradation compared to full-precision (FP) models due to the errors introduced by coarse gradient estimation through non-differentiable quantization layers. The existence of sharp local minima in the loss landscapes of overparameterized models (e.g., Transformers) tends to aggravate such performance penalty in low-bit (2, 4 bits) settings. In this work, we propose sharpness- and quantization-aware training (SQuAT), which would encourage the model to converge to flatter minima while performing quantization-aware training. Our proposed method alternates training between sharpness objective and step-size objective, which could potentially let the model learn the most suitable parameter update magnitude to reach convergence near-flat minima. Extensive experiments show that our method can consistently outperform state-of-the-art quantized BERT models under 2, 3, and 4-bit settings on GLUE benchmarks by 1 full precision (32-bit) models. Our experiments on empirical measurement of sharpness also suggest that our method would lead to flatter minima compared to other quantization methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/24/2021

Sharpness-aware Quantization for Deep Neural Networks

Network quantization is an effective compression method to reduce the mo...
research
06/13/2022

Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training

Data clipping is crucial in reducing noise in quantization operations an...
research
12/11/2022

Error-aware Quantization through Noise Tempering

Quantization has become a predominant approach for model compression, en...
research
05/25/2022

Train Flat, Then Compress: Sharpness-Aware Minimization Learns More Compressible Models

Model compression by way of parameter pruning, quantization, or distilla...
research
11/16/2016

The ZipML Framework for Training Models with End-to-End Low Precision: The Cans, the Cannots, and a Little Bit of Deep Learning

Recently there has been significant interest in training machine-learnin...
research
08/21/2023

Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers

Quantization scale and bit-width are the most important parameters when ...
research
08/21/2023

QD-BEV : Quantization-aware View-guided Distillation for Multi-view 3D Object Detection

Multi-view 3D detection based on BEV (bird-eye-view) has recently achiev...

Please sign up or login with your details

Forgot password? Click here to reset