Teacher Intervention: Improving Convergence of Quantization Aware Training for Ultra-Low Precision Transformers

02/23/2023
by   Minsoo Kim, et al.
0

Pre-trained Transformer models such as BERT have shown great success in a wide range of applications, but at the cost of substantial increases in model complexity. Quantization-aware training (QAT) is a promising method to lower the implementation cost and energy consumption. However, aggressive quantization below 2-bit causes considerable accuracy degradation due to unstable convergence, especially when the downstream dataset is not abundant. This work proposes a proactive knowledge distillation method called Teacher Intervention (TI) for fast converging QAT of ultra-low precision pre-trained Transformers. TI intervenes layer-wise signal propagation with the intact signal from the teacher to remove the interference of propagated quantization errors, smoothing loss surface of QAT and expediting the convergence. Furthermore, we propose a gradual intervention mechanism to stabilize the recovery of subsections of Transformer layers from quantization. The proposed schemes enable fast convergence of QAT and improve the model accuracy regardless of the diverse characteristics of downstream fine-tuning tasks. We demonstrate that TI consistently achieves superior accuracy with significantly lower fine-tuning iterations on well-known Transformers of natural language processing as well as computer vision compared to the state-of-the-art QAT methods.

READ FULL TEXT
research
06/01/2023

Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding

Fine-tuned transformer models have shown superior performances in many n...
research
11/20/2022

Understanding and Improving Knowledge Distillation for Quantization-Aware Training of Large Transformer Encoders

Knowledge distillation (KD) has been a ubiquitous method for model compr...
research
06/04/2022

Extreme Compression for Pre-trained Transformers Made Simple and Efficient

Extreme compression, particularly ultra-low bit precision (binary/ternar...
research
09/12/2019

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

Transformer based architectures have become de-facto models used for a r...
research
06/13/2022

Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training

Data clipping is crucial in reducing noise in quantization operations an...
research
06/24/2021

Quantization Aware Training, ERNIE and Kurtosis Regularizer: a short empirical study

Pre-trained language models like Ernie or Bert are currently used in man...
research
03/09/2023

Dynamic Stashing Quantization for Efficient Transformer Training

Large Language Models (LLMs) have demonstrated impressive performance on...

Please sign up or login with your details

Forgot password? Click here to reset