Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models

09/27/2022
by   Xiuying Wei, et al.
0

Transformer architecture has become the fundamental element of the widespread natural language processing (NLP) models. With the trends of large NLP models, the increasing memory and computation costs hinder their efficient deployment on resource-limited devices. Therefore, transformer quantization attracts wide research interest. Recent work recognizes that structured outliers are the critical bottleneck for quantization performance. However, their proposed methods increase the computation overhead and still leave the outliers there. To fundamentally address this problem, this paper delves into the inherent inducement and importance of the outliers. We discover that γ in LayerNorm (LN) acts as a sinful amplifier for the outliers, and the importance of outliers varies greatly where some outliers provided by a few tokens cover a large area but can be clipped sharply without negative impacts. Motivated by these findings, we propose an outlier suppression framework including two components: Gamma Migration and Token-Wise Clipping. The Gamma Migration migrates the outlier amplifier to subsequent modules in an equivalent transformation, contributing to a more quantization-friendly model without any extra burden. The Token-Wise Clipping takes advantage of the large variance of token range and designs a token-wise coarse-to-fine pipeline, obtaining a clipping range with minimal final quantization loss in an efficient way. This framework effectively suppresses the outliers and can be used in a plug-and-play mode. Extensive experiments prove that our framework surpasses the existing works and, for the first time, pushes the 6-bit post-training BERT quantization to the full-precision (FP) level. Our code is available at https://github.com/wimh966/outlier_suppression.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/27/2021

Understanding and Overcoming the Challenges of Efficient Transformer Quantization

Transformer-based architectures have become the de-facto standard models...
research
05/23/2022

Outliers Dimensions that Disrupt Transformers Are Driven by Frequency

Transformer-based language models are known to display anisotropic behav...
research
04/18/2023

Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling

Quantization of transformer language models faces significant challenges...
research
05/30/2023

Intriguing Properties of Quantization at Scale

Emergent properties have been widely adopted as a term to describe behav...
research
10/29/2022

Empirical Evaluation of Post-Training Quantization Methods for Language Tasks

Transformer-based architectures like BERT have achieved great success in...
research
04/15/2023

OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization

Transformer-based large language models (LLMs) have achieved great succe...
research
10/13/2019

Overwrite Quantization: Opportunistic Outlier Handling for Neural Network Accelerators

Outliers in weights and activations pose a key challenge for fixed-point...

Please sign up or login with your details

Forgot password? Click here to reset