OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

08/25/2023
by   Wenqi Shao, et al.
0

Large language models (LLMs) have revolutionized natural language processing tasks. However, their practical deployment is hindered by their immense memory and computation requirements. Although recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM, they hand-craft quantization parameters, which leads to low performance and fails to deal with extremely low-bit quantization. To tackle this issue, we introduce an Omnidirectionally calibrated Quantization (OmniQuant) technique for LLMs, which achieves good performance in diverse quantization settings while maintaining the computational efficiency of PTQ by efficiently optimizing various quantization parameters. OmniQuant comprises two innovative components including Learnable Weight Clipping (LWC) and Learnable Equivalent Transformation (LET). LWC modulates the extreme values of weights by optimizing the clipping threshold. Meanwhile, LET tackles activation outliers by shifting the challenge of quantization from activations to weights through a learnable equivalent transformation. Operating within a differentiable framework using block-wise error minimization, OmniQuant can optimize the quantization process efficiently for both weight-only and weight-activation quantization. For instance, the LLaMA-2 model family with the size of 7-70B can be processed with OmniQuant on a single A100-40G GPU within 1-16 hours using 128 samples. Extensive experiments validate OmniQuant's superior performance across diverse quantization configurations such as W4A4, W6A6, W4A16, W3A16, and W2A16. Additionally, OmniQuant demonstrates effectiveness in instruction-tuned models and delivers notable improvements in inference speed and memory reduction on real devices. Codes and models are available at <https://github.com/OpenGVLab/OmniQuant>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/18/2022

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Large language models (LLMs) show excellent performance but are compute-...
research
09/11/2023

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

Large Language Models (LLMs) have proven their exceptional capabilities ...
research
07/19/2023

ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats

In the complex domain of large language models (LLMs), striking a balanc...
research
11/27/2020

Rethinking Generalization in American Sign Language Prediction for Edge Devices with Extremely Low Memory Footprint

Due to the boom in technical compute in the last few years, the world ha...
research
06/01/2023

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Large language models (LLMs) have shown excellent performance on various...
research
03/11/2022

QDrop: Randomly Dropping Quantization for Extremely Low-bit Post-Training Quantization

Recently, post-training quantization (PTQ) has driven much attention to ...
research
06/02/2022

NIPQ: Noise Injection Pseudo Quantization for Automated DNN Optimization

The optimization of neural networks in terms of computation cost and mem...

Please sign up or login with your details

Forgot password? Click here to reset