Norm Tweaking: High-performance Low-bit Quantization of Large Language Models

09/06/2023
by   Liang Li, et al.
0

As the size of large language models (LLMs) continues to grow, model compression without sacrificing accuracy has become a crucial challenge for deployment. While some quantization methods, such as GPTQ, have made progress in achieving acceptable 4-bit weight-only quantization, attempts at lower bit quantization often result in severe performance degradation. In this paper, we introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision while being cost-efficient. Our approach is inspired by the observation that rectifying the quantized activation distribution to match its float counterpart can readily restore accuracy for LLMs. To achieve this, we carefully design a tweaking strategy that includes calibration data generation and channel-wise distance constraint to update the weights of normalization layers for better generalization. We conduct extensive experiments on various datasets using several open-sourced LLMs. Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations, surpassing existing PTQ methods. On GLM-130B and OPT-66B, our method even achieves the same level of accuracy at 2-bit quantization as their float ones. Our simple and effective approach makes it more practical for real-world applications.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/30/2023

FPTQ: Fine-grained Post-Training Quantization for Large Language Models

In the era of large-scale language models, the substantial parameter siz...
research
07/18/2021

A High-Performance Adaptive Quantization Approach for Edge CNN Applications

Recent convolutional neural network (CNN) development continues to advan...
research
07/16/2023

Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study

Despite the superior performance, Large Language Models (LLMs) require s...
research
09/04/2023

Softmax Bias Correction for Quantized Generative Models

Post-training quantization (PTQ) is the go-to compression technique for ...
research
06/01/2023

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Large language models (LLMs) have shown excellent performance on various...
research
10/31/2022

Model Compression for DNN-Based Text-Independent Speaker Verification Using Weight Quantization

DNN-based models achieve high performance in the speaker verification (S...
research
08/21/2023

QD-BEV : Quantization-aware View-guided Distillation for Multi-view 3D Object Detection

Multi-view 3D detection based on BEV (bird-eye-view) has recently achiev...

Please sign up or login with your details

Forgot password? Click here to reset