AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

06/01/2023
by   Ji Lin, et al.
0

Large language models (LLMs) have shown excellent performance on various tasks, but the astronomical model size raises the hardware barrier for serving (memory size) and slows down token generation (memory bandwidth). In this paper, we propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. Our method is based on the observation that weights are not equally important: protecting only 1 propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights. AWQ does not rely on any backpropagation or reconstruction, so it can well preserve LLMs' generalization ability on different domains and modalities, without overfitting to the calibration set; it also does not rely on any data layout reordering, maintaining the hardware efficiency. AWQ outperforms existing work on various language modeling, common sense QA, and domain-specific benchmarks. Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. We also implement efficient tensor core kernels with reorder-free online dequantization to accelerate AWQ, achieving a 1.45x speedup over GPTQ and is 1.85x faster than the cuBLAS FP16 implementation. Our method provides a turn-key solution to compress LLMs to 3/4 bits for efficient deployment.

READ FULL TEXT
research
11/18/2022

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Large language models (LLMs) show excellent performance but are compute-...
research
09/06/2023

Norm Tweaking: High-performance Low-bit Quantization of Large Language Models

As the size of large language models (LLMs) continues to grow, model com...
research
06/04/2023

OWQ: Lessons learned from activation outliers for weight quantization in large language models

Large language models (LLMs) with hundreds of billions of parameters sho...
research
08/25/2023

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

Large language models (LLMs) have revolutionized natural language proces...
research
07/12/2023

Self-Distilled Quantization: Achieving High Compression Rates in Transformer-Based Language Models

We investigate the effects of post-training quantization and quantizatio...
research
08/13/2023

Token-Scaled Logit Distillation for Ternary Weight Generative Language Models

Generative Language Models (GLMs) have shown impressive performance in t...
research
09/02/2023

eDKM: An Efficient and Accurate Train-time Weight Clustering for Large Language Models

Since Large Language Models or LLMs have demonstrated high-quality perfo...

Please sign up or login with your details

Forgot password? Click here to reset