FPTQ: Fine-grained Post-Training Quantization for Large Language Models

08/30/2023
by   Qingyuan Li, et al.
0

In the era of large-scale language models, the substantial parameter size poses significant challenges for deployment. Being a prevalent compression technique, quantization has emerged as the mainstream practice to tackle this issue, which is mainly centered on two recipes W8A8 and W4A16 (i.e. weights and activations in such bit widths). In this study, we propose a novel W4A8 post-training quantization method for the available open-sourced LLMs, which combines the advantages of both two recipes. Therefore, we can leverage the benefit in the I/O utilization of 4-bit weight quantization and the acceleration due to 8-bit matrix computation. Nevertheless, the W4A8 faces notorious performance degradation. As a remedy, we involve layerwise activation quantization strategies which feature a novel logarithmic equalization for most intractable layers, and we combine them with fine-grained weight quantization. Without whistles and bells, we eliminate the necessity for further fine-tuning and obtain the state-of-the-art W4A8 quantized performance on BLOOM, LLaMA, and LLaMA-2 on standard benchmarks. We confirm that the W4A8 quantization is achievable for the deployment of large language models, fostering their wide-spreading real-world applications.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/03/2023

RPTQ: Reorder-based Post-training Quantization for Large Language Models

Large-scale language models (LLMs) have demonstrated outstanding perform...
research
09/06/2023

Norm Tweaking: High-performance Low-bit Quantization of Large Language Models

As the size of large language models (LLMs) continues to grow, model com...
research
06/04/2023

OWQ: Lessons learned from activation outliers for weight quantization in large language models

Large language models (LLMs) with hundreds of billions of parameters sho...
research
07/16/2023

Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study

Despite the superior performance, Large Language Models (LLMs) require s...
research
08/16/2023

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

Large Language Models (LLMs) have achieved state-of-the-art performance ...
research
09/11/2023

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

Large Language Models (LLMs) have proven their exceptional capabilities ...
research
09/04/2023

Softmax Bias Correction for Quantized Generative Models

Post-training quantization (PTQ) is the go-to compression technique for ...

Please sign up or login with your details

Forgot password? Click here to reset