LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

05/29/2023
by   Zechun Liu, et al.
0

Several post-training quantization methods have been applied to large language models (LLMs), and have been shown to perform well down to 8-bits. We find that these methods break down at lower bit precision, and investigate quantization aware training for LLMs (LLM-QAT) to push quantization levels even further. We propose a data-free distillation method that leverages generations produced by the pre-trained model, which better preserves the original output distribution and allows quantizing any generative model independent of its training data, similar to post-training quantization methods. In addition to quantizing weights and activations, we also quantize the KV cache, which is critical for increasing throughput and support long sequence dependencies at current model sizes. We experiment with LLaMA models of sizes 7B, 13B, and 30B, at quantization levels down to 4-bits. We observe large improvements over training-free methods, especially in the low-bit settings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/15/2023

A Comprehensive Study on Post-Training Quantization for Large Language Models

Post-training quantization () had been recently shown as a compromising ...
research
09/30/2021

Towards Efficient Post-training Quantization of Pre-trained Language Models

Network quantization has gained increasing attention with the rapid grow...
research
06/24/2021

Quantization Aware Training, ERNIE and Kurtosis Regularizer: a short empirical study

Pre-trained language models like Ernie or Bert are currently used in man...
research
09/11/2023

Understanding the Impact of Post-Training Quantization on Large Language Models

Large language models (LLMs) are rapidly increasing in size, with the nu...
research
03/07/2020

Generative Low-bitwidth Data Free Quantization

Neural network quantization is an effective way to compress deep models ...
research
08/13/2023

Token-Scaled Logit Distillation for Ternary Weight Generative Language Models

Generative Language Models (GLMs) have shown impressive performance in t...
research
05/05/2021

Q-Rater: Non-Convex Optimization for Post-Training Uniform Quantization

Various post-training uniform quantization methods have usually been stu...

Please sign up or login with your details

Forgot password? Click here to reset