SqueezeLLM: Dense-and-Sparse Quantization

06/13/2023
by   Sehoon Kim, et al.
0

Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This has forced existing deployment frameworks to use multi-GPU inference pipelines, which are often complex and costly, or to use smaller and less performant models. In this work, we demonstrate that the main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, specifically for single batch inference. While quantization has emerged as a promising solution by representing model weights with reduced precision, previous efforts have often resulted in notable performance degradation. To address this, we introduce SqueezeLLM, a post-training quantization framework that not only enables lossless compression to ultra-low precisions of up to 3-bit, but also achieves higher quantization performance under the same memory constraint. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format. When applied to the LLaMA models, our 3-bit quantization significantly reduces the perplexity gap from the FP16 baseline by up to 2.1x as compared to the state-of-the-art methods with the same memory requirement. Furthermore, when deployed on an A6000 GPU, our quantized models achieve up to 2.3x speedup compared to the baseline. Our code is open-sourced and available online.

READ FULL TEXT
research
03/12/2021

Learnable Companding Quantization for Accurate Low-bit Neural Networks

Quantizing deep neural networks is an effective method for reducing memo...
research
06/20/2022

nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models

The recent advance of self-supervised learning associated with the Trans...
research
05/23/2023

QLoRA: Efficient Finetuning of Quantized LLMs

We present QLoRA, an efficient finetuning approach that reduces memory u...
research
08/16/2023

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

Large Language Models (LLMs) have achieved state-of-the-art performance ...
research
06/05/2023

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

Recent advances in large language model (LLM) pretraining have led to hi...
research
09/11/2023

Understanding the Impact of Post-Training Quantization on Large Language Models

Large language models (LLMs) are rapidly increasing in size, with the nu...
research
09/04/2023

Softmax Bias Correction for Quantized Generative Models

Post-training quantization (PTQ) is the go-to compression technique for ...

Please sign up or login with your details

Forgot password? Click here to reset