The case for 4-bit precision: k-bit Inference Scaling Laws

12/19/2022
by   Tim Dettmers, et al.
0

Quantization methods reduce the number of bits required to represent each parameter in a model, trading accuracy for smaller memory footprints and inference latencies. However, the final model size depends on both the number of parameters of the original model and the rate of compression. For example, a 30B 8-bit model and a 60B 4-bit model have the same number of bits but may have very different zero-shot accuracies. In this work, we study this trade-off by developing inference scaling laws of zero-shot performance in Large Language Models (LLMs) to determine the bit-precision and model size that maximizes zero-shot performance. We run more than 35,000 experiments with 16-bit inputs and k-bit parameters to examine which zero-shot quantization methods improve scaling for 3 to 8-bit precision at scales of 19M to 176B parameters across the LLM families BLOOM, OPT, NeoX/Pythia, and GPT-2. We find that it is challenging to improve the bit-level scaling trade-off, with the only improvements being the use of a small block size – splitting the parameters into small independently quantized blocks – and the quantization data type being used (e.g., Int vs Float). Overall, our findings show that 4-bit precision is almost universally optimal for total model bits and zero-shot accuracy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/15/2023

A Comprehensive Study on Post-Training Quantization for Large Language Models

Post-training quantization () had been recently shown as a compromising ...
research
01/01/2020

ZeroQ: A Novel Zero Shot Quantization Framework

Quantization is a promising approach for reducing the inference time and...
research
11/17/2022

Zero-Shot Dynamic Quantization for Transformer Inference

We introduce a novel run-time method for significantly reducing the accu...
research
05/22/2023

Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design

Scaling laws have been recently employed to derive compute-optimal model...
research
10/06/2021

8-bit Optimizers via Block-wise Quantization

Stateful optimizers maintain gradient statistics over time, e.g., the ex...
research
11/19/2021

Combined Scaling for Zero-shot Transfer Learning

We present a combined scaling method called BASIC that achieves 85.7 zer...
research
11/17/2021

IntraQ: Learning Synthetic Images with Intra-Class Heterogeneity for Zero-Shot Network Quantization

Learning to synthesize data has emerged as a promising direction in zero...

Please sign up or login with your details

Forgot password? Click here to reset