Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models

05/21/2023
by   Yijia Zhang, et al.
0

Efficient deployment of large language models (LLMs) necessitates low-bit quantization to minimize model size and inference cost. While low-bit integer formats (e.g., INT8/INT4) have been the conventional choice, emerging low-bit floating-point formats (e.g., FP8/FP4) offer a compelling alternative and are gaining support from cutting-edge hardware, such as NVIDIA's H100 GPU. However, the superiority of low-bit INT versus FP formats for quantization on LLMs remains unclear. In this study, we conduct a comparative analysis of INT and FP quantization with the same bit-width, revealing that the optimal quantization format varies across different layers due to the complexity and diversity of tensor distribution. Consequently, we advocate the Mixture of Formats Quantization (MoFQ), which selects the optimal format on a layer-wise basis. This simple yet effective approach achieves state-of-the-art results in both weight-only (W-only) and weight-activation (WA) post-training quantization scenarios when tested on LLaMA across various tasks. In 4-bit W-only quantization, MoFQ surpasses GPTQ without complex hyperparameter tuning and with an order of magnitude faster quantization speed. While in 8-bit WA quantization, MoFQ significantly outperforms INT/FP-only methods, achieving performance close to the full precision model. Notably, MoFQ incurs no hardware overhead compared to INT/FP-only quantization, as the bit-width remains unchanged.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/04/2022

I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference

Vision Transformers (ViTs) have achieved state-of-the-art performance on...
research
07/19/2023

ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats

In the complex domain of large language models (LLMs), striking a balanc...
research
07/12/2023

Self-Distilled Quantization: Achieving High Compression Rates in Transformer-Based Language Models

We investigate the effects of post-training quantization and quantizatio...
research
08/25/2023

A2Q: Accumulator-Aware Quantization with Guaranteed Overflow Avoidance

We present accumulator-aware quantization (A2Q), a novel weight quantiza...
research
08/31/2021

Quantization of Generative Adversarial Networks for Efficient Inference: a Methodological Study

Generative adversarial networks (GANs) have an enormous potential impact...
research
07/07/2023

INT-FP-QSim: Mixed Precision and Formats For Large Language Models and Vision Transformers

The recent rise of large language models (LLMs) has resulted in increase...
research
03/31/2023

FP8 versus INT8 for efficient deep learning inference

Recently, the idea of using FP8 as a number format for neural network tr...

Please sign up or login with your details

Forgot password? Click here to reset