INT-FP-QSim: Mixed Precision and Formats For Large Language Models and Vision Transformers

07/07/2023
by   Lakshmi Nair, et al.
0

The recent rise of large language models (LLMs) has resulted in increased efforts towards running LLMs at reduced precision. Running LLMs at lower precision supports resource constraints and furthers their democratization, enabling users to run billion-parameter LLMs on their personal devices. To supplement this ongoing effort, we propose INT-FP-QSim: an open-source simulator that enables flexible evaluation of LLMs and vision transformers at various numerical precisions and formats. INT-FP-QSim leverages existing open-source repositories such as TensorRT, QPytorch and AIMET for a combined simulator that supports various floating point and integer formats. With the help of our simulator, we survey the impact of different numerical formats on the performance of LLMs and vision transformers at 4-bit weights and 4-bit or 8-bit activations. We also compare recently proposed methods like Adaptive Block Floating Point, SmoothQuant, GPTQ and RPTQ on the model performances. We hope INT-FP-QSim will enable researchers to flexibly simulate models at various precisions to support further research in quantization of LLMs and vision transformers.

READ FULL TEXT
research
11/20/2020

HAWQV3: Dyadic Neural Network Quantization

Quantization is one of the key techniques used to make Neural Networks (...
research
05/21/2023

Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models

Efficient deployment of large language models (LLMs) necessitates low-bi...
research
10/28/2021

FAST: DNN Training Under Variable Precision Block Floating Point with Stochastic Rounding

Block Floating Point (BFP) can efficiently support quantization for Deep...
research
06/22/2023

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

Transformer models have been widely adopted in various domains over the ...
research
02/11/2018

QRkit: Sparse, Composable QR Decompositions for Efficient and Stable Solutions to Problems in Computer Vision

Embedded computer vision applications increasingly require the speed and...
research
02/09/2023

Binarized Neural Machine Translation

The rapid scaling of language models is motivating research using low-bi...
research
09/19/2022

SAMP: A Toolkit for Model Inference with Self-Adaptive Mixed-Precision

The latest industrial inference engines, such as FasterTransformer1 and ...

Please sign up or login with your details

Forgot password? Click here to reset