Post-Training Sparsity-Aware Quantization

05/23/2021
by   Gil Shomron, et al.
0

Quantization is a technique used in deep neural networks (DNNs) to increase execution performance and hardware efficiency. Uniform post-training quantization (PTQ) methods are common, since they can be implemented efficiently in hardware and do not require extensive hardware resources or a training set. Mapping FP32 models to INT8 using uniform PTQ yields models with negligible accuracy degradation; however, reducing precision below 8 bits with PTQ is challenging, as accuracy degradation becomes noticeable, due to the increase in quantization noise. In this paper, we propose a sparsity-aware quantization (SPARQ) method, in which the unstructured and dynamic activation sparsity is leveraged in different representation granularities. 4-bit quantization, for example, is employed by dynamically examining the bits of 8-bit values and choosing a window of 4 bits, while first skipping zero-value bits. Moreover, instead of quantizing activation-by-activation to 4 bits, we focus on pairs of 8-bit activations and examine whether one of the two is equal to zero. If one is equal to zero, the second can opportunistically use the other's 4-bit budget; if both do not equal zero, then each is dynamically quantized to 4 bits, as described. SPARQ achieves minor accuracy degradation, 2x speedup over widely used hardware architectures, and a practical hardware implementation. The code is available at https://github.com/gilshm/sparq.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/18/2022

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Large language models (LLMs) show excellent performance but are compute-...
research
03/21/2022

Overcoming Oscillations in Quantization-Aware Training

When training neural networks with simulated quantization, we observe th...
research
01/31/2020

Near-Lossless Post-Training Quantization of Deep Neural Networks via a Piecewise Linear Approximation

Quantization plays an important role for energy-efficient deployment of ...
research
03/15/2023

A Comprehensive Study on Post-Training Quantization for Large Language Models

Post-training quantization () had been recently shown as a compromising ...
research
03/11/2022

QDrop: Randomly Dropping Quantization for Extremely Low-bit Post-Training Quantization

Recently, post-training quantization (PTQ) has driven much attention to ...
research
04/20/2021

Differentiable Model Compression via Pseudo Quantization Noise

We propose to add independent pseudo quantization noise to model paramet...
research
11/17/2019

Any-Precision Deep Neural Networks

We present Any-Precision Deep Neural Networks (Any-Precision DNNs), whic...

Please sign up or login with your details

Forgot password? Click here to reset