VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

02/08/2021
by   Steve Dai, et al.
0

Quantization enables efficient acceleration of deep neural networks by reducing model memory footprint and exploiting low-cost integer math hardware units. Quantization maps floating-point weights and activations in a trained model to low-bitwidth integer values using scale factors. Excessive quantization, reducing precision too aggressively, results in accuracy degradation. When scale factors are shared at a coarse granularity across many dimensions of each tensor, effective precision of individual elements within the tensor are limited. To reduce quantization-related accuracy loss, we propose using a separate scale factor for each small vector of (≈16-64) elements within a single dimension of a tensor. To achieve an efficient hardware implementation, the per-vector scale factors can be implemented with low-bitwidth integers when calibrated using a two-level quantization scheme. We find that per-vector scaling consistently achieves better inference accuracy at low precision compared to conventional scaling techniques for popular neural networks without requiring retraining. We also modify a deep learning accelerator hardware design to study the area and energy overheads of per-vector scaling support. Our evaluation demonstrates that per-vector scaled quantization with 4-bit weights and activations achieves 37 24 4-bit weights and 8-bit activations achieve near-full-precision accuracy for both BERT-base and BERT-large on SQuAD while reducing area by 26 an 8-bit baseline.

READ FULL TEXT
research
05/16/2023

MINT: Multiplier-less Integer Quantization for Spiking Neural Networks

We propose Multiplier-less INTeger (MINT) quantization, an efficient uni...
research
02/18/2022

LG-LSQ: Learned Gradient Linear Symmetric Quantization

Deep neural networks with lower precision weights and operations at infe...
research
03/20/2023

Unit Scaling: Out-of-the-Box Low-Precision Training

We present unit scaling, a paradigm for designing deep learning models t...
research
07/01/2018

SYQ: Learning Symmetric Quantization For Efficient Deep Neural Networks

Inference for state-of-the-art deep neural networks is computationally e...
research
03/19/2019

Trained Uniform Quantization for Accurate and Efficient Neural Network Inference on Fixed-Point Hardware

We propose a method of training quantization clipping thresholds for uni...
research
04/10/2017

WRPN: Training and Inference using Wide Reduced-Precision Networks

For computer vision applications, prior works have shown the efficacy of...
research
04/19/2018

Minimizing Area and Energy of Deep Learning Hardware Design Using Collective Low Precision and Structured Compression

Deep learning algorithms have shown tremendous success in many recogniti...

Please sign up or login with your details

Forgot password? Click here to reset