DeepAI
Log In Sign Up

VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

02/08/2021
by   Steve Dai, et al.
0

Quantization enables efficient acceleration of deep neural networks by reducing model memory footprint and exploiting low-cost integer math hardware units. Quantization maps floating-point weights and activations in a trained model to low-bitwidth integer values using scale factors. Excessive quantization, reducing precision too aggressively, results in accuracy degradation. When scale factors are shared at a coarse granularity across many dimensions of each tensor, effective precision of individual elements within the tensor are limited. To reduce quantization-related accuracy loss, we propose using a separate scale factor for each small vector of (≈16-64) elements within a single dimension of a tensor. To achieve an efficient hardware implementation, the per-vector scale factors can be implemented with low-bitwidth integers when calibrated using a two-level quantization scheme. We find that per-vector scaling consistently achieves better inference accuracy at low precision compared to conventional scaling techniques for popular neural networks without requiring retraining. We also modify a deep learning accelerator hardware design to study the area and energy overheads of per-vector scaling support. Our evaluation demonstrates that per-vector scaled quantization with 4-bit weights and activations achieves 37 24 4-bit weights and 8-bit activations achieve near-full-precision accuracy for both BERT-base and BERT-large on SQuAD while reducing area by 26 an 8-bit baseline.

READ FULL TEXT
06/21/2018

Quantizing deep convolutional networks for efficient inference: A whitepaper

We present an overview of techniques for quantizing convolutional neural...
02/18/2022

LG-LSQ: Learned Gradient Linear Symmetric Quantization

Deep neural networks with lower precision weights and operations at infe...
09/13/2018

High-Accuracy Inference in Neuromorphic Circuits using Hardware-Aware Training

Neuromorphic Multiply-And-Accumulate (MAC) circuits utilizing synaptic w...
07/01/2018

SYQ: Learning Symmetric Quantization For Efficient Deep Neural Networks

Inference for state-of-the-art deep neural networks is computationally e...
03/19/2019

Trained Uniform Quantization for Accurate and Efficient Neural Network Inference on Fixed-Point Hardware

We propose a method of training quantization clipping thresholds for uni...
04/10/2017

WRPN: Training and Inference using Wide Reduced-Precision Networks

For computer vision applications, prior works have shown the efficacy of...
04/19/2018

Minimizing Area and Energy of Deep Learning Hardware Design Using Collective Low Precision and Structured Compression

Deep learning algorithms have shown tremendous success in many recogniti...