DeepAI
Log In Sign Up

ANT: Exploiting Adaptive Numerical Data Type for Low-bit Deep Neural Network Quantization

08/30/2022
by   Cong Guo, et al.
0

Quantization is a technique to reduce the computation and memory cost of DNN models, which are getting increasingly large. Existing quantization solutions use fixed-point integer or floating-point types, which have limited benefits, as both require more bits to maintain the accuracy of original models. On the other hand, variable-length quantization uses low-bit quantization for normal values and high-precision for a fraction of outlier values. Even though this line of work brings algorithmic benefits, it also introduces significant hardware overheads due to variable-length encoding and decoding. In this work, we propose a fixed-length adaptive numerical data type called ANT to achieve low-bit quantization with tiny hardware overheads. Our data type ANT leverages two key innovations to exploit the intra-tensor and inter-tensor adaptive opportunities in DNN models. First, we propose a particular data type, flint, that combines the advantages of float and int for adapting to the importance of different values within a tensor. Second, we propose an adaptive framework that selects the best type for each tensor according to its distribution characteristics. We design a unified processing element architecture for ANT and show its ease of integration with existing DNN accelerators. Our design results in 2.8× speedup and 2.5× energy efficiency improvement over the state-of-the-art quantization accelerators.

READ FULL TEXT
08/06/2019

Cheetah: Mixed Low-Precision Hardware Software Co-Design Framework for DNNs on the Edge

Low-precision DNNs have been extensively explored in order to reduce the...
10/13/2017

TensorQuant - A Simulation Toolbox for Deep Neural Network Quantization

Recent research implies that training and inference of deep neural netwo...
01/28/2019

Improving Neural Network Quantization using Outlier Channel Splitting

Quantization can improve the execution latency and energy efficiency of ...
01/28/2019

Improving Neural Network Quantization without Retraining using Outlier Channel Splitting

Quantization can improve the execution latency and energy efficiency of ...
05/27/2020

Accelerating Neural Network Inference by Overflow Aware Quantization

The inherent heavy computation of deep neural networks prevents their wi...
10/13/2019

Overwrite Quantization: Opportunistic Outlier Handling for Neural Network Accelerators

Outliers in weights and activations pose a key challenge for fixed-point...