ANT: Exploiting Adaptive Numerical Data Type for Low-bit Deep Neural Network Quantization

08/30/2022
by   Cong Guo, et al.
0

Quantization is a technique to reduce the computation and memory cost of DNN models, which are getting increasingly large. Existing quantization solutions use fixed-point integer or floating-point types, which have limited benefits, as both require more bits to maintain the accuracy of original models. On the other hand, variable-length quantization uses low-bit quantization for normal values and high-precision for a fraction of outlier values. Even though this line of work brings algorithmic benefits, it also introduces significant hardware overheads due to variable-length encoding and decoding. In this work, we propose a fixed-length adaptive numerical data type called ANT to achieve low-bit quantization with tiny hardware overheads. Our data type ANT leverages two key innovations to exploit the intra-tensor and inter-tensor adaptive opportunities in DNN models. First, we propose a particular data type, flint, that combines the advantages of float and int for adapting to the importance of different values within a tensor. Second, we propose an adaptive framework that selects the best type for each tensor according to its distribution characteristics. We design a unified processing element architecture for ANT and show its ease of integration with existing DNN accelerators. Our design results in 2.8× speedup and 2.5× energy efficiency improvement over the state-of-the-art quantization accelerators.

READ FULL TEXT
research
08/06/2019

Cheetah: Mixed Low-Precision Hardware Software Co-Design Framework for DNNs on the Edge

Low-precision DNNs have been extensively explored in order to reduce the...
research
04/15/2023

OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization

Transformer-based large language models (LLMs) have achieved great succe...
research
10/13/2017

TensorQuant - A Simulation Toolbox for Deep Neural Network Quantization

Recent research implies that training and inference of deep neural netwo...
research
05/17/2022

QAPPA: Quantization-Aware Power, Performance, and Area Modeling of DNN Accelerators

As the machine learning and systems community strives to achieve higher ...
research
01/28/2019

Improving Neural Network Quantization without Retraining using Outlier Channel Splitting

Quantization can improve the execution latency and energy efficiency of ...
research
01/28/2019

Improving Neural Network Quantization using Outlier Channel Splitting

Quantization can improve the execution latency and energy efficiency of ...
research
03/25/2021

A Survey of Quantization Methods for Efficient Neural Network Inference

As soon as abstract mathematical computations were adapted to computatio...

Please sign up or login with your details

Forgot password? Click here to reset