A Survey of Quantization Methods for Efficient Neural Network Inference

03/25/2021
by   Amir Gholami, et al.
10

As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fixed discrete set of numbers to minimize the number of bits required and also to maximize the accuracy of the attendant computations? This perennial problem of quantization is particularly relevant whenever memory and/or computational resources are severely restricted, and it has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas. Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x to 8x are often realized in practice in these applications. Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks. In this article, we survey approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. With this survey and its organization, we hope to have presented a useful snapshot of the current research in quantization for Neural Networks and to have given an intelligent organization to ease the evaluation of future research in this area.

READ FULL TEXT

page 1

page 8

page 12

page 13

research
03/23/2023

Scaled Quantization for the Vision Transformer

Quantization using a small number of bits shows promise for reducing lat...
research
03/02/2023

Ternary Quantization: A Survey

Inference time, model size, and accuracy are critical for deploying deep...
research
08/13/2018

A Survey on Methods and Theories of Quantized Neural Networks

Deep neural networks are the state-of-the-art methods for many real-worl...
research
06/30/2023

Designing strong baselines for ternary neural network quantization through support and mass equalization

Deep neural networks (DNNs) offer the highest performance in a wide rang...
research
08/30/2022

ANT: Exploiting Adaptive Numerical Data Type for Low-bit Deep Neural Network Quantization

Quantization is a technique to reduce the computation and memory cost of...
research
05/27/2020

Accelerating Neural Network Inference by Overflow Aware Quantization

The inherent heavy computation of deep neural networks prevents their wi...
research
04/03/2019

Progressive Stochastic Binarization of Deep Networks

A plethora of recent research has focused on improving the memory footpr...

Please sign up or login with your details

Forgot password? Click here to reset