Quantized Neural Network Inference with Precision Batching

02/26/2020
by   Maximilian Lam, et al.
0

We present PrecisionBatching, a quantized inference algorithm for speeding up neural network execution on traditional hardware platforms at low bitwidths without the need for retraining or recalibration. PrecisionBatching decomposes a neural network into individual bitlayers and accumulates them using fast 1-bit operations while maintaining activations in full precision. PrecisionBatching not only facilitates quantized inference at low bitwidths (< 8 bits) without the need for retraining/recalibration, but also 1) enables traditional hardware platforms the ability to realize inference speedups at a finer granularity of quantization (e.g: 1-16 bit execution) and 2) allows accuracy and speedup tradeoffs at runtime by exposing the number of bitlayers to accumulate as a tunable parameter. Across a variety of applications (MNIST, language modeling, natural language inference) and neural network architectures (fully connected, RNN, LSTM), PrecisionBatching yields end-to-end speedups of over 8x on a GPU within a < 1 outperforming traditional 8-bit quantized inference by over 1.5x-2x at the same error tolerance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/14/2020

Fast Implementation of 4-bit Convolutional Neural Networks for Mobile Devices

Quantized low-precision neural networks are very popular because they re...
research
02/18/2019

Low-bit Quantization of Neural Networks for Efficient Inference

Recent breakthrough methods in machine learning make use of increasingly...
research
09/22/2016

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

We introduce a method to train Quantized Neural Networks (QNNs) --- neur...
research
06/16/2022

Accelerating Inference and Language Model Fusion of Recurrent Neural Network Transducers via End-to-End 4-bit Quantization

We report on aggressive quantization strategies that greatly accelerate ...
research
12/29/2019

Mixed-Precision Quantized Neural Network with Progressively Decreasing Bitwidth For Image Classification and Object Detection

Efficient model inference is an important and practical issue in the dep...
research
05/25/2018

Heterogeneous Bitwidth Binarization in Convolutional Neural Networks

Recent work has shown that fast, compact low-bitwidth neural networks ca...
research
02/09/2022

Lightweight Jet Reconstruction and Identification as an Object Detection Task

We apply object detection techniques based on deep convolutional blocks ...

Please sign up or login with your details

Forgot password? Click here to reset