Mixed Low-precision Deep Learning Inference using Dynamic Fixed Point

01/31/2017
by   Naveen Mellempudi, et al.
0

We propose a cluster-based quantization method to convert pre-trained full precision weights into ternary weights with minimal impact on the accuracy. In addition, we also constrain the activations to 8-bits thus enabling sub 8-bit full integer inference pipeline. Our method uses smaller clusters of N filters with a common scaling factor to minimize the quantization loss, while also maximizing the number of ternary operations. We show that with a cluster size of N=4 on Resnet-101, can achieve 71.8 full precision results while replacing 85 accumulations. Using the same method with 4-bit weights achieves 76.3 accuracy which within 2 of the size of the cluster on both performance and accuracy, larger cluster sizes N=64 can replace 98 introduces significant drop in accuracy which necessitates fine tuning the parameters with retraining the network at lower precision. To address this we have also trained low-precision Resnet-50 with 8-bit activations and ternary weights by pre-initializing the network with full precision weights and achieve 68.9 run on a full 8-bit compute pipeline, with a potential 16x improvement in performance compared to baseline full-precision models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/02/2017

Ternary Neural Networks with Fine-Grained Quantization

We propose a novel fine-grained quantization (FGQ) method to ternarize p...
research
01/30/2023

Efficient and Effective Methods for Mixed Precision Neural Network Quantization for Faster, Energy-efficient Inference

For effective and efficient deep neural network inference, it is desirab...
research
07/15/2017

Ternary Residual Networks

Sub-8-bit representation of DNNs incur some discernible loss of accuracy...
research
09/11/2018

Discovering Low-Precision Networks Close to Full-Precision Networks for Efficient Embedded Inference

To realize the promise of ubiquitous embedded deep network inference, it...
research
11/13/2022

FullPack: Full Vector Utilization for Sub-Byte Quantized Inference on General Purpose CPUs

Although prior art has demonstrated negligible accuracy drop in sub-byte...
research
12/20/2022

Redistribution of Weights and Activations for AdderNet Quantization

Adder Neural Network (AdderNet) provides a new way for developing energy...
research
01/13/2021

FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference

Deep learning models typically use single-precision (FP32) floating poin...

Please sign up or login with your details

Forgot password? Click here to reset