Differentiable Fine-grained Quantization for Deep Neural Network Compression

10/20/2018
by   Hsin-Pai Cheng, et al.
0

Neural networks have shown great performance in cognitive tasks. When deploying network models on mobile devices with limited resources, weight quantization has been widely adopted. Binary quantization obtains the highest compression but usually results in big accuracy drop. In practice, 8-bit or 16-bit quantization is often used aiming at maintaining the same accuracy as the original 32-bit precision. We observe different layers have different accuracy sensitivity of quantization. Thus judiciously selecting different precision for different layers/structures can potentially produce more efficient models compared to traditional quantization methods by striking a better balance between accuracy and compression rate. In this work, we propose a fine-grained quantization approach for deep neural network compression by relaxing the search space of quantization bitwidth from discrete to a continuous domain. The proposed approach applies gradient descend based optimization to generate a mixed-precision quantization scheme that outperforms the accuracy of traditional quantization methods under the same compression rate.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/20/2021

BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization

Mixed-precision quantization can potentially achieve the optimal tradeof...
research
07/01/2019

Weight Normalization based Quantization for Deep Neural Network Compression

With the development of deep neural networks, the size of network models...
research
12/04/2017

Adaptive Quantization for Deep Neural Network

In recent years Deep Neural Networks (DNNs) have been rapidly developed ...
research
03/03/2023

Rotation Invariant Quantization for Model Compression

Post-training Neural Network (NN) model compression is an attractive app...
research
10/08/2021

LCS: Learning Compressible Subspaces for Adaptive Network Compression at Inference Time

When deploying deep learning models to a device, it is traditionally ass...
research
12/18/2019

TOCO: A Framework for Compressing Neural Network Models Based on Tolerance Analysis

Neural network compression methods have enabled deploying large models o...
research
11/23/2018

Joint Neural Architecture Search and Quantization

Designing neural architectures is a fundamental step in deep learning ap...

Please sign up or login with your details

Forgot password? Click here to reset