Mixed-Precision Quantization with Cross-Layer Dependencies

07/11/2023
by   Zihao Deng, et al.
1

Quantization is commonly used to compress and accelerate deep neural networks. Quantization assigning the same bit-width to all layers leads to large accuracy degradation at low precision and is wasteful at high precision settings. Mixed-precision quantization (MPQ) assigns varied bit-widths to layers to optimize the accuracy-efficiency trade-off. Existing methods simplify the MPQ problem by assuming that quantization errors at different layers act independently. We show that this assumption does not reflect the true behavior of quantized deep neural networks. We propose the first MPQ algorithm that captures the cross-layer dependency of quantization error. Our algorithm (CLADO) enables a fast approximation of pairwise cross-layer error terms by solving linear equations that require only forward evaluations of the network on a small amount of data. Decisions on layerwise bit-width assignments are then determined by optimizing a new MPQ formulation dependent on these cross-layer quantization errors via the Integer Quadratic Program (IQP), which can be solved within seconds. We conduct experiments on multiple networks on the Imagenet dataset and demonstrate an improvement, in top-1 classification accuracy, of up to 27 existing MPQ methods.

READ FULL TEXT

page 5

page 12

page 13

page 15

page 20

research
02/10/2023

A Practical Mixed Precision Algorithm for Post-Training Quantization

Neural network quantization is frequently used to optimize model size, l...
research
03/18/2021

Data-free mixed-precision quantization using novel sensitivity metric

Post-training quantization is a representative technique for compressing...
research
01/30/2023

Efficient and Effective Methods for Mixed Precision Neural Network Quantization for Faster, Energy-efficient Inference

For effective and efficient deep neural network inference, it is desirab...
research
04/21/2022

Arbitrary Bit-width Network: A Joint Layer-Wise Quantization and Adaptive Inference Approach

Conventional model quantization methods use a fixed quantization scheme ...
research
10/13/2021

Towards Mixed-Precision Quantization of Neural Networks via Constrained Optimization

Quantization is a widely used technique to compress and accelerate deep ...
research
08/03/2017

Learning Accurate Low-Bit Deep Neural Networks with Stochastic Quantization

Low-bit deep neural networks (DNNs) become critical for embedded applica...

Please sign up or login with your details

Forgot password? Click here to reset