DeepAI AI Chat
Log In Sign Up

Efficient and Effective Quantization for Sparse DNNs

by   Yiren Zhao, et al.
University of Cambridge
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences

Deep convolutional neural networks (CNNs) are powerful tools for a wide range of vision tasks, but the enormous amount of memory and compute resources required by CNNs poses a challenge in deploying them on constrained devices. Existing compression techniques show promising performance in reducing the size and computation complexity of CNNs for efficient inference, but there lacks a method to integrate them effectively. In this paper, we attend to the statistical properties of sparse CNNs and present focused quantization, a novel quantization strategy based on powers-of-two values, which exploits the weight distributions after fine-grained pruning. The proposed method dynamically discovers the most effective numerical representation for weights in layers with varying sparsities, to minimize the impact of quantization on the task accuracy. Multiplications in quantized CNNs can be replaced with much cheaper bit-shift operations for efficient inference. Coupled with lossless encoding, we build a compression pipeline that provides CNNs high compression ratios (CR) and minimal loss in accuracies. In ResNet-50, we achieve a 18.08 × CR with only 0.24% loss in top-5 accuracy, outperforming existing compression pipelines.


page 1

page 2

page 3

page 4


Structured Pruning is All You Need for Pruning CNNs at Initialization

Pruning is a popular technique for reducing the model size and computati...

Quantized Guided Pruning for Efficient Hardware Implementations of Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are state-of-the-art in numerous co...

Smaller Models, Better Generalization

Reducing network complexity has been a major research focus in recent ye...

Pruning Ternary Quantization

We propose pruning ternary quantization (PTQ), a simple, yet effective, ...

Pyramid Vector Quantization and Bit Level Sparsity in Weights for Efficient Neural Networks Inference

This paper discusses three basic blocks for the inference of convolution...

Exploiting Weight Redundancy in CNNs: Beyond Pruning and Quantization

Pruning and quantization are proven methods for improving the performanc...

EAST: Encoding-Aware Sparse Training for Deep Memory Compression of ConvNets

The implementation of Deep Convolutional Neural Networks (ConvNets) on t...

Code Repositories


Mayo: Auto-generation of hardware-friendly deep neural networks.

view repo