Efficient and Effective Quantization for Sparse DNNs

03/07/2019
by   Yiren Zhao, et al.
0

Deep convolutional neural networks (CNNs) are powerful tools for a wide range of vision tasks, but the enormous amount of memory and compute resources required by CNNs poses a challenge in deploying them on constrained devices. Existing compression techniques show promising performance in reducing the size and computation complexity of CNNs for efficient inference, but there lacks a method to integrate them effectively. In this paper, we attend to the statistical properties of sparse CNNs and present focused quantization, a novel quantization strategy based on powers-of-two values, which exploits the weight distributions after fine-grained pruning. The proposed method dynamically discovers the most effective numerical representation for weights in layers with varying sparsities, to minimize the impact of quantization on the task accuracy. Multiplications in quantized CNNs can be replaced with much cheaper bit-shift operations for efficient inference. Coupled with lossless encoding, we build a compression pipeline that provides CNNs high compression ratios (CR) and minimal loss in accuracies. In ResNet-50, we achieve a 18.08 × CR with only 0.24% loss in top-5 accuracy, outperforming existing compression pipelines.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/04/2022

Structured Pruning is All You Need for Pruning CNNs at Initialization

Pruning is a popular technique for reducing the model size and computati...
research
12/29/2018

Quantized Guided Pruning for Efficient Hardware Implementations of Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are state-of-the-art in numerous co...
research
08/29/2019

Smaller Models, Better Generalization

Reducing network complexity has been a major research focus in recent ye...
research
07/23/2021

Pruning Ternary Quantization

We propose pruning ternary quantization (PTQ), a simple, yet effective, ...
research
11/24/2019

Pyramid Vector Quantization and Bit Level Sparsity in Weights for Efficient Neural Networks Inference

This paper discusses three basic blocks for the inference of convolution...
research
06/22/2020

Exploiting Weight Redundancy in CNNs: Beyond Pruning and Quantization

Pruning and quantization are proven methods for improving the performanc...
research
12/20/2019

EAST: Encoding-Aware Sparse Training for Deep Memory Compression of ConvNets

The implementation of Deep Convolutional Neural Networks (ConvNets) on t...

Please sign up or login with your details

Forgot password? Click here to reset