Efficient Latency-Aware CNN Depth Compression via Two-Stage Dynamic Programming

01/28/2023
by   Jinuk Kim, et al.
0

Recent works on neural network pruning advocate that reducing the depth of the network is more effective in reducing run-time memory usage and accelerating inference latency than reducing the width of the network through channel pruning. In this regard, some recent works propose depth compression algorithms that merge convolution layers. However, the existing algorithms have a constricted search space and rely on human-engineered heuristics. In this paper, we propose a novel depth compression algorithm which targets general convolution operations. We propose a subset selection problem that replaces inefficient activation layers with identity functions and optimally merges consecutive convolution operations into shallow equivalent convolution operations for efficient end-to-end inference latency. Since the proposed subset selection problem is NP-hard, we formulate a surrogate optimization problem that can be solved exactly via two-stage dynamic programming within a few seconds. We evaluate our methods and baselines by TensorRT for a fair inference latency comparison. Our method outperforms the baseline method with higher accuracy and faster inference speed in MobileNetV2 on the ImageNet dataset. Specifically, we achieve 1.61×speed-up with only 0.62%p accuracy drop in MobileNetV2-1.4 on the ImageNet.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/09/2023

End-to-End Neural Network Compression via ℓ_1/ℓ_2 Regularized Latency Surrogates

Neural network (NN) compression via techniques such as pruning, quantiza...
research
04/01/2023

Progressive Channel-Shrinking Network

Currently, salience-based channel pruning makes continuous breakthroughs...
research
12/02/2020

DYNAMAP: Dynamic Algorithm Mapping Framework for Low Latency CNN Inference

Most of the existing works on FPGA acceleration of Convolutional Neural ...
research
06/02/2022

Pruning-as-Search: Efficient Neural Architecture Search via Channel Pruning and Structural Reparameterization

Neural architecture search (NAS) and network pruning are widely studied ...
research
02/24/2022

Optimal channel selection with discrete QCQP

Reducing the high computational cost of large convolutional neural netwo...
research
05/27/2021

FuSeConv: Fully Separable Convolutions for Fast Inference on Systolic Arrays

Both efficient neural networks and hardware accelerators are being explo...

Please sign up or login with your details

Forgot password? Click here to reset