PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning

01/01/2020
by   Wei Niu, et al.
15

With the emergence of a spectrum of high-end mobile devices, many applications that formerly required desktop-level computation capability are being transferred to these devices. However, executing the inference of Deep Neural Networks (DNNs) is still challenging considering high computation and storage demands, specifically, if real-time performance with high accuracy is needed. Weight pruning of DNNs is proposed, but existing schemes represent two extremes in the design space: non-structured pruning is fine-grained, accurate, but not hardware friendly; structured pruning is coarse-grained, hardware-efficient, but with higher accuracy loss. In this paper, we introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency. In other words, our method achieves the best of both worlds, and is desirable across theory/algorithm, compiler, and hardware levels. The proposed PatDNN is an end-to-end framework to efficiently execute DNN on mobile devices with the help of a novel model compression technique (pattern-based pruning based on extended ADMM solution framework) and a set of thorough architecture-aware compiler- and code generation-based optimizations (filter kernel reordering, compressed weight storage, register load redundancy elimination, and parameter auto-tuning). Evaluation results demonstrate that PatDNN outperforms three state-of-the-art end-to-end DNN frameworks, TensorFlow Lite, TVM, and Alibaba Mobile Neural Network with speedup up to 44.5x, 11.4x, and 7.1x, respectively, with no accuracy compromise. Real-time inference of representative large-scale DNNs (e.g., VGG-16, ResNet-50) can be achieved using mobile devices.

READ FULL TEXT

page 2

page 3

page 4

page 7

page 8

page 14

page 15

page 16

research
09/06/2019

PCONV: The Missing but Desirable Sparsity in DNN Weight Pruning for Real-time Execution on Mobile Devices

Model compression techniques on Deep Neural Network (DNN) have been wide...
research
07/20/2020

Achieving Real-Time Execution of 3D Convolutional Neural Networks on Mobile Devices

Mobile devices are becoming an important carrier for deep learning tasks...
research
05/02/2019

26ms Inference Time for ResNet-50: Towards Real-Time Execution of all DNNs on Smartphone

With the rapid emergence of a spectrum of high-end mobile devices, many ...
research
04/22/2020

Towards Real-Time DNN Inference on Mobile Platforms with Model Pruning and Compiler Optimization

High-end mobile platforms rapidly serve as primary computing devices for...
research
03/14/2020

CoCoPIE: Making Mobile AI Sweet As PIE –Compression-Compilation Co-Design Goes a Long Way

Assuming hardware is the major constraint for enabling real-time mobile ...
research
07/04/2022

CPrune: Compiler-Informed Model Pruning for Efficient Target-Aware DNN Execution

Mobile devices run deep learning models for various purposes, such as im...
research
12/20/2021

Load-balanced Gather-scatter Patterns for Sparse Deep Neural Networks

Deep neural networks (DNNs) have been proven to be effective in solving ...

Please sign up or login with your details

Forgot password? Click here to reset