Deep Compression for PyTorch Model Deployment on Microcontrollers

03/29/2021
by   Eren Dogan, et al.
0

Neural network deployment on low-cost embedded systems, hence on microcontrollers (MCUs), has recently been attracting more attention than ever. Since MCUs have limited memory capacity as well as limited compute-speed, it is critical that we employ model compression, which reduces both memory and compute-speed requirements. In this paper, we add model compression, specifically Deep Compression, and further optimize Unlu's earlier work on arXiv, which efficiently deploys PyTorch models on MCUs. First, we prune the weights in convolutional and fully connected layers. Secondly, the remaining weights and activations are quantized to 8-bit integers from 32-bit floating-point. Finally, forward pass functions are compressed using special data structures for sparse matrices, which store only nonzero weights (without impacting performance and accuracy). In the case of the LeNet-5 model, the memory footprint was reduced by 12.45x, and the inference speed was boosted by 2.57x.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/08/2017

Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations

Deep convolutional neural network (CNN) inference requires significant a...
research
07/09/2019

A Targeted Acceleration and Compression Framework for Low bit Neural Networks

1 bit deep neural networks (DNNs), of which both the activations and wei...
research
05/24/2019

Structured Compression by Unstructured Pruning for Sparse Quantized Neural Networks

Model compression techniques, such as pruning and quantization, are beco...
research
07/02/2020

Efficient Neural Network Deployment for Microcontroller

Edge computing for neural networks is getting important especially for l...
research
10/11/2022

Deep learning model compression using network sensitivity and gradients

Deep learning model compression is an improving and important field for ...
research
05/27/2018

Compact and Computationally Efficient Representation of Deep Neural Networks

Dot product operations between matrices are at the heart of almost any f...
research
07/09/2021

Model compression as constrained optimization, with application to neural nets. Part V: combining compressions

Model compression is generally performed by using quantization, low-rank...

Please sign up or login with your details

Forgot password? Click here to reset