Memory-Aware Fusing and Tiling of Neural Networks for Accelerated Edge Inference

07/14/2021
by   Jackson Farley, et al.
0

A rising research challenge is running costly machine learning (ML) networks locally on resource-constrained edge devices. ML networks with large convolutional layers can easily exceed available memory, increasing latency due to excessive swapping. Previous memory reduction techniques such as pruning and quantization reduce model accuracy and often require retraining. Alternatively, distributed methods partition the convolutions into equivalent smaller sub-computations, but the implementations introduce communication costs and require a network of devices. However, a distributed partitioning approach can also be used to run in a reduced memory footprint on a single device by subdividing the network into smaller operations. This report extends prior work on distributed partitioning using tiling and fusing of convolutional layers into a memory-aware execution on a single device. Our approach extends prior fusing strategies to allow for two groups of convolutional layers that are fused and tiled independently. This approach reduces overhead via data reuse, and reduces the memory footprint further. We also propose a memory usage predictor coupled with a search algorithm to provide fusing and tiling configurations for an arbitrary set of convolutional layers. When applied to the YOLOv2 object detection network, results show that our approach can run in less than half the memory, and with a speedup of up to 2.78 under severe memory constraints. Additionally, our algorithm will return a configuration with a latency that is within 6 a manual search.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/30/2021

Memory-Efficient Deep Learning Inference in Trusted Execution Environments

This study identifies and proposes techniques to alleviate two key bottl...
research
05/21/2020

TASO: Time and Space Optimization for Memory-Constrained DNN Inference

Convolutional neural networks (CNNs) are used in many embedded applicati...
research
07/26/2019

Memory- and Communication-Aware Model Compression for Distributed Deep Learning Inference on IoT

Model compression has emerged as an important area of research for deplo...
research
08/12/2021

perf4sight: A toolflow to model CNN training performance on Edge GPUs

The increased memory and processing capabilities of today's edge devices...
research
02/12/2020

Retrain or not retrain? – efficient pruning methods of deep CNN networks

Convolutional neural networks (CNN) play a major role in image processin...
research
02/12/2021

Depthwise Separable Convolutions Allow for Fast and Memory-Efficient Spectral Normalization

An increasing number of models require the control of the spectral norm ...
research
03/04/2020

Ordering Chaos: Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

Recent advances demonstrate that irregularly wired neural networks from ...

Please sign up or login with your details

Forgot password? Click here to reset