Demand Layering for Real-Time DNN Inference with Minimized Memory Usage

10/08/2022
by   Mingoo Ji, et al.
0

When executing a deep neural network (DNN), its model parameters are loaded into GPU memory before execution, incurring a significant GPU memory burden. There are studies that reduce GPU memory usage by exploiting CPU memory as a swap device. However, this approach is not applicable in most embedded systems with integrated GPUs where CPU and GPU share a common memory. In this regard, we present Demand Layering, which employs a fast solid-state drive (SSD) as a co-running partner of a GPU and exploits the layer-by-layer execution of DNNs. In our approach, a DNN is loaded and executed in a layer-by-layer manner, minimizing the memory usage to the order of a single layer. Also, we developed a pipeline architecture that hides most additional delays caused by the interleaved parameter loadings alongside layer executions. Our implementation shows a 96.5 representative DNNs. Furthermore, by exploiting the memory-delay tradeoff, near-zero delay overhead (under 1 ms) can be achieved with a slightly increased memory usage (still an 88.4 Layering.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/25/2016

vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design

The most widely used machine learning frameworks require users to carefu...
research
01/19/2019

A Two-Layer Component-Based Allocation for Embedded Systems with GPUs

Component-based development is a software engineering paradigm that can ...
research
02/13/2020

Training Large Neural Networks with Constant Memory using a New Execution Algorithm

Widely popular transformer-based NLP models such as BERT and GPT have en...
research
05/03/2017

Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks

Popular deep learning frameworks require users to fine-tune their memory...
research
02/08/2023

ED-Batch: Efficient Automatic Batching of Dynamic Neural Networks via Learned Finite State Machines

Batching has a fundamental influence on the efficiency of deep neural ne...
research
12/15/2022

Triangulating Python Performance Issues with Scalene

This paper proposes Scalene, a profiler specialized for Python. Scalene ...
research
03/17/2020

Co-Optimizing Performance and Memory FootprintVia Integrated CPU/GPU Memory Management, anImplementation on Autonomous Driving Platform

Cutting-edge embedded system applications, such as self-driving cars and...

Please sign up or login with your details

Forgot password? Click here to reset