Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU

07/10/2023
by   Zhihe Zhao, et al.
0

Many applications such as autonomous driving and augmented reality, require the concurrent running of multiple deep neural networks (DNN) that poses different levels of real-time performance requirements. However, coordinating multiple DNN tasks with varying levels of criticality on edge GPUs remains an area of limited study. Unlike server-level GPUs, edge GPUs are resource-limited and lack hardware-level resource management mechanisms for avoiding resource contention. Therefore, we propose Miriam, a contention-aware task coordination framework for multi-DNN inference on edge GPU. Miriam consolidates two main components, an elastic-kernel generator, and a runtime dynamic kernel coordinator, to support mixed critical DNN inference. To evaluate Miriam, we build a new DNN inference benchmark based on CUDA with diverse representative DNN workloads. Experiments on two edge GPU platforms show that Miriam can increase system throughput by 92 overhead for critical tasks, compared to state of art baselines.

READ FULL TEXT

page 4

page 5

page 10

page 11

page 12

research
11/03/2022

iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud

GPUs are essential to accelerating the latency-sensitive deep neural net...
research
11/28/2021

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

With the fast development of deep neural networks (DNNs), many real-worl...
research
02/03/2023

DynaMIX: Resource Optimization for DNN-Based Real-Time Applications on a Multi-Tasking System

As deep neural networks (DNNs) prove their importance and feasibility, m...
research
05/01/2023

BCEdge: SLO-Aware DNN Inference Services with Adaptive Batching on Edge Platforms

As deep neural networks (DNNs) are being applied to a wide range of edge...
research
04/15/2021

Performance Analysis and Optimization Opportunities for NVIDIA Automotive GPUs

Advanced Driver Assistance Systems (ADAS) and Autonomous Driving (AD) br...
research
06/23/2014

Preemptive Thread Block Scheduling with Online Structural Runtime Prediction for Concurrent GPGPU Kernels

Recent NVIDIA Graphics Processing Units (GPUs) can execute multiple kern...
research
06/15/2022

Understanding and Optimizing Deep Learning Cold-Start Latency on Edge Devices

DNNs are ubiquitous on edge devices nowadays. With its increasing import...

Please sign up or login with your details

Forgot password? Click here to reset