MoCA: Memory-Centric, Adaptive Execution for Multi-Tenant Deep Neural Networks

05/10/2023
by   Seah Kim, et al.
0

Driven by the wide adoption of deep neural networks (DNNs) across different application domains, multi-tenancy execution, where multiple DNNs are deployed simultaneously on the same hardware, has been proposed to satisfy the latency requirements of different applications while improving the overall system utilization. However, multi-tenancy execution could lead to undesired system-level resource contention, causing quality-of-service (QoS) degradation for latency-critical applications. To address this challenge, we propose MoCA, an adaptive multi-tenancy system for DNN accelerators. Unlike existing solutions that focus on compute resource partition, MoCA dynamically manages shared memory resources of co-located applications to meet their QoS targets. Specifically, MoCA leverages the regularities in both DNN operators and accelerators to dynamically modulate memory access rates based on their latency targets and user-defined priorities so that co-located applications get the resources they demand without significantly starving their co-runners. We demonstrate that MoCA improves the satisfaction rate of the service level agreement (SLA) up to 3.9x (1.8x average), system throughput by 2.3x (1.7x average), and fairness by 1.3x (1.2x average), compared to prior work.

READ FULL TEXT

page 4

page 9

page 10

research
05/01/2023

BCEdge: SLO-Aware DNN Inference Services with Adaptive Batching on Edge Platforms

As deep neural networks (DNNs) are being applied to a wide range of edge...
research
04/22/2022

SCOPE: Safe Exploration for Dynamic Computer Systems Optimization

Modern computer systems need to execute under strict safety constraints ...
research
09/07/2023

OSMOSIS: Enabling Multi-Tenancy in Datacenter SmartNICs

Multi-tenancy is essential for unleashing SmartNIC's potential in datace...
research
11/26/2019

Intelligent Resource Scheduling for Co-located Latency-critical Services: A Multi-Model Collaborative Learning Approach

Latency-critical services have been widely deployed in cloud environment...
research
08/26/2023

Throughput Maximization of DNN Inference: Batching or Multi-Tenancy?

Deployment of real-time ML services on warehouse-scale infrastructures i...
research
02/03/2023

DynaMIX: Resource Optimization for DNN-Based Real-Time Applications on a Multi-Tasking System

As deep neural networks (DNNs) prove their importance and feasibility, m...
research
02/23/2022

Memory Planning for Deep Neural Networks

We study memory allocation patterns in DNNs during inference, in the con...

Please sign up or login with your details

Forgot password? Click here to reset