GACER: Granularity-Aware ConcurrEncy Regulation for Multi-Tenant Deep Learning

04/23/2023
by   Yongbo Yu, et al.
0

As deep learning continues to advance and is applied to increasingly complex scenarios, the demand for concurrent deployment of multiple neural network models has arisen. This demand, commonly referred to as multi-tenant computing, is becoming more and more important. However, even the most mature GPU-based computing systems struggle to adequately address the significant heterogeneity and complexity among concurrent models in terms of resource allocation and runtime scheduling. And this usually results in considerable resource utilization and throughput issues. To tackle these issues, this work proposes a set of optimization techniques that advance the granularity of computing management from both the spatial and temporal perspectives, specifically tailored to heterogeneous model compositions for deep learning inference and training. These techniques are further integrated as GACER – an automated optimization framework that provides high-utilization, high-throughput, and low-latency multi-tenant computing support. And our experiments demonstrate that GACER significantly improves the overall resource utilization and consistently achieves outstanding speedups compared to native GPU computing frameworks and existing state-of-the-art optimization works.

READ FULL TEXT

page 1

page 4

page 9

research
02/23/2023

Hera: A Heterogeneity-Aware Multi-Tenant Inference Server for Personalized Recommendations

While providing low latency is a fundamental requirement in deploying re...
research
11/28/2021

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

With the fast development of deep neural networks (DNNs), many real-worl...
research
02/27/2022

PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers

In cloud machine learning (ML) inference systems, providing low latency ...
research
03/17/2022

A Survey of Multi-Tenant Deep Learning Inference on GPU

Deep Learning (DL) models have achieved superior performance. Meanwhile,...
research
12/31/2018

Dynamic Space-Time Scheduling for GPU Inference

Serving deep neural networks in latency critical interactive settings of...
research
05/05/2020

Towards QoS-Aware and Resource-Efficient GPU Microservices Based on Spatial Multitasking GPUs In Datacenters

While prior researches focus on CPU-based microservices, they are not ap...
research
08/22/2023

Octopus: A Heterogeneous In-network Computing Accelerator Enabling Deep Learning for network

Deep learning (DL) for network models have achieved excellent performanc...

Please sign up or login with your details

Forgot password? Click here to reset