Towards QoS-Aware and Resource-Efficient GPU Microservices Based on Spatial Multitasking GPUs In Datacenters

05/05/2020
by   Wei Zhang, et al.
0

While prior researches focus on CPU-based microservices, they are not applicable for GPU-based microservices due to the different contention patterns. It is challenging to optimize the resource utilization while guaranteeing the QoS for GPU microservices. We find that the overhead is caused by inter microservice communication, GPU resource contention and imbalanced throughput within microservice pipeline. We propose Camelot, a runtime system that manages GPU micorservices considering the above factors. In Camelot, a global memory-based communication mechanism enables onsite data sharing that significantly reduces the end-to-end latencies of user queries. We also propose two contention aware resource allocation policies that either maximize the peak supported service load or minimize the resource usage at low load while ensuring the required QoS. The two policies consider the microservice pipeline effect and the runtime GPU resource contention when allocating resources for the microservices. Compared with state-of-the-art work, Camelot increases the supported peak load by up to 64.5 usage at low load while achieving the desired 99

READ FULL TEXT
research
11/04/2022

Dynamic Resource Allocation Method for Load Balance Scheduling over Cloud Data Center Networks

The cloud datacenter has numerous hosts as well as application requests ...
research
09/01/2023

FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference

Serverless computing (FaaS) has been extensively utilized for deep learn...
research
04/05/2018

SARA: Self-Aware Resource Allocation for Heterogeneous MPSoCs

In modern heterogeneous MPSoCs, the management of shared memory resource...
research
02/01/2023

Revisiting Query Performance in GPU Database Systems

GPUs offer massive compute parallelism and high-bandwidth memory accesse...
research
07/30/2019

Runtime QoS service for application-driven adaptation in network computing

A distributed application executing on a Network of Workstations (NOW) n...
research
06/02/2020

Flex: Closing the Gaps between Usage and Allocation

Data centers are giant factories of Internet data and services. Worldwid...
research
04/23/2023

GACER: Granularity-Aware ConcurrEncy Regulation for Multi-Tenant Deep Learning

As deep learning continues to advance and is applied to increasingly com...

Please sign up or login with your details

Forgot password? Click here to reset