PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers

02/27/2022
by   Yunseong Kim, et al.
0

In cloud machine learning (ML) inference systems, providing low latency to end-users is of utmost importance. However, maximizing server utilization and system throughput is also crucial for ML service providers as it helps lower the total-cost-of-ownership. GPUs have oftentimes been criticized for ML inference usages as its massive compute and memory throughput is hard to be fully utilized under low-batch inference scenarios. To address such limitation, NVIDIA's recently announced Ampere GPU architecture provides features to "reconfigure" one large, monolithic GPU into multiple smaller "GPU partitions". Such feature provides cloud ML service providers the ability to utilize the reconfigurable GPU not only for large-batch training but also for small-batch inference with the potential to achieve high resource utilization. In this paper, we study this emerging GPU architecture with reconfigurability to develop a high-performance multi-GPU ML inference server. Our first proposition is a sophisticated partitioning algorithm for reconfigurable GPUs that systematically determines a heterogeneous set of multi-granular GPU partitions, best suited for the inference server's deployment. Furthermore, we co-design an elastic scheduling algorithm tailored for our heterogeneously partitioned GPU server which effectively balances low latency and high GPU utilization.

READ FULL TEXT

page 1

page 2

page 3

page 6

page 11

research
09/01/2021

Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning

As machine learning techniques are applied to a widening range of applic...
research
09/11/2020

Fast LDPC GPU Decoder for Cloud RAN

The GPU as a digital signal processing accelerator for cloud RAN is inve...
research
03/09/2023

Cloudless-Training: A Framework to Improve Efficiency of Geo-Distributed ML Training

Geo-distributed ML training can benefit many emerging ML scenarios (e.g....
research
09/18/2021

Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem

Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA A100 GPUs...
research
03/09/2023

GPU-enabled Function-as-a-Service for Machine Learning Inference

Function-as-a-Service (FaaS) is emerging as an important cloud computing...
research
04/23/2023

GACER: Granularity-Aware ConcurrEncy Regulation for Multi-Tenant Deep Learning

As deep learning continues to advance and is applied to increasingly com...
research
05/14/2022

A Low-latency Communication Design for Brain Simulations

Brain simulation, as one of the latest advances in artificial intelligen...

Please sign up or login with your details

Forgot password? Click here to reset