iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud

11/03/2022
by   Fei Xu, et al.
0

GPUs are essential to accelerating the latency-sensitive deep neural network (DNN) inference workloads in cloud datacenters. To fully utilize GPU resources, spatial sharing of GPUs among co-located DNN inference workloads becomes increasingly compelling. However, GPU sharing inevitably brings severe performance interference among co-located inference workloads, as motivated by an empirical measurement study of DNN inference on EC2 GPU instances. While existing works on guaranteeing inference performance service level objectives (SLOs) focus on either temporal sharing of GPUs or reactive GPU resource scaling and inference migration techniques, how to proactively mitigate such severe performance interference has received comparatively little attention. In this paper, we propose iGniter, an interference-aware GPU resource provisioning framework for cost-efficiently achieving predictable DNN inference in the cloud. iGniter is comprised of two key components: (1) a lightweight DNN inference performance model, which leverages the system and workload metrics that are practically accessible to capture the performance interference; (2) A cost-efficient GPU resource provisioning strategy that jointly optimizes the GPU resource allocation and adaptive batching based on our inference performance model, with the aim of achieving predictable performance of DNN inference workloads. We implement a prototype of iGniter based on the NVIDIA Triton inference server hosted on EC2 GPU instances. Extensive prototype experiments on four representative DNN models and datasets demonstrate that iGniter can guarantee the performance SLOs of DNN inference workloads with practically acceptable runtime overhead, while saving the monetary cost by up to 25 strategies.

READ FULL TEXT

page 10

page 15

research
08/08/2020

Spatial Sharing of GPU for Autotuning DNN models

GPUs are used for training, inference, and tuning the machine learning m...
research
07/10/2023

Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU

Many applications such as autonomous driving and augmented reality, requ...
research
08/14/2023

Symphony: Optimized Model Serving using Centralized Orchestration

The orchestration of deep neural network (DNN) model inference on GPU cl...
research
01/21/2019

No DNN Left Behind: Improving Inference in the Cloud with Multi-Tenancy

With the rise of machine learning, inference on deep neural networks (DN...
research
01/05/2016

Resource Sharing for Multi-Tenant NoSQL Data Store in Cloud

Multi-tenancy hosting of users in cloud NoSQL data stores is favored by ...
research
09/01/2023

FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference

Serverless computing (FaaS) has been extensively utilized for deep learn...
research
01/18/2022

Model-driven Cluster Resource Management for AI Workloads in Edge Clouds

Since emerging edge applications such as Internet of Things (IoT) analyt...

Please sign up or login with your details

Forgot password? Click here to reset