D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUs

03/31/2023
by   Aditya Dhakal, et al.
0

Hardware accelerators such as GPUs are required for real-time, low-latency inference with Deep Neural Networks (DNN). However, due to the inherent limits to the parallelism they can exploit, DNNs often under-utilize the capacity of today's high-end accelerators. Although spatial multiplexing of the GPU, leads to higher GPU utilization and higher inference throughput, there remain a number of challenges. Finding the GPU percentage for right-sizing the GPU for each DNN through profiling, determining an optimal batching of requests to balance throughput improvement while meeting application-specific deadlines and service level objectives (SLOs), and maximizing throughput by appropriately scheduling DNNs are still significant challenges. This paper introduces a dynamic and fair spatio-temporal scheduler (D-STACK) that enables multiple DNNs to run in the GPU concurrently. To help allocate the appropriate GPU percentage (we call it the "Knee"), we develop and validate a model that estimates the parallelism each DNN can utilize. We also develop a lightweight optimization formulation to find an efficient batch size for each DNN operating with D-STACK. We bring together our optimizations and our spatio-temporal scheduler to provide a holistic inference framework. We demonstrate its ability to provide high throughput while meeting application SLOs. We compare D-STACK with an ideal scheduler that can allocate the right GPU percentage for every DNN kernel. D-STACK gets higher than 90 percent throughput and GPU utilization compared to the ideal scheduler. We also compare D-STACK with other GPU multiplexing and scheduling methods (e.g., NVIDIA Triton, Clipper, Nexus), using popular DNN models. Our controlled experiments with multiplexing several popular DNN models achieve up to 1.6X improvement in GPU utilization and up to 4X improvement in inference throughput.

READ FULL TEXT
research
08/08/2020

Spatial Sharing of GPU for Autotuning DNN models

GPUs are used for training, inference, and tuning the machine learning m...
research
08/18/2022

L3: Accelerator-Friendly Lossless Image Format for High-Resolution, High-Throughput DNN Training

The training process of deep neural networks (DNNs) is usually pipelined...
research
08/26/2023

Throughput Maximization of DNN Inference: Batching or Multi-Tenancy?

Deployment of real-time ML services on warehouse-scale infrastructures i...
research
09/06/2019

PREMA: A Predictive Multi-task Scheduling Algorithm For Preemptible Neural Processing Units

To amortize cost, cloud vendors providing DNN acceleration as a service ...
research
08/14/2023

Symphony: Optimized Model Serving using Centralized Orchestration

The orchestration of deep neural network (DNN) model inference on GPU cl...
research
08/31/2023

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Large Language Model (LLM) inference consists of two distinct phases - p...
research
11/08/2020

Towards Latency-aware DNN Optimization with GPU Runtime Analysis and Tail Effect Elimination

Despite the superb performance of State-Of-The-Art (SOTA) DNNs, the incr...

Please sign up or login with your details

Forgot password? Click here to reset