Throughput Maximization of DNN Inference: Batching or Multi-Tenancy?

Deployment of real-time ML services on warehouse-scale infrastructures is on the increase. Therefore, decreasing latency and increasing throughput of deep neural network (DNN) inference applications that empower those services have attracted attention from both academia and industry. A common solution to address this challenge is leveraging hardware accelerators such as GPUs. To improve the inference throughput of DNNs deployed on GPU accelerators, two common approaches are employed: Batching and Multi-Tenancy. Our preliminary experiments show that the effect of these approaches on the throughput depends on the DNN architecture. Taking this observation into account, we design and implement DNNScaler which aims to maximize the throughput of interactive AI-powered services while meeting their latency requirements. DNNScaler first detects the suitable approach (Batching or Multi-Tenancy) that would be most beneficial for a DNN regarding throughput improvement. Then, it adjusts the control knob of the detected approach (batch size for Batching and number of co-located instances for Multi-Tenancy) to maintain the latency while increasing the throughput. Conducting an extensive set of experiments using well-known DNNs from a variety of domains, several popular datasets, and a cutting-edge GPU, the results indicate that DNNScaler can improve the throughput by up to 14x (218 approach, while meeting the latency requirements of the services.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/08/2020

Spatial Sharing of GPU for Autotuning DNN models

GPUs are used for training, inference, and tuning the machine learning m...
research
03/31/2023

D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUs

Hardware accelerators such as GPUs are required for real-time, low-laten...
research
05/10/2023

MoCA: Memory-Centric, Adaptive Execution for Multi-Tenant Deep Neural Networks

Driven by the wide adoption of deep neural networks (DNNs) across differ...
research
11/08/2020

Towards Latency-aware DNN Optimization with GPU Runtime Analysis and Tail Effect Elimination

Despite the superb performance of State-Of-The-Art (SOTA) DNNs, the incr...
research
05/05/2021

DeepRT: A Soft Real Time Scheduler for Computer Vision Applications on the Edge

The ubiquity of smartphone cameras and IoT cameras, together with the re...
research
10/21/2022

Partitioning and Placement of Deep Neural Networks on Distributed Edge Devices to Maximize Inference Throughput

Edge inference has become more widespread, as its diverse applications r...
research
01/26/2023

PerfSAGE: Generalized Inference Performance Predictor for Arbitrary Deep Learning Models on Edge Devices

The ability to accurately predict deep neural network (DNN) inference pe...

Please sign up or login with your details

Forgot password? Click here to reset