Queueing Analysis of GPU-Based Inference Servers with Dynamic Batching: A Closed-Form Characterization

12/13/2019
by   Yoshiaki Inoue, et al.
0

GPU-accelerated computing is a key technology to realize high-speed inference servers using deep neural networks (DNNs). An important characteristic of GPU-based inference is that the computational efficiency, in terms of the processing speed and energy consumption, drastically increases by processing multiple jobs together in a batch. In this paper, we formulate GPU-based inference servers as a batch service queueing model with batch-size dependent processing times. We first show that the energy efficiency of the server monotonically increases with the arrival rate of inference jobs, which suggests that it is energy-efficient to operate the inference server under a utilization level as high as possible within a latency requirement of inference jobs. We then derive a closed-form upper bound for the mean latency, which provides a simple characterization of the latency performance. Through simulation and numerical experiments, we show that the exact value of the mean latency is well approximated by this upper bound.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/13/2023

Energy-Efficient GPU Clusters Scheduling for Deep Learning

Training deep neural networks (DNNs) is a major workload in datacenters ...
research
10/17/2022

Merlin HugeCTR: GPU-accelerated Recommender System Training and Inference

In this talk, we introduce Merlin HugeCTR. Merlin HugeCTR is an open sou...
research
08/18/2021

Modeling Performance and Energy trade-offs in Online Data-Intensive Applications

We consider energy minimization for data-intensive applications run on l...
research
09/20/2020

On the Throughput Optimization in Large-Scale Batch-Processing Systems

We analyze a data-processing system with n clients producing jobs which ...
research
06/03/2022

Multi-user Co-inference with Batch Processing Capable Edge Server

Graphics processing units (GPUs) can improve deep neural network inferen...
research
12/31/2018

Dynamic Space-Time Scheduling for GPU Inference

Serving deep neural networks in latency critical interactive settings of...
research
04/11/2019

A Processor-Sharing model for the Performance of Virtualized Network Functions

The parallel execution of requests in a Cloud Computing platform, as for...

Please sign up or login with your details

Forgot password? Click here to reset