Deep neural networks (DNNs) have become increasingly popular tools to implement artificial intelligence (AI) related capability, such as image classification and speech recognition, into mobile applications. Because executing inference with DNNs is computationally heavy for mobile devices, inference jobs are typically offloaded to a cloud (or fog) server. From the server’s perspective, many inference jobs originating a large number of different mobile devices arrive, and the server should process them within latency requirements of the applications. To realize high-speed DNN inference, such a server usually utilizes the parallel computing capability of a GPU, which largely accelerate the inference process[1, 20].
GPU-based inference has an interesting characteristic that batching many jobs drastically increases the computing efficiency in terms of the processing speed and energy consumption [1, 5, 20]. Table 1 shows measurement results of the computing performance for two different GPUs (Tesla V100 and Tesla P4) and precision (FP16/FP32 Mixed precision and INT8), which are reported in 
. A DNN called ResNet-50 is employed for these measurements, which is a winner of ImageNet Large Scale Visual Recognition Competition 2015 (ILSVRC2015). Note that the energy efficiency is represented as the number of inference jobs per unit time which is able to be processed with unit power (measured in Watt). We can also interpret this quantity as the average number of inference jobs processed with unit energy (measured in Joule). In each case of Table1, we see that the throughput and the energy efficiency largely increase by batching multiple jobs.
Because of this characteristic of GPU-based inference, it is efficient for a server to combine multiple inference jobs arriving from different devices into a batch, and process them simultaneously. Such a dynamic batching procedure is indeed supported by DNN server-application libraries such as TensorFlow-Serving and TensorRT Inference Server .
The main purpose of this paper is to introduce a queueing theoretic perspective on GPU-based DNN inference systems with dynamic batching. We formulate an inference server as a batch-service queueing model with batch-size dependent processing times, and we present a novel analytical method for this model. Although the analysis of batch-service queues is a well studied subject of the queueing theory [3, 4, 6, 7, 8, 10, 12, 14, 15, 16], no closed-form characterization of performance metrics are reported in the literature, except for the special case that the processing time distribution is independent of the batch size. Therefore, existing closed-form formulas are not applicable to performance evaluations of GPU-based inference servers with dynamic batching scheme, where service times increase with the batch size.
While most previous works focus on models with batch-size-independent processing times, Neuts , , [16, Section 4.2] considers the case with batch-size dependent processing times, where computational procedures to numerically obtain several performance metrics are presented. In particular, the matrix-analytic method developed in  provides a unified way to perform an algorithmic analysis of a wide range of batch service queueing models. However, the main weakness of numerical approaches like the matrix-analytic method is that derived mathematical formulas provide little information about the impact of the model parameters on the system performance.
In this paper, we first show that the energy-efficiency of the system monotonically increases with the arrival rate of inference jobs (i.e., the system load), by means of the stochastic comparison techniques [13, 18]. This result suggests that it is energy-efficient to operate the server under a utilization level as high as possible within a latency requirement. We then derive a closed-form upper bound of the mean latency, which provides a simple characterization to the latency performance of GPU-based inference servers.
The key idea of our approach is to model the system as a batch-service queueing model with infinite maximum batch size
and batch processing times that linearly increase with the batch size. Note that the finiteness assumption on the maximum batch size is essential in approaches based on the matrix-analytic method, because it is a necessary condition for the system being formulated by a Markov chain with block upper or lower Hessenberg transition probability matrix. As we will see, however, the assumptions of the infinite maximum batch size and linear batch processing times enable us to derive a simple closed-form upper bound of the mean latency. Furthermore, it is shown through numerical and simulation experiments that the mean latency is quite well-approximated by this closed-form upper bound, even for the case with a finite maximum batch size.
The rest of this paper is organized as follows. In Section 2, we introduce the mathematical model considered in this paper. In Section 3, we first show the monotonicity of the energy-efficiency with respect to the system load under a relatively general setting, and then derive a closed-form upper bound for the mean latency assuming linear batch processing times. In Section 4, we conduct numerical and simulation experiments to discuss the tightness of the derived upper bound. Finally, we conclude this paper in Section 5.
We model an inference server with dynamic batching as a single-server batch-service queueing model with infinite buffer. We assume that arrivals of inference jobs follow a Poisson process with rate
. The server can process multiple inference jobs simultaneously in a batch, and processing times of batches are assumed to be independent following a probability distribution depending on their batch sizes. Let(,
) denote the cumulative distribution function (CDF) of the processing time for a batch of size. Let (
) denote a generic random variable following the CDF. We define () as the mean throughput (the number of inference jobs processed per time unit) for a batch size :
Throughout this paper, we make the following assumption:
(), i.e., the mean throughput is non-decreasing with the batch size .
Assumption 1 (i) reflects the characteristic of the GPU-based inference that the computing efficiency increases with the batch size. Note that under Assumption 1 (i), the limit is always well-defined. Clearly, Assumption 1 (ii) is a necessary (and sufficient in the batching scheme described below) condition for the system to be stable.
In order to construct a tractable model, we assume the following simple dynamic batching scheme: whenever the server is idle and there is at least one waiting job in the buffer, all of the waiting jobs are incorporated into a single batch, and its processing is immediately initiated. To be more specific, suppose that the server is idle and the buffer is empty at time . Let () denote the size of the th batch processed after time . Also, let () denote the number of waiting inference jobs just before the departure of the th batch. For convenience, we define . Under the batching scheme described above, all waiting jobs are put into the next batch, so that if . If , on the other hand, the st batch contains only one inference job which have arrived to the empty system. Therefore, it follows that
where denotes an indicator function.
In the next section, we will derive analytical results for the batch-service queueing system described so far.
3 Queueing Analysis
Let () denote the number of inference jobs arriving in the processing time of the th batch. By definition, the probability function of () is given by
where (, ) is defined as
It is readily verified that the number of waiting jobs () at the th processing completion satisfies
so that we obtain from (2),
Note that this Markov chain is of GI/G/1-type, i.e, there is no skip-free structure in the transition matrix . In general, it is difficult to characterize the exact stationary distribution of the GI/G/1-type Markov chain, and one has to resort to numerical approximation methods such as the truncation techniques [9, 19, 11]. As we will see in Section 3.3, however, we can obtain a closed-form upper bound of the mean latency, by assuming linearly increasing batch processing times.
In the rest of this subsection, we derive some basic relations among key performance metrics in steady state. Let denote a generic random variable following the stationary distribution of . Let denote a generic random variable for the stationary number of inference jobs in the system at arbitrary time instant. Further let () denote a generic random variable following the probability function (). It is readily verified from (4
) that the first two moments ofare given by
We define and () as the probability generating functions (PGFs) of and :
Let () denote the number of inference jobs in the system at time . By definition, each sample-path of is given by a step function with unit upward jumps (arrivals of customers) and downward jumps of magnitude (completions of batch processing). For convenience, we assume that each sample-path of is constructed so that it is right-continuous with left limits. Because the system is stable, there is one-to-one correspondence between an upward jump and the contribution of an inference job to a downward jump (see Fig. 1). To be more specific, let and denote the arrival and departure times of the th arriving job (). We define and as
i.e., denotes the number of inference jobs in the system seen by the th inference job on arrival, and denotes the number of inference jobs arrived in the sojourn time of the th inference job which are in the system just before its departure. It is then verified that for each sample path, there is a bijection such that .
Let (resp. ) denote a generic random variable for (resp. ) in steady state. Owing to PASTA and the observation above, we obtain
where denotes equality in distribution. We then consider the distribution of to prove (9).
Let denote the latency (sojourn time) of a randomly chosen inference job. We define (resp. ) as a generic random variable for the processing time of a randomly chosen batch (resp. a randomly chosen inference job). Note that the distributions of and are given by (cf. (12))
The mean latency is given by
We can verify that the first term (resp. the second term) on the right-hand side of (15) represents the mean waiting (resp. processing) time of a randomly chosen inference job.
3.2 Monotonicity of the Energy Efficiency
In this subsection, we show that the larger the system load, the more energy efficient this system is, under some additional assumptions. Let () denote the amount of energy consumed for processing a batch of size . is calculated from Table 1 by the product of the average board power and the batch processing time (i.e., the batch size divided by the throughput). For each case in Table 1, is well-fitted by a linear function (with the least squares method, we have the coefficient of determination for Tesla V100 and for Tesla P4). See Fig. 2 for plotted as function of .
We thus make the following assumption on :
() is given by
for some and .
In steady state, the server processes batches per unit time with energy consumption on average. We then define the average energy efficiency of the system as
In what follows, we show that the energy efficiency is non-decreasing with respect to the arrival rate . To establish this monotonicity result for , we need an additional assumption on the batch processing time distribution ():
Definition 1 ([18, Eq. (1.A.1)]).
Let and denote non-negative random variables. is said to be smaller than in the usual stochastic order if and only if
Remark 2 ([18, Eq. (1.A.7)]).
holds if and only if
for any non-decreasing function () provided the expectations exist. In particular, .
holds for any .
Although Assumption 3 is a strong assumption on the batch processing time distribution, it is reduced to the condition about only their the mean value in several probability distributions, as shown in the following example:
In the following cases, we have (cf. Remark 2):
Let and () denote the stationary batch size and the energy efficiency represented as functions of the arrival rate .
Under Assumption 3, the stationary batch size () increases with the arrival rate in the usual stochastic order, i.e.,
Let () denote the transition probability matrix of given , and let () denote the th element of . To prove (20), it is sufficient to show that in the sense of usual stochastic order, the probability distribution increases with , and is smaller than [13, Pages 186–187], i.e.,
where () is defined as (cf. (4))
Let () denote a generic random variable satisfying ().
Corollary 1 suggests that it is energy-efficient to operate the inference server under a utilization level as high as possible within a latency requirement of inference jobs. In the following subsection, we derive a closed-form upper bound of the mean latency, assuming linearly increasing batch processing times.
3.3 Deterministic Linear Batch Processing Times
In Lemma 2, we showed that the mean latency is given in terms of the stationary distribution of the Markov chain of batch sizes. As mentioned above, an exact analysis of the stationary distribution of the GI/GI/1 type Markov chain is difficult, and only its numerical approximation is known in the literature.
In this subsection, it is shown that we can obtain a closed-form upper bound of the mean latency by assuming a specific structure in batch processing times. Specifically, we make the following assumption throughout this subsection:
The batch processing time () takes a constant value equal to , which is given by
for some and .
The deterministic distribution is a natural choice to model batch inference times because most DNNs take a vector of fixed size (input dimension times the batch size) as its input, and the output is computed by applying a predefined sequence of operations to it such as matrix multiplications and non-linear activation functions, so that the computational steps are invariant regardless of the input vector. Furthermore, we see that the linearity assumption (25) is consistent with the measurement results in Table 1: with the least squares method, we have the coefficient of determination (resp. ) with and (resp. , ) for batch processing times calculated from the data in Table 1 (a) (resp. Table 1 (b)) by dividing batch sizes by throughputs (cf. (1)). Note that under Assumption 4, the throughput () is written as
so that the stability condition stated in Assumption 1 (ii) is rewritten as
In view of this relation, the normalized load represents the ratio of the arrival rate to the server’s processing capacity, which corresponds to the traffic intensity in ordinary single-server queueing models.
Under Assumption 4, the mean latency is given in terms of the probability that the server is idle by
In addition, owing to Little’s law, the server utilization (i.e., the mean number of batches being served in steady state) is equal to the product of the number of batches processed per unit time and the mean batch processing time:
By definition, we have (see (8)).
Even under Assumption 4, it seems difficult to determine the exact value of . However, we have the following simple lower bound for this quantity:
Under Assumption 4, is bounded below by
If , the quantity is equal to the probability that the server is idle in a stationary single-service M/D/1 queue with the arrival rate and the processing time , where arriving inference jobs are processed one by one.
We are in a position to obtain the main result of this paper:
Under Assumption 4, the mean latency is bounded above by
In addition, we have if and only if .
Theorem 2 provides a surprisingly simple upper bound for the mean latency . For convenience, let
Even though this upper bound is obtained by replacing the idle probability with its almost trivial lower bound in (39), it provides a quite good approximation to the exact value of the mean latency , as we will see in the next section.
4 Numerical Evaluation
In this section, we present some numerical and simulation experiments. Throughout this section, we concentrate on the case with deterministic linear batch processing times considered in Section 3.3. In particular, we employ the model parameters and estimated in Section 3.3 from Table 1 (see the paragraph just after Assumption 4).
Fig. 5 shows simulation results for the mean latency and its upper bounds and given in (41) and (42) (recall that the normalized load is defined in (27)). We observe that the combination (43) of these upper bounds quite well approximates the exact curve of . In particular, except for small values of , takes fairly close values to .
Recall that the upper bound is obtained by replacing the idle probability with its trivial lower bound . In Fig. 3(b), the server utilization is plotted as a function of the normalized load . As a reference, we also plot its upper bound (cf. (39)). From this figure, we see that the server utilization takes a value close to even for a moderate value of , which is quite different from ordinary single-server queues, where the server utilization is equal to the traffic intensity. This phenomenon comes from the fact that the server’s processing speed largely increases as the batch size increases, so that the system is overloaded for small batch sizes even under a moderate load level . Because of this behavior of the server utilization, the upper bound is a good approximation to the mean latency for a wide range of .
On the other hand, for small , the upper bound is a good approximation to . Note that is obtained by replacing the mean batch size with its trivial lower bound . Therefore, implies that the mean batch size , i.e., the server does not sufficiently leverage its batch-processing capability in such a region.
We next discuss the energy efficiency using the linear model (17) considered in Section 3.2. Recall that the average energy efficiency is defined as (18), which represents the mean number of jobs processed with unit energy. In Fig. 7, simulation results for and its lower bound (40) are plotted as functions of the normalized load . From this figure, we observe that the energy efficiency can be largely enhanced by letting the server adequately loaded. Also, the energy-efficiency is well-approximated by the lower bound (40) except for small values of . Fig. 7 shows the energy-latency tradeoff, where the relation between and the mean latency is plotted with parameter . In this figure, we also plot approximation curves obtained by combining (40) and (43). We see that the closed-form bounds (40) and (43) are useful to determine an adequate operating point of the server, taking the energy-latency tradeoff into consideration.
Finally, we discuss the relation between the model considered in this paper and a corresponding batch-service queue with finite maximum batch size . As mentioned in Section 1, the mean latency in the case of finite can be numerically obtained with results in [16, Section 4.2]. Fig. 8 shows that if is sufficiently large, the mean latency is well approximated by our closed-form upper bound given by (43). If is small, on the other hand, the mean latency deviates from for the arrival rate near the stability boundary . However, we observe from this figure that even for small values of , the mean latency is still well-approximated by (43) if the system is moderately loaded, i.e., is sufficiently small compared to .
In this paper, we introduced a queueing model representing GPU-based inference servers with dynamic batching. We modeled an inference server as a batch-service queueing model with infinite maximum batch sizes and batch-size dependent processing times. We first showed that the energy efficiency of the server increases with the arrival rate of inference jobs, which suggests that it is energy-efficient to operate the server under a traffic load as large as possible, within a latency requirement of inference jobs. We then derived a simple closed-form upper bound for the mean latency in Theorem 2, under the assumption that the batch processing time linearly increase with the batch size. Through numerical and simulation experiments, we showed that the exact value of the mean latency is well-approximated by this simple upper bound.
This work was supported in part by JSPS KAKENHI Grant Number 18K18007.
-  Nvidia AI Inference Platform, Giant Leaps in Performance and Efficiency for AI Services, from the Data Center to the Network’s Edge. https://www.nvidia.com/en-us/data-center/resources/inference-technical-overview/ (accessed 06-Dec-2019).
Nvidia TensorRT Inference Server.
https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/ (accessed 06-Dec-2019).
-  N. T. J. Bailey, On queueing processes with bulk service, J. Roy. Stat. Soc. B 16 (1954) 80–87.
-  G. Briére and M. L. Chaudhry, Computational analysis of single-server bulk-service queues, M/G/1, Adv. Appl. Prob. 21 (1989) 207–225.
-  D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and I. Stoica, Clipper: A low-latency online prediction serving system, in Proc. of 14th USENIX Symposium on Networked Systems Design and Implementation (2017) 613–627.
-  R. K. Deb and R. F. Serfozo, Optimal control of batch service queues, Adv. Appl. Prob. 5 (1973) 340–361.
-  F. Downton, Waiting time in bulk service queues, J. Roy. Stat. Soc. B 17 (1955) 256–261.
-  F. Downton, On limiting distributions arising in bulk service queues, J. Roy. Stat. Soc. B 18 (1956) 265–274.
-  D. Gibson and E. Seneta, Augmented truncations of infinite stochastic matrices, J. Appl. Prob. 24 (1987) 600–608.
-  N. K. Jaiswal, Time-dependent solution of the bulk-service queueing problem, Oper. Res. 8 (1960) 773–781.
-  Y. Liu, Augmented truncation approximations of discrete-time markov chains, Oper. Res. Lett. 38 (2010) 218–222.
-  J. Medhi, Waiting time distribution in a Poisson queue with a general bulk service rule, Manag. Sci. 21 (1975) 777–782.
-  A. Müller and D. Stoyan, Comparison Methods for Stochastic Models and Risks, John Wiley & Sons, Chichester, UK, 2002.
-  M. F. Neuts, The busy period of a queue with batch service, Oper. Res. 13 (1965) 815–819.
-  M. F. Neuts, A general class of bulk queues with Poisson input, Ann. Math. Stat. 38 (1967) 759–770.
-  M. F. Neuts, Structured Stochastic Matrices of M/G/1 Type and Their Applications, Marcel Dekker, New York, 1989.
-  C. Olston, N. Fiedel, K. Gorovoy, J. Harmsen, L. Lao, F. Li, V. Rajashekhar, S. Ramesh, and J. Soyke, TensorFlow-Serving: Flexible, high-performance ML serving, in Proc. of Workshop on ML Systems at NIPS 2017, 2017.
-  M. Shaked and J. G. Shanthikumar, Stochastic Orders, Springer, New York, NY, 2007.
-  R. L. Tweedie, Truncation approximations of invariant measures for markov chains, J. Appl. Prob. 35 (1998) 517–536.
R. Xu, F. Han, and Q. Ta, Deep learning at scale on NVIDIA V100 accelerators in Proc. of 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS18), 2017.