1 Introduction
Deep neural networks (DNNs) have become increasingly popular tools to implement artificial intelligence (AI) related capability, such as image classification and speech recognition, into mobile applications. Because executing inference with DNNs is computationally heavy for mobile devices, inference jobs are typically offloaded to a cloud (or fog) server. From the server’s perspective, many inference jobs originating a large number of different mobile devices arrive, and the server should process them within latency requirements of the applications. To realize highspeed DNN inference, such a server usually utilizes the parallel computing capability of a GPU, which largely accelerate the inference process
[1, 20].GPUbased inference has an interesting characteristic that batching many jobs drastically increases the computing efficiency in terms of the processing speed and energy consumption [1, 5, 20]. Table 1 shows measurement results of the computing performance for two different GPUs (Tesla V100 and Tesla P4) and precision (FP16/FP32 Mixed precision and INT8), which are reported in [1]
. A DNN called ResNet50 is employed for these measurements, which is a winner of ImageNet Large Scale Visual Recognition Competition 2015 (ILSVRC2015). Note that the energy efficiency is represented as the number of inference jobs per unit time which is able to be processed with unit power (measured in Watt). We can also interpret this quantity as the average number of inference jobs processed with unit energy (measured in Joule). In each case of Table
1, we see that the throughput and the energy efficiency largely increase by batching multiple jobs.


Because of this characteristic of GPUbased inference, it is efficient for a server to combine multiple inference jobs arriving from different devices into a batch, and process them simultaneously. Such a dynamic batching procedure is indeed supported by DNN serverapplication libraries such as TensorFlowServing
[17] and TensorRT Inference Server [2].The main purpose of this paper is to introduce a queueing theoretic perspective on GPUbased DNN inference systems with dynamic batching. We formulate an inference server as a batchservice queueing model with batchsize dependent processing times, and we present a novel analytical method for this model. Although the analysis of batchservice queues is a well studied subject of the queueing theory [3, 4, 6, 7, 8, 10, 12, 14, 15, 16], no closedform characterization of performance metrics are reported in the literature, except for the special case that the processing time distribution is independent of the batch size. Therefore, existing closedform formulas are not applicable to performance evaluations of GPUbased inference servers with dynamic batching scheme, where service times increase with the batch size.
While most previous works focus on models with batchsizeindependent processing times, Neuts [14], [15], [16, Section 4.2] considers the case with batchsize dependent processing times, where computational procedures to numerically obtain several performance metrics are presented. In particular, the matrixanalytic method developed in [16] provides a unified way to perform an algorithmic analysis of a wide range of batch service queueing models. However, the main weakness of numerical approaches like the matrixanalytic method is that derived mathematical formulas provide little information about the impact of the model parameters on the system performance.
In this paper, we first show that the energyefficiency of the system monotonically increases with the arrival rate of inference jobs (i.e., the system load), by means of the stochastic comparison techniques [13, 18]. This result suggests that it is energyefficient to operate the server under a utilization level as high as possible within a latency requirement. We then derive a closedform upper bound of the mean latency, which provides a simple characterization to the latency performance of GPUbased inference servers.
The key idea of our approach is to model the system as a batchservice queueing model with infinite maximum batch size
and batch processing times that linearly increase with the batch size. Note that the finiteness assumption on the maximum batch size is essential in approaches based on the matrixanalytic method, because it is a necessary condition for the system being formulated by a Markov chain with block upper or lower Hessenberg transition probability matrix. As we will see, however, the assumptions of the infinite maximum batch size and linear batch processing times enable us to derive a simple closedform upper bound of the mean latency. Furthermore, it is shown through numerical and simulation experiments that the mean latency is quite wellapproximated by this closedform upper bound, even for the case with a finite maximum batch size.
The rest of this paper is organized as follows. In Section 2, we introduce the mathematical model considered in this paper. In Section 3, we first show the monotonicity of the energyefficiency with respect to the system load under a relatively general setting, and then derive a closedform upper bound for the mean latency assuming linear batch processing times. In Section 4, we conduct numerical and simulation experiments to discuss the tightness of the derived upper bound. Finally, we conclude this paper in Section 5.
2 Model
We model an inference server with dynamic batching as a singleserver batchservice queueing model with infinite buffer. We assume that arrivals of inference jobs follow a Poisson process with rate
. The server can process multiple inference jobs simultaneously in a batch, and processing times of batches are assumed to be independent following a probability distribution depending on their batch sizes. Let
(,) denote the cumulative distribution function (CDF) of the processing time for a batch of size
. Let () denote a generic random variable following the CDF
. We define () as the mean throughput (the number of inference jobs processed per time unit) for a batch size :(1) 
Throughout this paper, we make the following assumption:
Assumption 1.
(), i.e., the mean throughput is nondecreasing with the batch size .
.
Assumption 1 (i) reflects the characteristic of the GPUbased inference that the computing efficiency increases with the batch size. Note that under Assumption 1 (i), the limit is always welldefined. Clearly, Assumption 1 (ii) is a necessary (and sufficient in the batching scheme described below) condition for the system to be stable.
In order to construct a tractable model, we assume the following simple dynamic batching scheme: whenever the server is idle and there is at least one waiting job in the buffer, all of the waiting jobs are incorporated into a single batch, and its processing is immediately initiated. To be more specific, suppose that the server is idle and the buffer is empty at time . Let () denote the size of the th batch processed after time . Also, let () denote the number of waiting inference jobs just before the departure of the th batch. For convenience, we define . Under the batching scheme described above, all waiting jobs are put into the next batch, so that if . If , on the other hand, the st batch contains only one inference job which have arrived to the empty system. Therefore, it follows that
(2) 
where denotes an indicator function.
In the next section, we will derive analytical results for the batchservice queueing system described so far.
3 Queueing Analysis
3.1 Preliminaries
Let () denote the number of inference jobs arriving in the processing time of the th batch. By definition, the probability function of () is given by
(3) 
where (, ) is defined as
(4) 
It is readily verified that the number of waiting jobs () at the th processing completion satisfies
so that we obtain from (2),
(5) 
It then follows from (3) and (5) that the sequence of processed batch sizes forms a discretetime Markov chain on state space , whose transition probability matrix is given by
(6) 
Note that this Markov chain is of GI/G/1type, i.e, there is no skipfree structure in the transition matrix . In general, it is difficult to characterize the exact stationary distribution of the GI/G/1type Markov chain, and one has to resort to numerical approximation methods such as the truncation techniques [9, 19, 11]. As we will see in Section 3.3, however, we can obtain a closedform upper bound of the mean latency, by assuming linearly increasing batch processing times.
In the rest of this subsection, we derive some basic relations among key performance metrics in steady state. Let denote a generic random variable following the stationary distribution of . Let denote a generic random variable for the stationary number of inference jobs in the system at arbitrary time instant. Further let () denote a generic random variable following the probability function (). It is readily verified from (4
) that the first two moments of
are given by(7) 
We define and () as the probability generating functions (PGFs) of and :
(8)  
Lemma 1.
() satisfies
(9) 
Proof.
Let () denote the number of inference jobs in the system at time . By definition, each samplepath of is given by a step function with unit upward jumps (arrivals of customers) and downward jumps of magnitude (completions of batch processing). For convenience, we assume that each samplepath of is constructed so that it is rightcontinuous with left limits. Because the system is stable, there is onetoone correspondence between an upward jump and the contribution of an inference job to a downward jump (see Fig. 1). To be more specific, let and denote the arrival and departure times of the th arriving job (). We define and as
(10) 
i.e., denotes the number of inference jobs in the system seen by the th inference job on arrival, and denotes the number of inference jobs arrived in the sojourn time of the th inference job which are in the system just before its departure. It is then verified that for each sample path, there is a bijection such that .
Let (resp. ) denote a generic random variable for (resp. ) in steady state. Owing to PASTA and the observation above, we obtain
(11) 
where denotes equality in distribution. We then consider the distribution of to prove (9).
Let denote the latency (sojourn time) of a randomly chosen inference job. We define (resp. ) as a generic random variable for the processing time of a randomly chosen batch (resp. a randomly chosen inference job). Note that the distributions of and are given by (cf. (12))
(13)  
(14) 
Lemma 2.
The mean latency is given by
(15) 
Proof.
Remark 1.
We can verify that the first term (resp. the second term) on the righthand side of (15) represents the mean waiting (resp. processing) time of a randomly chosen inference job.
3.2 Monotonicity of the Energy Efficiency
In this subsection, we show that the larger the system load, the more energy efficient this system is, under some additional assumptions. Let () denote the amount of energy consumed for processing a batch of size . is calculated from Table 1 by the product of the average board power and the batch processing time (i.e., the batch size divided by the throughput). For each case in Table 1, is wellfitted by a linear function (with the least squares method, we have the coefficient of determination for Tesla V100 and for Tesla P4). See Fig. 2 for plotted as function of .
We thus make the following assumption on :
Assumption 2.
() is given by
(17) 
for some and .
In steady state, the server processes batches per unit time with energy consumption on average. We then define the average energy efficiency of the system as
(18) 
i.e., the mean number of inference jobs processed with unit energy. Under Assumption 2, (18) is rewritten as
(19) 
In what follows, we show that the energy efficiency is nondecreasing with respect to the arrival rate . To establish this monotonicity result for , we need an additional assumption on the batch processing time distribution ():
Definition 1 ([18, Eq. (1.A.1)]).
Let and denote nonnegative random variables. is said to be smaller than in the usual stochastic order if and only if
Remark 2 ([18, Eq. (1.A.7)]).
holds if and only if
for any nondecreasing function () provided the expectations exist. In particular, .
Assumption 3.
holds for any .
Although Assumption 3 is a strong assumption on the batch processing time distribution, it is reduced to the condition about only their the mean value in several probability distributions, as shown in the following example:
Example 1.
In the following cases, we have (cf. Remark 2):

(
) follows a gamma distribution with a fixed coefficient of variation
, i.e.,where and denotes the gamma function and the lower incomplete gamma function.

() takes a constant value, i.e.,
Let and () denote the stationary batch size and the energy efficiency represented as functions of the arrival rate .
Theorem 1.
Under Assumption 3, the stationary batch size () increases with the arrival rate in the usual stochastic order, i.e.,
(20) 
Proof.
Let () denote the transition probability matrix of given , and let () denote the th element of . To prove (20), it is sufficient to show that in the sense of usual stochastic order, the probability distribution increases with , and is smaller than [13, Pages 186–187], i.e.,
(21) 
and
(22) 
Using (6), we rewrite (21) and (22) as
(23) 
and
(24) 
where () is defined as (cf. (4))
Let () denote a generic random variable satisfying ().
Corollary 1.
Corollary 1 suggests that it is energyefficient to operate the inference server under a utilization level as high as possible within a latency requirement of inference jobs. In the following subsection, we derive a closedform upper bound of the mean latency, assuming linearly increasing batch processing times.
3.3 Deterministic Linear Batch Processing Times
In Lemma 2, we showed that the mean latency is given in terms of the stationary distribution of the Markov chain of batch sizes. As mentioned above, an exact analysis of the stationary distribution of the GI/GI/1 type Markov chain is difficult, and only its numerical approximation is known in the literature.
In this subsection, it is shown that we can obtain a closedform upper bound of the mean latency by assuming a specific structure in batch processing times. Specifically, we make the following assumption throughout this subsection:
Assumption 4.
The batch processing time () takes a constant value equal to , which is given by
(25) 
for some and .
The deterministic distribution is a natural choice to model batch inference times because most DNNs take a vector of fixed size (input dimension times the batch size) as its input, and the output is computed by applying a predefined sequence of operations to it such as matrix multiplications and nonlinear activation functions, so that the computational steps are invariant regardless of the input vector. Furthermore, we see that the linearity assumption (
25) is consistent with the measurement results in Table 1: with the least squares method, we have the coefficient of determination (resp. ) with and (resp. , ) for batch processing times calculated from the data in Table 1 (a) (resp. Table 1 (b)) by dividing batch sizes by throughputs (cf. (1)). Note that under Assumption 4, the throughput () is written as(26) 
As shown in Fig. 3, the throughput characteristics in Table 1 are wellfitted by this simple rational function.
We can readily verify from (26) that Assumption 4 ensures Assumption 1 (i). Furthermore, (26) implies
so that the stability condition stated in Assumption 1 (ii) is rewritten as
(27) 
In view of this relation, the normalized load represents the ratio of the arrival rate to the server’s processing capacity, which corresponds to the traffic intensity in ordinary singleserver queueing models.
Assumption 4 simplifies the analysis mainly because under this assumption, , , and (see (13) and (14)) are given in terms of the first two moments and of the stationary batch size distribution:
(28)  
(29)  
(30) 
Lemma 3.
Proof.
Lemma 4.
Under Assumption 4, the mean latency is given in terms of the probability that the server is idle by
(35) 
Proof.
It follows from (15), (30), (31), and (32) that
(36) 
Note here that (31) and (32) imply
(37) 
In addition, owing to Little’s law, the server utilization (i.e., the mean number of batches being served in steady state) is equal to the product of the number of batches processed per unit time and the mean batch processing time:
(38) 
where we used (28) for the second equality. Therefore, we obtain (35) from (36), (37), and (38). ∎
Remark 3.
By definition, we have (see (8)).
Even under Assumption 4, it seems difficult to determine the exact value of . However, we have the following simple lower bound for this quantity:
Lemma 5.
Under Assumption 4, is bounded below by
(39) 
Remark 4.
If , the quantity is equal to the probability that the server is idle in a stationary singleservice M/D/1 queue with the arrival rate and the processing time , where arriving inference jobs are processed one by one.
We are in a position to obtain the main result of this paper:
Theorem 2.
Under Assumption 4, the mean latency is bounded above by
(41)  
and  
(42) 
In addition, we have if and only if .
Proof.
Theorem 2 provides a surprisingly simple upper bound for the mean latency . For convenience, let
(43) 
Even though this upper bound is obtained by replacing the idle probability with its almost trivial lower bound in (39), it provides a quite good approximation to the exact value of the mean latency , as we will see in the next section.
4 Numerical Evaluation
In this section, we present some numerical and simulation experiments. Throughout this section, we concentrate on the case with deterministic linear batch processing times considered in Section 3.3. In particular, we employ the model parameters and estimated in Section 3.3 from Table 1 (see the paragraph just after Assumption 4).
Fig. 5 shows simulation results for the mean latency and its upper bounds and given in (41) and (42) (recall that the normalized load is defined in (27)). We observe that the combination (43) of these upper bounds quite well approximates the exact curve of . In particular, except for small values of , takes fairly close values to .
Recall that the upper bound is obtained by replacing the idle probability with its trivial lower bound . In Fig. 3(b), the server utilization is plotted as a function of the normalized load . As a reference, we also plot its upper bound (cf. (39)). From this figure, we see that the server utilization takes a value close to even for a moderate value of , which is quite different from ordinary singleserver queues, where the server utilization is equal to the traffic intensity. This phenomenon comes from the fact that the server’s processing speed largely increases as the batch size increases, so that the system is overloaded for small batch sizes even under a moderate load level . Because of this behavior of the server utilization, the upper bound is a good approximation to the mean latency for a wide range of .
On the other hand, for small , the upper bound is a good approximation to . Note that is obtained by replacing the mean batch size with its trivial lower bound . Therefore, implies that the mean batch size , i.e., the server does not sufficiently leverage its batchprocessing capability in such a region.
We next discuss the energy efficiency using the linear model (17) considered in Section 3.2. Recall that the average energy efficiency is defined as (18), which represents the mean number of jobs processed with unit energy. In Fig. 7, simulation results for and its lower bound (40) are plotted as functions of the normalized load . From this figure, we observe that the energy efficiency can be largely enhanced by letting the server adequately loaded. Also, the energyefficiency is wellapproximated by the lower bound (40) except for small values of . Fig. 7 shows the energylatency tradeoff, where the relation between and the mean latency is plotted with parameter . In this figure, we also plot approximation curves obtained by combining (40) and (43). We see that the closedform bounds (40) and (43) are useful to determine an adequate operating point of the server, taking the energylatency tradeoff into consideration.
Finally, we discuss the relation between the model considered in this paper and a corresponding batchservice queue with finite maximum batch size . As mentioned in Section 1, the mean latency in the case of finite can be numerically obtained with results in [16, Section 4.2]. Fig. 8 shows that if is sufficiently large, the mean latency is well approximated by our closedform upper bound given by (43). If is small, on the other hand, the mean latency deviates from for the arrival rate near the stability boundary . However, we observe from this figure that even for small values of , the mean latency is still wellapproximated by (43) if the system is moderately loaded, i.e., is sufficiently small compared to .
5 Conclusions
In this paper, we introduced a queueing model representing GPUbased inference servers with dynamic batching. We modeled an inference server as a batchservice queueing model with infinite maximum batch sizes and batchsize dependent processing times. We first showed that the energy efficiency of the server increases with the arrival rate of inference jobs, which suggests that it is energyefficient to operate the server under a traffic load as large as possible, within a latency requirement of inference jobs. We then derived a simple closedform upper bound for the mean latency in Theorem 2, under the assumption that the batch processing time linearly increase with the batch size. Through numerical and simulation experiments, we showed that the exact value of the mean latency is wellapproximated by this simple upper bound.
Acknowledgements
This work was supported in part by JSPS KAKENHI Grant Number 18K18007.
References
 [1] Nvidia AI Inference Platform, Giant Leaps in Performance and Efficiency for AI Services, from the Data Center to the Network’s Edge. https://www.nvidia.com/enus/datacenter/resources/inferencetechnicaloverview/ (accessed 06Dec2019).

[2]
Nvidia TensorRT Inference Server.
https://docs.nvidia.com/deeplearning/sdk/tensorrtinferenceserverguide/docs/ (accessed 06Dec2019).  [3] N. T. J. Bailey, On queueing processes with bulk service, J. Roy. Stat. Soc. B 16 (1954) 80–87.
 [4] G. Briére and M. L. Chaudhry, Computational analysis of singleserver bulkservice queues, M/G/1, Adv. Appl. Prob. 21 (1989) 207–225.
 [5] D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and I. Stoica, Clipper: A lowlatency online prediction serving system, in Proc. of 14th USENIX Symposium on Networked Systems Design and Implementation (2017) 613–627.
 [6] R. K. Deb and R. F. Serfozo, Optimal control of batch service queues, Adv. Appl. Prob. 5 (1973) 340–361.
 [7] F. Downton, Waiting time in bulk service queues, J. Roy. Stat. Soc. B 17 (1955) 256–261.
 [8] F. Downton, On limiting distributions arising in bulk service queues, J. Roy. Stat. Soc. B 18 (1956) 265–274.
 [9] D. Gibson and E. Seneta, Augmented truncations of infinite stochastic matrices, J. Appl. Prob. 24 (1987) 600–608.
 [10] N. K. Jaiswal, Timedependent solution of the bulkservice queueing problem, Oper. Res. 8 (1960) 773–781.
 [11] Y. Liu, Augmented truncation approximations of discretetime markov chains, Oper. Res. Lett. 38 (2010) 218–222.
 [12] J. Medhi, Waiting time distribution in a Poisson queue with a general bulk service rule, Manag. Sci. 21 (1975) 777–782.
 [13] A. Müller and D. Stoyan, Comparison Methods for Stochastic Models and Risks, John Wiley & Sons, Chichester, UK, 2002.
 [14] M. F. Neuts, The busy period of a queue with batch service, Oper. Res. 13 (1965) 815–819.
 [15] M. F. Neuts, A general class of bulk queues with Poisson input, Ann. Math. Stat. 38 (1967) 759–770.
 [16] M. F. Neuts, Structured Stochastic Matrices of M/G/1 Type and Their Applications, Marcel Dekker, New York, 1989.
 [17] C. Olston, N. Fiedel, K. Gorovoy, J. Harmsen, L. Lao, F. Li, V. Rajashekhar, S. Ramesh, and J. Soyke, TensorFlowServing: Flexible, highperformance ML serving, in Proc. of Workshop on ML Systems at NIPS 2017, 2017.
 [18] M. Shaked and J. G. Shanthikumar, Stochastic Orders, Springer, New York, NY, 2007.
 [19] R. L. Tweedie, Truncation approximations of invariant measures for markov chains, J. Appl. Prob. 35 (1998) 517–536.

[20]
R. Xu, F. Han, and Q. Ta, Deep learning at scale on NVIDIA V100 accelerators in Proc. of 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS18), 2017.
Comments
There are no comments yet.