# Queueing Analysis of GPU-Based Inference Servers with Dynamic Batching: A Closed-Form Characterization

GPU-accelerated computing is a key technology to realize high-speed inference servers using deep neural networks (DNNs). An important characteristic of GPU-based inference is that the computational efficiency, in terms of the processing speed and energy consumption, drastically increases by processing multiple jobs together in a batch. In this paper, we formulate GPU-based inference servers as a batch service queueing model with batch-size dependent processing times. We first show that the energy efficiency of the server monotonically increases with the arrival rate of inference jobs, which suggests that it is energy-efficient to operate the inference server under a utilization level as high as possible within a latency requirement of inference jobs. We then derive a closed-form upper bound for the mean latency, which provides a simple characterization of the latency performance. Through simulation and numerical experiments, we show that the exact value of the mean latency is well approximated by this upper bound.

## Authors

• 5 publications
08/27/2021

### Renting Servers in the Cloud: The Case of Equal Duration Jobs

Renting servers in the cloud is a generalization of the bin packing prob...
02/27/2022

### PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers

In cloud machine learning (ML) inference systems, providing low latency ...
08/18/2021

### Modeling Performance and Energy trade-offs in Online Data-Intensive Applications

We consider energy minimization for data-intensive applications run on l...
09/20/2020

### On the Throughput Optimization in Large-Scale Batch-Processing Systems

We analyze a data-processing system with n clients producing jobs which ...
05/27/2019

### The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study

Over the past years, great progress has been made in improving the compu...
06/14/2019

### hepaccelerate: Fast Analysis of Columnar Collider Data

At HEP experiments, processing terabytes of structured numerical event d...
10/15/2019

### Energy-Efficient Job-Assignment Policy with Asymptotically Guaranteed Performance Deviation

We study a job-assignment problem in a large-scale server farm system wi...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Deep neural networks (DNNs) have become increasingly popular tools to implement artificial intelligence (AI) related capability, such as image classification and speech recognition, into mobile applications. Because executing inference with DNNs is computationally heavy for mobile devices, inference jobs are typically offloaded to a cloud (or fog) server. From the server’s perspective, many inference jobs originating a large number of different mobile devices arrive, and the server should process them within latency requirements of the applications. To realize high-speed DNN inference, such a server usually utilizes the parallel computing capability of a GPU, which largely accelerate the inference process

[1, 20].

GPU-based inference has an interesting characteristic that batching many jobs drastically increases the computing efficiency in terms of the processing speed and energy consumption [1, 5, 20]. Table 1 shows measurement results of the computing performance for two different GPUs (Tesla V100 and Tesla P4) and precision (FP16/FP32 Mixed precision and INT8), which are reported in [1]

. A DNN called ResNet-50 is employed for these measurements, which is a winner of ImageNet Large Scale Visual Recognition Competition 2015 (ILSVRC2015). Note that the energy efficiency is represented as the number of inference jobs per unit time which is able to be processed with unit power (measured in Watt). We can also interpret this quantity as the average number of inference jobs processed with unit energy (measured in Joule). In each case of Table

1, we see that the throughput and the energy efficiency largely increase by batching multiple jobs.

Because of this characteristic of GPU-based inference, it is efficient for a server to combine multiple inference jobs arriving from different devices into a batch, and process them simultaneously. Such a dynamic batching procedure is indeed supported by DNN server-application libraries such as TensorFlow-Serving

[17] and TensorRT Inference Server [2].

The main purpose of this paper is to introduce a queueing theoretic perspective on GPU-based DNN inference systems with dynamic batching. We formulate an inference server as a batch-service queueing model with batch-size dependent processing times, and we present a novel analytical method for this model. Although the analysis of batch-service queues is a well studied subject of the queueing theory [3, 4, 6, 7, 8, 10, 12, 14, 15, 16], no closed-form characterization of performance metrics are reported in the literature, except for the special case that the processing time distribution is independent of the batch size. Therefore, existing closed-form formulas are not applicable to performance evaluations of GPU-based inference servers with dynamic batching scheme, where service times increase with the batch size.

While most previous works focus on models with batch-size-independent processing times, Neuts [14], [15], [16, Section 4.2] considers the case with batch-size dependent processing times, where computational procedures to numerically obtain several performance metrics are presented. In particular, the matrix-analytic method developed in [16] provides a unified way to perform an algorithmic analysis of a wide range of batch service queueing models. However, the main weakness of numerical approaches like the matrix-analytic method is that derived mathematical formulas provide little information about the impact of the model parameters on the system performance.

In this paper, we first show that the energy-efficiency of the system monotonically increases with the arrival rate of inference jobs (i.e., the system load), by means of the stochastic comparison techniques [13, 18]. This result suggests that it is energy-efficient to operate the server under a utilization level as high as possible within a latency requirement. We then derive a closed-form upper bound of the mean latency, which provides a simple characterization to the latency performance of GPU-based inference servers.

The key idea of our approach is to model the system as a batch-service queueing model with infinite maximum batch size

and batch processing times that linearly increase with the batch size. Note that the finiteness assumption on the maximum batch size is essential in approaches based on the matrix-analytic method, because it is a necessary condition for the system being formulated by a Markov chain with block upper or lower Hessenberg transition probability matrix. As we will see, however, the assumptions of the infinite maximum batch size and linear batch processing times enable us to derive a simple closed-form upper bound of the mean latency. Furthermore, it is shown through numerical and simulation experiments that the mean latency is quite well-approximated by this closed-form upper bound, even for the case with a finite maximum batch size.

The rest of this paper is organized as follows. In Section 2, we introduce the mathematical model considered in this paper. In Section 3, we first show the monotonicity of the energy-efficiency with respect to the system load under a relatively general setting, and then derive a closed-form upper bound for the mean latency assuming linear batch processing times. In Section 4, we conduct numerical and simulation experiments to discuss the tightness of the derived upper bound. Finally, we conclude this paper in Section 5.

## 2 Model

We model an inference server with dynamic batching as a single-server batch-service queueing model with infinite buffer. We assume that arrivals of inference jobs follow a Poisson process with rate

. The server can process multiple inference jobs simultaneously in a batch, and processing times of batches are assumed to be independent following a probability distribution depending on their batch sizes. Let

(,

) denote the cumulative distribution function (CDF) of the processing time for a batch of size

. Let (

) denote a generic random variable following the CDF

. We define () as the mean throughput (the number of inference jobs processed per time unit) for a batch size :

 μ[b]=bE[H[b]]. (1)

Throughout this paper, we make the following assumption:

###### Assumption 1.

(), i.e., the mean throughput is non-decreasing with the batch size .

.

Assumption 1 (i) reflects the characteristic of the GPU-based inference that the computing efficiency increases with the batch size. Note that under Assumption 1 (i), the limit is always well-defined. Clearly, Assumption 1 (ii) is a necessary (and sufficient in the batching scheme described below) condition for the system to be stable.

In order to construct a tractable model, we assume the following simple dynamic batching scheme: whenever the server is idle and there is at least one waiting job in the buffer, all of the waiting jobs are incorporated into a single batch, and its processing is immediately initiated. To be more specific, suppose that the server is idle and the buffer is empty at time . Let () denote the size of the th batch processed after time . Also, let () denote the number of waiting inference jobs just before the departure of the th batch. For convenience, we define . Under the batching scheme described above, all waiting jobs are put into the next batch, so that if . If , on the other hand, the st batch contains only one inference job which have arrived to the empty system. Therefore, it follows that

 Bn+1=LD,n+1{LD,n=0},n=0,1,…, (2)

where denotes an indicator function.

In the next section, we will derive analytical results for the batch-service queueing system described so far.

## 3 Queueing Analysis

### 3.1 Preliminaries

Let () denote the number of inference jobs arriving in the processing time of the th batch. By definition, the probability function of () is given by

 Pr(An=k∣Bn=b) =a[b]k,k=0,1,…, (3)

where (, ) is defined as

 a[b]k=∫∞0e−λx(λx)kk!dH[b](x). (4)

It is readily verified that the number of waiting jobs () at the th processing completion satisfies

 LD,n=An,

so that we obtain from (2),

 Bn+1=An+1{An=0}. (5)

It then follows from (3) and (5) that the sequence of processed batch sizes forms a discrete-time Markov chain on state space , whose transition probability matrix is given by

 P=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝a[1]0+a[1]1a[1]2a[1]3⋯a[2]0+a[2]1a[2]2a[2]3⋯a[3]0+a[3]1a[3]2a[3]3⋯⋮⋮⋮⋱⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠. (6)

Note that this Markov chain is of GI/G/1-type, i.e, there is no skip-free structure in the transition matrix . In general, it is difficult to characterize the exact stationary distribution of the GI/G/1-type Markov chain, and one has to resort to numerical approximation methods such as the truncation techniques [9, 19, 11]. As we will see in Section 3.3, however, we can obtain a closed-form upper bound of the mean latency, by assuming linearly increasing batch processing times.

In the rest of this subsection, we derive some basic relations among key performance metrics in steady state. Let denote a generic random variable following the stationary distribution of . Let denote a generic random variable for the stationary number of inference jobs in the system at arbitrary time instant. Further let () denote a generic random variable following the probability function (). It is readily verified from (4

) that the first two moments of

are given by

 E[A[b]]=λE[H[b]],E[(A[b])2]=λ2E[(H[b])2]. (7)

We define and () as the probability generating functions (PGFs) of and :

 π(z) =E[zL]=∞∑n=0Pr(L=n)zn, (8) a[b](z) =E[zA[b]]=∞∑n=0Pr(A[b]=n)zn.
###### Lemma 1.

() satisfies

 π(z)=∞∑b=1bPr(B=b)E[B]⋅1−zbb(1−z)⋅a[b](z). (9)
###### Proof.

Let () denote the number of inference jobs in the system at time . By definition, each sample-path of is given by a step function with unit upward jumps (arrivals of customers) and downward jumps of magnitude (completions of batch processing). For convenience, we assume that each sample-path of is constructed so that it is right-continuous with left limits. Because the system is stable, there is one-to-one correspondence between an upward jump and the contribution of an inference job to a downward jump (see Fig. 1). To be more specific, let and denote the arrival and departure times of the th arriving job (). We define and as

 ^LA,n =limδt→0+Ltn−δt, ^LD,n =|{k>n;t′k=t′n}|+Lt′n, (10)

i.e., denotes the number of inference jobs in the system seen by the th inference job on arrival, and denotes the number of inference jobs arrived in the sojourn time of the th inference job which are in the system just before its departure. It is then verified that for each sample path, there is a bijection such that .

Let (resp. ) denote a generic random variable for (resp. ) in steady state. Owing to PASTA and the observation above, we obtain

 L=st^LA=st^LD, (11)

where denotes equality in distribution. We then consider the distribution of to prove (9).

Let denote a generic random variable for the size of a batch in which a randomly chosen inference job is processed. It is readily verified that follows the length-biased batch size distribution, i.e.,

 Pr(^B=b)=bPr(B=b)E[B]. (12)

We then obtain from (10) and (11),

 π(z)=E[z^LD]=∞∑b=1Pr(^B=b)(b−1∑k=01b⋅zk)a[b](z),

which implies (9). ∎

Let denote the latency (sojourn time) of a randomly chosen inference job. We define (resp. ) as a generic random variable for the processing time of a randomly chosen batch (resp. a randomly chosen inference job). Note that the distributions of and are given by (cf. (12))

 Pr(H≤x) =∞∑b=1Pr(B=b)Pr(H[b]≤x), (13) Pr(^H≤x) =∞∑b=1bPr(B=b)E[B]⋅Pr(H[b]≤x). (14)
###### Lemma 2.

The mean latency is given by

 E[W]=E[B2]−E[B]2λE[B]+E[^H]. (15)
###### Proof.

Taking the derivative of (9) and letting , we have

 E[L] =∞∑b=1bPr(B=b)E[B](b−12+λE[H[b]]) =E[B2]−E[B]2E[B]+λE[^H], (16)

where we used (7) in the first equality and (14) in the second equality. (15) thus follows from Little’s law . ∎

###### Remark 1.

We can verify that the first term (resp. the second term) on the right-hand side of (15) represents the mean waiting (resp. processing) time of a randomly chosen inference job.

### 3.2 Monotonicity of the Energy Efficiency

In this subsection, we show that the larger the system load, the more energy efficient this system is, under some additional assumptions. Let () denote the amount of energy consumed for processing a batch of size . is calculated from Table 1 by the product of the average board power and the batch processing time (i.e., the batch size divided by the throughput). For each case in Table 1, is well-fitted by a linear function (with the least squares method, we have the coefficient of determination for Tesla V100 and for Tesla P4). See Fig. 2 for plotted as function of .

We thus make the following assumption on :

###### Assumption 2.

() is given by

 c[b]=βb+c0, (17)

for some and .

In steady state, the server processes batches per unit time with energy consumption on average. We then define the average energy efficiency of the system as

 η:=λλE[B]∞∑b=1Pr(B=b)c[b], (18)

i.e., the mean number of inference jobs processed with unit energy. Under Assumption 2, (18) is rewritten as

 η=1β+c0/E[B]. (19)

In what follows, we show that the energy efficiency is non-decreasing with respect to the arrival rate . To establish this monotonicity result for , we need an additional assumption on the batch processing time distribution ():

###### Definition 1 ([18, Eq. (1.A.1)]).

Let and denote non-negative random variables. is said to be smaller than in the usual stochastic order if and only if

 Pr(X>x)≤Pr(Y>x),for all x≥0.
###### Remark 2 ([18, Eq. (1.A.7)]).

holds if and only if

 E[ϕ(X)]≤E[ϕ(Y)],

for any non-decreasing function () provided the expectations exist. In particular, .

###### Assumption 3.

holds for any .

Although Assumption 3 is a strong assumption on the batch processing time distribution, it is reduced to the condition about only their the mean value in several probability distributions, as shown in the following example:

###### Example 1.

In the following cases, we have (cf. Remark 2):

• (

) follows an exponential distribution, i.e.,

 Pr(H[b]>x)=e−x/E[H[b]],x≥0.
• (

) follows a gamma distribution with a fixed coefficient of variation

, i.e.,

 Pr(H[b]>x)=1−γ(1/c2,x/(c2E[H[b]])Γ(1/c2),x≥0.

where and denotes the gamma function and the lower incomplete gamma function.

• () takes a constant value, i.e.,

 Pr(H[b]>x)=1{E[H[b]]>x},x≥0.

Let and () denote the stationary batch size and the energy efficiency represented as functions of the arrival rate .

###### Theorem 1.

Under Assumption 3, the stationary batch size () increases with the arrival rate in the usual stochastic order, i.e.,

 B⟨λ1⟩≤stB⟨λ2⟩,for any λ1≤λ2. (20)
###### Proof.

Let () denote the transition probability matrix of given , and let () denote the th element of . To prove (20), it is sufficient to show that in the sense of usual stochastic order, the probability distribution increases with , and is smaller than [13, Pages 186–187], i.e.,

 ∞∑j=kp⟨λ1⟩i,j≤∞∑j=kp⟨λ1⟩i′,j,i≤i′,k=1,2,…, (21)

and

 ∞∑j=kp⟨λ1⟩i,j≤∞∑j=kp⟨λ2⟩i,j,i=0,1,…,k=1,2,…. (22)

Using (6), we rewrite (21) and (22) as

 ∞∑j=ka[i],⟨λ1⟩j≤∞∑j=ka[i′],⟨λ1⟩j,i≤i′,k=2,3,…, (23)

and

 ∞∑j=ka[i],⟨λ1⟩j≤∞∑j=ka[i],⟨λ2⟩j,i=1,2,…,k=2,3,…, (24)

where () is defined as (cf. (4))

 a[i],⟨λm⟩j:=∫∞0e−λmx(λmx)jj!dH[i](x).

Let () denote a generic random variable satisfying ().

Note that the Poisson distribution is increasing in the usual stochastic order with respect to its mean

[18, Example 8.A.2], so that we have from Assumption 3 and [18, Theorem 1.A.6],

 A[i],⟨λ1⟩≤stA[i′],⟨λ1⟩,i≤i′,

and from [18, Theorem 1.A.3 (d)],

 A[i],⟨λ1⟩≤stA[i],⟨λ2⟩,λ1≤λ2,i=0,1,….

Therefore, we obtain (23) and (24), which completes the proof. ∎

###### Corollary 1.

Under Assumptions 2 and 3, the energy efficiency is non-decreasing with the arrival rate .

###### Proof.

Corollary 1 immediately follows from Theorem 1 and (19). ∎

Corollary 1 suggests that it is energy-efficient to operate the inference server under a utilization level as high as possible within a latency requirement of inference jobs. In the following subsection, we derive a closed-form upper bound of the mean latency, assuming linearly increasing batch processing times.

### 3.3 Deterministic Linear Batch Processing Times

In Lemma 2, we showed that the mean latency is given in terms of the stationary distribution of the Markov chain of batch sizes. As mentioned above, an exact analysis of the stationary distribution of the GI/GI/1 type Markov chain is difficult, and only its numerical approximation is known in the literature.

In this subsection, it is shown that we can obtain a closed-form upper bound of the mean latency by assuming a specific structure in batch processing times. Specifically, we make the following assumption throughout this subsection:

###### Assumption 4.

The batch processing time () takes a constant value equal to , which is given by

 τ[b]=αb+τ0, (25)

for some and .

The deterministic distribution is a natural choice to model batch inference times because most DNNs take a vector of fixed size (input dimension times the batch size) as its input, and the output is computed by applying a predefined sequence of operations to it such as matrix multiplications and non-linear activation functions, so that the computational steps are invariant regardless of the input vector. Furthermore, we see that the linearity assumption (

25) is consistent with the measurement results in Table 1: with the least squares method, we have the coefficient of determination (resp. ) with and (resp. , ) for batch processing times calculated from the data in Table 1 (a) (resp. Table 1 (b)) by dividing batch sizes by throughputs (cf. (1)). Note that under Assumption 4, the throughput () is written as

 μ[b]=bαb+τ0. (26)

As shown in Fig. 3, the throughput characteristics in Table 1 are well-fitted by this simple rational function.

We can readily verify from (26) that Assumption 4 ensures Assumption 1 (i). Furthermore, (26) implies

 limb→∞μ[b]=1α,

so that the stability condition stated in Assumption 1 (ii) is rewritten as

 ρ:=λα<1. (27)

In view of this relation, the normalized load represents the ratio of the arrival rate to the server’s processing capacity, which corresponds to the traffic intensity in ordinary single-server queueing models.

Assumption 4 simplifies the analysis mainly because under this assumption, , , and (see (13) and (14)) are given in terms of the first two moments and of the stationary batch size distribution:

 E[H] =αE[B]+τ0, (28) E[H2] =α2E[B2]+2ατ0E[B]+τ20, (29) E[^H] =α⋅E[B2]E[B]+τ0. (30)
###### Lemma 3.

Let denote a generic random variable for in steady state (cf. (3)):

 Pr(A=k)=∞∑b=1Pr(B=b)a[b]k,k=0,1,….

Under Assumption 4, and are given in terms of by

 E[B] =λτ0+Pr(A=0)1−λα, (31) E[B2] =(1+2λ2ατ0)E[B]+λ2τ201−λ2α2. (32)
###### Proof.

From (7) and (13), we have

 E[A] =∞∑b=1Pr(B=b)λE[H[b]]=λE[H], (33) E[A2] =∞∑b=1Pr(B=b)(λE[H[b]]+λ2E[(H[b])2]) =λE[H]+λ2E[H2]. (34)

It then follows from (5), (28), and (33) that

 E[B] =E[A]+Pr(A=0) =λ(αE[B]+τ0)+Pr(A=0),

so that we obtain (31). Similarly, it follows from (5), (29), and (34) that

 E[B2] =E[A2]+Pr(A=0) =λ(αE[B]+τ0)+λ2(α2E[B2]+2ατ0E[B]+τ20)+Pr(A=0).

We then obtain (32) by rearranging terms of this equation. ∎

###### Lemma 4.

Under Assumption 4, the mean latency is given in terms of the probability that the server is idle by

 E[W] =α+τ0+λ(1+2λα)(2ατ0+α2+(1−π0−λα)τ0λ)2(1−λ2α2). (35)
###### Proof.

It follows from (15), (30), (31), and (32) that

 E[W] =τ0+(1+2λα)E[B2]−E[B]2λE[B] =α+τ0+(1+2λα)(E[B2]−E[B])2λE[B]. (36)

Note here that (31) and (32) imply

 E[B2]−E[B]λE[B] =(2λ2ατ0+λ2α2)E[B]+λ2τ20(1−λ2α2)λE[B] =λ(2ατ0+α2+τ20E[B])1−λ2α2. (37)

In addition, owing to Little’s law, the server utilization (i.e., the mean number of batches being served in steady state) is equal to the product of the number of batches processed per unit time and the mean batch processing time:

 1−π0=λE[B]⋅E[H]=λα+λτ0E[B], (38)

where we used (28) for the second equality. Therefore, we obtain (35) from (36), (37), and (38). ∎

###### Remark 3.

By definition, we have (see (8)).

Even under Assumption 4, it seems difficult to determine the exact value of . However, we have the following simple lower bound for this quantity:

###### Lemma 5.

Under Assumption 4, is bounded below by

 π0≥max(0,1−λ(α+τ0)). (39)
###### Proof.

Because holds with probability one, holds. We then have from (38),

 π0≥1−λ(α+τ0),

which and imply (39). ∎

###### Remark 4.

If , the quantity is equal to the probability that the server is idle in a stationary single-service M/D/1 queue with the arrival rate and the processing time , where arriving inference jobs are processed one by one.

###### Remark 5.

It follows from (38) that , so that if Assumption 2 is satisfied, we have (cf. (19))

 η≥1β+c0/max(1,λτ0/(1−λα)). (40)

We are in a position to obtain the main result of this paper:

###### Theorem 2.

Under Assumption 4, the mean latency is bounded above by

 E[W] ≤α+τ02(1−λα)(1+2λτ0+1−λτ01+λα)=:ϕ0(λ,α,τ0), (41) and E[W] ≤32⋅τ01−λα+α2⋅λα+21−λ2α2=:ϕ1(λ,α,τ0). (42)

In addition, we have if and only if .

###### Proof.

As stated in Lemma 5, we have two lower bounds for . Using and (35), we have

 E[W] ≤α+τ0+λ(1+2λα)(α+τ0)22(1−λ2α2) =α+τ02(1−λα)⋅2+λα+λτ0+2λ2ατ01+λα,

which implies (41). On the other hand, we have (42) from and (35) as follows:

 E[W] ≤α+τ0+(1+2λα)(λατ0+λα2+τ0)2(1−λ2α2) =α+τ0+(1+2λα)τ02(1−λα)+(1+2λα)λα22(1−λ2α2) =3τ02(1−λα)+λα2+2α2(1−λ2α2).

The relation is thus obvious from these derivations. ∎

Theorem 2 provides a surprisingly simple upper bound for the mean latency . For convenience, let

 ϕ(λ,α,τ0):=min(ϕ0(λ,α,τ0),ϕ1(λ,α,τ0)). (43)

Even though this upper bound is obtained by replacing the idle probability with its almost trivial lower bound in (39), it provides a quite good approximation to the exact value of the mean latency , as we will see in the next section.

## 4 Numerical Evaluation

In this section, we present some numerical and simulation experiments. Throughout this section, we concentrate on the case with deterministic linear batch processing times considered in Section 3.3. In particular, we employ the model parameters and estimated in Section 3.3 from Table 1 (see the paragraph just after Assumption 4).

Fig. 5 shows simulation results for the mean latency and its upper bounds and given in (41) and (42) (recall that the normalized load is defined in (27)). We observe that the combination (43) of these upper bounds quite well approximates the exact curve of . In particular, except for small values of , takes fairly close values to .

Recall that the upper bound is obtained by replacing the idle probability with its trivial lower bound . In Fig. 3(b), the server utilization is plotted as a function of the normalized load . As a reference, we also plot its upper bound (cf. (39)). From this figure, we see that the server utilization takes a value close to even for a moderate value of , which is quite different from ordinary single-server queues, where the server utilization is equal to the traffic intensity. This phenomenon comes from the fact that the server’s processing speed largely increases as the batch size increases, so that the system is overloaded for small batch sizes even under a moderate load level . Because of this behavior of the server utilization, the upper bound is a good approximation to the mean latency for a wide range of .

On the other hand, for small , the upper bound is a good approximation to . Note that is obtained by replacing the mean batch size with its trivial lower bound . Therefore, implies that the mean batch size , i.e., the server does not sufficiently leverage its batch-processing capability in such a region.

We next discuss the energy efficiency using the linear model (17) considered in Section 3.2. Recall that the average energy efficiency is defined as (18), which represents the mean number of jobs processed with unit energy. In Fig. 7, simulation results for and its lower bound (40) are plotted as functions of the normalized load . From this figure, we observe that the energy efficiency can be largely enhanced by letting the server adequately loaded. Also, the energy-efficiency is well-approximated by the lower bound (40) except for small values of . Fig. 7 shows the energy-latency tradeoff, where the relation between and the mean latency is plotted with parameter . In this figure, we also plot approximation curves obtained by combining (40) and (43). We see that the closed-form bounds (40) and (43) are useful to determine an adequate operating point of the server, taking the energy-latency tradeoff into consideration.

Finally, we discuss the relation between the model considered in this paper and a corresponding batch-service queue with finite maximum batch size . As mentioned in Section 1, the mean latency in the case of finite can be numerically obtained with results in [16, Section 4.2]. Fig. 8 shows that if is sufficiently large, the mean latency is well approximated by our closed-form upper bound given by (43). If is small, on the other hand, the mean latency deviates from for the arrival rate near the stability boundary . However, we observe from this figure that even for small values of , the mean latency is still well-approximated by (43) if the system is moderately loaded, i.e., is sufficiently small compared to .

## 5 Conclusions

In this paper, we introduced a queueing model representing GPU-based inference servers with dynamic batching. We modeled an inference server as a batch-service queueing model with infinite maximum batch sizes and batch-size dependent processing times. We first showed that the energy efficiency of the server increases with the arrival rate of inference jobs, which suggests that it is energy-efficient to operate the server under a traffic load as large as possible, within a latency requirement of inference jobs. We then derived a simple closed-form upper bound for the mean latency in Theorem 2, under the assumption that the batch processing time linearly increase with the batch size. Through numerical and simulation experiments, we showed that the exact value of the mean latency is well-approximated by this simple upper bound.

## Acknowledgements

This work was supported in part by JSPS KAKENHI Grant Number 18K18007.

## References

• [1] Nvidia AI Inference Platform, Giant Leaps in Performance and Efficiency for AI Services, from the Data Center to the Network’s Edge. https://www.nvidia.com/en-us/data-center/resources/inference-technical-overview/ (accessed 06-Dec-2019).
• [2] Nvidia TensorRT Inference Server.
https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/ (accessed 06-Dec-2019).
• [3] N. T. J. Bailey, On queueing processes with bulk service, J. Roy. Stat. Soc. B 16 (1954) 80–87.
• [4] G. Briére and M. L. Chaudhry, Computational analysis of single-server bulk-service queues, M/G/1, Adv. Appl. Prob. 21 (1989) 207–225.
• [5] D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and I. Stoica, Clipper: A low-latency online prediction serving system, in Proc. of 14th USENIX Symposium on Networked Systems Design and Implementation (2017) 613–627.
• [6] R. K. Deb and R. F. Serfozo, Optimal control of batch service queues, Adv. Appl. Prob. 5 (1973) 340–361.
• [7] F. Downton, Waiting time in bulk service queues, J. Roy. Stat. Soc. B 17 (1955) 256–261.
• [8] F. Downton, On limiting distributions arising in bulk service queues, J. Roy. Stat. Soc. B 18 (1956) 265–274.
• [9] D. Gibson and E. Seneta, Augmented truncations of infinite stochastic matrices, J. Appl. Prob. 24 (1987) 600–608.
• [10] N. K. Jaiswal, Time-dependent solution of the bulk-service queueing problem, Oper. Res. 8 (1960) 773–781.
• [11] Y. Liu, Augmented truncation approximations of discrete-time markov chains, Oper. Res. Lett. 38 (2010) 218–222.
• [12] J. Medhi, Waiting time distribution in a Poisson queue with a general bulk service rule, Manag. Sci. 21 (1975) 777–782.
• [13] A. Müller and D. Stoyan, Comparison Methods for Stochastic Models and Risks, John Wiley & Sons, Chichester, UK, 2002.
• [14] M. F. Neuts, The busy period of a queue with batch service, Oper. Res. 13 (1965) 815–819.
• [15] M. F. Neuts, A general class of bulk queues with Poisson input, Ann. Math. Stat. 38 (1967) 759–770.
• [16] M. F. Neuts, Structured Stochastic Matrices of M/G/1 Type and Their Applications, Marcel Dekker, New York, 1989.
• [17] C. Olston, N. Fiedel, K. Gorovoy, J. Harmsen, L. Lao, F. Li, V. Rajashekhar, S. Ramesh, and J. Soyke, TensorFlow-Serving: Flexible, high-performance ML serving, in Proc. of Workshop on ML Systems at NIPS 2017, 2017.
• [18] M. Shaked and J. G. Shanthikumar, Stochastic Orders, Springer, New York, NY, 2007.
• [19] R. L. Tweedie, Truncation approximations of invariant measures for markov chains, J. Appl. Prob. 35 (1998) 517–536.
• [20]

R. Xu, F. Han, and Q. Ta, Deep learning at scale on NVIDIA V100 accelerators in Proc. of 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS18), 2017.