Please come back later: Benefiting from deferrals in service systems

by   Anmol Kagrecha, et al.

The performance evaluation of loss service systems, where customers who cannot be served upon arrival get dropped, has a long history going back to the classical Erlang B model. In this paper, we consider the performance benefits arising from the possibility of deferring customers who cannot be served upon arrival. Specifically, we consider an Erlang B type loss system where the system operator can, subject to certain constraints, ask a customer arriving when all servers are busy, to come back at a specified time in the future. If the system is still fully loaded when the deferred customer returns, she gets dropped for good. For such a system, we ask: How should the system operator determine the rearrival times of the deferred customers based on the state of the system (which includes those customers already deferred and yet to arrive)? How does one quantify the performance benefit of such a deferral policy? Our contributions are as follows. We propose a simple state-dependent policy for determining the rearrival times of deferred customers. For this policy, we characterize the long run fraction of customers dropped. We also analyse a relaxation where the deferral times are bounded in expectation. Via extensive numerical evaluations, we demonstrate the superiority of the proposed state-dependent policies over naive state-independent deferral policies.



There are no comments yet.


page 1

page 2

page 3

page 4


When to arrive at a queue with earliness, tardiness and waiting costs

We consider a queueing facility where customers decide when to arrive. A...

Strategic arrivals to a queue with service rate uncertainty

This paper studies the problem of strategic choice of arrival time to a ...

Performance Evaluation of Stochastic Bipartite Matching Models

We consider a stochastic bipartite matching model consisting of multi-cl...

Real-Time Prediction of Delay Distribution in Service Systems using Mixture Density Networks

Motivated by interest in providing more efficient services in customer s...

A Deep Probabilistic Model for Customer Lifetime Value Prediction

Accurate predictions of customers' future lifetime value (LTV) given the...

On the Bike Spreading Problem

A free-floating bike-sharing system (FFBSS) is a dockless rental system ...

Matching Impatient and Heterogeneous Demand and Supply

Service platforms must determine rules for matching heterogeneous demand...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Many service systems have the property that denial of service upon the arrival of a request results in the request getting dropped. The classical example comes from telephony, where an incoming call request must either be connected, or dropped. Another contemporary example is an electric car charging facility, which cannot serve an incoming customer when all charging stations are occupied. Similarly, certain online services decline fresh requests when congested.

Such service systems, by their very nature, are unable to queue requests that arrive when service capacity is fully utilized. It is therefore natural to ask: What if these systems were instead able to defer some requests, i.e., have these requests ‘come back’ at a pre-specified time in the future? If so, what is optimal strategy for deferring requests? To what extent does the throughput of the system improve from such a deferral policy? The goal of this paper is to shed light on these questions.

Formally, we consider an Erlang B (M/M//) service system, where the system operator can, subject to certain constraints, defer jobs that arrive when all servers are occupied. The constraints include an upper bound on the deferral time (which is natural given QoS considerations), and/or a constraint on the number of jobs that can be deferred at any time. In this setting, our goal is to design simple deferral policies that are effective, while also being analytically tractable. The main difficulty in the performance evaluation of deferral policies is that the state description must incorporate information about (future) arrival instants of deferred jobs.

Our main contribution is the design and analysis of a deferral policy, which spaces arrivals of deferred jobs uniformly in the future. Even though this policy induces a complicated (uncountable) Markovian state space, we are able to analytically characterize the blocking probability via a novel combination of steady state and transient analysis of certain Markov processes. Via numerical experiments, we show that the proposed policy outperforms naive state-independent deferral policies, as well as the Erlang B system (which does not allow deferrals).

A relaxation on the deferral time constraint is considered next, where the upper bound on the deferral time is imposed in expectation. For this relaxed model, we propose a deferral policy, which admits a considerably more explicit performance characterization. Via numerical experiments, we show that the performance of this policy is in fact a good approximation to the policy designed for the ‘hard’ deferral constraint.

The remainder of the paper is as follows. We provide a brief survey of related literature next. In Section II, we describe the primary system model and state some preliminary results. The ‘hard’ case where deferral times are bounded deterministically is considered in Section III, and the relaxed model is considered in Section IV.

Related literature: This work is motivated by the recent work [3], which implemented a deferral system for a web-server that is prone to congestion. To the best of our knowledge, there is no prior work that provides an analytical treatment of queueing systems that allow deferral.

This work is peripherally related to the considerable literature on retrial queues (we refer the reader to a recent survey paper [6] and the book [1]). Retrial queues are systems where blocked customers ‘try again later’. The main contrast between retrial queues and deferral queues is that in the former setting, the retrial model is taken to be exogenous and indicative of customer behavior, while in the latter setting, deferral is a control decision on part of the system operator.

Another related class of loss systems that have been studied are abandonment based queues (references include [2] and [7]

). In these systems, customers who are denied service upon arrival wait for a certain (potentially random) amount of time. If service is not offered before this patience time runs out, the customer leaves. In contrast, in deferral queues, a deferred customer does not actually wait, but rather leaves the system to come back at a pre-specified time. So this deferred customer can only be served if service capacity is available at the precise moment of re-entry. In this sense, abandonment based systems provide lower bounds on the performance achievable in deferral systems.

Ii Model and Preliminaries

We consider a loss service system consisting of identical parallel servers of unit speed. Jobs (or customers) arrive according to a Poisson process of rate

and service times are exponentially distributed with rate

In the absence of deferrals, this system is the classical Erlang B (M/M//) queue.

On this baseline Erlang B model, we allow the system operator (or the scheduling policy) to defer jobs that arrive when all servers are busy, to come back to the system at a pre-specified time in the future. If a free server is available when this deferred job comes back, it gets admitted into service. If not, the job gets dropped. In other words, we only allow a job to be deferred once. We refer to the time instant at which a deferred job comes back to the system as its rearrival time, and we refer to the interval between the rearrival time and the original arrival time as the deferral time.

Of course, it is natural to impose certain constraints on the deferral policy. Firstly, QoS considerations dictate that deferral time cannot be too large. We impose this constraint in two ways: In Section III, we consider the case where there is a deterministic upper bound on deferral times. In Section IV, we consider a relaxed model where the deferral times are bounded above in expectation by A second constraint we impose is an upper bound on the number of deferred jobs at any time. Finally, we impose work conservation, i.e., when a job arrives (either for the first time, or post deferral), it must be served so long as there is at least one free server available.

Our metric for evaluating the performance of a deferral policy is the long run fraction of jobs blocked, a.k.a., the blocking probability. Note that jobs can be blocked in two ways: One, a job arrives when all servers are busy, and there are already previous arrivals that have been scheduled for rearrival in the future. Second, a deferred job on rearrival finds that all the servers are (still) busy.

A naive deferral policy would be to assign a large deferral time (say ), subject to limit on number of deferred jobs. Another naive policy would be to assign a deferral time sampled uniformly at random in again subject to the deferral limit. Such policies are state-independent, in that the deferral time of a job is assigned independently of the deferral times assigned to previous jobs. In this paper, we propose low overhead state-dependent deferral policies, that schedule rearrivals of deferred jobs cognizant of the already scheduled rearrivals.

Finally, we note that work conservation ensures that the blocking probability of any deferral policy is at most the blocking probability in the Erlang B system. In other words, the Erlang B formula provides an upper bound on the blocking probability of any deferral policy.

Throughout the paper, for the set is denoted by

Iii Bounded Deferral Times

In this section, we consider the case where the deferral times are bounded by i.e., the system cannot defer an incoming job by more than time units.111While we take to be an exogenous model parameter, its value may in practice be determined based on QoS considerations and customer patience. For this model, we propose a simple state-dependent deferral policy, which we refer to as the Determinstically Spaced Rearrival T

imes (DSRT) policy, which spreads the rearrival instants of deferred jobs ‘uniformly’, so as to maximize the chances of admitting them. We provide a characterization of the blocking probability under this policy, via a novel combination of the steady state analysis of a Markov chain that captures the system evolution in the absence of deferrals, and the transient analysis of another Markov chain that captures the system evolution in the presence of deferrals. Finally, we present a case study that demonstrates the superiority of our state-dependent deferral policy over naive state-independent policies.

Iii-a DSRT Policy

The proposed deferral policy is described as follows. It is parameterized by which specifies the spacing between arrivals of deferred jobs into the system. If there are no jobs currently deferred, and an arriving job finds all servers busy, then the policy defers that job by time units. On the other hand, if there are deferred jobs, and an arriving job finds all servers busy, then the policy schedules its rearrival time to be time units after the rearrival time of the last deferred job, subject to a maximum of deferred jobs at any time. This policy is defined formally as Algorithm 1.

procedure DSRT
     if  then
     else if  then
         Customer blocked
     end if
end procedure
Algorithm 1 DSRT policy

Note that the proposed DSRT policy attempts to spread the rearrival instants units of time apart, subject to the deferral constraints. Intuitively, if the spacing is too small, then the deferred jobs are likely to find all servers occupied upon rearrival. On the other hand, if is too large, the policy is inefficient, since fewer jobs may be deferred at any time (given the constraint on the deferral times), and fewer opportunities are created to admit deferred jobs. Thus, the policy must operate at an intermediate ‘soft spot’; we address this issue in our numerical experiments.

Under the DSRT policy, the temporal evolution of the system can be captured as a Markov process over state space Here, the state indicates that servers are currently busy, there are jobs in deferral, and that the earliest rearrival of a deferred job will occur after time Note that the state space is uncountable, making an analysis of the stationary distribution cumbersome. We overcome this difficulty by directly characterizing which is the long run fraction of time the state is of the form for some This is done by separately analysing the temporal evolution of the system in the presence and absence of deferred jobs, and then combining these analyses in a novel manner.

We begin by considering the special case of a single server, and a limit of at most one deferral at any time (). This special case is instructive, not just because it illustrates our analysis approach, but also because it admits a closed form characterization of the blocking probability. We then consider the case of general in Section III-C; in this case, our characterization is less explicit, though amenable to an exact computation.

Iii-B Single-Server, Single-Deferral System

Fig. 1: Illustration of the temporal evolution of when Deferral phases () of length are interspaced between non-deferral phases ().

The temporal evolution of the system for the case is illustrated in Figure 1. The key observation that informs our analysis is the following. Considering only the times when the system has no deferrals (i.e., disregarding the deferral phases, when ), the system configuration evolves as per the Markov chain depicted at the bottom of Figure 1.222We avoid referring to as the state of the system, and instead refer to it as the configuration. On the other hand, over each deferral phase of length (when ) the system configuration evolves as per the Markov chain depicted at the top of Figure 1.

Since the system configuration evolves as per the Markov chain depicted at the bottom of Figure 1 when there are no deferrals, the ratio between and is dictated by the flow balance equations for that Markov chain:


Turning now to the configurations where we note that the relationship between and is dictated by the transient behavior of the Markov chain depicted on the top of Figure 1 over an interval of length starting in configuration Focusing on this Markov chain over the interval the distribution of the chain at time captured by is given by:


It now follows that


Next, we need to relate the long run fraction of time spent in non-deferral configurations to those under deferral configurations . To do this, we note, using the PASTA property, that the rate at which transitions occur from non-deferral configurations to deferral configurations equals On the other hand, transitions from deferral configurations to non-deferral configurations occur at rate Equating the two, we get:


Lastly, we have the normalization condition


Solving the system of equations (1)–(4) gives yields a closed form characterization of the long run fraction of time spent in each configuration.

Fig. 2: Illustration of the temporal evolution of when





Fig. 3: System evolution over the non-deferral configurations

















Fig. 4: System evolution within each deferral stage

Finally, we are interested in characterizing the blocking probability which is the long run fraction of customers dropped. This can be done as follows:


The first term above captures the long run fraction of customers dropped on arrival (because the server was busy and there was already a deferred customer). The second term captures the long run fraction of customers dropped post deferral on rearrival. Indeed, the probability that any deferred customer gets dropped on rearrival equals Thus, the rate of drops of deferred customers equals

To summarize, solving equations (1)–(4), and applying the solution to (5) yields the following result.

Theorem 1

For the case under the DSRT policy, the blocking probability is given by


where .

Note that is the blocking probability of an M/M/1/1 system. It is therefore clear from (6) that the blocking probability under the DSRT policy is a strict improvement over the scenario where deferral is not allowed. Moreover, the improvement diminishes to zero when and as While the optimal value of that minimizes does not admit an explicit formula, it can be shown that the optimal value is bounded between and .

Iii-C Multiple-Servers, Multiple-Deferrals System

We now turn to the performance evaluation in the general setting, i.e., the number of servers and the maximum number of deferrals are arbitrary. Figure 2 illustrates the temporal evolution of in this case. Note that each deferral phase (an interval over which ) is no longer necessarily of duration anymore, but is instead made up of a random number of deferral stages, each of duration Moreover, at the end of each deferral phase, the configuration is of the form where Thus, disregarding the deferral phases, the system configuration evolves as per the Markov chain depicted in Figure 3. Here, denotes the probability that a deferral phase ends with active servers. (The values of are in turn dictated by the behavior of the system during each deferral phase; we will characterize shortly.) Moreover, over each deferral stage of duration the system configuration evolves as per the Markov chain shown in Figure 4.

In order to characterize and to arrive at the system of equations relating the long run fractions of time spent in the different configurations, we first analyse the evolution of the system configuration over a single deferral stage, and then consider the sequence of starting configurations across deferral stages within a deferral phase.

Iii-C1 Evolution of system configuration within a single deferral stage

Deferral stages begin in a configuration of the form where and 333Specifically, the first deferral stage in any deferral phase begins in configuration . Subsequently for duration the system configuration evolves as per the Markov chain depicted in Figure 4. The transition rate matrix corresponding to this chain is given by

where 0 is an all-zero square matrix of size , is the transition rate matrix of an M/M/K/K queue, is a diagonal matrix with entries , and . In writing this rate matrix, the configurations have been ordered by stacking them row-wise from Figure 4.

Considering the evolution of this chain over the interval the distribution at time

is denoted by the vector


Here, the initial probability vector would have an entry 1 according to the starting configuration of the stage under consideration, the other entries being zero.

Iii-C2 Evolution of deferral stage starting configurations

As mentioned before, the starting configuration in a deferral stage is of the form where and We construct a discrete time Markov chain to capture the evolution of the starting configurations across successive deferral stages. To make the chain recurrent, we add a dummy state which captures the end of the deferral phase. Note that the transition from the dummy state is to the configuration since that is the starting configuration on the next deferral phase.

To describe the transition probability matrix corresponding to this discrete time Markov chain, we need the following notation. Within a deferral stage, let denote the probability of being in configuration at time with starting configuration Note that these probabilities are defined in Section III-C1.

Now, let denote the probability that the starting configuration in the next deferral stage is given that the starting configuration in the present deferral stage is where

The transition probabilities are given by

Given these transition probabilities corresponding to this finite irreducible discrete time Markov chain, one can obtain the stationary distribution. Let denote the stationary probability associated with configuration

Iii-C3 The equations relating

We begin by relating the long run fractions of time spent in the non-deferral configurations. This is in turn based on the Markov chain depicted in Figure 3. The probabilities are characterized as follows.

Lemma 1



The characterization of follows directly from the discrete-time Markov chain discussed in Section III-C2; indeed, is simply the probability that a deferral phase ends with a transition to the configuration Consider a (discrete-time) renewal process where renewal instance correspond to visits to the dummy state (i.e., the end of a deferral phase) in the above DTMC. Each renewal cycle produces a reward of 1 if it ends with a transition to the configuration and zero otherwise. An application of the renewal reward theorem yields

The statement of the lemma now follows, once we note that the balance equations for the DTMC under consideration imply

Given this result, we have:


Next, we focus on the ratio of long run fraction of time spent in deferral configurations. This is given by


where and .

Finally, it remains to relate the time fractions corresponding to deferral configurations to those corresponding to non-deferral configurations. This is done as follows:


This is justified as follows. gives the rate of transition from non-deferral configurations to deferral configurations. On the other hand, the rate of rearrivals equals gives the probability of exiting the positive deferral configurations at the end of any deferral stage of duration . (9) follows by combining these arguments.

Solving the system of linear equations (7)–(9) along with the normalization condition yields for

Iii-C4 Blocking probability characterization

The blocking probability can be characterized in a similar manner as before.

Lemma 2

Under the DSRT policy,


Customers are blocked in two ways. First, a customer arrives when the system configuration is Using the PASTA property, the long run fraction of arrivals blocked this way equals which accounts for the first term in (10).

Second, deferred customers can get blocked when all servers are busy upon rearrival. Let denote the number of deferral phases completed until time and denote the number of deferred customers blocked in deferral phase  Let denote the number of arrivals until time  The long run fraction of arrivals blocked upon rearrival equals

Finally, a renewal reward argument analogous to that in the proof of Lemma 1 yields

Iii-D Case Study

In this section, we numerically compare the performance of the DSRT policy with state-independent policies and the M/M// system in the Halfin-Whitt regime (a.k.a. quality and efficiency driven regime; see [4]). We consider two simple state independent policies: the first policy assigns a deferral time equal to and the second policy assigns a deferral time uniformly sampled in the interval . The Halfin-Whitt regime corresponds to a sequence of queueing systems parameterized by the number of servers  The arrival rate in the system with servers equals , where is a constant. In our experiments, the service rate for each server is set to be 1, the patience time is set to be 10, and is set to be 0.1.

We compare the policies for the cases when equals 1 and 2. We emphasize that acts as a parameter for the state-independent policies while the deferral limit for DSRT , where . For DSRT, the blocking probability is evaluated numerically using the procedure developed in previous sections and is optimized by tuning the parameter  For the state independent policies, we run MCMC simulations where the blocking probability is calculated for 100,000 customers. The blocking probability for M/M// system is calculated numerically using the Erlang-B formula. The blocking probabilities of the policies for the cases where is equal to 1 and 2 are given in Figure 4(a) and Figure 4(b) respectively. We note that DSRT outperforms the state-independent policies, which produce a blocking probability close to the Erlang-B formula.

(a) Maximum Deferrals: 1
(b) Maximum Deferrals: 2
Fig. 5: Performance Evaluation

The optimal value of is plotted against the number of servers in Figure 6. Note that as the number of servers increases, the optimal value of shrinks. Characterizing this behavior analytically presents an interesting avenue for future work.

Fig. 6: Optimal for DSRT

Iv Deferral Times Bounded in Expectation

In this section, we consider a relaxation of the deferral constraint considered before, in that the upper bound on deferral times is imposed in expectation (rather than on each instance, as in Section III). Specifically, we require that the expected deferral time should not exceed . This relaxation may be interpreted as an approximation of the ‘hard’ deferral constraint considered before.

In the above setting, we propose and analyse a deferral policy, referred to as the Exponentially Spaced Rearrival Times (ESRT) policy. As the name suggests, this policy introduces an exponentially distributed spacing between rearrival instants. Specifically, the ESRT policy allots exponentially separated deferral times, i.e., the difference between the rearrival times of deferred arrivals is exponential with rate , where , the deferral limit being this policy is described formally in Algorithm 2. Note that

refers to an exponential random variable with rate

that is independent of the workload, as well as other inter-rearrival intervals.

     if  then
     else if  then
         customer blocked
     end if
end procedure
Algorithm 2 ESRT Policy

Given the memorylessness of the interarrival times, service times, and the time until next rearrival, the evolution of the system under the ESRT policy is described by a continuous time Markov chain over the state space where state indicates that servers are currently active, and there are jobs in deferral. The transition rate diagram for this Markov chain is depicted in Figure 7. Note that memorylessness of the time until the next rearrival results in a much simpler system description (compared to the DSRT policy analysed in the previous section), and yields a much more explicit performance evaluation. In the following, we first compute the stationary distribution corresponding to the ESRT Markov chain, and then use that to characterize the blocking probability.

















Fig. 7: Dynamics of ESRT

Iv-a Obtaining the stationary distribution

Matrix geometric methods can be used to obtain the stationary distribution corresponding to the Markov chain shown in Figure 7. Let represent the steady state probability of state

We begin by writing the flow balance equations for states with deferrals

Letting , the above system of equations can be used to characterize upto a multiplicative constant. Specifically, we write


where is a row vector whose last entry is 1.

Now, consider the collection of states corresponding to deferrals, where deferrals. The flow balance equations for these states are as follows

Let , then the above equations can be written as


One can show using a simple manipulation and strict diagonal dominance that is non-singular. Hence, we have the following recursive relation



Finally, considering the flow balance equations for states with no deferrals,

The equations above can be written as implying


where .444The invertibility of follows since the determinant of is .

Using (11)–(13), the stationary distribution is completely determined upto the multiplicative constant

which can be obtained by applying the law of total probability:

Iv-B The blocking probability

Having characterized the stationary distribution associated with the system state, we now express the blocking probability in terms of the stationary distribution. As before, a job may be blocked either on arrival into state or on rearrival if all servers are occupied. Since is the rate of dropped rearrivals, an elementary renewal argument shows that the long run fraction of jobs dropped on rearrival equals Thus,

For the special case the blocking probability takes the following explicit form:

where and .

Iv-C Case Study

Given that the ESRT policy is more tractable than the DSRT policy, our first goal is to check if the performance of the former is a good approximation to the performance of the latter. The numerical analysis is done for the case when , and . i.e., the system is allowed to defer one or two customers. We then consider a sequence of queueing systems under the Halfin-Whitt scaling regime, with In Figure 7(a), we compare the blocking probability under ESRT using the optimal computed value of the policy parameter, with the blocking probability under DSRT using 555Note that it is natural to approximate the performance under DSRT with parameter with the performance under ESRT with Note that the two curves are nearly indistinguishable, suggesting that the approximation is sound (at least for small values of ). In Figure 7(b), we plot v/s Since decreases with it follows that it is optimal to decrease the spacing between rearrivals as the size of the system grows under the Halfin-Whitt regime. This is to be expected, since the arrival rate and the service rate of the system grow nearly proportionately under this scaling regime.

(a) Approximate
(b) Optimal for ESRT

Next we study the impact of the maximum number of deferrals Under the Halfin-Whitt scaling regime as before, in Figure 7(c), we plot the optimal (with respect to the deferral parameter ) blocking probability when and along with the Erlang B formula (which corresponds to ) For this case, we set Note that increasing the limit on the number of deferrals does improve the performance significantly. Moreover, it is well-known that the blocking probability of the Erlang B (M/M//) system decays as under the Halfin-Whitt scaling (see [5]). The numerical results as shown in Figure 7(c), suggest that the decay of blocking probability of ESRT is also (note the nearly linear scaling of blocking probability with on a log-log scale), at least when is small. Having a large value of could possibly allow a decay in the blocking probability. Formalizing this scaling behavior is an interesting avenue for future work.

When is 1000, we also plot the inverse of optimal deferral parameter against the number of servers in Figure 7(d). Notice the nearly linear decay of in the log-log plot, indicating a regularly varying scaling with the number of servers in the Halfin-Whitt regime. Characterizing the optimal deferral parameters for both ESRT and DSRT would also be illuminating.

(c) Performance Evaluation
(d) Optimal for ESRT

V Concluding Remarks

Our work motivates several directions for future work including the following. One, characterizing the optimal parameter values and optimal blocking probabilities in a large system limit for DSRT/ESRT. Second, designing deferral policies operating under constraints on the number of deferrals and patience time that can beat the blocking probability scaling under the Halfin-Whitt regime. Finding fundamental lower bounds on the blocking probability of such deferral policies would also be interesting.


  • Artalejo and Gómez-Corral [1999] J. R. Artalejo and A. Gómez-Corral. Retrial queueing systems. Mathematical and Computer Modelling, 30(3-4):xiii–xv, 1999.
  • Baccelli and Hebuterne [1981] F. Baccelli and G. Hebuterne. On queues with impatient customers. 1981.
  • Doshi et al. [2015] B. Doshi, C. Kumar, P. Piyush, and M. Vutukuru. Webq: A virtual queue for improving user experience during web server overload. In 2015 IEEE 23rd International Symposium on Quality of Service (IWQoS), pages 135–140. IEEE, 2015.
  • Halfin and Whitt [1981] S. Halfin and W. Whitt. Heavy-traffic limits for queues with many exponential servers. Operations research, 29(3):567–588, 1981.
  • Jagerman [1974] D. L. Jagerman.

    Some properties of the erlang loss function.

    Bell System Technical Journal, 53(3):525–551, 1974.
  • Kim and Kim [2016] J. Kim and B. Kim. A survey of retrial queueing systems. Annals of Operations Research, 247(1):3–36, 2016.
  • Zeltyn and Mandelbaum [2005] S. Zeltyn and A. Mandelbaum. Call centers with impatient customers: many-server asymptotics of the m/m/n+ g queue. Queueing Systems, 51(3-4):361–402, 2005.