Background and motivation. Load balancing algorithms provide a crucial mechanism for achieving efficient resource allocation in parallel-server systems, ensuring high server utilization and robust user performance. The design of scalable load balancing algorithms has attracted immense interest in recent years, motivated by the challenges involved in dispatching jobs in large-scale cloud networks and data centers with massive numbers of servers.
In particular, token-based algorithms such as the Join-the-Idle-Queue (JIQ) scheme [1, 7] have gained huge popularity recently. In the JIQ scheme, idle servers send tokens to the dispatcher (or one among several dispatchers) to advertise their availability. When a job arrives and the dispatcher has tokens available, it assigns the job to one of the corresponding servers (and disposes of the token). When no tokens are available at the time of a job arrival, the job may either be discarded or forwarded to a randomly selected server. Note that a server only issues a token when a job completion leaves its queue empty. Thus at most one message is generated per job (or possibly two messages, in case a token is revoked when an idle server receives a job through random selection from a dispatcher without any tokens).
Under Markovian assumptions, the JIQ scheme achieves a zero probability of wait for any fixed subcritical load per server in a regime where the total number of servers grows large. Thus the JIQ scheme provides asymptotically optimal performance with minimal communication overhead (at most one or two messages per job), and outperforms power-of- policies as we will further discuss below.
The latter asymptotic optimality of the JIQ scheme prevails in a multiple-dispatcher scenario provided the job arrival rates at the various dispatchers are exactly equal . When the various dispatchers receive jobs from external sources it is difficult however to perfectly balance the job arrival rates, and hence it is not uncommon for skewed load patterns to arise.
Key contributions. In the present paper we examine the performance of the JIQ scheme in the presence of possibly heterogeneous dispatcher loads. We distinguish two scenarios, referred to as blocking and queueing, depending on whether jobs are discarded or forwarded to a randomly selected server in the absence of any tokens at the dispatcher. We use exact product-form distributions and fluid-limit techniques to establish that the blocking and wait no longer vanish for asymmetric dispatcher loads as the total number of servers grows large. In fact, even for an arbitrarily small degree of skewness and arbitrarily low overall load, the blocking and wait are strictly positive in the limit. We show that, surprisingly, it is the least-loaded dispatcher that acts as a bottleneck and throttles the flow of tokens. The accumulation of tokens at the least-loaded dispatcher hampers the visibility of idle servers to the heavier-loaded dispatchers, and leaves idle servers stranded while jobs queue up at other servers.
In order to counter the above-described performance degradation for asymmetric dispatcher loads, we introduce two extensions to the basic JIQ scheme. In the first mechanism tokens are not uniformly distributed among dispatchers but in proportion to the respective loads. We prove that this enhancement achieves zero blocking and wait in a many-server regime, for any subcritical overall load and arbitrarily skewed load patterns. In the second approach, tokens are continuously exchanged among the various dispatchers at some exponential rate. We establish that for any load profile with subcritical overall load there exists a finite token exchange rate for which the blocking and wait vanish in the many-server limit. Extensive simulation experiments are conducted to corroborate these results, indicating that they apply even in moderately sized systems.
In summary we make three key contributions:
1) We show how the blocking scenario can be represented in terms of a closed Jackson network. We leverage the associated product-form distribution to express the blocking probability as function of the relevant load parameters.
2) We use fluid-limit techniques to establish that in both the blocking and the queueing scenario the system performance depends on the aggregate load and the minimum load across all dispatchers. The fluid-limit regime not only offers analytical tractability, but is also highly relevant given the massive numbers of servers in data centers and cloud operations.
3) We propose two enhancements to the basic JIQ scheme where tokens are either distributed non-uniformly or occasionally exchanged among the various dispatchers. We demonstrate that these mechanisms can achieve zero blocking and wait in the many-server limit, for any subcritical overall load and arbitrarily skewed load profiles.
Discussion of alternative schemes and related work. As mentioned above, the JIQ scheme outperforms power-of- policies in terms of communication overhead and user performance. In a power-of- policy an incoming job is assigned to a server with the shortest queue among randomly selected servers from the total available pool of servers. In the absence of memory at the dispatcher(s), this involves an exchange of messages per job (assuming ).
In [8, 17] mean-field limits are established for power-of- policies in Markovian scenarios with a single dispatcher and identical servers. These results indicate that even a value as small as yields significant performance improvements over a purely random assignment scheme () in large-scale systems, in the sense that the tail of the queue length distribution at each individual server falls off much more rapidly. This is commonly referred to as the ‘power-of-two’ effect. At the same time, a small value of significantly reduces the amount of information exchange compared to the classical Join-the-Shortest-Queue (JSQ) policy (which corresponds to ) in large-scale systems. These results also extend to heterogeneous servers, non-Markovian service requirements and loss systems [2, 3, 13, 14, 18].
In summary, power-of- policies involve low communication overhead for fixed , and can even deliver asymptotically optimal performance (when the value of suitably scales with [10, 11, 12]). In contrast to the JIQ scheme however, for no single value of , a power-of- policy can achieve both low communication overhead and asymptotically optimal performance, which is also reflected in recent results in . The only exception arises in case of batch arrivals when the value of and the batch size grow large in a specific proportion, as can be deduced from the arguments in .
Scenarios with multiple dispatchers have hardly received any attention so far. The results for the JIQ scheme in [7, 9, 16] all assume that the loads at the various dispatchers are strictly equal. We are not aware of any results for heterogeneous dispatcher loads. To the best of our knowledge, power-of--policies have not been considered in a multiple-dispatcher scenario at all. While the results in  show that the JIQ scheme is asymptotically optimal for symmetric dispatcher loads, even when the servers are heterogeneous, it is readily seen that power-of- policies cannot even be maximally stable in that case for any fixed value of .
Organization of the paper. The remainder of the paper is organized as follows. In Section II we present a detailed model description, specify the two proposed enhancements and state the main results. In Section III we describe how the blocking scenario can be represented in terms of a closed Jackson network, and leverage the associated product-form distribution to obtain an insightful formula for the blocking probability. We then turn to a fluid-limit approach in Section IV to analyze the two proposed enhancements in the blocking scenario. A similar analysis is adopted in Section V in the queueing scenario to obtain results for the basic model and the enhanced variants. Finally, in Section VI we make some concluding remarks and briefly discuss future research directions.
Ii Model description, notation and key results
We consider a system with parallel identical servers and a fixed set of (not depending on ) dispatchers, as depicted in Figure 1. Jobs arrive at dispatcher as a Poisson process of rate , with , , , and denoting the job arrival rate per server. For conciseness, we denote , and without loss of generality we assume that the dispatchers are indexed such that
. The job processing requirements are independent and exponentially distributed with unit mean at each of the servers.
When a server becomes idle, it sends a token to one of the dispatchers selected uniformly at random, advertising its availability. When a job arrives at a dispatcher which has tokens available, one of the tokens is selected, and the job is immediately forwarded to the corresponding server.
We distinguish two scenarios when a job arrives at a dispatcher which has no tokens available, referred to as the blocking and queueing scenario respectively. In the blocking scenario, the incoming job is blocked and instantly discarded. In the queueing scenario, the arriving job is forwarded to one of the servers selected uniformly at random. If the selected server happens to be idle, then the outstanding token at one of the other dispatchers is revoked.
In the queueing scenario we assume , which is not only necessary but also sufficient for stability. It is not difficult to show that the joint queue length process is stochastically majorized by a case where each job is sent to a uniformly at random selected server. In the latter case, the system decomposes into independent M/M/1 queues, each of which has load and is stable.
Denote by the number of busy servers and by the number of tokens held by dispatcher at time , . Note that for all . Also, denote by the number of servers with jobs (including a possible job being processed) at time , , so that .
In the blocking scenario, no server can have more than one job, i.e. for all and
. Because of the symmetry among the servers, the state of the system can thus be described by the vector, and evolves as a Markov process, with state space .
Likewise, in the queueing scenario, the state of the system can be described by the vector with , and also evolves as a Markov process.
Denote by the steady-state blocking probability of an arbitrary job in the blocking scenario. Also, denote by
a random variable with the steady-state waiting-time distribution of an arbitrary job in the queueing scenario.
In Section III we will prove the following theorem for the blocking scenario.
Theorem 1 (Least-loaded dispatcher determines blocking).
Theorem 1 shows that in the many-server limit the system performance in terms of blocking is either determined by the relative load of the least-loaded dispatcher, or by the aggregate load. This may be informally explained as follows. Let be the expected fraction of busy servers in steady state, so that each dispatcher receives tokens on average at a rate . We distinguish two cases, depending on whether a positive fraction of the tokens reside at the least-loaded dispatcher in the limit or not. If that is the case, then the job arrival rate at dispatcher must equal the rate at which it receives tokens, i.e., . Otherwise, the job arrival rate at dispatcher must be no less the rate at which it receives tokens, i.e., . Since dispatcher is the least-loaded, it then follows that for all , which means that the job arrival rate at all the dispatchers is higher that the rate at which tokens are received. Thus the fraction of tokens at each dispatcher is zero in the limit, i.e., the fraction of idle servers is zero, implying . Combining the two cases, and observing that , we conclude . Because of Little’s law, is related to the blocking probability as . This yields , or equivalently, as stated in Theorem 1.
The above explanation also reveals that in the limit dispatcher (or the set of least-loaded dispatchers in case of ties) inevitably ends up with all the available tokens, if any. The accumulation of tokens hampers the visibility of idle servers to the heavier-loaded dispatchers, and leaves idle servers stranded while jobs queue up at other servers.
Figure 2 illustrates Theorem 1 for dispatchers and servers, and clearly reflects the two separate regions in which the blocking probability depends on either or . The line represents the cross-over curve .
In Section V we will establish the following theorem for the queueing scenario.
Theorem 2 (Mean waiting time).
For and ,
and the convention that if .
can be interpreted as the rate at which jobs are forwarded to randomly selected servers. Furthermore, dispatchers receive tokens at a lower rate than the incoming jobs, and in particular if and only if . When , Theorem 2 simplifies to
When the arrival rates at all dispatchers are strictly equal, i.e., for all , Theorems 2 and 1 indicate that the stationary blocking probability and the mean waiting time asymptotically vanish in a regime where the total number of servers grows large, which is in agreement with the results in . However, when the arrival rates at the various dispatchers are not perfectly equal, so that , the blocking probability and mean wait are strictly positive in the limit, even for arbitrarily low overall load and an arbitrarily small degree of skewness in the arrival rates. Thus, the basic JIQ scheme fails to achieve asymptotically optimal performance when the dispatcher loads are not strictly equal.
In order to counter the above-described performance degradation for asymmetric dispatcher loads, we propose two enhancements.
Enhancement 1 (Non-uniform token allotment).
When a server becomes idle, it sends a token to dispatcher with probability .
Enhancement 2 (Token exchange mechanism).
Any token is transferred to a uniformly randomly selected dispatcher at rate .
Note that the token exchange mechanism only creates a constant communication overhead per job as long as the rate does not depend on the number of servers , and thus preserves the scalability of the basic JIQ scheme.
The above enhancements can achieve asymptotically optimal performance for suitable values of the parameters and the exchange rate , as stated in the next proposition.
Proposition 1 (Vanishing blocking and waiting).
The minimum value of required in the blocking scenario may be intuitively understood as follows. Zero blocking means that a fraction of the servers must be busy, and thus a fraction of the tokens reside with the various dispatchers, while the heaviest loaded dispatcher 1 receives enough tokens for all incoming jobs: which is satisfied by the given minimum value of .
A similar reasoning applies to the queueing scenario, although in that case the number of servers with exactly one job no longer equals the number of busy servers, and a different approach is needed.
In order to establish Propositions 1 and 2, we examine in Sections V and IV the fluid limits for the blocking and queueing scenarios, respectively. Rigorous proofs to establish weak convergence to the fluid limit are omitted, but can be constructed along similar lines as in . The fluid-limit regime not only provides mathematical tractability, but is also particularly relevant given the massive numbers of servers in data centers and cloud operations. Simulation experiments will be conducted to verify the accuracy of the fluid-limit approximations, and show an excellent match, even in small systems (small values of ).
Iii Jackson network representation
In this section we describe how the blocking scenario can be represented in terms of a closed Jackson network. We leverage the associated product-form distribution to express the asymptotic blocking probability as a function of the aggregate load and the minimum load across all dispatchers, proving Theorem 1.
We view the system dynamics in the blocking scenario in terms of the process as a fixed total population of tokens that circulate through a network of stations. Specifically, the tokens can reside either at station , meaning that the corresponding server is busy, or at some station , indicating that the corresponding server is idle and has an outstanding token with dispatcher , .
Let denote the service rate at station when there are tokens present. Then and for . The service times are exponentially distributed at all stations, but station is an infinite-server node with mean service time , while station is a single-server node with mean service time , . The routing probabilities of tokens moving from station to station are given by for and for . With denoting the throughput of tokens at station , the traffic equations
uniquely determine the relative values of the throughputs.
Let be the stationary probability that the process resides in state . The theory of closed Jackson networks  implies
with a normalization constant.
The blocking probability can then be expressed by summing the probabilities over all the states with where no tokens are available at dispatcher , and weighting these with the fractions , :
Despite this rather complicated expression, Theorem 1 provides a compact characterization of the blocking probability in the many-server limit , as will be proved in the Appendix. The proof uses stochastic coupling, for which we define a ‘better’ system and a ‘worse’ system. Both systems are amenable to analysis and have an identical blocking probability in the many-server limit .
The better system merges the first dispatchers into one super-dispatcher, which results in two dispatchers with arrival rates and , respectively. However, in contrast to the original blocking scenario, when a job is completed and leaves a server idle, a token is not sent to either dispatcher with equal probability. Instead, tokens are sent to the super-dispatcher with probability . To analyze this better system, we study in Subsection A-A the blocking scenario with enhanced with non-uniform token allotment.
The worse system thins the incoming rates of jobs at the dispatchers, so that some jobs are blocked, irrespective of whether or not the dispatcher has any tokens available. This thinning process is defined as follows: a job arriving at dispatcher is blocked with probability
This thinning process is designed in such a way that the system with admitted jobs behaves as a system with total arrival rate in which all arrival rates are equal (), which is analyzed in Subsection A-B.
With coupling, one can show that the blocking probability of the ‘better system’ is lower and of the ‘worse system’ is higher, which completes the proof. Specifically, when the arrival moments, the service times and the token-allotment are coupled, the number of tokens used at each dispatcher by timeis always lower in the worse system and higher in the better system. Due to page limitations, the detailed coupling arguments are omitted, but it is intuitively clear that the better system performs better and the worse system performs worse. Namely, the tokens at dispatchers 1 to are consolidated in the better system. If there is at least one token amongst these dispatchers, any job arriving at any of the dispatchers can make use of a token. In the original system, a job is blocked when the token amongst the first dispatchers, is not present at the dispatchers at which a job arrives. The worse system performs obviously worse, since blocking jobs beforehand has no benefits for the acceptance of jobs.
Iv Fluid limit in the blocking scenario
We now turn to the fluid-limit analysis and start with the blocking scenario. We consider a sequence of systems indexed by the total number of servers . Denote by the fraction of busy servers and by the normalized number of tokens held by dispatcher in the -th system at time . Further define and assume that as , with . Then any weak limit of the sequence as is called a fluid limit.
where and initial condition .
The above set of fluid-limit equations may be interpreted as follows.
The term represents the (scaled) rate at which dispatcher
uses tokens and forwards incoming jobs to idle servers at time .
Equation 4 reflects that the latter rate equals the job
arrival rate , unless the fraction of tokens held
by dispatcher is zero (), and the rate
at which it receives
tokens from idle servers or through the exchange mechanism is less
than the job arrival rate. Equation 2 states that the rate of change in the fraction
of busy servers is the difference between the aggregate rate
at which the various dispatchers use tokens
and forward jobs to idle servers, and the rate
at which jobs are completed and busy servers become idle.
Equation 3 captures that the rate of change of the fraction
of tokens held by dispatcher is the balance of the rate
at which it receives tokens
from idle servers or through the exchange mechanism,
and the rate at which it uses tokens
and forwards jobs to idle servers or releases tokens through the
Figure 3 shows the exact and simulated fluid-limit trajectories. We observe that the simulation results closely match the fluid-limit dynamics. We further note that in the long run only dispatcher with the lower arrival rate holds a strictly positive fraction of the tokens, corroborating Theorem 1.
Iv-a Fixed-point analysis
In order to determine the fixed point(s) , we set for all , and obtain
Without proof, we assume that the many-server () and stationary () limits commute, so that is also the limit of the mean fraction of busy servers in stationarity. Because of Little’s law, the limit of the blocking probability satisfies
This in particular implies that leads to : vanishing blocking.
Now let be the index set of the least-loaded dispatchers. Equation 9 forces for all .
We now distinguish two cases, depending on whether or not for all as well. If that is the case, then we must have , and , i.e., . Otherwise, we must have , i.e., , so forces .
In conclusion, we have . When so that , it must be the case that for all . When so that , any vector with for all and is a fixed point. In particular, for equal dispatcher loads, i.e., for all , so that , we have for all when , while any vector with is a fixed point when .
In Table I we compare the fluid-limit approximations for the blocking probability with the exact formula from the Jackson network representation and simulation results for various numbers of servers.
We now examine the behavior of the system for Enhancements 2 and 1, and show that they can achieve asymptotically zero blocking for any and suitable parameter values as identified in Proposition 1. In light of Equation 8 it suffices to show that for both enhancements.
which by Equation 5 gives , so that .
Figure 4 displays the blocking probability as for the system with both enhancements. Since , we have that is optimal. The blocking probability decreases as approaches and as increases. For , it suffices to choose close to , which implies that it is not necessary to know the exact loads, for the enhancements to be effective.
V Fluid limit in the queueing scenario
We now proceed to the queueing scenario (with for stability). As before, we consider a sequence of systems indexed by the total number of servers . Denote by the fraction of servers with jobs and by the normalized number of tokens held by dispatcher in the -th system at time , . Further define , with , and assume that as , with . Then any weak limit of the sequence as is called a fluid limit.
and initial condition .
The above set of fluid-limit equations may be interpreted as follows. Similarly as in the blocking scenario, the term represents the (scaled) rate at which dispatcher uses tokens and forwards incoming jobs to idle servers at time . Accordingly, is the aggregate rate at which dispatchers use tokens to forward jobs to (guaranteed) idle servers at time , while is the aggregate rate at which jobs are forwarded to randomly selected servers (which may or may not be idle). Equation 10 reflects that the rate of change in the fraction of idle servers is the difference between the aggregate rate at which jobs are completed by servers with one job, and the rate at which dispatchers use tokens to forward jobs to idle servers plus the rate at which jobs are forward to randomly selected servers that happen to be idle. Equation 11 states that the rate of change in the fraction of servers with jobs is the balance of the rate at which jobs are forwarded to randomly selected servers with jobs plus the aggregate rate at which jobs are completed by servers with jobs, and the rate at which jobs are forwarded to randomly selected servers with jobs plus the aggregate rate at which jobs are completed by servers with jobs. In case , the rate at which dispatchers use tokens to forward jobs to idle servers should be included as additional positive term.
V-a Fixed-point analysis
In order to determine the fixed point(s) , we set for all , and for all . We obtain
Solving Equation 14 gives
Thus the mean number of jobs at a server is
As for the blocking scenario, we assume that the many-server and stationary limits commute. Little’s law then gives
where the left-hand side represents the mean number of jobs at a server in the many-server limit. We use Equation 15 to obtain
which shows vanishing wait in case .
We also obtain the following equations for the fixed point:
Basic JIQ scheme. In case and , Equation 21 can be rewritten to
and by further calculations, since is decreasing in , to the expression for in Theorem 2.
In Table II we compare the fluid-limit approximations for the mean-waiting time with simulation figures for various numbers of servers.
Table II shows that the fluid-limit analysis agrees with the simulation results, although the number of servers needs to be larger than in the blocking scenario for extremely high accuracy to be observed. Similarly to Table I, the more symmetric the loads, the better the performance and the lower the mean waiting time, which is in line with Theorem 2.
We examine the behavior of the system for Enhancements 2 and 1 and show that they can achieve asymptotically zero waiting for any and suitable parameter values as identified in Proposition 1. In view of Equation 16 it suffices to show that for both enhancements. We first consider Enhancement 1 in which for all and . Equation 20 gives for all and . We obtain
which has a unique solution , so that .
which by Equation 19 gives , and since , we obtain .
Figure 5 displays the mean waiting time as for the system with both enhancements. Similarly to Figure 4, we can greatly improve the performance by tuning and . Again , so that is the best choice. The mean waiting time decreases as approaches , or as the rate increases. Exact knowledge of the arrival rates is not required, and a rough approximation of and a small value of are sufficient for the mean waiting time to vanish.
We examined the performance of the Join-the-Idle-Queue (JIQ) scheme in large-scale systems with several possibly heterogeneous dispatchers. We used product-form representations and fluid limits to show that the basic JIQ scheme fails to deliver zero blocking and wait for any asymmetric dispatcher loads, even for arbitrarily low overall load. Remarkably, it is the least-loaded dispatcher that throttles tokens and leaves idle servers stranded, thus acting as bottleneck.
In order to counter the performance degradation for asymmetric dispatcher loads, we introduced two extensions of the basic JIQ scheme where tokens are either distributed non-uniformly or occasionally exchanged among the various dispatchers. We proved that these extensions can achieve zero blocking and wait in the many-server limit, for any subcritical overall load and arbitrarily skewed load profiles. Extensive simulation experiments corroborated these results, indicating that they apply even in moderately sized systems.
It is worth emphasizing that the proposed enhancements involve no or constant additional communication overhead per job, and hence retain the scalability of the basic JIQ scheme. The algorithms do rely on suitable parameter settings, and it would be of interest to develop learning techniques for that.
While we allowed the dispatchers to be heterogeneous, we assumed all the servers to be statistically identical, and the service requirements to be exponentially distributed. As noted earlier, Theorem 1 in fact holds for non-exponential service requirement distributions as well. In ongoing work we aim to extend Propositions 1 and 2 to possibly non-exponential service requirement distributions.
This work is supported by the NWO Gravitation Networks grant 024.002.003, an NWO TOP-GO grant and an ERC Starting Grant.
-  R. Badonnel, M. Burgess (2008). Dynamic pull-based load balancing for autonomic servers. Proc. IEEE NOMS 2008, 751–754.
-  M. Bramson, Y. Lu, B. Prabhakar (2010). Randomized load balancing with general service time distributions. ACM SIGMETRICS Perf. Eval. Rev. 38 (1), 275–286.
-  M. Bramson, Y. Lu, B. Prabhakar (2012). Asymptotic independence of queues under randomized load balancing. Queueing Systems 71 (3), 247–292.
-  D. Gamarnik, J. Tsitsiklis, M. Zubeldia (2016). Delay, memory and messaging tradeoffs in distributed service systems. Proc. ACM SIGMETRICS Perf. Eval. Rev. 44 (1), 1–12.
P. Hunt, T. Kurtz (1994). Large loss networks.Stoc. Proc. Appl. 53 (2), 363–378.
-  F. P. Kelly (2011). Reversibility and stochastic networks. Cambridge University Press, New York.
-  Y. Lu, Q. Xie, G. Kliot, A. Geller, J. Larus, A. Greenberg (2011). Join-idle-queue: A novel load balancing algorithm for dynamically scalable web services. Perf. Eval. 68 (11), 1056–1071.
-  M. Mitzenmacher (2001). The power of two choices in randomized load balancing. IEEE Trans. Par. Distr. Syst. 12 (10), 1094–1104.
-  M. Mitzenmacher (2016). Analyzing distributed Join-Idle Queue: A fluid limit approach. Preprint, arXiv:1606.01833.
-  D. Mukherjee, S.C. Borst, J.S.H. van Leeuwaarden, P.A. Whiting (2016). Asymptotic optimality of power-of- load balancing in large-scale systems. Preprint, arXiv:1612.00722.
-  D. Mukherjee, S.C. Borst, J.S.H. van Leeuwaarden, P.A. Whiting (2016). Universality of power-of- load balancing in many-server systems. Preprint, arXiv:1612.00723.
-  D. Mukherjee, S.C. Borst, J.S.H. van Leeuwaarden, P.A. Whiting (2016). Universality of power-of- load balancing schemes. ACM SIGMETRICS Perf. Eval. Rev. 44 (2), 36–38.
-  A. Mukhopadhyay, A. Karthik, R.R. Mazumdar (2015). Randomized assignment of jobs to servers in heterogeneous clusters of shared servers for low delay. Stoc. Syst. 6 (1), 90–131.
-  A. Mukhopadhyay, A. Karthik, R.R. Mazumdar, F. Guillemin (2016). Mean field and propagation of chaos in multi-class heterogeneous loss models. Perf. Eval. 91, 117–131.
-  A.L. Stolyar (2015). Pull-based load distribution in large-scale heterogeneous service systems. Queueing Systems 80 (4), 341–361.
-  A.L. Stolyar (2015). Pull-based load distribution among heterogeneous parallel servers: the case of multiple routers. Queueing Systems, to appear.
-  N. Vvedenskaya, R. Dobrushin, F. Karpelevich (1996). Queueing system with selection of the shortest of two queues: An asymptotic approach. Prob. Inf. Trans. 32 (1), 20–34.
-  Q. Xie, X. Dong, Y. Lu, R. Srikant (2015). Power of choices for large-scale bin packing: A loss model. ACM SIGMETRICS Perf. Eval. Rev. 43 (1), 321–334.
-  L. Ying, R. Srikant, X. Kang (2015). The power of slightly more than one sample in randomized load balancing. Proc. IEEE INFOCOM 2015, 1131–1139.
Appendix A Proof of Theorem 1
A-a Blocking model with two dispatchers (better system)
Consider the blocking scenario with servers, dispatchers and arrival fractions and . The probability of sending a token to dispatcher is . Without loss of generality, assume . Since this is a closed Jackson network (see Section III), we have the stationary distribution
with the normalization constant. We use this to prove the following proposition.
Proposition 2 (Limiting blocking probability for ).
A-B Symmetric blocking model (worse system)
Consider the blocking model with servers, dispatchers and , assume and use the stationary distribution and blocking probability provided in Section III.
Proposition 3 (Recursive formula blocking probability).
Proposition 4 (Vanishing blocking probability in symmetric systems).
For , .
The proofs again start from an exact expression for the blocking probability obtained from the stationary distribution: