In this survey we review scalable load balancing algorithms (LBAs) which achieve excellent delay performance in large-scale systems and yet only involve low implementation overhead. LBAs play a critical role in distributing service requests or tasks (e.g. compute jobs, data base look-ups, file transfers) among servers or distributed resources in parallel-processing systems. The analysis and design of LBAs has attracted strong attention in recent years, mainly spurred by crucial scalability challenges arising in cloud networks and data centers with massive numbers of servers. LBAs can be broadly categorized as static, dynamic, or some intermediate blend, depending on the amount of feedback or state information (e.g. congestion levels) that is used in allocating tasks. The use of state information naturally allows dynamic policies to achieve better delay performance, but also involves higher implementation complexity and a substantial communication burden. The latter issue is particularly pertinent in cloud networks and data centers with immense numbers of servers handling a huge influx of service requests. In order to capture the large-scale context, we examine scalability properties through the prism of asymptotic scalings where the system size grows large, and identify LBAs which strike an optimal balance between delay performance and implementation overhead in that regime.
The most basic load balancing scenario consists of
identical parallel servers and a dispatcher where tasks arrive that must immediately be forwarded to one of the servers. Tasks are assumed to have unit-mean exponentially distributed service requirements, and the service discipline at each server is supposed to be oblivious to the actual service requirements. In this canonical setup, the celebrated Join-the-Shortest-Queue (JSQ) policy has several strong stochastic optimality properties. In particular, the JSQ policy achieves the minimum mean overall delay among all non-anticipating policies that do not have any advance knowledge of the service requirements[29, 135]. In order to implement the JSQ policy however, a dispatcher requires instantaneous knowledge of all the queue lengths, which may involve a prohibitive communication burden with a large number of servers .
This poor scalability has motivated consideration of JSQ() policies, where an incoming task is assigned to a server with the shortest queue among servers selected uniformly at random. Note that this involves an exchange of messages per task, irrespective of the number of servers . Results in Mitzenmacher  and Vvedenskaya et al.  indicate that even sampling as few as servers yields significant performance enhancements over purely random assignment () as grows large, which is commonly referred to as the power-of-two or power-of-choice effect. Specifically, when tasks arrive at rate , the queue length distribution at each individual server exhibits super-exponential decay for any fixed as grows large, a considerable improvement compared to exponential decay for purely random assignment.
The diversity parameter thus induces a fundamental trade-off between the amount of communication overhead and the delay performance. Specifically, a random assignment policy does not entail any communication burden, but the mean waiting time remains constant as grows large for any fixed . In contrast, a nominal implementation of the JSQ policy (without maintaining state information at the dispatcher) involves messages per task, but the mean waiting time vanishes as grows large for any fixed . Although JSQ() policies with yield major performance improvements over purely random assignment while reducing the communication burden by a factor O() compared to the JSQ policy, the mean waiting time does not vanish in the limit. Hence, no fixed value of will provide asymptotically optimal delay performance. This is evidenced by results of Gamarnik et al.  indicating that in the absence of any memory at the dispatcher the communication overhead per task must increase with in order for any scheme to achieve a zero mean waiting time in the limit.
We will explore the intrinsic trade-off between delay performance and communication overhead as governed by the diversity parameter , in conjunction with the relative load . The latter trade-off is examined in an asymptotic regime where not only the overall task arrival rate is assumed to grow with , but also the diversity parameter is allowed to depend on . We write and , respectively, to explicitly reflect that, and investigate what growth rate of is required, depending on the scaling behavior of , in order to achieve a zero mean waiting time in the limit. The analysis covers both fluid-scaled and diffusion-scaled versions of the queue length process in regimes where and as , respectively. We establish that the limiting processes are insensitive to the exact growth rate of , as long as the latter is sufficiently fast, and in particular coincide with the limiting processes for the JSQ policy. This demonstrates that the optimality of the JSQ policy can asymptotically be preserved while dramatically lowering the communication overhead.
We will also consider network scenarios where the servers are assumed to be inter-connected by some underlying graph topology . Tasks arrive at the various servers as independent Poisson processes of rate , and each incoming task is assigned to whichever server has the shortest queue among the one where it appears and its neighbors in . In case is a clique (fully connected graph), each incoming task is assigned to the server with the shortest queue across the entire system, and the behavior is equivalent to that under the JSQ policy. The stochastic optimality properties of the JSQ policy thus imply that the queue length process in a clique will be ‘better’ than in an arbitrary graph . We will establish sufficient conditions for the fluid-scaled and diffusion-scaled versions of the queue length process in an arbitrary graph to be equivalent to the limiting processes in a clique as . The conditions demonstrate that the optimality of a clique can be asymptotically preserved while dramatically reducing the number of connections, provided the graph is suitably random.
While a zero waiting time can be achieved in the limit by sampling only servers, the amount of communication overhead in terms of must still grow with
. This may be explained from the fact that a large number of servers need to be sampled for each incoming task to ensure that at least one of them is found idle with high probability. This can be avoided by introducing memory at the dispatcher, in particular maintaining a record of vacant servers, and assigning tasks to idle servers, if there are any. This so-called Join-the-Idle-Queue (JIQ) scheme[12, 77] has gained huge popularity recently, and can be implemented through a simple token-based mechanism generating at most one message per task. As shown by Stolyar , the fluid-scaled queue length process under the JIQ scheme is equivalent to that under the JSQ policy as , and we will extend this result to the diffusion-scaled queue length process. Thus, the use of memory allows the JIQ scheme to achieve asymptotically optimal delay performance with minimal communication overhead. In particular, ensuring that tasks are assigned to idle servers whenever available is sufficient to achieve asymptotic optimality, and using any additional queue length information yields no meaningful performance benefits on the fluid or diffusion levels.
Stochastic coupling techniques play an instrumental role in the proofs of the above-described universality and asymptotic optimality properties. A direct analysis of the queue length processes under a JSQ() policy, in a load balancing graph , or under the JIQ scheme is confronted with formidable obstacles, and does not seem tractable. As an alternative route, we leverage novel stochastic coupling constructions to relate the relevant queue length processes to the corresponding processes under a JSQ policy, and show that the deviation between these processes is asymptotically negligible under suitable assumptions on or .
While the stochastic coupling schemes provide an effective and overarching approach, they defy a systematic recipe and involve some degree of ingenuity and customization. Indeed, the specific coupling arguments that we develop are not only different from those that were originally used in establishing the stochastic optimality properties of the JSQ policy, but also differ in critical ways between a JSQ() policy, a load balancing graph , and the JIQ scheme. Yet different coupling constructions are devised for model variants with infinite-server dynamics that we will discuss in Section 5.
The survey is organized as follows. In Section 2 we discuss various LBAs and evaluate their scalability properties. In Section 3 we introduce some useful preliminary concepts, and then review fluid and diffusion limits for the JSQ policy as well as JSQ() policies with a fixed value of . In Section 4 we explore the trade-off between delay performance and communication overhead as function of the diversity parameter , in conjunction with the relative load. In particular, we establish asymptotic universality properties for JSQ() policies, which are extended to systems with server pools and network scenarios in Sections 5 and 6, respectively. In Section 7 we establish asymptotic optimality properties for the JIQ scheme. We discuss somewhat related redundancy policies and alternative scaling regimes and performance metrics in Section 8. The survey is concluded in Section 9 with a discussion of yet further extensions and several open problems and emerging research directions.
2 Scalability spectrum
In this section we review a wide spectrum of LBAs and examine their scalability properties in terms of the delay performance vis-a-vis the associated implementation overhead in large-scale systems.
2.1 Basic model
Throughout this section and most of the paper, we focus on a basic scenario with parallel single-server infinite-buffer queues and a single dispatcher where tasks arrive as a Poisson process of rate , as depicted in Figure 2. Arriving tasks cannot be queued at the dispatcher, and must immediately be forwarded to one of the servers. Tasks are assumed to have unit-mean exponentially distributed service requirements, and the service discipline at each server is supposed to be oblivious to the actual service requirements. This is the supermarket model described in Section 1.
When tasks do not get served and never depart but simply accumulate, the above setup corresponds to a so-called balls-and-bins model, and we will further elaborate on the connections and differences with work in that domain in Subsection 8.5.
2.2 Asymptotic scaling regimes
An exact analysis of the delay performance is quite involved, if not intractable, for all but the simplest LBAs. Numerical evaluation or simulation are not straightforward either, especially for high load levels and large system sizes. A common approach is therefore to consider various limit regimes, which not only provide mathematical tractability and illuminate the fundamental behavior, but are also natural in view of the typical conditions in which cloud networks and data centers operate. One can distinguish several asymptotic scalings that have been used for these purposes:
In the classical heavy-traffic regime, with a fixed number of servers and a relative load that tends to one in the limit.
In the conventional large-capacity or many-server regime, the relative load approaches a constant as the number of servers grows large.
The popular Halfin-Whitt regime  combines heavy traffic with a large capacity, with
so the relative capacity slack behaves as as the number of servers grows large.
The so-called non-degenerate slow-down regime  involves , so the relative capacity slack shrinks as as the number of servers grows large.
The term non-degenerate slow-down refers to the fact that in the context of a centralized multi-server queue, the mean waiting time in regime (iv) tends to a strictly positive constant as , and is thus of similar magnitude as the mean service requirement. In contrast, in regimes (ii) and (iii), the mean waiting time in a multi-server queue decays exponentially fast in or is of the order , respectively as , while in regime (i) the mean waiting time grows arbitrarily large relative to the mean service requirement.
In the context of a centralized M/M/ queue, scalings (ii), (iii) and (iv) are commonly referred to as Quality-Driven (QD), Quality-and-Efficiency-Driven (QED) and Efficiency-Driven (ED) regimes. These terms reflect that (ii) offers excellent service quality (vanishing waiting time), (iv) provides high resource efficiency (utilization approaching one), and (iii) achieves a combination of these two, providing the best of both worlds.
In the present paper we will focus on scalings (ii) and (iii), and occasionally also refer to these as fluid and diffusion scalings, since it is natural to analyze the relevant queue length process on fluid scale () and diffusion scale () in these regimes, respectively. We will not provide a detailed account of scalings (i) and (iv), which do not capture the large-scale perspective and do not allow for low delays, respectively, but we will briefly mention some results for these regimes in Subsections 8.2 and 8.3.
An important issue in the context of scaling limits is the rate of convergence and the accuracy for finite-size systems. Some interesting results for the accuracy of mean-field approximations for interacting-particle systems including load balancing models may be found in recent work of Gast , Gast & Van Houdt , and Ying [137, 138].
2.3 Basic load balancing algorithms
2.3.1 Random assignment: N independent M/M/1 queues
One of the most basic LBAs is to assign each arriving task to a server selected uniformly at random. In that case, the various queues collectively behave as independent M/M/1 queues, each with arrival rate
and unit service rate. In particular, at each of the queues, the total number of tasks in stationarity has a geometric distribution with parameter. By virtue of the PASTA property, the probability that an arriving task incurs a non-zero waiting time is . The mean number of waiting tasks (excluding the possible task in service) at each of the queues is , so the total mean number of waiting tasks is , which by Little’s law implies that the mean waiting time of a task is . In particular, when , the probability that a task incurs a non-zero waiting time is , and the mean waiting time of a task is , independent of , reflecting the independence of the various queues.
As we will see later, a broad range of queue-aware LBAs can deliver a probability of a non-zero waiting time and a mean waiting time that vanish asymptotically. While a random assignment policy is evidently not competitive with such queue-aware LBAs, it still plays a relevant role due to the strong degree of tractability inherited from its simplicity. For example, the queue process under purely random assignment can be shown to provide an upper bound (in a stochastic majorization sense) for various more involved queue-aware LBAs for which even stability may be difficult to establish directly, yielding conservative performance bounds and stability guarantees.
A slightly better LBA is to assign tasks to the servers in a Round-Robin manner, dispatching every -th task to the same server. In the fluid regime where , the inter-arrival time of tasks at each given queue will then converge to a constant as . Thus each of the queues will behave as a D/M/1 queue in the limit, and the probability of a non-zero waiting time and the mean waiting time will be somewhat lower than under purely random assignment. However, both the probability of a non-zero waiting time and the mean waiting time will still tend to strictly positive values and not vanish as .
2.3.2 Join-the-Shortest Queue (JSQ)
Under the Join-the-Shortest-Queue (JSQ) policy, each arriving task is assigned to the server with the currently shortest queue. In the basic model described above, the JSQ policy has several strong stochastic optimality properties, and yields the ‘most balanced and smallest’ queue process among all non-anticipating policies that do not have any advance knowledge of the service requirements [29, 135].
2.3.3 Join-the-Smallest-Workload (JSW): centralized M/M/N queue
Under the Join-the-Smallest-Workload (JSW) policy, each arriving task is assigned to the server with the currently smallest workload. Note that this is an anticipating policy, since it requires advance knowledge of the service requirements of all the tasks in the system. Further observe that this policy (myopically) minimizes the waiting time for each incoming task, and mimicks the operation of a centralized -server queue with a FCFS discipline. The equivalence with a centralized
-server queue with a FCFS discipline yields a strong optimality property of the JSW policy: The vector of joint workloads at the various servers observed by each incoming task is smaller in the Schur convex sense than under any alternative admissible policy.
It is worth observing that the above optimality properties in fact do not rely on Poisson arrival processes or exponential service requirement distributions. At the same time, these optimality properties do not imply that the JSW policy minimizes the mean stationary waiting time. In our setting with Poisson arrivals and exponential service requirements, however, it can be shown through direct means that the total number of tasks under the JSW policy is stochastically smaller than under the JSQ policy. Even though the JSW policy requires a similar excessive communication overhead, aside from its anticipating nature, the equivalence with a centralized FCFS queue means that there cannot be any idle servers while tasks are waiting and that the total number of tasks behaves as a birth-death process, which renders it far more tractable than the JSQ policy. Specifically, given that all the servers are busy, the total number of waiting tasks is geometrically distributed with parameter . The total mean number of waiting tasks is then , and the mean waiting time is , with denoting the probability of all servers being occupied and a task incurring a non-zero waiting time. This immediately shows that the mean waiting time is smaller by at least a factor than for the random assignment policy considered in Subsection 2.3.1.
In the large-capacity regime , it can be shown that the probability of a non-zero waiting time decays exponentially fast in , and hence so does the mean waiting time. In the Halfin-Whitt heavy-traffic regime (2.1), the probability of a non-zero waiting time converges to a finite constant , implying that the mean waiting time of a task is of the order , and thus vanishes as .
2.3.4 Power-of-d load balancing (JSQ(d))
We have seen that the achilles heel of the JSQ policy is its excessive communication overhead in large-scale systems. This poor scalability has motivated consideration of so-called JSQ() policies, where an incoming task is assigned to a server with the shortest queue among servers selected uniformly at random. Results in Mitzenmacher  and Vvedenskaya et al.  indicate that in the fluid regime where , the probability that there are or more tasks at a given queue is proportional to as , and thus exhibits super-exponential decay as opposed to exponential decay for the random assignment policy considered in Subsection 2.3.1.
As alluded to in Section 1, the diversity parameter thus induces a fundamental trade-off between the amount of communication overhead and the performance in terms of queue lengths and delays. A rudimentary implementation of the JSQ policy (, without replacement) involves communication overhead per task, but it can be shown that the probability of a non-zero waiting time and the mean waiting vanish as , just like in a centralized queue. Although JSQ() policies with a fixed parameter yield major performance improvements over purely random assignment, the probability of a non-zero waiting time and the mean waiting time do not vanish as .
2.3.5 Token-based mechanisms: Join-the-Idle-Queue (JIQ)
While a zero waiting time can be achieved in the limit by sampling only servers, the amount of communication overhead in terms of must still grow with . This can be countered by introducing memory at the dispatcher, in particular maintaining a record of vacant servers, and assigning tasks to idle servers as long as there are any, or to a uniformly at random selected server otherwise. This so-called Join-the-Idle-Queue (JIQ) scheme [12, 77] has received keen interest recently, and can be implemented through a simple token-based mechanism. Specifically, idle servers send tokens to the dispatcher to advertize their availability, and when a task arrives and the dispatcher has tokens available, it assigns the task to one of the corresponding servers (and disposes of the token). Note that a server only issues a token when a task completion leaves its queue empty, thus generating at most one message per task. Surprisingly, the mean waiting time and the probability of a non-zero waiting time vanish under the JIQ scheme in both the fluid and diffusion regimes, as we will further discuss in Section 7. Thus, the use of memory allows the JIQ scheme to achieve asymptotically optimal delay performance with minimal communication overhead.
2.4 Performance comparison
We now present some simulation experiments to compare the above-described LBAs in terms of delay performance.
Specifically, we evaluate the mean waiting time and the probability of a non-zero waiting time in both a fluid regime () and a diffusion regime (). The results are shown in Figure 1. An overview of the delay performance and overhead associated with various LBAs is given in Table 1.
We are specifically interested in distinguishing two classes of LBAs – the ones delivering a mean waiting time and probability of a non-zero waiting time that vanish asymptotically, and the ones that fail to do so – and relating that dichotomy to the associated communication overhead and memory requirement at the dispatcher. We give these classifications for both the fluid regime and the diffusion regime.
JSQ, JIQ, and JSW.
Three schemes that clearly have vanishing wait are JSQ, JIQ and JSW. The optimality of JSW is observed in the figures; JSW has the smallest mean waiting time, and all three schemes have vanishing wait in both the fluid and diffusion regime. There is a significant difference, however, between JSW and JSQ/JIQ. We observe that the probability of positive wait does not vanish for JSW, while it does vanish for JSQ/JIQ. This implies that the mean of all positive waiting times is an order larger in JSQ/JIQ compared to JSW. Intuitively, this is clear since in JSQ/JIQ, when a task is placed in a queue, it waits for at least one specific other task. In JSW, which is equivalent to the M/M/N queue, a task that cannot start service immediately, can start service when one of the servers becomes idle.
Random and Round-Robin.
The mean waiting time does not vanish for Random and Round-Robin in the fluid regime, as already mentioned in Subsection 2.3.1. Moreover, the mean waiting time grows without bound in the diffusion regime for these two schemes. This is because the system can still be decomposed into single-server queues, and the loads of the individual M/M/1 and D/M/1 queues tend to 1.
Three versions of JSQ() are included in the Figure 1; , and for which . Note that the graph for shows sudden jumps when increases by 1. As can be seen in Figure 1, the choices for which have vanishing wait in the fluid regime, while has not. Overall, we see that JSQ() policies clearly outperform Random and Round-Robin dispatching, while JSQ, JIQ, and JSW are better in terms of mean wait.
|Scheme||Queue length||Waiting time (fixed )||Waiting time ()||Overhead per task|
|same as JSQ||same as JSQ||??|
|same as JSQ||same as JSQ||same as JSQ|
|JIQ||same as JSQ||same as JSQ||same as JSQ|
3 Preliminaries, JSQ policy, and power-of-d algorithms
In this section we first introduce some useful notation and preliminary concepts, and then review fluid and diffusion limits for the JSQ policy as well as JSQ() policies with a fixed value of .
We keep focusing on the basic scenario where all the servers are homogeneous, the service requirements are exponentially distributed, and the service discipline at each server is oblivious of the actual service requirements. In order to obtain a Markovian state description, it therefore suffices to only track the number of tasks, and in fact we do not need to keep record of the number of tasks at each individual server, but only count the number of servers with a given number of tasks. Specifically, we represent the state of the system by a vector
with denoting the number of servers with or more tasks at time , including the possible task in service, . Note that if we represent the queues at the various servers as (vertical) stacks, and arrange these from left to right in non-descending order, then the value of corresponds to the width of the -th (horizontal) row, as depicted in the schematic diagram in Figure 3.
In order to examine the fluid and diffusion limits in regimes where the number of servers grows large, we consider a sequence of systems indexed by , and attach a superscript to the associated state variables. The fluid-scaled occupancy state is denoted by , with representing the fraction of servers in the -th system with or more tasks as time , . Let be the set of all possible fluid-scaled states. Whenever we consider fluid limits, we assume the sequence of initial states is such that as .
The diffusion-scaled occupancy state is defined as , with
Note that corresponds to the number of vacant servers, normalized by . The reason why is centered around while , , are not, is that for the scalable LBAs we consider the fraction of servers with exactly one task tends to one, whereas the fraction of servers with two or more tasks tends to zero as . For convenience, we will assume that each server has an infinite-capacity buffer, but all the results extend to the finite-buffer case.
3.1 Fluid limit for JSQ(d) policies
The sequence of processes has a weak limit that satisfies the system of differential equations
The fluid-limit equations may be interpreted as follows. The first term represents the rate of increase in the fraction of servers with or more tasks due to arriving tasks that are assigned to a server with exactly tasks. Note that the latter occurs in fluid state with probability , i.e., the probability that all sampled servers have or more tasks, but not all of them have or more tasks. The second term corresponds to the rate of decrease in the fraction of servers with or more tasks due to service completions from servers with exactly tasks, and the latter rate is given by . While the system in (3.3
) characterizes the functional law of large numbers (FLLN) behavior of systems under the JSQ() scheme, weak convergence to a certain Ornstein-Ulenbeck process (both in the transient regime and in steady state) was shown in 
, establishing a functional central limit theorem (FCLT) result. Strong approximations for systems under the JSQ() scheme on any finite time interval by the deterministic system in (3.3), a certain infinite-dimensional jump process, and a diffusion approximation were established in .
When the derivatives in (3.3) are set equal to zero for all , the unique fixed point for any is obtained as
It can be shown that the fixed point is asymptotically stable in the sense that as for any initial fluid state with . As mentioned earlier, the fixed point reveals that the stationary queue length distribution at each individual server exhibits super-exponential decay as , as opposed to exponential decay for a random assignment policy. It is worth observing that this involves an interchange of the many-server () and stationary () limits. The justification is provided by the asymptotic stability of the fixed point along with a few further technical conditions.
3.2 Fluid limit for JSQ policy
We now turn to the fluid limit for the ordinary JSQ policy, which rather surprisingly was not rigorously established until fairly recently in , leveraging martingale functional limit theorems and time-scale separation arguments .
In order to state the fluid limit starting from an arbitrary fluid-scaled occupancy state, we first introduce some additional notation. For any fluid state , denote by the minimum queue length among all servers. Now if , then define and for all . Otherwise, in case , define
Any weak limit of the sequence of processes is given by the deterministic system that satisfies the system of differential equations
where denotes the right-derivative. The reason we have used derivative in (3.3), and right-derivative in (3.6) is that the limiting trajectory for the JSQ policy may not be differentiable at all time points. In fact, one of the major technical challenges in proving the fluid limit for the JSQ policy is that the drift of the process is not continuous, which leads to non-smooth limiting trajectories.
The fluid-limit trajectory in (3.6) can be interpreted as follows. The coefficient represents the instantaneous fraction of incoming tasks assigned to servers with a queue length of exactly in the fluid state . Note that a strictly positive fraction of the servers have a queue length of exactly . Clearly the fraction of incoming tasks that get assigned to servers with a queue length of or larger is zero: for all . Also, tasks at servers with a queue length of exactly are completed at (normalized) rate , which is zero for all , and hence the fraction of incoming tasks that get assigned to servers with a queue length of or less is zero as well: for all . This only leaves the fractions and to be determined. Now observe that the fraction of servers with a queue length of exactly is zero. If , then clearly the incoming tasks will join an empty queue, and thus, , and for all . Furthermore, if , since tasks at servers with a queue length of exactly are completed at (normalized) rate , incoming tasks can be assigned to servers with a queue length of exactly at that rate. We thus need to distinguish between two cases, depending on whether the normalized arrival rate is larger than or not. If , then all the incoming tasks can be assigned to a server with a queue length of exactly , so that and . On the other hand, if , then not all incoming tasks can be assigned to servers with a queue length of exactly active tasks, and a positive fraction will be assigned to servers with a queue length of exactly : and .
The unique fixed point of the dynamical system in (3.6) is given by
Note that the fixed point naturally emerges when in the fixed point expression (3.4) for fixed . However, the process-level results in [85, 130] for fixed cannot be readily used to handle joint scalings, and do not yield the entire fluid-scaled sample path for arbitrary initial states as given by (3.6).
The fixed point in (3.7), in conjunction with an interchange of limits argument, indicates that in stationarity the fraction of servers with a queue length of two or larger under the JSQ policy is negligible as .
3.3 Diffusion limit for JSQ policy
We next describe the diffusion limit for the JSQ policy in the Halfin-Whitt heavy-traffic regime (2.1), as derived by Eschenfeldt & Gamarnik . Recall the centered and diffusion-scaled processes in (3.2).
For suitable initial conditions, the sequence of processes converges weakly to the limit , where is the unique solution to the following system of SDEs
and , for , where is the standard Brownian motion and is the unique non-decreasing non-negative process satisfying .
The above convergence of the scaled occupancy measure was established in  only in the transient regime on any finite time interval. The tightness of the sequence of diffusion-scaled steady-state occupancy measures , the ergodicity of the limiting diffusion process (3.8), and hence the interchange of limits were open until Braverman  recently further established that the weak-convergence result extends to the steady state as well, i.e., converges weakly to as , where has the stationary distribution of the process . Thus, the steady state of the diffusion process in (3.8) is proved to capture the asymptotic behavior of large-scale systems under the JSQ policy.
In  a Lyapunov function is obtained via a generator expansion framework using Stein’s method, which establishes exponential ergodicity of . Although this approach gives a good handle on the rate of convergence to stationarity, it sheds little light on the form of the stationary distribution of the limiting diffusion process (3.8) itself. In two companion papers Banerjee & Mukherjee [15, 14] perform a thorough analysis of the steady state of this diffusion process. Using a classical regenerative process construction of the diffusion process in (3.8), [15, Theorem 2.1] establishes that has a Gaussian tail, and the tail exponent is uniformly bounded by constants which do not depend on , whereas has an exponentially decaying tail, and the coefficient in the exponent is linear in . More precisely, for any there exist positive constants not depending on and positive constants , , , , , depending only on such that
It was further shown in [15, Theorem 2.2] that there exists a positive constant not depending on such that almost surely along any sample path:
Notice that the width of fluctuation of does not depend on the value of , whereas that of is linear in .
Since the -th system is ergodic and its arrival rate is , it is straightforward to see that for all , and hence as expected, it can also be derived from the evolution of the limiting diffusion process that .
Thus, intuitively, for large enough , the system has mostly many idle servers, and thus the number of servers with queue length at least two diminishes.
But the way scales as becomes large, is highly non-trivial.
Specifically, it was shown in  that there exists
and positive constants
such that for all ,
e^-C_1β^2 ≤E_π(Q_2(∞)) ≤e^-C_2β^2,
P(Q_2(∞) ≥e^-e^D_1β^2) ≤e^-D_2β^2, i.e., the steady-state mean is of order , but most of the steady-state mass concentrates at a much smaller scale . This suggests intermittency in the behavior of the process, namely, is typically of order , but during rare events when it achieves higher values, it takes a long time to decay. However, for small enough , the behavior is qualitatively different. Since , the system is expected to become more congested as becomes smaller. As a result, intuitively, should increase. In this regime as well, exhibits some surprising behavior. Specifically, it was shown in  that there exist positive constants and such that for all M1β ≤E(¯Q_2(∞)) ≤M2β.
Comparison with M/M/ queue.
It is worth mentioning that the M/M/N queue in the Halfin-Whitt heavy-traffic regime has been studied extensively (see [41, 42, 58, 123, 124, 40, 125], and the references therein). In this case, the centered and scaled total number of tasks in the system converges weakly to a diffusion process [58, Theorem 2] with
where is the standard Brownian motion. As reflected in (3.8) and (3.11), the JSQ policy and the M/M/ system share some surprising similarities in terms of the qualitative behavior of the total number of tasks in the system. In particular, both the number of idle servers and the number of waiting tasks are of the order . This shows that despite the distributed queueing operation a suitable load balancing policy can deliver a similar combination of excellent service quality and high resource utilization efficiency in the QED regime as in a centralized queueing arrangement. Moreover, the interchange of limits result in  implies that for systems under the JSQ policy, converges weakly to , which has an Exponential upper tail (large positive deviation) and a Gaussian lower tail (large negative deviation), see (3.9). This is again reminiscent of the corresponding tail asymptotics for the M/M/ queue. Note that since is a simple combination of a Brownian motion with a negative drift (when all servers are fully occupied) and an Ornstein Uhlenbeck (OU) process (when there are idle servers), the steady-state distribution
can be computed explicitly, and is indeed a combination of an exponential distribution (from the Brownian motion with a negative drift) and a Gaussian distribution (from the OU process).
Observe that in case of M/M/N systems, whenever there are some waiting tasks (equivalent to being positive in our case), the queue length has a constant negative drift towards zero. This leads to the exponential upper tail of , by comparing with the stationary distribution of a reflected Brownian motion with constant negative drift. In the JSQ case, however, the rate of decrease of is always proportional to itself, which makes it somewhat counter-intuitive that its stationary distribution has an exponential tail.
In the M/M/N system, the number of idle servers can be non-zero only when the number of waiting tasks is zero. Thus, the dynamics of both the number of idle servers and the number of waiting tasks are completely captured by the one-dimensional process and by the one-dimensional diffusion in the limit. But in the JSQ case, is never zero, and the dynamics of are truly two-dimensional (although the diffusion is non-elliptic) with and interacting with each other in an intricate manner.
From (3.8) we see that never hits zero. Thus, in steady state, there is no mass at , and the system always has waiting tasks. This is in sharp contrast with the M/M/N case, where the system has no waiting tasks in steady state with positive probability.
In the M/M/ system, a positive fraction of the tasks incur a non-zero waiting time as , but a non-zero waiting time is only of length in expectation. Moreover, in the JSQ case, it is easy to see that (the limit of the scaled number of idle servers) spends zero time at the origin, i.e., in steady state the fraction of arriving tasks that find all servers busy vanishes in the large-N limit (in fact, this is of order , see ). However, such tasks will have to wait for the duration of a residual service time (the time till the service of the task ahead of it in its queue finishes), yielding a waiting time of the order .
As , [58, Proposition 2] implies that for the M/M/ queue converges weakly to a unit-mean exponential distribution. In contrast, results in  show that converges weakly to a Gammarandom variable. This indicates that despite similar order of performance, due to the distributed operation, in terms of the number of waiting tasks JSQ is a factor 2 worse in expectation than the corresponding centralized system.
3.4 JSQ(d) policies in heavy-traffic regime
Finally, we briefly discuss the behavior of JSQ() policies with a fixed value of in the Halfin-Whitt heavy-traffic regime (2.1). While a complete characterization of the occupancy process for fixed has remained elusive so far, significant partial results were recently obtained by Eschenfeldt & Gamarnik . In order to describe the transient asymptotics, introduce the following rescaled processes
Note that in contrast to (3.2), in (3.12) all components are centered by . For suitable initial states, [30, Theorem 2] establishes that on any finite time interval, converges weakly to a deterministic system that satisfies the system of ODEs
with the convention that . It is noteworthy that the scaled occupancy process loses its diffusive behavior for fixed . It is further shown in  that with high probability the steady-state fraction of queues with length at least tasks approaches unity, which in turn implies that with high probability the steady-state delay is at least as . The diffusion approximation of the JSQ() policy in the Halfin-Whitt regime (2.1), starting from a different initial scaling, has been studied by Budhiraja & Friedlander .
In the work of Ying  a broad framework involving Stein’s method was introduced to analyze the rate of convergence of the stationary distribution in a heavy-traffic regime where as , with a positive function diverging to infinity as . Note that the case corresponds to the Halfin-Whitt heavy-traffic regime (2.1). Using this framework, it was proved that when with some ,
and is an arbitrarily small constant. Equation (3.14) not only shows that asymptotically the stationary occupancy measure concentrates at , but also provides the rate of convergence.
4 Universality of JSQ(d) policies
In this section we will further explore the trade-off between delay performance and communication overhead as a function of the diversity parameter , in conjunction with the relative load. The latter trade-off will be examined in an asymptotic regime where not only the total task arrival rate grows with , but also the diversity parameter depends on , and we write to explicitly reflect that. We will specifically investigate what growth rate of is required, depending on the scaling behavior of , in order to asymptotically match the optimal performance of the JSQ policy and achieve a zero mean waiting time in the limit. The results presented in the remainder of the section are based on  unless specified otherwise.
Theorem 4.1 (Universality fluid limit for JSQ()).
Theorem 4.2 (Universality diffusion limit for JSQ()).
If , then for suitable initial conditions the weak limit of the sequence of processes coincides with that of the ordinary JSQ policy, and in particular, is given by the system of SDEs in (3.8).
The above universality properties indicate that the JSQ overhead can be lowered by almost a factor O() and O() while retaining fluid- and diffusion-level optimality, respectively. In other words, Theorems 4.1 and 4.2 reveal that it is sufficient for to grow at any rate, and faster than , in order to observe similar scaling benefits as in a pooled system with parallel single-server queues on fluid scale and diffusion scale, respectively. The stated conditions are in fact close to necessary, in the sense that if is uniformly bounded and as , then the fluid-limit and diffusion-limit paths of the system occupancy process under the JSQ() scheme differ from those under the ordinary JSQ policy. In particular, if is uniformly bounded, the mean steady-state delay does not vanish asymptotically as .
One implication of Theorem 4.1 is that in the subcritical regime any growth rate of is enough to achieve asymptotically vanishing steady-state probability of wait. This result is complimented by recent results of Liu and Ying  and Brightwell et al. , where steady-state analysis is extended in the heavy-traffic regime. Specifically, it is established in  that when the system load of the -th system scales as with (i.e., the system is in heavy traffic, but the load is lighter than that in the Halfin-Whitt regime), the steady-state probability of wait for the JSQ() policy with vanishes as . The results of  imply that when and with , , and , with probability tending to 1 as , the proportion of queues with queue length equal to is at least and there are no longer queues. It is important to note that a crucial difference between the result stated in Theorem 4.2 and the results in [23, 73] is that the former analyzes the system on diffusion scale (and describes its behavior in terms of a limiting diffusion process), whereas [23, 73] analyze the system on fluid-scale (and characterize its behavior in terms of limiting fluid-scaled occupancy state).
High-level proof idea.
The proofs of both Theorems 4.1 and 4.2 rely on a stochastic coupling construction to bound the difference in the queue length processes between the JSQ policy and a scheme with an arbitrary value of . This coupling is then exploited to obtain the fluid and diffusion limits of the JSQ() policy, along with the associated fixed point, under the conditions stated in Theorems 4.1 and 4.2.
A direct comparison between the JSQ scheme and the ordinary JSQ policy is not straightforward, which is why the class of schemes is introduced as an intermediate scenario to establish the universality result. Just like the JSQ scheme, the schemes in the class may be thought of as “sloppy” versions of the JSQ policy, in the sense that tasks are not necessarily assigned to a server with the shortest queue length but to one of the lowest ordered servers, as graphically illustrated in Figure 3(a). In particular, for , the class only includes the ordinary JSQ policy. Note that the JSQ scheme is guaranteed to identify the lowest ordered server, but only among a randomly sampled subset of servers. In contrast, a scheme in the class only guarantees that one of the lowest ordered servers is selected, but across the entire pool of servers. We will show that for sufficiently small , any scheme from the class is still ‘close’ to the ordinary JSQ policy. We will further prove that for sufficiently large relative to we can construct a scheme called JSQ, belonging to the class, which differs ‘negligibly’ from the JSQ scheme. Therefore, for a ‘suitable’ choice of the idea is to produce a ‘suitable’ . This proof strategy is schematically represented in Figure 3(b).
In order to prove the stochastic comparisons among the various schemes, the many-server system is described as an ensemble of stacks, in a way that two different ensembles can be ordered. This stack formulation has also been considered in the literature for establishing the stochastic optimality properties of the JSQ policy [110, 113, 114]. In Remark 4.7 we will compare and contrast the various stochastic comparison techniques. In this formulation, at each step, items are added or removed (corresponding to an arrival or departure) according to some rule. From a high level, it is then shown that if two systems follow some specific rules, then at any step, the two ensembles maintain some kind of deterministic ordering. This deterministic ordering turns into an almost sure ordering in the probability space constructed by a specific coupling. In what follows, each server along with its queue is thought of as a stack of items, and the stacks are always considered to be arranged in non-decreasing order of their heights. The ensemble of stacks then represents the empirical CDF of the queue length distribution, and the -th horizontal bar corresponds to (for some task assignment scheme ), as depicted in Figure 3. For the sake of full exposure, we will describe the coupling construction in the scenario when the buffer capacity at each stack can possibly be finite. If and an arriving item happens to land on a stack which already contains items, then the item is discarded, and is added to a special stack of discarded items, where it stays forever.
Any two ensembles and , each having stacks and a maximum height per stack, are said to follow Rule() at some step, if either an item is removed from the -th stack in both ensembles (if nonempty), or an item is added to the -th stack in ensemble and to the -th stack in ensemble .
For any two ensembles of stacks and , if at any step Rule is followed for some value of , , and , with , then the following ordering is always preserved: for all ,
This proposition says that, while adding the items to the ordered stacks, if we ensure that in ensemble the item is always placed to the left of that in ensemble , and if the items are removed from the same ordered stack in both ensembles, then the aggregate size of the highest horizontal bars as depicted in Figure 3 plus the cumulative number of discarded items is no larger in than in throughout.
Another type of sloppiness.
Recall that the class contains all schemes that assign incoming tasks by some rule to any of the lowest ordered servers. Let MJSQ be a particular scheme that always assigns incoming tasks to precisely the -th ordered server. Notice that this scheme is effectively the JSQ policy when the system always maintains idle servers, or equivalently, uses only servers, and . For brevity, we will often suppress in the notation where it is clear from the context. We call any two systems S-coupled, if they have synchronized arrival clocks and departure clocks of the -th longest queue, for (‘S’ in the name of the coupling stands for ‘Server’). Consider three S-coupled systems following respectively the JSQ policy, any scheme from the class , and the scheme. Recall that is the number of servers with at least tasks at time and is the total number of lost tasks up to time , for the schemes JSQ, , . The following proposition provides a stochastic ordering for any scheme in the class CJSQ with respect to the ordinary JSQ policy and the MJSQ scheme.
For any fixed ,
provided the inequalities hold at time .
The above proposition has the following immediate corollary, which will be used to prove bounds on the fluid and the diffusion scale.
In the joint probability space constructed by the S-coupling of the three systems under respectively JSQ, MJSQ, and any scheme from the class CJSQ, the following ordering is preserved almost surely throughout the sample path: for any fixed
provided the inequalities hold at time .
Note that represents the aggregate size of the rightmost stacks, i.e., the longest queues. Using this observation, the stochastic majorization property of the JSQ policy as stated in [110, 113, 114] can be shown following similar arguments as in the proof of Proposition 4.5. Conversely, the stochastic ordering between the JSQ policy and the MJSQ scheme presented in Proposition 4.5 can also be derived from the weak majorization arguments developed in [110, 113, 114]. But it is only through the stack arguments developed in  as described above, that the results could be extended to compare any scheme from the class CJSQ with the scheme MJSQ as well as in Proposition 4.5 (ii).
Comparing two arbitrary schemes.
To analyze the JSQ scheme, we need a further stochastic comparison argument. Consider two S-coupled systems following schemes and
. Fix a specific arrival epoch, and let the arriving task join the-th ordered server in the -th system following scheme , (ties can be broken arbitrarily in both systems). We say that at a specific arrival epoch the two systems differ in decision, if , and denote by the cumulative number of times the two systems differ in decision up to time .
For two S-coupled systems under schemes and the following inequality is preserved almost surely
provided the two systems start from the same occupancy state at , i.e., for all .
A bridge between two types of sloppiness.
We will now introduce the JSQ scheme with , which is an intermediate blend between the CJSQ schemes and the JSQ scheme. At its first step, just as in the JSQ scheme, the JSQ scheme first chooses the shortest of random candidates but only sends the arriving task to that server’s queue if it is one of the shortest queues. If it is not, then at the second step it picks any of the shortest queues uniformly at random and then sends the task to that server’s queue. Note that by construction, JSQ is a scheme in CJSQ. Consider two S-coupled systems with a JSQ and a JSQ scheme. Assume that at some specific arrival epoch, the incoming task is dispatched to the -th ordered server in the system under the JSQ() scheme. If , then the system under the JSQ scheme also assigns the arriving task to the -th ordered server. Otherwise, it dispatches the arriving task uniformly at random among the first ordered servers.
The next proposition provides a bound on the number of times these two systems differ in decision on any finite time interval. For any , let and be the total number of arrivals to the system and the cumulative number of times that the JSQ() scheme and JSQ scheme differ in decision up to time .
For any , and
First it is shown that if as , then the MJSQ scheme has the same fluid limit as the ordinary JSQ policy.
Then the application of Corollary 4.6 proves that as long as , any scheme from the class has the same fluid limit as the ordinary JSQ policy.
The proof of Theorem 4.2 follows the same arguments, but uses the candidate (instead of ) in Steps (i) and (ii), and the candidate