In multi-server distributed queueing systems, the access of stochastically arriving jobs to resources, or servers, is often regulated by a central dispatcher, also known as load balancer. A fundamental problem consists in designing a load balancing algorithm able to minimize the delays experienced by jobs. In this paper, we are interested in a setting where a traffic of rate needs to be distributed across unit-rate parallel servers, each with its own queue, as indicated in Figure 1.
The load balancer may rely on feedback information coming from the servers, which may be also stored in a local memory. This type of model find applications, for instance, in computer and communication systems, hospitals, and road networks, and it is not surprising that there exists a significant and growing number of references; see, e.g., the recent works Ying et al. , Gardner et al. , Gamarnik et al. , Gupta and Walton  and the references therein. Nevertheless, it is often difficult to establish whether an algorithm is better than another because in general the answer strongly depends on the underlying architecture, application or traffic conditions.
During the last two decades, the power-of--choice algorithm, introduced in Mitzenmacher , Vvedenskaya et al.  and referred to as SQ(), has emerged as a breakthrough in the foundations of this area due to its versatility and its appealing asymptotic properties. It works as follows: Upon arrival of each job, servers are contacted uniformly at random, their state (e.g., queue length or workload) is retrieved, and finally the job is dispatched to a server in the best observed state. The first remarkable property is that in the large-server limit, , the stationary proportion of servers with at least jobs decreases doubly exponentially in , though it remains strictly positive for all . This result has been generalized in Bramson et al.  to the case where service times are heavy-tailed rather than exponential. It turns out that SQ() is heavy-traffic optimal in the sense that it minimizes the workload over all time in the diffusion limit where ; see Chen and Ye  and Maguluri et al. . In Ying et al. , it is shown that the number of sampled servers can be dramatically reduced by using the fact that tasks arrive in batches. This is useful to reduce communication overheads. In Mitzenmacher et al. , the power-of--choice algorithm is studied in the case where the load balancer is endowed with a local memory that stores the index and the state of the least loaded server out of the sampled each time a job arrives. When the -th job arrives, the winning server is chosen among the servers randomly selected upon its arrival and the least loaded server seen by the th job. The resulting performance is better than the one achieved by SQ. If is allowed to depend on and , SQ() has been recently shown to become fluid (or mean-field) optimal, i.e., optimal in the large-system limit; see Mukherjee et al. , Dieker and Suk . In this setting, optimality refers to the ability of assigning each incoming job to an idle server. Also our work aims at achieving fluid optimality, though we will consider as a constant to keep the communication overhead at a minimum. Towards this purpose, we will show that it is enough to endow the load balancer with a local memory that keeps track of the latest observation collected on each server. This approach is also close to Mitzenmacher et al. , though different because in that reference the memory can only store one observation. In fact, one observation (or even a finite number of observations) is not enough to achieve fluid optimality.
In Algorithm 1, we provide a pseudocode for the power-of--choices algorithm with memory and servers, referred to as SQ; in the Conclusions, we will describe some variants practical optimizations. Upon arrival of one job, the states collected from randomly chosen servers are stored in the local array Memory. Then, the job is sent to a server chosen randomly among the ones having the lowest recorded state. Finally, the observation of the selected server is incremented by one. Server selections are thus with replacement.
It is intuitive that SQ() results in more balanced allocations than SQ. This follows by using the coupling argument developed in Theorem 3.5 of Azar et al. 
, which can be adapted to argue that at any point in time the vector of queue lengths achieved with SQis majorized by the vector of queue lengths achieved with SQ. On the other hand, it is not clear how much such improvement can be. This is the goal of the present paper.
We investigate the dynamics of SQ
by means of a continuous-time Markov chainthat keeps track of the proportion of servers with jobs and for which their last observation collected by the load balancer is , for all and . To the best of our knowledge, this is the first paper that studies the time-varying dynamics induced by SQ. The transition rates of are non-Lipschitz and a satisfactory analysis of when is finite seems to be out of reach. Our main contributions are as follows:
In Theorem 1, we let and identify the fluid limit of , an absolutely continuous function that is interpreted as a first-order approximation of the original model . The fluid limit is motivated by the fact that real systems are composed of many servers and that it enables a tractable analysis for the dynamics of SQ. A fluid limit is necessarily a fluid solution, as introduced in Definition 1. The proof of the fluid limit is the main technical part of this work and is given in Section 4. The main difficulty stands in the discontinuous structure of the drift of ; see Section 2.2 for further details.
We then study fixed points, fluid solutions that are constant over time. Theorem 2 shows that there exists a unique fixed point. The general structure of such fixed point as a function of is quite cumbersome and implies that
The queue length of each server never exceeds , where will be defined in Section 3 by means of a polynomial equation. This is in contrast with SQ, where queue lengths are unbounded; see Mitzenmacher . Figure 2 illustrates the behaviour of the tight upper bound by varying and , and shows that the size of the most loaded server will remain very small even when is very close to its critical value.
In fact, even when and (respectively, ), no server will contain more than just 5 (3) jobs.
The load balancer memory can only contain two possible observations, namely and .
The case of particular interest is when , where and thus the load balancer memory always contains a strictly positive proportion of zeros. This means that the load balancer can always assign incoming jobs to idle servers, which is clearly the ideal situation for any incoming job. In this sense we say that SQ is asymptotically optimal. When the load balancer memory will never contain a strictly positive mass of zeros but it will still be able to assign a fraction of jobs to idle servers ensuring that the average number of jobs in each queue belongs to the interval (Proposition 2).
Finally, we investigate stability properties of the unique fixed point. Theorem 3 establishes that fluid solutions converge to such point regardless of the initial condition and exponentially fast, provided that . Thus, in this case all fluid solutions will be eventually asymptotically optimal as the load balancer memory will eventually be populated by a strictly positive mass of zeros. The proof of this result, given in Section 5.3, is based on a sort of Lyapunov argument that allows us to show that the time evolution of fluid solutions is eventually governed by the unique solution of a simple linear and autonomous ODE system.
2 Performance Models
In order to describe the time varying effects of SQ on queue lengths, we introduce a stochastic and a deterministic model. The stochastic model is meant to capture the variability of job inter-arrival and service times that is intrinsic in multi-server distributed queueing systems. Due to its intractability, a satisfactory analysis of such model is out of reach. In this respect, the deterministic model is convenient because it does enable analytical tractability. In this section, we also show our first result, which states that both models are connected each other: the deterministic can be interpreted as a first-order approximation of the stochastic.
In the following, we will refer to a server with jobs and for which its last observation at the controller is as an -server.
2.1 Markov Model
First, we model the dynamics induced by SQ() as a Markov chain in continuous-time: arrivals at the load balancer are assumed to follow a Poisson process with rate , withjobs at most. A job that is sent to a server with jobs is rejected.
Let be the system state at time : represents the number of jobs in queue at time and represents the last observation collected from server by the controller at time . At time zero, the controller is assumed to know the vector .
Since the system is symmetric with respect to the servers, it is convenient to represent the system state by where
denotes the proportion of -servers at time . It is clear that is still a Markov chain with values in some finite set that is a subset of
The transitions and rates of the Markov chain that are due to server departures are easy to write because they have no impact on memory: for , the transition occurs with rate where and is the Kronecker delta. On the other hand, the transitions and rates of that are due to job arrivals are quite complex to write and are omitted. However, in Section 4.1 we will show how to construct the sample paths of .
2.2 Fluid Model
For any , let
The next definition introduces the fluid model for the dynamics of .
A function is said to be a fluid model (or fluid solution) if the following conditions are satisfied:
is absolutely continuous, and
almost everywhere, for every and ,
where is given by
The discontinuous function will be referred to as drift, and to some extent it may be interpreted as the conditional expected change from state of the Markov chain , though this may only be true when , where for all and the formulas above become linear admitting a very intuitive explanation.
Let us provide some intuition for the drift expressions in Definition 1
, and let us start with coordinates (0,0). At the moment of each arrival at the load balancer, the states ofservers are sampled and
idle servers that the load balancer has not yet spotted are sampled with probability. Since and arrivals occur with rate , the rationale behind the first term in (3) is well understood. The dynamics that remain to specify are the ones related to the effective job assignments, that is where singularities can happen. In order to build a fluid model ‘consistent’ with the finite stochastic system , one should take into account the fluctuations of order that appear when . These bring discontinuities in the drift. Let and . We notice that , where is defined in (8), will be interpreted as the proportion of time where the process stays on zero in the limit where first and then ; this will be formalized in Section 4.3.2. Since is the probability of sampling servers containing more than jobs only, is then interpreted as the rate in which the process tends to remain on zero. Thus, the term represents the rate in which jobs are assigned to (0,0)-servers, which become (1,1)-servers as soon as they receive a job. This explains the drift expression in (3). The particular structure of given in (8) will be the outcome of the stochastic analysis that will be performed in Section 4.
Let us provide some intuition also for the drift expression on coordinates (1,1) (see (5)), as it brings some additional interpretation that also applies on general coordinates. The first term says that departures from -servers occur with rate and the second one says that new -servers are discovered with rate . This can be easily justified as done above for . Then, we notice that the term has been already interpreted above and thus the dynamics that remain to specify are the ones related to job assignments at (1,1)-servers. According to SQ, if the load balancer knows no -server then it randomizes over the set of -servers, and thus within this scenario should decrease with rate proportional to . This is indeed the case if . Thus, is the rate in which jobs are assigned to (1,1)-servers when . It remains to model the rate in which jobs are assigned to (1,1)-servers when . Since we aim at building a deterministic model ‘consistent’ with the stochastic one, to model the rate of job assignments to (1,1)-servers when one should take into account the fluctuations of order that appear when . The term is indeed such rate, and again it will be the outcome of the stochastic analysis developed in Section 4.
The following proposition will be proven in Section 4.
Fluid solutions exist.
2.3 Connecting the Markov and the Fluid Models
Our first result is the following connection between the stochastic and the fluid models.
Assume that almost surely. With probability one, any limit point of the stochastic process satisfies the conditions that define a fluid solution.
In view of this result, proven in Section 4, a fluid solution may be interpreted as an accurate approximation of the time-dependent dynamics of the finite stochastic system when is sufficiently large.
Given , let us define the functions
We notice that represents the number of jobs in the system at time scaled by and that represents the number of jobs scaled by the load balancer believes are in the system at time . Since the system is symmetric with respect to the servers, the function is also interpreted as the average number of jobs at time in each queue.
It is clear that which is to be expected because ()-servers can not contain more than jobs.
The following corollary of Theorem 1 is immediate.
Let be a fluid solution. Assume that is a limit point of with probability one. Then, and are limit points of and , respectively, with probability one.
We complement Theorem 1 and Corollary 1 presenting some numerical simulations to support the claim that the fluid model provides a remarkably accurate approximation of the sample paths of even when is finite and relatively small. Assuming , Figure 3 plots the time dependent dynamics of and .
At time zero, we have chosen and such that , which means that all servers are idle and the load balancer is aware of it. Each curve on these plots is an average over ten simulations. The fluid (stochastic) model is always represented by dashed (continuous) lines. In the picture on the left (), we set and notice that the fluid model already captures in an accurate manner the dynamics of , which turn out to be concentrated more and more on just three components: namely (0,0), (0,1) and (1,1). Matter of fact gets closer and closer to 1 when both and increase. In the picture on the right (), dynamics are distributed on several components and for convenience we have plotted with its fluid model counterpart . We notice that almost overlaps the trajectory of already when . This size is in agreement with the magnitude of modern distributed computing such as web-server farms or data-centers, as they are often composed of (tenths of) thousands of servers.
3 Main Results
In this section we focus on fluid solutions and show properties about their ‘optimality and stability’. First, we are interested in fixed points.
We say that a fluid solution is a fixed point if for all .
When fluid solutions are fixed points, we drop the dependency on .
Let us introduce the intervals
where and , for , is the unique root in of the polynomial equation
The first values of are and . Let also be the unique integer such that , and for simplicity let us assume that .
Theorem 2 (Existence and Uniqueness of Fixed Points).
There exists a unique fixed point, say . It is such that and
This result, proven in Section 5.1, establishes the existence and uniqueness of a fixed point and says that its mass is concentrated only on coordinates of the form and . Thus, our first remark is that queue lengths are bounded, by . As we show in our proof, an explicit expression for seems to be out of reach, though it can be easily computed when and are fixed numerically. In fact, in Section 5.1 we provide an explicit expression for when as a function of , and identify by means of a polynomial equation of degree (see (60)).
A case of particular interest is when , i.e., , where we have the following remark.
Remark 1 (Asymptotic Optimality).
If , then Theorem 2 implies that , , and on the remaining coordinates. Thus, provided that dynamics converge to , we have shown that a load balancer implementing SQ() is always aware of the fact that some servers are idle when and is sufficiently large because . In this scenario, the load balancer can certainly assign each incoming job to one of such idle servers, and the job itself would incur zero delay. This is in fact the ideal situation for any arriving job and in this sense we say that SQ is asymptotically optimal.
The next proposition provides further insights on the system performance at the fixed point .
Let as in Theorem 2. Then,
Proposition 2, proven in Section 5.2, provides simple bounds on the average number of jobs in each queue. It also says that there is a fluid mass equal to that the load balancer will never spot. In other words, the samplings performed by the load balancer at each arrival will correctly build the true state of the system up to an (absolute) error of .
In Remark 1, we discussed the asymptotic optimality of SQ() postulating some form of stability for fluid solutions when . The next result shows that fluid solutions are indeed globally stable when .
Theorem 3 (Global Stability).
Let be a fluid solution. If , then
where is the Euclidean norm.
The proof of Theorem 3 is given in Section 5.3 and is based on the following ‘Lyapunov-type’ argument. When , we first show that , which implies that decreases with derivative bounded away from zero. Thus, in finite time must increase, and when it does we show that is uniquely determined by the unique solution of a linear ODE system of the form . At this point, (13) follows by standard results of ODE theory. Our proof actually says more than the statement in the theorem itself because it provides insights on the speed of convergence to . Given that the general solution of the previous linear ODE has the matrix exponential form
and the eigenvalues ofwill be shown to be strictly negative, the convergence of to occurs exponentially fast. When , a generalization of this argument is complicated by the involved structure of . However, we conjecture that is again globally stable, i.e., (13) still holds true. This is also confirmed by the numerical simulations shown in Section 2.3.
4 Connection Between the Fluid and the Markov Models
Our proof is based on three steps. First, we construct the sample paths of the process on each pair of coordinates. This is achieved using a common coupling technique that defines the processes for all on a single probability space and in terms of a finite number of “fundamental processes”. Then, we show that limit trajectories exist and are Lipschitz continuous with probability one. This is done by using standard arguments, e.g., Gamarnik et al. , Tsitsiklis and Xu , and Bramson . Finally, we prove that any such limit trajectory must be a fluid solution, which is the main difficulty. This last step is based on technical arguments that are specific of the stochastic model under investigation.
4.1 Probability Space and Coupled Construction of Sample Paths
We construct a probability space where the stochastic processes are coupled. All the processes of interest will be a function of the following fundamental processes, all of them independent of each other:
, the Poisson processes of job arrivals, with rate , defined on ;
, the Poisson processes of potential job departures, with rate 1, defined on ;
for all , , , where the random variables , and , for all and
, are all independent and uniformly distributed over the interval. These are selection processes: will select the servers to sample at each arrival (see Line 5 of Algorithm 1), will be used to randomize among the servers having the lowest observations (see Line 8 of Algorithm 1) and will select the server that fires a departure. These processes are defined on ;
, the process of the initial conditions, where each random variable takes values in , defined on .
Each process , with , can be constructed on by using that , where denotes equality in distribution. This equality ensures that the Poisson process with rate , which represents the arrival process associated to the -th system, is coupled with the fundamental Poisson process . Since , this coupling is also used for the processes of potential job departures.
Now, let and be the times of the th jump of the Poisson processes and , respectively. Let also and . In view of the coupling discussed above, we can construct as follows
In the above expression, the term (14a) is related to the action of sampling servers and the term (14b) is related to the action of assigning each job to a server. At the arrival of the -th job, , the proportion of (0,0)-servers increases by if -servers are sampled for any , which justifies the term in (14a), and decreases by except when such proportion is zero immediately before and no idle server is sampled at time , which justifies the term in (14b).
Using the random variables and , an expression similar to (14) can be written for when . For simplicity, let us define
We notice that is the number of -servers sampled immediately before time . Furthermore, let also
if , and
if . We will use , with , in the scenario where and for all and , that is the case where the load balancer memory contains no observation less than and no server containing less than jobs is sampled immediately before . In this case, according to SQ, the -th job must be routed to a random -server, provided that such a server exists. This randomness is captured by the uniform random variable and we notice that is the number of -servers, or equivalently the occurrences of in the memory of the load balancer, at the arrival of the -th job and after having performed the associated sampling of the states of random servers. Within these conditions, the job arriving at time is routed to an -server if and only if .
Provided that , the following formula constructs the process on coordinates
The summations in (18a) and (18b) refer, respectively, to job departures from - and -servers, the summation in (18c) refers to the case where -servers are sampled (as soon as of them are sampled, they become -servers, and thus decreases by ), and the summations in (18d) and (18e) refer to the case where a job is assigned to an -server and to an -server, respectively. We notice that a job can be assigned at time to an -server only if i.e., the memory contains no server with observation less than (i.e., ) and no -server, for some and for any , has been sampled at time . Summation (18f) covers the boundary case where and has the same intuition of term (18f).
Similarly, when , we have