1 Introduction
In this tutorial, our aim is to summarize the implications of learning and information for queueing systems. Over the past decade, the theory of machine learning and data science has grown rapidly. As a consequence, techniques from these areas have infused themselves within the various disciplines of operations research. The potential scope for applying techniques to queueing networks is vast and important given the role of queueing networks in areas such communications, manufacturing, healthcare, supply chain and transportation. However, it is reasonable to say that the combination of learning and queueing theory is still in its infancy, particularly when compared with other operations research disciplines, such as revenue management, where there are wellestablished connections between parameter estimation and decision making.
From a practical perspective, it is often possible to apply existing learning techniques, such as reinforcement learning, to a queueing system. However from a theoretical perspective, the assumptions of these learning frameworks often do not naturally fit within the assumptions of a queueing theoretic model. Moreover, in instances where they do match, the two approaches can often be treated as independent tasks. For instance, statistical uncertainties in parameter estimates might simply be propagated through an appropriate queueing formula.
With this noted, the aim of this article is to introduce the reader to a few techniques which we believe are important for understanding the role of information and learning in queueing systems, as well as potential challenges. We do this first by providing examples of theoretical results applied to a queueing system. These initial examples provides motivation from which we can then survey recent work incorporating learning within queueing network analysis. Toward the end of this tutorial, we discuss the challenges in developing techniques from reinforcement learning for queueing systems.
We now briefly discuss the two examples that constitute the theoretical results presented in this article. First, we relate Blackwell’s Approachability Theorem with the MaxWeight (and Backpressure) policies for queueing networks. Blackwell’s Approachability Theorem is a classical result in game theory, learning and control. It provides a necessary and sufficient condition for the average vector of rewards in a twoplayer game to approach a convex set. A wellknown consequence is the existence of sublinear regret algorithms. A variety of approaches in online convex optimization follow from Blackwell’s result. On the other hand, the MaxWeight algorithm is a celebrated algorithm in the design of scheduling for switched queueing networks. MaxWeight has the desirable property that it ensures stability in a queueing network whenever it is possible to stabilize the said network for a given set of arrival rates. In this article, we argue that the desirable stability properties of MaxWeight can be seen as a consequence of Blackwell Approachability Theorem. From this we can see maximal stability is a form of approachability. This view point helps us understand when maximal stability is possible and why parameters which regulate service rates may still need to be learned via estimation whereas arrival parameters do not.
This leads us to the second example considered. Here, we discuss the role of expert and bandit algorithms in queueing systems. A number of papers over recent years have investigated the role of learning and regret in queueing. We provide a simple result that relates Lindley’s recursion for the G/G/1 queue to the regret of a queue’s waiting time. We discuss informally how the regret of bandit learning in this setting can be analyzed. Then we develop an example where service rates are estimated and regret is bounded when we use a perceptron algorithm to classify the service policy applied to different customer classes in the queue. Through a simple example, we discuss the impact of learning system parameters on the performance of our queueing model.
These first two studies provide mechanisms to understand further literature on learning, information and queueing and lead us to review other aspects of decision making in queueing systems. We discuss the role of information in queueing networks and resource allocation. Although parameter estimation might improve performance, the increased use of expanded state information can also achieve this goal. We review results where future information can assist in reducing delay when making admission control decision. Further we discuss how memory and communication at a dispatcher can help to reduce load.
Finally we review recent progress on the application of reinforcement learning theory to queueing systems. We note that many early examples of reinforcement learning (including early deep reinforcement learning) are designed for sovling scheduling problems in queueing systems. However, while it is possible to apply reinforcement learning algorithms to queueing systems in a relatively straightforward manner, we note that many of the underlying assumptions that guarantee convergence and correctness in reinforcement learning tend not to hold in the case of a queueing system. We identify key areas where more theoretical development is required to understand the application of reinforcement learning algorithms to queueing system. We then provide a review of recent literature that works to address this challenge. While this area is very nascent, we hope that this tutORial provides an introduction that is both instructive but also acts as a guide on recent literature and potential themes of future research
Organization.
The remainder of the paper is organized as follows. In Section 2 we prove the relationship between the MaxWeight policy and Blackwell’s Approachability Theorem. In Section 3, we analyse the regret of queueing systems when service rates are unknown. In Section 4, we discuss the use of detailed state information and its impact on performance. In Section 5, we review the recent progress applying reinforcement learning theory to queueing systems.
2 Blackwell Approachability and MaxWeight.
In this section, we relate a number of classical approaches to learning, control and queueing. Specifically, we discuss connections between Blackwell’s Approachability Theorem, sublinear regret and the MaxWeight policy. We start by presenting Blackwell’s Approachability Theorem, and, as a corollary, we prove the HannanGaddum Theorem. We then introduce the MaxWeight policy and prove its stability for subcritical queueing networks. Finally, we prove that the stability of MaxWeight can be viewed as a consequence of Blackwell’s Approachability Theorem.
2.1 Blackwell Approachability
Blackwell’s Approachability Theorem is a foundational result in game theory, learning and control. The result of blackwell1956analog gives a necessary and sufficient condition for the mean of a vector payoff to approach a convex set under adversarial perturbations. The result is a generalization of the Minimax Theorem for twoperson zerosum games. In a subsequent note, blackwell1954controlled observed that this approachability property can be used to prove sublinear regret of sequential decision policies. Below we present the model setting and then state Blackwell’s Approachability Theorem; a proof of this result is provided in the appendix.
2.1.1 A vector valued game.
At time , a player makes a decision and its adversary makes a decision . Here is a closed, bounded, convex subset of and is closed, bounded, convex subset of . From this an dimensional vector of payoff is made
(1) 
where, for each and , . We let . Decisions from the player and adversary at each time may be a function of previous decisions and payoffs. We let denote the decisions and payoffs up until time . We allow and to be random but must be conditionally independent given the past payoffs and decisions, . We let and . Starting from some initial position , the average payoff vector is
(2) 
Without loss of generality, we assume that is chosen so that .
The task of the player is to make a sequence of decisions such that the mean payoff converges towards a closed, convex set , regardless of decisions of the adversary . We say that the set is approachable if this convergence is possible regardless of the sequence of decisions made by the adversary. Specifically, for all sequences there exists a sequence such that
where gives the norm of the distance between and :
2.1.2 Blackwell’s Approachability Theorem.
Blackwell’s Approachability Theorem provides necessary and sufficient conditions for approachability to hold.
Theorem 1 (Blackwell’s Approachability Theorem).
The following are equivalent

is approachable.

For every there exists such that .

Every halfspace containing is approachable.
We focus on the equivalence between Parts 1 and 3, above, and, for reasons of brevity, we assume that the equivalence between Parts 1 and 2 has already been shown. We note that approachability of halfspaces can be shown to be equivalent to the classical Minimax Theorem. Further we note that Blackwell’s proof that Part 3 implies Part 1 is constructive, and the algorithm applied is as follows.
We define to be the projection with respect to the Euclidean norm of onto . We let
(3)  
(4) 
This defines a hyperplane
(5) 
Note that is contained in
. Blackwell proposes to choose a random variable
(6) 
The existence of this choice is not obvious at this point, but this will follow as a consequence of the Minimax theorem.
2.1.3 Sublinear Regret as an application of Blackwell Approachability.
Regret minimization is an objective in a variety of sequential decision making tasks. The regret, which we define below, is measure the performance of a sequence of decisions against the best fixed decision. A “good” algorithm should have regret that grows sublinearly with time. Sublinear regret is a consequence of approachability, see blackwell1954controlled.
We consider the following setting: again, a player makes a decision and its adversary makes a decision . The decision set
is assumed to be the set of probability distributions indexed by
. The player receives a reward for choosing index . Consequently, the reward for distribution isThe regret of the sequence is
(7) 
A good algorithm should have low regret regardless of the sequence . Notice that if holds, then the performance of the algorithm is at least as good as the best fixed choice in . Further if then we can see that the algorithm is in effect learning the best policy. Such algorithms are referred to as being Hannan consistent, see cesa2006prediction.
One interesting consequence of Blackwell’s Approachability Theorem is that it can be used to construct an algorithm that is Hannan consistent.
Theorem 2 (HannanGaddum Theorem).
There exists a playing strategy such that for any
(8) 
In other words, our performance in the game is asymptotically as good as the best fixed action.
The result above shows that regardless of the choices of the adversary, there are always ways to learn the best fixed decision. This result and its proof have become the basis of much subsequent development in adversarial and nonstochastic sequential decision making.
2.2 Stability and Approachability in MaxWeight
A classical approach to scheduling in queueing networks is the MaxWeight algorithm, which was first introduced in the seminal work of tassiulas1990stability. In that paper, it is proved that the MaxWeight scheduling algorithm has the property of stabilizing a queueing network whenever it is possible to construct a stabilizing policy – a statement that is somewhat similar to the statement of Blackwell’s result. Below we discuss how MaxWeight can be interpreted as special case of Blackwell’s policy (6). Thus the desirable stability properties of MaxWeight can be seen as a consequence of approachability.
Below, we present the queueing network setting and the MaxWeight policy. Afterwards we present the main stability result for MaxWeight; a proof of the result is provided in the appendix. Following this, we make the connection with Blackwell’s Approachability Theorem.
2.2.1 Switched Queueing Network and MaxWeight.
We consider a discrete time queueing network, called a switch queueing network. In this model, there are constraints on which queues can be served simultaneously. We let index the set of queues. At time , the number of arrivals to each queue is given by the vector . For simplicity, we suppose that the components above are bounded by some maximum value . The vector may be random, in which case, we let be the expectation of . We let be the number of departures for each queue at time . We suppose that belongs to some set of schedules . Again we suppose that the elements of are bounded by some maximum value and that each queue can be served by some element in . Further, we suppose that the set of schedules is monotone, meaning that for any and if componentwise then . We let be the convex closure of .
From some initial queue size vector , we can define the queueing process
(9) 
for . (Note by our monotonicity property on , we may assume that whenever and more generally that so that queues remain positive.)
The MaxWeight policy is a wellknown scheduling rule for chosing the departure vector among the set of schedules . Specifically, at time , MaxWeight chooses
(10) 
where ties may be broken arbitrarily. Since the scheduling decision is a function of the current queues sizes, the resulting queue size process under the recursion (9
) is a Markov chain. Thus we can discuss the positive recurrence and transience of this Markov chain. When the queueing network is positive recurrent then we often refer to the network being stable, whereas a transient network is referred to as unstable.
2.2.2 Maximal Stability of MaxWeight.
A wellknown result due to Tassiulas and Ephremedes is the following
Theorem 3.
Given are independent identically distributed with mean then

If then, regardless of the policy used, is transient.

If then, under the MaxWeight policy, is positive recurrent.
The first part of the theorem, above, establishes that is the maximal stability region for the switched queueing network. That is the set is the largest set of arrival rates for which we can hope to find a policy to stabilize the queueing network. (Note that, in principle, this policy may need to know the arrival rates to stabilize the network.) What the second part shows is that there is a single policy, specifically MaxWeight, that achieves the maximal stability region, regardless of the which arrival rate it is inside the stability region.^{1}^{1}1Note here we are ignoring arrival rates on the boundary
The proof of Theorem 3 is given in the appendix. We note that the proof for MaxWeight has several connections with the proof of the Blackwell Approachability Theorem:

The argument is constructive and applies a quadratic Lyapunov function.

The proof applies the Minimax Theorem.^{2}^{2}2We note that we apply the Minimax Theorem in the way that we construct our proof but this is not necessarily standard.

The divergence of the policy follows from a separating hyperplane argument.
All of these properties indicate a close connection between MaxWeight and Blackwell Approachability, which we develop next.
2.2.3 MaxWeight and Approachability.
Here we show that MaxWeight is an instance of Blackwell’s policy, (6), for a specific choice of . Thus a reason for the robust stability properties of MaxWeight can be seen as a consequence of robustness to adversarial perturbations found in Blackwell’s Approachability Theorem.
Proposition 1.
A few remarks about the above result can be made.

The rate of convergence can be found to be of the order . Thus the result can be seen as a form of “fluid stability” or ”rate stability” for the policy. Because the rate of departures from the network is of the same order as the rate of arrivals. Provided the result can be improved to form a bound that is .

The result proven above holds when the arrival rates can change over time (in an adversarial way) so long as they belong to the interior of the stability region .

We note that there are variants of Blackwell Approachability that expand upon the basic version to incorporate general Lyapunov functions (cf. cesa2006prediction). If we apply these alternative algorithms for queueing network scheduling then we arrive the MaxWeight policies. This is a variant of MaxWeight where the queue size in the MaxWeight policy is replaced by for a suitably positive increasing function .

The argument given in Proposition 1, can be extended to the BackPressure policies. The BackPressure policy is an extension of the MaxWeight policy introduced by tassiulas1990stability. BackPressure allows for routing decisions that send jobs through multiple queues. BackPressure can also be seen to be an instance of Blackwell Approachability for a suitable choice of .

MaxWeight as stated assume unit sized jobs and there is no randomness in the service received by a job. However, the policy can be extended to jobs which have a random size or a random probability of being served. In this case, maximal stability does not hold under the policy (
11). We need to consider a policy of the following form(11) where is the rate of service of jobs at queue . This can be seen as a consequence of the form of the function in this case. Specifically, service must be represented within each matrix in the proof, where as arrivals can be implemented as an adversarial perturbation. This gives some rationale for the requirement to change the policy under random services.
2.3 Literature Review
As mentioned earlier, the original Approachability Theorem is due to blackwell1956analog. The connection with sublinear regret is described by blackwell1954controlled. The use of Blackwell’s condition is now commonly applied to analyze the regret of online learning algorithms. See cesa2006prediction for an overview. The text of cesa2006prediction provides a generalized Lyapunov version of Blackwell’s condition, which as discussed specializes to well known variants of MaxWeight in the case of switched queueing networks, see shah2012switched. It has also been shown that sublinear regret is equivalent to Blackwell approachability, abernethy2011blackwell.
The BackPressure policy, of which MaxWeight is a special case, was first introduced by tassiulas1990stability. During this period several commonly used queueing policies were found to be unstable when each queue was nominally underloaded, for instance see rybko1992ergodicity. The BackPressure policy is a popular algorithm that circumvents these issues. It was later found that inputqueued switches, which are the routers used in the core of the internet, could have a suboptimal stability region under the greedy workconserving strategies used at the time. For this reason the MaxWeight algorithm was proposed by mckeown1999achieving as a strategy to achieve maximal stability. Since then the MaxWeight policy has been applied extensively in the analysis of input queued switches. More generally the MaxWeight and BackPressure policies have been widely analyzed in a broad set of queueing models. The paper dai2005maximum gives an account of their broad applicability in a wide variety of contexts including manufacturing, call centres as well as communications systems. We refer to georgiadis2006resource for a survey on MaxWeight and BackPressure. Also the recent text dai2020processing provides an excellent introduction to stochastic networks, their stability as well as providing a good account of the various use cases of MaxWeight and BackPressure policy.
The connection between Blackwell Approachability and MaxWeight/Backpressure appears to be new but we note it is a natural consequence of the two approaches. Specifically, the results of neely2010universal are particularly relevant to our analysis as it is shown that MaxWeight and Backpressure are robust to changing arrival rates so long as the mean arrival rate lies within the stability set over each time period (see also neely2010stochastic). The literature on adversarial queueing models is also relevant [borodin1996adversarial]. Similar fluid stability results are sufficient for various adversarial queueing models [gamarnik1998stability]. MaxWeight has been analysed in the context of adversarial queueing by lim2013stability.
From a theoretical perspective, understanding of MaxWeight has improved through a series of papers analysing statespace collapse of MaxWeight in heavy traffic, see for instance stolyar2004maxweight and maguluri2016heavy. Also it can was proven that MaxWeight does not enjoy the same robust stability properties that BackPressure enjoys in large networks bramson2019stability. An intriguing fluctuation bound for MaxWeight is found in sharifnassab2020fluctuation. More recent literature on MaxWeight and BackPressure in queueing networks has focused closed systems, such as using it as a decision rule in ride sharing platforms (see banerjee2018dynamic and kanoria2019blind). Another application concerns road traffic signal control [le2015decentralized, varaiya2013max, xiao2015throughput], which we will discuss in some more detail in Section 5.
A variety of recent papers combine elements of learning and MaxWeight. One area of research employs MaxWeightlike policies in designing learning and information processing systems, such as those encountered in crowdsourcing, in order to achieve maximum throughput or minimize regret [hsu2021integrated, massoulie2018capacity, shah2020adaptive]. The novelty here is that instead of working with queues of jobs, the decision maker is trying to minimize certain information backlog that corresponds to “uncertainty” in the system. Another line of work investigates how to augment MaxWeight with explicit learning and side information to improve performance [krishnasamy2018augmenting, neely2012max]. However, they tend to focus more on parameter estimation, rather than approachabilitiy proporties for the algorithm. We will focus on such methods in the next section.
3 Regret Bounds and Queueing
When we analyzed MaxWeight, we saw that stability is achievable under an unknown arrival rate. However, we do require knowledge of service rates to make decisions. So if service rates are not known in advance then they must be learned. Recently there have been a number of papers that investigate sequential learning of service rates in queueing systems. These papers consider how bounds on regret (as defined above) relate to the performance of a queueing system. We will review these shortly, but first we discuss the main ideas by developing a simple example.
3.1 Regret and service in a queue.
We suppose that jobs arrive to a single server queue. We let , , be the interarrival time between th arrival and the th arrival. When we serve a job, we suppose that there are different modes of service that we can select. This represents the manner in which we are going to serve the job, with some service modes being more suitable than others. We let be the set of service modes. A policy choses a service mode for each job, i.e. for the th job at the queue we select service mode . We let be the time spend serving the th job under service mode . We assume that for each job, one service mode (which is not known to us) will lead to a fast service time, , and the other service modes will lead to a slower service time. We let be the time taken to serve the th job under policy . We will discuss shortly a model that relates the service modes to the speed of service but for now we leave this unspecified.
Following our earlier definition, (7), the regret of policy is
I.e., the regret is the difference between the total service we receive under policy and the service that we receive under optimal service. Given the interarrival times, , and service times, , the waiting time of the th customer satisfies the famous Lindley recursion:
The waiting time under the optimal policy is then
In both cases we assume that the queues are initially empty: .
The following simple lemma provides a way to compare the waiting time of jobs in terms of the regret of the policy
Lemma 1.
If is the last time before where the queue for policy is empty, i.e. , then
(12) 
With this result in hand, there are a host of ways that we can apply regret analysis to a queueing model. In informal terms, here are two ways that one can proceed. First, notice that if , i.e. queues do not empty often, then the term will dominate the above expression, (29). Thus we have a form of rate stability. Specifically
So if there is some policy under which queues do not grow linearly, then the queues will not grow linearly under , either. This is the form of the result that one can expect in an adversarial setting. Second, in a stochastic setting we can expect a queue to be positive recurrent, that is . Thus the bound is of the form of the expected change in . In a stochastic multiarm bandit problem, the regret is often the order , So , and we can expect that
(13) 
This is essentially the observation made by krishnasamy2021learning for a specific algorithm, although the result likely holds in more generality.
There are a host of algorithms that are likely to permit exact analysis; however, one classical learning algorithm that permits a particularly tractable analysis is the perceptron algorithm, which we discuss next.
3.2 Analysis with the Perceptron Algorithm
We consider the setting where there are two modes of service . With each job, , we associate a context and a response . We assume is a bounded vector in , i.e. for some , and we assume that belongs to . The context summarizes information about the job and with this information we are allowed to make a service mode decision, . We assume that the sign of determines whether the job receives fast or slow service. Specifically, if the sign of and match then the job receives fast service and otherwise it receives a slower rate of service. That is we assume that
Here corresponds to the slow service time and so . This correctly classifies the context of a job and leads to faster service.
We restrict ourselves to policies that apply weights to each context inorder to make a decision. Specifically, for weights , the policy with context makes the decision
For weights , the loss of the th job is
We can upperbound this by the hinge loss
It is not hard to check that . The resulting optimization
is known as a support vector machine.
^{3}^{3}3Often an additional regularization term is added to this objective though we do not consider that here. Note thatIf we apply an online gradient descent algorithm, we get the following update algorithm for the weights :
where here is the learning rate of the algorithm and we take . The above algorithm is known as the perceptron algorithm.
Allowing the learning rate to depend on time, the following is a standard bound in online convex optimization (OCO):
Proposition 2 (OCO bound).
For any function and , as above, then for all
The above result bounds the performance of the sequence of weights in comparison with any fixed choice . Thus the above bound is a regret bound. Appropriate conditions on the boundedness of and the magnitude lead to sharp regret bounds. We pursue this now in the context of the Perceptron Algorithm. However, a more general overview is given by hazan2019introduction.
A consequence of the above bound is the the following classical result:
Theorem 4 (Perceptron Mistake Bound).
If there exists such that then the perceptron algorithm is such that
If we apply the perceptron algorithm to the service of rate of customers as considered in Lemma 1 then, since the number of mistakes made by the algorithm is finite, the difference in the regret of the algorithm is bounded by a constant. Thus applying Lemma 1, we arrive at the following result.
Corollary 1.
For weights determined by the perceptron algorithm then the waiting time of customers at the queue is
Moreover if is the first time that after the last mistake of the perceptron algorithm, then for all
The analysis here is particularly clean as the number of mistakes made by the perceptron algorithm in this “wellseparated” setting is finite.^{4}^{4}4Here wellseparates means that there exists such that . In general the difference in aggregate loss (or number of mistakes) will continue to increase, and the change in losses with decrease rather than be zero after some point. In this case, slightly more careful bounds need to be applied to gain similar results, although the main idea remains the same.
3.3 Literature
The application of Lindley’s recursion to the regret analysis of a queueing system follows the discussion walton2014two. Here a result similar to Lemma 1 is derived and the regret is a variety of expert and bandit algorithms is considered. The results of walton2014two are primarily concerned with the adversial arrival and service types. The regret of stochastic bandit problems are considered in NIPS2016_430c3626. Here a specific algorithm called UCB is considered and a detailed analysis of the transition from instability to stable behavior is considered. The discussion in display (13
) is a heuristic derivation of this result in
NIPS2016_430c3626 and suggests that the results hold more broadly than for the UCB algorithm. The work is developed further by the authors to consider multiple queues and matching constraints [krishnasamy2021learning]. The adversarial setting is considered for a MaxWeight model by liang2018minimizing. Further formulations for wireless networks [stahlbuhk2019learning] and for network utility maximization are considered [liang2018network]. A strategic adversarial model of queueing is considered by gaitonde2020stability. Here the loss in capacity through adversarial coordination is the object of interest. The authors improve these results exploiting submodularity structure gaitonde2020virtues.The set of papers above are amongst the first papers to consider regret analysis within a queueing system. Nonetheless it should be noted that there is an extensive prior literature adressing related problems. For a discounted reward objective the Gittin’s index is known to be a Bayesian optimal bandit algorithm, see [lattimore2020bandit, Section 35]. There are several works which develop a queueing approach to the multiarm bandit problem. For instance see aalto2009gittins, jacko2010restless, scully2020simple as well as the text gittins2011multi. Further adversarial models of queueing systems are considered in some detail initiated by the work borodin1996adversarial.
The perceptron algorithm is a classical algorithm considered in machine learning [novikoff1963convergence]. The mistake bound forms the basis for VC and Radamacher complexity bounds [vapnik2013nature]. Online convex optimization and perceptron mistake bounds follow the proofs given in shalev2011online.
Further, it is worth noting that the analysis of regret only applied to algorithms applied in a online setting where parameters must be fitted statistically. I.e. a regret analysis is not relevant to models trained by simulation. Here only the quality of the final solution reached by the learning algorithm matters. If an algorithm is trained online then care must be taken to the potential changing dynamics and statistics of the environment. Such a view point is taken by besbes2015non in the setting of stochastic approximation and by daniely2015strongly in the nonstochastic case.
4 Role of Information
The algorithms introduced the previous sections generally have a “learning” flavor: the algorithm either explicitly gathers information about unknown quantities in the environment, such as in the case of bandits, or implicitly controls the system in a manner that mitigates the lack of knowledge of such information, such as MaxWeight being able to stabilize the queues despite not knowing the arrival rate. What exactly, then, is the information that is being learned, and how does such information shape the performance and algorithmic design? These questions will guide our discussion in the next two subsections.
At a highlevel, by information we are referring to knowledge of uncertain values in the system. It is instructive to further consider the following two categories of uncertainty:

Epistemic information: this corresponds to static uncertain parameters and system model specifications. For instance, in a typical queueing network, uncertainty in the arrival rate or mean service time belongs to this categories.

Aleaoric information: this corresponds to uncertain realizations and stochastic events that manifest over time, and in particular, is unknown even if we have a complete specification of the system model. For example, while we may know that the arrival counts in time step
follows a Poisson distribution with mean 1 (epistemic information), the exact realization of the number of arrivals
is still uncertain. Knowledge of the realized value of would correspond to aleaoric information.
For the purpose of our discussion, one can roughly think of epistemic information as about static quantities, whereas aleaoric information captures dyanmic realizations of randomness over time.
Aleaoric information: past, present and future
We will focus in this section on the impact of aleaoric information. We will have a broader discussion about the role of epistemic information in the next section in the context of reinforcement learning. Note that, the banditinspired algorithms and the MaxWeight algorithm both focus on addressing epistemic uncertainty.
We will further orient our discussion along three subtype of aleaoric information, determined by where the information is situated along the time axis, relative to the present time frame:

Information about the future inputs, a.k.a. predictive information. E.g., how many jobs will arrive and to what queues in the next 5 hours.

Information about the current system state, such as the queue lengths at various servers in a loadbalancing model.

Information about the past, possibly stored as part of the algorithm’s internal memory.
Measuring the value of information
One of the core questions when it comes to the role of information is its decisiontheoretic value, expressed as the performance improvement that accessing such information may enable. For instance, we may seek to find a function such that, given amount of future information, we may claim to achieve amount of performance improvement.
Two immediate questions come to mind. First, in various stochastic resource allocation problems, what how does the value depend on the amount of information, ? In a centralized decisionmaking framework (e.g., barring competing selfish agents), more information cannot hurt the decision maker, so we would naturally expect to be always nonnegative and monotone nondecreasing. A more interesting question can be: how fast does increase as the amount of information increases, if at all. As we will see in some of the results surveyed in this section, sometimes a seemingly small amount of information can lead to significant performance improvement. Conversely, performance improvement could quickly plateau after obtaining “enough” information, beyond which more information only brings small, marginal benefits.
Once we understand how much benefit information can bring, an immediate follow up question would be how to achieve the optimal performance improvement using efficient algorithms. A notion that is suitable for most engineering applications is to require that the resulting decision policy to be computationally tractable. A taller order might be warranted when the application involves humans, such as in a call center or healthcare setting, we would ideally like the algorithm to be simple and intuitive from the perspective of a human operator.
4.1 Improving Queueing with Future Information
Our first example concerns admission control in a queueing system using future information [spencer2014queueing]. We will look at an operator who may access to a thirdparty oracle (via predictive algorithms, machine learning, or simply a sidechannel) that offers a lookahead window containing future arrival and departure patterns. We will see that in this example, not only does queueing delay improve with future information, but the function experiences a sharp increase early on, before petering out, suggesting that the system can benefit a lot from having access to a moderate amount of information.
The model is depicted in Figure 1. An single stream of jobs arrive to a queue following a Poisson process with rate . Service variability is modeled by an independent Poisson process of rate : suppose that there is a jump in this process at time , then the queue is reduced by if it is not empty, or stays at zero, otherwise. It is easy to show that in this system the queue length process, , would evolve according to the same law as the number of jobs in system (including in service) in a conventional queue, with arrival rate , service rate , and mean job size .
Now, notice that if the arrival rate exceeds the service rate the queue length will eventually grow to infinity. We would thus assume that the policy designer is to make some admission decisions on a dynamic basis: for each arriving job, the algorithm is to decide whether the job is to be admitted to the queue, or diverted, in which case the job leaves the system directly. Because diversions are costly (the diverted job is either not receiving the needed service, or needs to be served elsewhere), we impose a constraint that the average rate of diversion be at most . This model of admission control has a long history and has been studied in various applications, ranging from Internet congestion control to ambulance divergence for Emergency Departments; see references in [stidham1985optimal, spencer2014queueing, xu2016using].
A key question is how to control admissions in such a way that minimize the waiting time experienced by the admitted jobs, while obeying the diversion rate constraint. Everyday intuition would suggest that jobs should be turned away when the queue is already very long, and conversely, we should probably admit the job if the queue is nearly empty. In other words, we could implement a thresholdpolicy where a job is admitted if any only if the queue length at the time of arrival is less than a certain threshold, and the threshold would be chosen to be smallest one that results in a feasible admission rate. Showing that this is indeed optimal is nontrivial, and was elegantly carried out in stidham1985optimal. In particular, under the optimal policy (which uses a threshold rule), the average steadystate queue length scales as:
(14) 
Here, the limit corresponds to the conventional heavytraffic regime, where the arrival rate of the admitted traffic, , approaches the service capacity . Note that as , the queue length explodes to infinity, as one would expect from standard heavytraffic theory of queues, albeit at a more modest, logarithmic rate compared to the typical heavytraffic delay scaling.
Let us now introduce future information in this model. Suppose now that the decision maker is equipped some advanced knowledge of the future arrivals and potential services in the form of a lookahead window: at any time , we will assume that the algorithm is able to access the realizations of both the arrival and service Poisson processes in the time window . That is, the decision maker not only has at their disposal how long the queue is now, but how arrival and service will materialize in the next units of time. We would expect that the availability of such future (aleaoric) information should allow the algorithm to plan more proactively and improve performance. But how large of an improvement can we get, and how much future information do we need in order to make such improvement possible?
It turns out that future information does make a big difference. Specifically, it is shown in spencer2014queueing that there exists some universal constant , such that if that the size of the lookahead window , then the queue length under an optimal policy satisfies:
(15) 
That is, equipped with a sufficient amount of future information, the heavytraffic queue length does not diverge, but instead converges to a finite constant, a sharp departure from the scaling in (14).
We have thus seen that having access to some future information could bring potentially massive performance improvement. However, (15) is only half of the picture, as we might want to know whether the same performance improvement can be attained with much less information. The answer turns out to be negative, as demonstrated by the following lowerbound by xu2015necessity: there exists another positive constant, , , such that if the lookahead window size , then the queue length under any algorithm must satisfy:
(16) 
Comparing the above with (14) and (15), we see that, quite surprisingly, the performance deteriorates to the same level as an online algorithm, as soon as the lookahead window size drops by even a constant factor, . In other words, appears to capture a critical threshold for performancerelevant future information, where queueing delay is highly sensitive with respect to whether is greater or less than this level.
The above results also reveal an interesting duality between information and steadstate queue length. While one may expect future information to help reduce the steadystate queue length, it may be surprising that they are related to each other so strongly as to be almost interchangeable. Examining (14) through (16) shows that the system can essentially operate in one of two regimes, as the system load approaches :

Informationrich: if the lookahead window size grows faster than , then the steadystate queue length would be bounded,

Informationscarce: if the lookahead window grows slower than , then the steadystate queue length would grow at rate .
That is, the operate “pays” a cost of either in information or queue length.
Insight on Policy Design
We may also be interested in knowing how policy design should adjust to the availability of future information and forecasts. Do we need to drastically revamp our usual intuition, or are we only looking at minor tweaks to existing policies?
The reality is likely the former. One of the most striking findings of spencer2014queueing is that not only is the optimal control with future information different from the one without, but the difference is so large that it would almost appear counterintuitive. As discussed earlier, the intuition for an online version of the admission control problem suggests that a good admission policy should follow some threshold rule and admit arrivals to the queue only when the queue is short. In contrast, spencer2014queueing show that when future information is available, the optimal admission policy does almost the opposite: it is more likely reject jobs when the queue is short, and admit when the queue is long. In the extreme case where the lookahead window size is infinite, the optimal policy would only reject jobs when the queue is empty. Why does future information lead to such drastic changes in optimal admission rules?
In some sense, this departure is but an instance of a broader shift from being reactive to proactive as future information becomes part of our decisionmaking apparatus. In the online setting, since the decision maker has no knowledge of the future realizations of arrivals and services, the buildup in the queue serves as the only warning of congestion. Consequently, the decision maker can react to bursts in arrivals only after they have already occurred. When future information is available, however, the emphasis of the decision maker naturally shifts from what has happened to what will happen, and as such, the current “state” of the system (e.g., queue length) will carry less significance as the amount of future information grows. In the admission control problem, if the decision maker is to foresee future arrivals, then rejecting jobs that occur at the beginning of a burst of arrivals will prove far more profitable than rejecting those jobs that arrive later in the same bursty period. This is because rejecting early arrivals reduces waiting for all subsequent arrivals, whereas rejecting the later arrivals does not have the same “domino effect.” It is for this reason that the optimal policy under future information tend to be more aggressive in turning away jobs when the queue is small; not because the queue’s being small in itself forebodes future congestion, but simply that early arrivals in a burst of jobs tend to, by definition, arrive when the queue is small. Clearly, with a lack of foresight, it would not have been possible for an online policy to have such discernment.
While these specifics ways in which policy design changes as a result of more future information may or may not generalize to other stochastic systems, a broader point to be made here is that variations in information structure likely will require us to reevaluate our usual intuition, and sometimes in fairly drastic ways.
4.2 Communication and Memory in Load Balancing
Our second example shifts the focus of information from the future to that of the present and past. Here, the agent may not even have the complete information about the current state of the system. A typical application that motivates such consideration is that of a large data center, where it is often expensive, if not impossible, for a scheduling algorithm to have at its disposal the realtime state of all the servers and queues. In this case, the lack of accurate state information comes from two possible sources:

Limited communication: the decision maker may not be able to communicate frequently with all the queues and servers.

Limited memory: there may not be enough memory to store all information gathered in the past.
To make matters more interesting, it turns out that the effects of communications and memory are often intertwined. With unlimited communication capability (and suitable Markovian assumptions), the true system state is readily accessible and there would be little need for memory. Conversely, when communications are limited, memory becomes crucial as the decision could make up for the lack of realtime information by piecing together last observations to form a better estimate of the system state. Considering that communication and memory can both be prized resources in largescale queueing networks, how they jointly impact performance is an important question.
The tension between communication, memory and system performance was elegantly captured by a model proposed by gamarnik2018delay in the context of load balancing. The system is that of a standard load balancing setup, where a single dispatcher routes jobs, arriving at rate , to a collection of parallel servers each operating at unit speed. The dispatcher however does not have access to the queue lengths at all the servers by default, but must rely on messages sent from the servers to gather such information. A main innovation of this model is its systematic accounting of the communication and memory overhead.

Communication model: when a server idles, it sends messages to the dispatcher according to a Poisson process of rate . Each message contains the identity of the server encoded in a bit binary string. The server sends nothing if it is currently busy.

Memory model: the dispatcher is equipped with a finitesized memory bank of bits. Given that it takes bits to specify the index of a single server, this is effectively the same as the dispatcher being able to store the identities of (idle) servers. Whenever the dispatcher receives a new message, it adds the identity of that server to its memory, and discards it if the memory is already full.
Under the above memory and communication model, gamarnik2018delay study the delay performance of the following natural routing policy. Upon the arrival of a new job, the dispatcher randomly chooses a server ID from its memory bank, if it is nonempty, and sends the new job to the said server while erasing the ID from the memory. If the memory bank is empty, then the dispatcher simply routes the job to a randomly chosen server. The main result of gamarnik2018delay characterizes the resulting delay as a function of the level of memory and communication available. In particular, they consider the regime as the number of servers tends to infinity, and distinguish whether the average systemwide queueing delay vanishes in the limit.
Their main results are summarized in Figure 3. There are several interesting takeaways. First, as one would expect, delay improves as the amount of memory or message rate increase. For instance, in the “High message regime,” the messaging rate of each server tends to infinity as increases. This high rate of message allows the system to achieve vanishing delay even when the dispatcher has only a limited memory size, capable of storing only a bounded number of server IDs ( but bounded). In contrast, for the same memory size, if the messaging rate per server remains bounded, then delay becomes strictly positive in the largeserver limit. The same complimentarity between memory and communication is partially mirrored in the “High memory regime,” where the servers send messages at a constant rate of at least , but the dispatcher can remember an unbounded number of server IDs. Here, the lack of communication is made up for by an abundance of memory, and vanishing delay is again achievable.
A second, and perhaps more striking, insight is that the roles of memory and communications in this model are not entirely symmetrical. In the “High memory regime,” the messaging rate per server is assumed to be bounded. In this case, having more memory alone is not sufficient for achieving vanishing delay. In particular, assuming that the number of server IDs the memory can store grows to infinity as increases (), then vanishing delay is achieved if and only if the messaging rate per server is at least . Recall that is exactly the average arrival rate per server. So another interpretation of this result is that the servers need to communicate its idling status at the rate of one message per job in order for more memory to have a meaningful impact on delay performance.
It is worthnoting that the analysis carried out in gamarnik2018delay applies to the specific dispatching policy. While it is conceivable that some other policy could utilize the same memory and communication resources more efficiently, a follow up paper [gamarnik2020lower] shows that any potential improvement will be limited. They prove that under a certain natural symmetric assumptions, any policy operating in the “Constrained regime” (see Figure 3), where the messaging rate per server stays bounded and the memory size is on the order of , necessarily incurs a strictly positive limit delay as .
The interplay between memory, communication and system performance in a queueing network has also been investigated by xu2020information in a different system model. They study dynamic resource allocation in a singlehop stochastic processing network model under imperfect communication and limited memory. Unlike gamarnik2018delay, where communication is captured by messaging rates, xu2020information considers a more general communication model using memoryless Shannon channels in information theory, which would allow for capturing both ratelimited and noisy communication. The focus of xu2020information is centered around understanding how to achieve the maximum throughput region when the decision maker can access the state of a queueing system only through a noisy and ratelimited channel, when the algorithms are memorylimited. The main results of xu2020information consist of a family of Episodic MaxWeight policies that (approximately) achieve the maximum stability region under various combinations of memory and channel conditions. As another departure from the above loadbalancing model, in which only the dispatcher possesses a memory, both the queues and scheduler in xu2020information have their own memory bank. A surprising conclusion from the results in xu2020information is that it is not only how much memory that matters, but also where the memory is located: granting more memory to the scheduler, rather than to the queues, leads to a significantly larger impact on the system’s stability region.
4.3 Literature Review and Discussions
There has been a growing body of literature that examines the impact of predictive information in queueing networks. xu2016using studies admission control in Emergency Departments where the operator may have access to limited and noisy future information. They propose a family of admission control policies that provably outperforms the optimal online policy at all arrival rates, in contrast to the policy in spencer2014queueing which is only optimal in the heavytraffic limit as . Using simulations based on historical hospital admission data, xu2016using also demonstrate that leveraging future information can lead to substantial improvements in queueing delay over online policies even if the predictions are noisy and the lookahead window size limited. ata2020optimal applies a version of the model with future information of spencer2014queueing to the problem of admission control in a call center with callback options. A key insight there is that the type of proactive admission policies that are shown to be optimal in the overloaded setting of spencer2014queueing () remains effective in an underloaded system ().
In some cases, having information about future arrivals goes handinhand with being able to intervene on a more proactive basis. huang2015backpressure examine proactive scheduling policies that improve the standard BackPressure algorithms in a queueing network. In this model, future arrivals to a queue can be not only predicted, but also served, in advance. In a similar spirit, but different domain, hu2020optimal uses a queueing model to study how to provide proactive care in a healthcare application, where the care provider strives to achieve better treatment outcomes by proactively intervening on a patient who is at the risk of deteriorating in the near future.
There is a vast body of literature on dispatching policies for load balancing that aim to reduce the communication overhead while maintaining desirable delay performance, including the celebrated powerofchoices algorithm, where the dispatcher queries randomly sampled servers and sends the arriving job to the least loaded among them. mitzenmacher2002load, and more recently anselmi2020power, hellemans2020performance, study the impact of dispatcher’s memory on the performance of this class of algorithms. To the best of our knowledge, gamarnik2018delay is the first to formally model the joint impact of communication and memory.
5 Reinforcement Learning in Queues
We now discuss the challenges in applying Reinforcement Learning theory to Queueing systems. This discussion acts as a literature review of recent progress in the field. We highlight novel approaches in applying reinforcement learning in the context of queueing systems.
We will assume some knowledge of the formalism of Markov decision processes and reinforcement learning, such as
learning, policy gradients, function approximation, as well as the important role of simulation in training models. To very briefly summarize, a Markov decision process (MDP) is a Markov processes where the evolution of the process is also influenced by actions which are chosen by a policy. The purpose of the policy is to maximize the expected cumulative rewards received by the process. The optimal policy and its value can be calculated using the Bellman equation. In practice, the evolution of the MDP and its rewards is often unknown and instead must be estimated from data. The suite of algorithms for this joint estimation and optimization is referred to as reinforcement learning (RL). Often the MDP objective has an unreasonably large complexity, thus different forms of function approximation must instead approximate the Bellman equation or directly perform an approximate policy optimization on the MDP objective. Since the early ’90, and particularly over the last 5 years, neural networks have been a commonly used framework for function approximation. Successful policies can require large amounts of data to be learned. Thus policy training is often performed via simulation. We refer the uninitiated reader to text such as
sutton2018reinforcement, bertsekas2019reinforcement and szepesvari2010algorithms for introductory accounts.Below we emphasize the role of queueing models as an important application area for early reinforcement learning models. We then discuss research challenges in the theory of reinforcment learning when applied to queueing systems. We review recent literature that addresses a number of these challanges and finally emphasis their importance in emerging application areas.
5.1 Queueing in Early RL Literature.
The application of dynamic programming and reinforcement learning to queueing systems has existed almost since inception. A good example is the work of crites1996improving. The paper considers allocation of a set of elevators in a building in order to minimize waiting time. The work is amongst the earliest examples of Qlearning with neural networks. The paper highlights a number of specific challenges that can occur within a queueing system: time is continuous and so the times at which decisions are made is a part of the modeling process (in order to regulate the branching factor); there is often incomplete state information (for instance about the number of people waiting); there is decentralized coordination required between different agents; and the stochastic and potentially nonstationary arrival of passengers must be modelled. Many of these issues remain challenges for applying reinforcement learning in queueing systems.
Works of zhang1995reinforcement consider the application in job shop scheduling problems. Here complex planning constraints are approximated. singh1997reinforcement gives an early example of reinforcement learning applied to dynamic channel allocation. Here a channel assignment problem in a cellular telecommunication system is modeled as a dynamic program and a reinforcement learning solution with function approximation is found to outperform the best heuristics. The application of neural network reinforcement learning to inventory management is considered by van1997neuro. In addition to neural network approximations, the cross entropy method when first applied to reinforcement learning was applied to inventory control as an example application by mannor2003cross. From these works it is clear that after inception new reinforcement learning methods can readily be applied to a queueing system.
It is worth noting that success of these methods does depend greatly on the predictability of the underlying queueing process. There is a risk of overoptimizing a system for incorrect parameters. As a related anecdotal example, we consider the routing of telephone calls. Here AT&T would perform periodic learning and optimization in order to determine dynamic routing of telephone traffic, see 391435. However, it was found that a dynamic alternative routing (DAR) gibbens1988dynamic, where routing decisions are made based on local blocking decisions, was found to be much more successful. Here, centralized learning and optimization of (epistemic) parameters performed worse than decentralized state dependent (aleatoric) optimization. This is because the local state provided a better representation of the decisions to be made rather than the estimation of the state evolution. The role of current information in decision making, as discussed in the previous section, remains important here.
As we will discuss next, it continues to be the case that queueing systems provide a rich set of models where simulation and thus reinforcement learning methods are applicable. Nonetheless it is also reasonable to say that even in simple settings mathematical guarantees, correctness and convergence of reinforcement learning algorithms is not well understood.^{5}^{5}5Beyond canonical assumptions such as finite state space with each state having a lower bound on the proportion of time visited.
5.2 Challenges in applying Reinforcement Learning Theory to Queueing Networks.
There is a growing interest in the application of reinforcement learning to queueing systems, but there are evidently challenges. We need algorithms that learn parameters that we know to be fixed but provide flexibility to model variability. To do this, as the previous section suggests, models require correct and extended notations of state information. For instance information of future arrivals can improve decision making for these transient parameters, or providing contextual information on service rates. However, decentralization, continuous time modeling, incomplete state information, and simulation model misspecification can all inhibit our ability to leverage the benefits of reinforcement learning in queueing settings.
Nonetheless the potential practical power of reinforcement learning methods remains appealing. Many features of a queueing model (e.g. the speed up and slow down of vehicles at a road junction) cannot be well modeled in closed form and so simulation and datadriven solutions have the ability to capture features that are not well represented by handcrafted models. Before discussing recent literature, we highlight challenges that apply to the development of reinforcement learning for queueing systems:

Unbounded State Space. Queueing systems often have unbounded state spaces, while theoretical results on reinforcement learning assume large but finite state spaces. Function approximation can help alleviate thus; however, optimization of rewards for simulations for small queue sizes may lead to sub optimal performance for a system in highload or in overload, which is often the situation that the system designer cares most about.

Unbounded State Cost. Even if function approximation may assist, queueing systems can have long regenerative cycles and thus large mixing times. Thus, if we wish to minimize queue size or delay, then the associated costs will be unboundedly large.

Stability Guarantees. We may wish to guarantee a level of first order performance. We may wish to optimize cumulative rewards subject to stability over a range of loads. There are little guarantees of robustness to parameter uncertainty in a trained reinforcement learning model.

Nonstationarity. Parameters may change in time. Daily variations can be modeled through the use of contexts. However, in settings such as communications, fluctuations in load can be erratic, which again leads to the risk of learning one part of the state space at the cost of performance out of these samples.

Decentralized Coordination. A queueing systems often consist of interconnected, geographically and logically diverse components. Often centralized coordination is not possible and simulation of the whole system may not be possible either. The optimization of individual components should contribute to the overall optimality of the system as a whole.

Dependence on Simulation. As with other applications areas such as robotics, a simulation driven solution may simulate behaviors not found in reality. Here we could imagine the modelling of road traffic. Here a simulation might incorporate the speed up and slow down of vehicles but may not account for change lanes or drivers accelerating through an amber light.

Continuous Time Modeling.
Although not unassailable, it is worth noting that many queueing applications are continuous time and thus appropriate decision epochs must be defined, in order to reduce branching factors while allowing for a suitably rich set of policies to be implemented over time.

Incomplete State Information. In many the state of a queueing system may not be fully observable or may only be locally observable. In internet congestion in one part of a system may only be sensed locally. In a data center dispatcher may only be able to gain information on a subset of its servers. We may not know how may people are waiting at a traffic light when a button is pressed.
The above list is by no means exhaustive but covers many issues that RL theory does not currently address for queueing systems.
5.3 Recent literature.
Here we review recent theoretical works that look to bridge the gap between theory and practice for reinforcement learning applied to queueing systems.
The paper of shah2020stable looks to address the issue of infinite state spaces found in queueing systems. The method considers a setting where their is know to be a Lyapunov function providing stability for some (optimal) policy. The approach pays particular emphasis on evaluating actions for, thus far, unvisited states. Here, the online algorithm does have access to an oracle (or simulator) that can predict future performance from the current state. The paper of liu2019reinforcement take a different approach by truncating the statespace on which reinforcement learning can be applied. Here it is assumed that there is a known but perhaps suboptimal policy (such as MaxWeight) that can stabilize the queueing network. A standard RL algorithm is used for learning when the state space is sufficiently small; however, once the Lyapunov function associated with this policy increases beyond some threshold the suboptimal stabilizing policy is applied.
Although not explicitly analyzing queueing networks, the work of qu2020scalable does much to analyze the effects of decentralized learning and control in a network setting. Here is it proven that if separate nodes of a network implement actions and if the impact of actions at neighboring nodes is delayed, then applying reinforcement learning locally yields a policy that is within a constant factor of the globally optimal policy. The factor in the degradation can be given by either the discount factor of the MDP or by the mixing time of the Markov decision processes.
The previous papers place more focus on tabular reinforcement learning methods. However, in practice it would appear that function approximation is more commonly applied to yield a solution. The paper of dai2020queueing
analyses on the application of deep reinforcement learning to queueing systems. The paper considers the implications of proximal algorithms such as Proximal Policy Optimization and Trust Region Policy Optimization in application to queueing problems. A number of credible mathematical arguments are introduced to explain why policies will improve with these algorithms in a queueing setting. Further, a number of variance reductions techniques are introduced.
An under investigates area is if know formulas can be used to improve the optimization of a queueing system. Albeit not a strictly reinforcement learning setting, the paper of dieker2017optimal
should how closed form queueing expressions can be incorporated into stochastic gradient descent procedures in order to expedite the optimization of a queueing network model.
A similar approach exploiting the structure of the queueing system is to simply start with an MDP queueing system with a known solution and then to use this structure along with parameter learning in order to establish theoretical bounds. This approach is taken in the case of inventory control see agrawal2019learning. Here the basestock policy is known to be optimal and the value function and cost functions are know to be convex. Through this precise regret bounds are established for an online policy in comparison with the best basestock solution.
5.4 Application Areas
We review a range of queueing application areas where reinforcement learning approaches have been applied. These studies usually focus on performance under simulation. Theoretical correctness and robustness guarantees are often not provided.
dai2019inpatient consider the application of reinforcement learning for the optimization of hospital patient inflows. Here the overflow decision problem is cast as an average cost MDP. An approximate dynamic program is formulate and verified. Markov decision process approaches to emergency response and vehicle assignment have been applied by van2018real. Reinforcement learning formulations are then considered by lopez2018distributed. feng2020scalable develope the work of dai2020queueing and apply this to ride sharing systems. A number of recent works consider the application in ridesharing system. See qin2019deep for a review and discussion of the application in Didi’s ridesharing fleet.
Other transport applications involve traffic signal control. The SurTrac system in the US applies estimation along with a forward dynamic programming solution, see smith2013smart. Companies such as Vivacity explicitly apply deep reinforcement learning in their traffic control system.^{6}^{6}6https://vivacitylabs.com/technology/junctioncontrol/ See cabrejas2021reinforcement for a recent study that demonstrates the ability of deep RL to outperform current commercial traffic signal controllers.
Studies of distributed reinforcement learning for manufacturing processes are considered in qu2018dynamic. Here the stochastic processing network model of dai2020processing is considered within a reinforcement learning framework. Further studies in semiconductor manufacting is given by park2019reinforcement.
Applications in communications systems and data centre applications is initiated by tesauro2005online. See also tesauro2005utility and tesauro2006hybrid for data centre applications of RL. The area has received further attention with Deepmind’s application in Google Data centers.^{7}^{7}7See https://deepmind.com/blog/article/deepmindaireducesgoogledatacentrecoolingbill40 A more recent study focusing more on congestion control protocols is given by tessler2021reinforcement.
The above references are intended to be indicative of applications of current interest. Regardless of theoretical challenges there are also challenges in more applied studies. It is reasonably to say that there are not standardized queueing problems even within specific fields. For instance, there are no generic simulators like Open AI’s gym environment. Although within fields there are simulation environments like ns3 simulator^{8}^{8}8https://www.nsnam.org/ for networking, or Vissim^{9}^{9}9https://www.ptvgroup.com/en/solutions/products/ptvvissim/ and SUMO^{10}^{10}10https://sumo.dlr.de/docs/index.html for traffic control. Further some queueing areas remain very challenging, for instance the DAG scheduling problem in heterogeneous computing remains challenging due the large array of potential DAGs that a platform must support, yet dynamic programming heuristics remain popular, see topcuoglu2002performance. Further, in many practical applications (air traffic control or hospital assignment) there is human level control which governs part of the dynamics and the costs and rewards of the process. Thus a set of solutions are likely required to be readily available. This may limit that ability to consider completely modelfree learning in some application areas, or at least must allow for models that can be trained in close to realtime.
6 Conclusions
In this tutorial, we have seen how classical queueing algorithms such as MaxWeight and Back Pressure can viewed as an application of Blackwell’s Approachability Theorem. Through this we see connections between queueing and online learning theory. This online learning approach does not require the explicit statistical estimation of parameters. However, to consider more general models, we see that the underlying parameters such as service rates must be estimated. We discuss this next and develop connections between statistical learning of sequences of data and the Lindley’s recursion for the G/G/1 queue. Through this we develop a notion of regret for a single server queue. As an example we should how the perceptron algorithm can be used for the classification of customer service at a queue and we bound the impact that this has on the queue length of the policy implemented and the (unknown) optimal policy. It would be interesting to develop these notions for networks of queues. We then discuss the role of state estimation when compared with parameter estimation. Here we discussed two examples. The first considers the importance of future information in making access control decisions. The second considers the importance of memory in load balancing systems. Finally, we review the application of reinforcement learning highlighting the importance of reinforcement learning to solve practical queueing challenges as well as highlighting current gaps between theoretical results in reinforcement learning and their applicability to modern queueing problems.
Certainly we anticipate that over coming years results different approaches in learning, control and information will grow in their applicability to queueing systems, and we hope that this tutorial provides some necessary background and introduction to modern methods in this emerging area.
Appendix A Appendix
a.1 Proof of Blackwell’s Approachability Theorem
We now restate and prove Blackwell’s Approachability Theorem.
Theorem 5 (Blackwell’s Approachability Theorem).
The following are equivalent

is approachable.

For every there exists such that .

Every halfspace containing is approachable.
Proof.
Proof.
First we need to show that the decision given in (6) exists. To this end, we recall that the Minimax Theorem states that for a matrix it holds that
A less compact way of writing the Minimax Theorem is that for every
(17) 
(Informally this states that if there exists a good choice of for each then there exists a good choice for all .)
Now if a halfspace, , that contains is approachable, then it must be that when the adversary choses there exists a choice such that
(18) 
If this were not true then certainly would not be approachable.
The statement (18) can be recast in terms of the Minimax Theorem. Specifically, we can define the matrix with components . Thus by definition . Equation (18) now states that such that
Thus by the Minimax Theorem, (17), there exists such that
So, we now know that for all hyperplanes containing , there exists a such that
(19) 
This establishes the existence as described in (6).
Next we show that the sequence , is such that approaches .
(20) 
The first inequality holds since the projection is by definition closer to . The equality is the parallelogram rule. The final inequality follows since, by the definition of and since
Further, applying the above equality, to the final term in (20) gives
(21) 
Substituting (21) into (20) and taking expectations, we have
(22) 
where in the 1st inequality above we note that
and in the 2nd inequality above we use the fact that
which holds by our assumption on , (6).
Multiplying both sides of (A.1) by and rearranging gives
Summing these interpolating terms gives
Thus, as required,
(23) 
∎
a.2 Hannan Gaddum Theorem
Theorem 6 (HannanGaddum Theorem).
There exists a playing strategy such that for any
(24) 
In other words, our performance in the game is asymptotically as good as the best fixed action.
The proof follows as a consequence of Blackwell’s Approachability Theorem.
Proof.
Proof.We define the vector payoff and convex region . For all there exists such that componentwise , in particular we choose to be the probability distribution with where . This verifies condition 2 of Blackwell’s Approachability Theorem. Thus there exists a strategy such that for all
As a consequence, we can analyse the expected regret:
(25)  
(26)  
For first inequality above, we note that maximizing in (25) in included in the sum in (26). The second inequality follows from Jensen’s inequality. ∎
a.3 Proof of MaxWeight Stability
Theorem 7.
Given are independent identically distributed with mean then

If then, regardless of the policy used, is transient.

If then, under the MaxWeight policy, is positive recurrent.
Proof.
Proof.For Part i), we note that given then there is a hyperplane separating from . We let be the normal vector of this hyperplane. Since this hyperplane is separating there exists an such that
for all . Thus we see that irrespective of the policy used it must hold that , which clearly diverges as . This proves Part i).
We now focus on Part ii). Here the crux of this argument is that for all queue sizes , is by some margin strictly less than the MaxWeight choice . Given above, we aim to use , the biggest such that for all time. This can be defined as the biggest such that for all i.e. so the distance of the smallest component to the boundary is maximized. Given this optimization description we observe the following
Comments
There are no comments yet.