In this work, we consider a statistical learning problem that is motivated by the following application. Consider a wireless communication system consisting of a single transmitter/receiver pair and channels over which they may communicate. Packets randomly arrive to the transmitter’s queue and wait in the queue until they are successfully delivered to the receiver. At each time slot, the transmitter can decide to transmit a frame on one of the channels. If the queue is non-empty, the transmitted frame carries a packet with it over the channel, and if the frame is then successfully received, the packet is successfully delivered. For each channel, each frame transmission attempt is successful according to a probability that is initially unknown. The transmitter is informed whether a transmission was successful via receiver feedback that immediately follows each transmission. The objective of the controller is to minimize the queue’s backlog by using the receiver feedback to quickly identifying the best channel to transmit on.
In the above application, each successful frame transmission offers one packet’s worth of service to the queue. Thus, the channels in the above application behave like servers in a general queueing system that, when selected, offer a random amount of service. Given the above motivation, we consider the problem of identifying the best of available servers to minimize the queue’s backlog. To this end, we associate a one unit delay-cost to each time slot that each packet has to wait in the queue. To obtain good performance, the controller must schedule the servers to explore the offered service rate that each gives and also exploit its knowledge to schedule the server that appears to give the most service. We define the queue length regret as , where is the backlog under a learning policy and is the backlog under a controller that knows the best server. Our objective is to find a policy that minimizes this regret.
Our problem is closely related to the stochastic multi-armed bandit problem. In this problem, a player is confronted by a set of possible actions of which it may select only one at a given time. Each action provides an i.i.d. stochastic reward, but the statistics of the rewards are initially unknown. Over a set of successive rounds, the player must select from the actions to both explore how much reward each gives as well as exploit its knowledge to focus on the action that appears to give the most reward. Learning policies for solving the multi-armed bandit have long been considered . Historically, the performance of a policy is evaluated using regret, which is defined to be the expected difference between the reward accumulated by the learning policy and a player that knows, a priori, the action with highest mean reward. This quantifies the cost of having to learn the best action. It is well known that there exist policies such that the regret scales on the order of and that the order of this bound is tight . In the seminal work of , policies for achieving an asymptotically efficient regret rate were derived, and subsequent work in [5, 6, 7] have provided simplified policies that are often used in practice.
One approach to solving our problem would be to simply use traditional bandit algorithms. In this approach, the reward in the bandit problem would be viewed as offered service in our problem, and the resulting methods would focus on maximizing the rate of offered service to the queue. In the context of the wireless system described above, this is equivalent to targeting a high rate of successful frame transmissions without regard to which frames are packet-bearing. Note that in an infinitely backlogged system, which always has packets available to send, this approach would maximize throughput. Since, as a rule of thumb, higher service throughputs generally correspond with lower packet delays in queueing systems, this approach has merit.
However, the drawback with this approach is that it does not exploit the fundamental queueing dynamics of the system. During time periods when the queue is empty, the offered service is unused, and the controller can, therefore, freely poll the servers without hurting its objective. This contrasts with non-empty periods, when exploring can be costly since it potentially uses a suboptimal server. However, a controller cannot take this lesson too far and restrict its explorations only to time slots when the queue is empty. This is because some servers may have a service rate that is below the packet arrival rate, and if the controller refuses to explore during non-empty periods, it may settle on destabilizing actions that cause the backlog to grow to infinity.
Instead, policies should favor exploration during periods when the queue is empty but perform enough exploration during non-empty periods to maintain stability. In this work, we show that for systems that have at least one server whose service rate is greater than the arrival rate there exist queue-length based policies such that .111For functions and , iff and such that , . Likewise, iff such that , with . Likewise, we show that any traditional bandit learning algorithm, when applied to our setting, can have queue length regret . We additionally show that if there does not exist a server with a service rate that is greater than the arrival rate, for any policy the queue-length regret can grow as .
The problem considered herein is related to the work of , which also considered learning algorithms for scheduling service to a queue. The main difference is that in  the controller did not use the queue backlog to make decisions. As a result, the policies considered in that work were closely related to those in the bandit literature and focused on maximizing offered service. Under these policies,  showed that the tail of diminishes as , implying that is logarithmic. In this work, we show that by exploiting empty periods, the queue length regret is bounded.
Our problem can also be compared to the more general literature on reinforcement learning. In reinforcement learning, an actor seeks to learn the optimal policy for a sequential decision making problem with an evolving system state. The decision made by the actor at each state causes the actor to obtain a probabilistic reward and the system to randomly change its state. The objective is to learn how to obtain high reward by estimating the system’s statistics. See[9, 10] for a review of reinforcement learning. The online performance of reinforcement learning algorithms has been previously explored in  and , which used the upper confidence methods UCRL and UCRL2. The work of  has improved upon these bounds, and 
extended the results to weakly-communicating problems. Thompson-sampling-inspired algorithms have also emerged as a design principle. Posterior sampling reinforcement learning (PSRL), which is based on this principle, has become a popular method that can, in certain problems, outperform upper confidence methods [16, 17, 18, 19].
Our work can be viewed as a reinforcement learning problem with known structure. The queue backlog is the problem’s state and a penalty is paid for inefficiently scheduling whenever the backlog is nonzero. Note that in our problem, the amount of regret that can be accrued at a given time is unbounded and the problem’s state space (the queue occupancy) is infinite. Our work has some overlap with the field of safe reinforcement learning [20, 21, 22]. We seek to find learning policies that do not destabilize the system and regularly return the queue backlog to the empty state. This structure separates our work from most of the previous literature that explores algorithms that are not designed to exploit queueing dynamics and are not focused on analyzing queue stability under different policies. The algorithms we design obtain their performance guarantees by using what is known about the problem, the dynamics of the queue and its impact on regret, while focusing on learning what is unknown, the server rates. Since the amount of service that a server will offer when scheduled is independent of the current queue backlog, our methods do not waste time trying to match the action-rewards to individual states, as would occur with a general reinforcement learning algorithm that knew nothing about the problem’s structure.
The use of statistical learning methods for optimizing wireless channel access in systems with uncertain channel conditions has been considered in the previous literature [23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35]. However, most previous works examined algorithms for maximizing transmission opportunities and did not focus on minimizing queueing delay. A major contribution of this work is to show that these two objectives are not necessarily the same. A learning algorithm that maximizes total offered service to a queue may not statistically minimize the queue backlog over time.
Scheduling algorithms that use queue state to make service decisions have a long history. In the seminal work of  and , the max-weight algorithm for assigning service to queues was shown to maximize throughput under complex scheduling constraints and probabilistic dynamics. This framework has been extended and applied to network switching , satellite communications , ad-hoc networking [40, 41], packet multicasting and broadcasting , packet-delivery-time reduction , multi-user MIMO , energy harvesting systems , and age-of-information minimization [46, 47]. In the works of  and , learning algorithms were used for achieving network stability under unknown arrival and channel statistics. The methods considered in those works augmented the max-weight algorithm with a statistical learning component. The resulting methods were shown to achieve stability but did not analyze queue length regret. In , learning algorithms for controlling networks with adversarial dynamics were considered and the max-weight algorithm and tracking algorithm were both explored as solutions for minimizing queue backlogs. However, the adversarial model considered in that work is pessimistic compared to the stochastic model considered herein.
The focus of this work is on proving the achievability of asymptotically-optimal regret growth. This requires that the controller samples all servers often enough to converge on the optimal server and achieve bounded regret. Importantly, the controller must concurrently guarantee that the queue does not blow-up to infinity and that exploration of sub-optimal servers is eventually limited to times when the queue is empty. The policies we use to prove our results do not optimize the queue length regret for finite-time analysis. As a result, although the tail of their regret is optimal, they can initially accrue large regret. We designed these policies to facilitate analysis, which is made difficult by the high-correlation between the queue backlog (which determines when to sample the servers to converge on the best server) and the observations of the servers (which determine the queue backlog and are used for identifying the best server). Our analyzed algorithms disentangle this relationship at the expense of good non-asymptotic performance. Nevertheless, we show through simulation that the insight gained from our analysis can inspire well-performing heuristics that exploit the structure found in our theoretical proofs. Analyzing and improving upon these heuristics is a potential direction for future research.
The rest of this paper is organized as follows. In Section II, we give the problem setup. We analyze the performance of traditional bandit algorithms, when applied to our problem, in Section III. In Sections IV and V, we characterize the regret of queue-length based policies when the best server has a service rate above and below the arrival rate, respectively. In Section VI, we use our theoretical analysis to define heuristic policies and test their performance through simulation. We conclude in Section VII. Note that a subset of the material in this paper appeared in .
Ii Problem Setup
Ii-a Problem Description
We consider a system consisting of a single queue and servers (where integer ) operating over discrete time slots . See Fig. 1. Packets arrive to the queue as a Bernoulli process with rate and wait in the queue until they receive service from one of the servers. Each packet can be serviced no sooner than the next time slot after its arrival. Each server’s ability to offer service randomly fluctuates. Specifically, the number of packets that server can service at time follows a Bernoulli process with rate . The arrival process and server processes are assumed to be independent of each other. We refer to server as stabilizing if , which implies that the rate at which the server can provide service keeps up with the arrival process. Otherwise, we refer to it as non-stabilizing. Given the above, an instantiation of our problem is characterized by the tuple , where
is the vector of service rates. Then, we letbe the set of all problems (i.e., tuples).
The main challenge in our problem is that the system can only activate one server at a given time. At each time slot, a system controller must select only one of the servers and ask it to offer service. Then the selected server will inform the controller of the number of packets that it can offer to serve and, if the queue is non-empty, will service that number of packets. We denote the controller’s choice at time as . For decision , the service offered to the queue is then denoted which is equal to . Throughout this work, the controller is not allowed to observe the offered service processes prior to making its decision , and therefore it cannot know which servers will offer service prior to its choice.
Given the above, the queue backlog, , evolves as
where is used to denote the maximum of and . We assume and the controller knows this.
The system incurs a unit cost, in delay, for each time slot that each packet has to wait in the queue. Therefore, for a time horizon , the controller wishes to minimize . Since, the controller cannot observe the values of prior to making its decision , the optimal action is to select the server
to provide service to the queue. For simplicity, we assume is always unique.222Then, we let be the set of all tuples such that is unique. In our framework, the controller does not a priori know the values of and must therefore use observations of to identify . This implies that the service available from chosen server is revealed to the controller after time , but the service that would have been available from all other servers remains hidden. Note that the controller can observe at all times , even when . Finally, in this work, we assume that in addition to not knowing the values of , the system does not initially know the exact value of , either.
Given the above, the objective is to design a controller that uses its observations to decide which server to schedule at each time slot to minimize
To this end, denoting the history of all previous arrivals, service opportunities, and decisions as
we need to specify a controller policy , which at each time uses any portion of to (possibly randomly) make a decision on . Note that since contains the history of all previous arrivals and offered service to the queue, at time , the policy implicitly knows the current backlog as well (i.e., we could explicitly include in , but this would be redundant). We let be the set of all policies that we could design.
Define to be the queue backlog under the controller that always schedules , and let be the backlog under our policy that must learn which server is optimal. We will analyze the performance of using the following definition of queue length regret,
where the expectation is over the product measure of where denotes the operation of and the operation of the scheduler that always selects . We will often simply refer to this metric as regret in the rest of the paper. Observe that, if our policy minimizes (2) over the set , it must also minimize over the set.
Ii-B Analyzing Queue Length Regret
Since at any time , server has the highest probability of offering service to the queue, it is clear that the controller that always schedules must minimize the expected queue backlog. Therefore, for every time and any policy ,
which implies that is monotonically increasing in for all . Since, the queue begins empty at time , using (1), .
The focus in this work will be on characterizing how queue length regret can scale with time horizon . Therefore, we will not focus on finding policies that optimize for some finite value of . Instead, we will characterize as goes to infinity. We will proceed to show that under different assumptions on and , the regret . Define the tail of the regret to be
Intuitively, this is the additional cost-to-go of infinitely running the policy past time . Then the following proposition holds.
If for policy and subset , for each , then as for every .
Thus, we see that a policy that obtains queue length regret seeks to minimize the tail of the regret as time goes to infinity (i.e., causes the tail to trend to zero). A policy that optimizes the tail of the regret is not guaranteed to minimize the queue length regret. Indeed the policies we proceed to analyze in Section IV do not generally achieve small queue length regret, and are instead designed to facilitate proving that queue length regret is achievable. Nevertheless, we will show in Section VI that the insight gained from these policies can be used to construct heuristics with good regret performance.
Iii Regret of Traditional Bandit Algorithms
In this section, we establish that the regret of any policy that does not use previous observations of the arrival process must have a queue length regret that grows logarithmically with for some . Under this restriction, the policies in this section (in effect) do not observe previous arrivals to the queue, and their decision making process is solely focused on the observed offered services. Since the policies do not monitor process , they cannot directly use the queue backlog to make their decisions, either. Under this restriction, we may still borrow any strategy from the traditional stochastic multi-armed bandit literature to solve the problem. These policies focus on only maximizing offered service to the queue, using previous observations of the offered service to guide their decisions. Without loss of generality, the theorem is established for a system with two servers. The theorem’s proof makes use of a well-known lower bound on the performance of bandit learning algorithms [3, Theorem 2.2].
Let be the service rate when the service rates are sorted in increasing order.
For any that does not use previous observations of the arrival process, with server rates and () such that for any fixed , .
Note that to facilitate later comparison to Theorem 4, the theorem statement specifies , but the proof actually holds for the larger set . We let
be the indicator random variable.
We apply a sample path argument. There are two servers in the system. Denote the optimal server as and the suboptimal server as . Consider an arbitrary policy that does not make its decisions using observations of the arrival process. Define to be the decision process of a controller that follows up to time , but at time chooses server with probability . Let and be the queue backlogs under the two respective controllers.
Then, for any outcome defining a fixed sample path for the arrival process, service processes, and possible randomization in
Denote the offered service from servers and as and respectively and the decision by policy at time as . Then, from (3) and the queue evolution equation,
Thus, taking expectation and summing over time
Note that each defines a different decision process leading up to time . For example, chooses server with probability at time , but follows the actions of policy at time (and chooses server with probability at time , instead). One can then see that in the above, at each index , we are comparing the performance of to the performance of a different decision process. Since the queue begins empty at , by (1), the expected queue backlog under any controller must be equal to and at times and , respectively. Thus, the left-hand side of (4) has been chosen as a summation from time slot to .
Now, and are functions of events up to time and are therefore independent of and . Thus, we can write
Now, and are not necessarily independent events. However, we can use the fact that an arrival at time is a subset of the event that to lower bound the above (i.e., ). Then,
We then use the fact that policy is independent of the arrival process to obtain
Note that .
Now, policy is a strategy for solving the stochastic multi-armed bandit problem (i.e., it uses the previous actions and observed offered service to decide on which server to schedule at time ). From [3, Theorem 2.2], there must exist a such that333This is because either: the policy has for some and some Bernoulli distribution on the rewards (which we can assume to be
and some Bernoulli distribution on the rewards (which we can assume to be, without loss of generality), or for any Bernoulli distribution on the rewards. See [3, Theorem 2.2] for details.
Then, because for all ,444It is easy to show through dynamic programming that, when is given to the controller a priori, the controller that minimizes the expected queue backlog at each time , schedules for all time .
Therefore, by (8),
Using the definition of given by (2), this proves the result. ∎
Iv Regret of Queue-Length-Based Policies
In this section, we examine the asymptotic scaling of regret for policies that make decisions using (or equivalently, the previous observations of the arrival process). We do so by considering the problem for different subsets of that impose different assumptions on the relationship between and . This section will build to the main result of this work, Theorem 4, which states that there exists a such that for any problem with , .
We begin our analysis in Subsection IV-A under the assumption that every server is stabilizing (i.e., , ). Under this assumption, the controller does not need to account for the possibility of system instability. As a result, the controller can limit itself to performing exploration only on time slots in which the queue is empty. During time slots in which the queue is backlogged, the controller will exploit its previously obtained knowledge to schedule the server it believes to be best.
In Subsection IV-B, we allow for both stabilizing and non-stabilizing servers in the system (i.e., is allowed to be less than for a non-empty subset of the servers). However, we will assume that the controller is given a randomized policy that achieves an offered service rate that is greater than the arrival rate to the queue. Note that the controller does not need to learn this policy and can use it to return the queue to the empty state. The given randomized policy is not required to have any relationship to and will not, in general, minimize queue length regret. As a result, to minimize queue length regret, the controller will not want to excessively rely upon it.
In Subsection IV-C, we further relax our assumptions on the problem and only require that . In this subsection, the controller will need to identify which servers are stabilizing, while simultaneously trying to minimize . This will require a policy that does not destabilize the system. To this end, the controller will have to explore the servers’ offered service rates during both time slots when the queue is empty and backlogged. Explorations during time slots when the queue is backlogged, in general, waste work and should therefore be performed sparingly. Intuitively, as the controller identifies which subset of servers have service rates , it can focus its explorations on time slots when the queue is empty.
The above three cases build upon one another. The insight from one will point to a policy for the next, and we will therefore analyze the above cases in sequence. Under each of the above assumptions, we will find that there exists a policy such that the regret converges for all meeting the assumption. In contrast, in Section V, we show that there does not exist a policy that can achieve convergent regret over the class of all problems for which , .
Note that the main difficulty in evaluating a policy’s performance is the correlation between the time-evolving queue backlog and the history of observed offered service, which is used to find the best server. In general, the queue will be empty more often once we begin to identify the best server, and, likewise, the best server will be more easily identified when we begin to obtain more empty periods in which we can freely explore. The policies considered in this section have been chosen to disentangle this relationship so that well-known tools from statistics can be applied in our analysis. Specifically, our results will make use of Hoeffding’s inequality, which we restate here for convenience.
Consider where are i.i.d. random variables with mean and support between for . Then, by Hoeffding’s inequality, for all ,
For simplicity, we will often use Hoeffding’s inequality in our analysis even when tighter concentration inequalities may exist.
Iv-a All Servers Are Stabilizing
In this subsection, we assume all servers are stabilizing and design policies that are able to obtain regret under this assumption. Specifically, we will assume that the environment will only choose parameters from a subset defined below.
Under this assumption we will prove the following theorem, which states that there exists a policy such that, for every in the set , the regret converges. The proof of Theorem 2 is constructive and will give a policy for obtaining the result.
Under Assumption 1, such that, for each , .
To prove Theorem 2, we analyze the policy shown in Fig. 2 on an arbitrary . This policy maintains sample mean variables that estimate the servers’ service rates . Each sample mean is updated using observations of server ’s offered service. However, it is important to note that each sample mean is not updated at every time slot that we schedule its corresponding server, and we will therefore not use every observation of the offered services to construct the sample means. Instead, to facilitate analysis, we will only update the sample means at strategically chosen time slots. Throughout this section, sample mean variables are initialized to zero prior to their first update with a service observation. After the first update, the sample mean variables equal the sum of the used observations divided by the number of used observations.
Note that under policy , the queue backlog will transition through alternating time intervals, wherein the queue is continuously empty over an interval of time (i.e., ) and then continuously busy over an interval of time (i.e., ). We enumerate the periods using positive integers and refer to the occurrence of an empty period (busy period) as empty period (busy period , respectively). To arrive at Theorem 2, we will analyze the integrated queue backlog taken over the busy periods. To this end, we define the integral of the queue backlog over busy period to be the summation of the queue backlog taken over those times that are in busy period . By our problem formulation, a unit cost is incurred by the system for each time slot each packet waits in the queue, and the integral of the queue backlog over busy period can then be interpreted as the total cost accumulated over the busy period. In Fig. 3, we illustrate our terminology on a sample path of .
We now briefly point out some aspects of policy . See Fig. 4, which provides an illustration of . Under the policy, the first time slot of each empty period is used to update one of the sample means. To do this, the policy chooses one of the servers uniformly at random and schedules it. The amount of service that is offered from the chosen server is then observed and used to update the server’s corresponding sample mean. Note that the sample means are only updated during these time slots, and all other observations of the servers’ offered services, that are made during other time slots, are not used in our estimations. Now, following an empty period, when a packet arrives to the empty queue, the system enters a busy period. At the start of each busy period, the policy chooses to schedule the server with the highest sample mean (breaking ties between sample means according to any arbitrary decision rule) and continues to schedule that server until the queue finally empties and the busy period ends. Thus, during each busy period the queue is serviced by only one server (namely, the one with highest sample mean). We again stress that the policy does not use the observations of the service obtained during the busy period to update the sample mean. Notionally, the policy performs no exploration during busy periods and instead focuses on exploiting previous observations to empty the queue quickly.
Now, suppose we are at the beginning of busy period and policy chooses to schedule server for this busy period. Then, the duration of the busy period (i.e., the amount of time that will be required to empty the queue) is given by a random variable , whose mean we denote . Note that conditioned on server being scheduled during the busy period, is not a function of the busy period number . By Assumption 1, for all we have that and therefore is finite. Now, to establish the proof of Theorem 2, we will be interested in the integral of the queue backlog taken over the busy period. Note that, as with the busy period’s duration, the integral of the queue backlog is given by a random variable, which we denote . Using to denote the arbitrary start of the busy period,
and has mean . As with the busy period duration, variable is not a function of the busy period number .
We introduce one last piece of notation before moving to the proof of Theorem 2. At the start of each busy period, policy schedules the server with the highest sample mean to empty the queue (i.e., the server with the highest sample mean is scheduled to every time slot in the busy period until the queue empties). We will refer to this decision as scheduling the busy period to server . In the following analysis, we will be interested in the frequency with which the policy schedules each server to busy periods. Therefore, we will use to denote the number of busy periods that schedules to server in the first busy periods. We will use these variables to bound, in expectation, the number of busy periods that are scheduled to a server other than .
For example, if in Fig. 4 the first and fourth busy periods were scheduled to server but the second and third busy periods were scheduled to server , then and . Lemma 1 shows that we can upper bound the queue length regret with the queue backlog summed over times that is not scheduling . The proof follows from a sample path argument.
Consider a fixed outcome for the arrival and service processes and randomization in policy . Then, is the queue backlog of policy at time for this sample path, and is the backlog at time for a controller that schedules for all time slots since the start of time for this sample path. Now, assume at time policy is in a busy period where it is scheduling . Then, . The intuition behind this claim follows from the fact that the queue is empty right before the busy period begins and does not switch servers mid-busy period. Therefore, for the given sample path of arrivals and service processes, had the controller scheduled for all time slots since the start of time, the queue could not be less. Taking expectations and using the definition of queue length regret in (2) then gives the result. The complete proof of the lemma can be found in Appendix A.
Next, in Lemma 2, the expected value of is shown to be finite. Note that by our problem’s definition at most one packet can arrive to the queue at each time slot. Then, since a busy period must start with one packet in queue at its first time slot (i.e., the packet that arrives to the queue to begin the busy period), the maximum size of the queue backlog during the busy period cannot be larger than the busy period’s duration. Thus, the expected value of can be bounded by the expected value of , which is finite. The complete proof of the lemma can be found in Appendix B.
We next proceed to bound the expected value of (9) with Lemma 3. The lemma states that the integrated queue backlog (taken over busy periods scheduled to server ) up until time is less than the integrated queue backlog (taken over busy periods scheduled to server ) through the busy period. Observe that the right hand side of (10), is the expected number of busy periods out of busy periods scheduled to server , , multiplied by the expected integrated queue backlog, . This quantity is the expected value of the integral of the queue backlog, over busy periods scheduled to , up to busy period .
The intuition behind Lemma 3 is simple. Since, by definition, each busy period must be at least one time slot in duration, we cannot have more than busy periods in time slots. (In fact, since empty periods must also be one time slot in duration, it is obvious that we will have far fewer than busy periods in any time slots.) Therefore, for any outcome of the system’s performance, the integrated queue backlog up to time slot will be less than the integrated queue backlog through the busy period. Taking expectations then gives the result. The complete proof of the lemma can be found in Appendix C.
Finally, in Lemma 4, we show that the expected number of times schedules server to busy periods is bounded by a finite constant which is independent of . In order to schedule server to a busy period, must estimate to be no less than . Since we obtain a new observation of one of the servers during each empty period , using Hoeffding’s inequality, we can show that the probability that there exists a server such that decays exponentially in . The result then follows from the convergence of geometric series. The proof of the lemma can be found in Appendix D.
Given the above four lemmas, we are now ready to establish Theorem 2.
As a final note, it is important to observe that, by its statement, Theorem 2 does not imply that there exists a constant that bounds for all . It only states that for any chosen , under policy , the queue length regret converges to a finite number that is dependent on the chosen parameters.
Iv-B Non-Stabilizing Servers
In this subsection, we relax Assumption 1 to allow for non-stabilizing servers. Thus, we will now allow for some strict subset of the servers. Importantly, we will assume that the controller is given a known convex summation over the servers’ rates that strictly dominates the arrival rate to the queue. Then, by randomizing over the servers using this convex summation, the controller can always stabilize the system. Concretely, we make the following assumption.
For some given and ,
Observe that the set defined above is determined by the values of that are given to the controller and different values will lead to different sets. We emphasize that Assumption 2 does not imply that the controller knows any of the values of , a priori. Rather, it states that the controller can be confident that the convex summation it is provided will meet the condition of (11).
For example, if then Assumption 2 reduces to
and this fact being known by the controller, a priori. Then, if at any given moment the controller wishes to return the queue to the empty state, it can simply resort to persistently scheduling serverand wait for the queue to empty. However, note that server may not be the optimal server , and, as a result, persistently scheduling server for every time slot will not in general give good queue length regret performance for every .
Generalizing the above example to arbitrary values, we see that Assumption 2 implies that, if at any point the controller decides to resort to a strategy of scheduling at each time slot server with probability , it can be guaranteed that the queue will eventually empty. In other words, the values of define a stationary, randomized policy that has a service rate that dominates the arrival rate to the queue and can, therefore, empty the queue infinitely often. However, since by the assumption’s statement the values are not required to have any special relationship to , the randomized policy will not generally minimize regret, and its use should not be overly relied upon.
Before continuing, we note that Assumption 2 is intentionally artificial. The assumption assumes a randomized policy is given to the controller and that the system will only encounter parameters for and such that the given policy can stabilize the queue. In most practical applications, this assumption is unrealistic. However, the results of this subsection will be highly instructive for the next subsection, where we relax Assumption 2 and no longer assume a known stabilizing policy is given to the controller. Importantly, in the following, we will develop a method that can sparingly use a suboptimal policy to consistently guarantee future empty periods for the queue without overly using the suboptimal policy so as to lose the bounded regret result.
We now give the main theorem of this subsection.
Under Assumption 2, such that, for each , .
A policy that achieves Theorem 3 is given in Fig. 5 and is referred to as throughout this subsection. The policy, similar to the policy of Subsection IV-A, iterates over empty and busy periods. See Fig. 6 for an illustration of the policy. As with policy , in the first time slot of each empty period, the policy schedules one server uniformly at random, observes the resulting offered service, and uses the observation to update a sample mean that estimates the server’s service rate. Again as with policy , only uses the observations made during these time slots to update its sample means and does not use observations made during other time slots.
At the start of each busy period , policy starts the busy period by scheduling the server with the highest sample mean for the first time slots of the busy period or until the queue empties and ends the busy period (whichever occurs first). However, if the busy period lasts longer than time slots the policy hits a time-out threshold and resorts to using the randomized policy defined by Assumption 2 to bring the queue backlog back to the empty state. The time-out threshold grows linearly in busy period number , and therefore with each busy period, the controller becomes more reluctant to call upon the randomized policy. In the following analysis, we will see that this growing reluctance is important to establishing the proof of Theorem 3.
We now establish Theorem 3 by analyzing on an arbitrary . The proof is derived through three lemmas. We begin by introducing some new notation that will facilitate understanding.
We begin by introducing notation that allows us to reference the period number that a time slot corresponds to. To this end, for each time , define to be the period number for the empty or busy period in which resides. Note that for a given sample path, maps time slots to their corresponding period number . For example, in Fig. 6, for , since these time slots occur during the first empty and busy periods and for , since these time slots occur during the second empty and busy periods.
Now, in our following analysis, we will be concerned with identifying the busy periods in which the policy does not strictly schedule server over the entire busy period. Note that an ideal policy would schedule over all busy periods, and therefore we wish to identify during which busy periods policy falls short of this ideal. To facilitate this analysis, for each busy period , we let be an indicator that takes the value if: either a server not equal to is first scheduled by policy at the start of the busy period or the time-out threshold is hit. Note that by hitting the time-out threshold, we mean that the queue did not empty during the first time slots of busy period . Otherwise, if neither of these two conditions are met, we let the indicator . Note that by the definition of , if any server is scheduled during busy period , then must equal (see Fig. 5). This is because, by the definition of policy , we can only schedule a server during a busy period if: either the policy began the busy period by scheduling or it began the busy period scheduling server but still hit the time-out threshold, leading it to schedule the stabilizing policy. Since to minimize regret, our objective is to mimic as best as we can the policy that schedules for all time slots, we can notionally view equaling as policy failing to meet this ideal over busy period .
In the following, we will combine the above two notations as . Then, if is in a busy period, if a server not equal to is scheduled at any time during the busy period in which belongs. If is in an empty period, indicates whether during the busy period following this empty period, a server not equal to is ever scheduled. Then,
is the queue backlog integrated, up to time slot , over those busy periods for which the conditions of are met. Note that if is in an empty period, , and therefore the time slot contributes a value of to the above summation regardless of the value of . For example, for the sample path in Fig. 6, if scheduled for the start of busy period , then since these busy periods reached their time-out thresholds, , and . Likewise, if for the sample path, did not schedule for the start of busy period , then and .
Given this notation, we have the following lemma, which states that the regret is upper bounded in expectation by the sum of queue backlogs over those busy periods in which indicator . The proof is similar to that of Lemma 1 and can be found in Appendix E.
Now, to upper bound the right-hand side of Lemma 5, we will want to bound the probability that indicator . Recall that can equal if one of the following events occurs: the policy begins the busy period by choosing a server other than to be scheduled for the first time slots or if it crosses the time-out threshold. In the following lemma we establish that the probability of either event occurring diminishes exponentially in . To understand the intuition behind this result, first note that at each busy period , the probability that there exists a server with a decreases exponential with . This follows from the fact that the policy updates one of the sample means with a new observation at each empty period and Hoeffding’s inequality. Thus, the probability that at busy period , is not chosen to be scheduled over the first time slots, is also exponentially decreasing with . Note that this argument is very similar to the one made in the proof of Lemma 4. Now suppose that is scheduled at the start of busy period . Then, since , we can show that the probability the queue does not empty before reaching the time-out threshold must decrease exponentially in as well. We now state the lemma whose proof can be found in Appendix F.
There exists positive constants , , and such that for all ,
To complete our upper bound on the right-hand side of Lemma 5, we will need to bound the cost of indicator equaling . To this end, let variable denote the integral of the queue backlog taken over busy period . For example, for the sample path shown in Fig. 6, and . In Lemma 7, we show grows at most quadratically with . This is because the queue backlog can grow at most linearly over the first time slots of the busy period. Thus, if the time-out threshold is hit, the required amount of time that the randomized policy will need to empty the queue can also be at most on the order of . Since, the maximum queue backlog at any time slot in a given busy period cannot be greater than the busy period’s duration, the expected integrated queue backlog will at most be on the order of .
There exists positive constants and such that for all ,
The proof of the lemma can be found in Appendix G.
Before continuing to the proof of the theorem, we observe that , since implies that the time-out threshold was not reached. Thus, we see that for all , is finite (implying the expected duration of each busy period is finite under policy ). We are now ready to establish the theorem.
Beginning with Lemma 5, we have that the regret is bounded by the integrated queue backlog over those busy periods such that the conditions of are met. i.e.,
Now, no more than busy periods can occur in time slots, since each busy period, by definition, must be one time slot long. In fact, as was noted in the previous subsection, there must be much fewer than busy periods in time slots, since the empty periods in between the busy periods must also be at least one time slot long, as well. Thus, we can bound the right hand side of (12), which is the integral of the queue backlogs over busy periods with up to time , with the integral of the queue backlogs over busy periods with through the busy period. Formally,
Now, recall that is an indicator. When , . Thus,
which follows from well-known results for series. Thus, we see that , which gives the result. ∎
Iv-C Learning Stabilizing Policies
We now relax Assumption 2 and no longer assume that a randomized policy is given to the controller that can be relied upon to stabilize the queue. As a result, the controller will have to learn how to take actions so as to empty the queue infinitely often. Once the controller can accomplish this, it may use the empty periods to identify the best server. To this end, we only make the following assumption about the parameters and .
We then have the following theorem, which is analogous to the previous subsections. The rest of this subsection will be devoted to proving its statement.
Under Assumption 3, such that, for each , .
To prove the theorem, we will analyze policy on an arbitrary . Policy is shown in Fig. 7. The policy is very similar to , except now when the time-out threshold of a busy period is reached, rather than relying on the known randomized policy given by Assumption 2, uses the method of Fig. 8 to bring the queue back to the empty state. Starting from the time-out threshold’s crossing, the method is run until the queue finally empties and the busy period ends. As can be seen in Fig. 8, we use time slot counter to indicate the number of time slots since the method’s invocation (i.e., is the time slot number normalized to the time-out threshold’s crossing).
In its execution, the method in Fig. 8 dedicates certain time slots to exploration. During these dedicated exploration time slots, the policy chooses from the servers uniformly at random, observes the offered service of the chosen server, and updates the corresponding sample mean using the observation. Note that the sample means are new variables that are used by the method during its execution at busy period and are completely separate from the sample means in Fig. 7. The sample means are initialized to zero at the start of the method, when the time-out threshold is crossed, and do not make use of any of the previous observations that form sample means . Likewise, sample means are not updated with any of the observations made during the method’s dedicated exploration time slots. Finally, it is worth stressing that each call to the method of Fig. 8, at a new busy period , initializes new sample means that are used only during this execution and do not share observations with previous executions of the method. Thus, the variables are estimates of the sample means formed only by observations made during the exploration times slots in busy period . This property will be important to our later analysis.
Now, examining Fig. 8, one may note that the exact locations of the dedicated exploration time slots are not specified. In fact, we will allow for an implementer of the method to choose their exact locations subject to some constraints that we now specify. We assume the locations are fixed and are specified relative to the start of the method’s invocation. We also require that the same relative locations be used for all busy periods. Thus, the implementer must choose for which values of in Fig. 8 dedicated explorations taken place and must use the same choice at each busy period that the method is invoked. To give an example, one choice could be to have dedicated explorations occur at times for (i.e.,