On Learning the cμ Rule: Single and Multiserver Settings

02/02/2018 ∙ by Subhashini Krishnasamy, et al. ∙ The University of Texas at Austin Stanford University 0

We consider learning-based variants of the c μ rule -- a classic and well-studied scheduling policy -- in single and multi-server settings for multi-class queueing systems. In the single server setting, the c μ rule is known to minimize the expected holding-cost (weighted queue-lengths summed both over classes and time). We focus on the setting where the service rates μ are unknown, and are interested in the holding-cost regret -- the difference in the expected holding-costs between that induced by a learning-based rule (that learns μ) and that from the c μ rule (which has knowledge of the service rates) over any fixed time horizon. We first show that empirically learning the service rates and then scheduling using these learned values results in a regret of holding-cost that does not depend on the time horizon. The key insight that allows such a constant regret bound is that a work-conserving scheduling policy in this setting allows explore-free learning, where no penalty is incurred for exploring and learning server rates. We next consider the multi-server setting. We show that in general, the c μ rule is not stabilizing (i.e. there are stabilizable arrival and service rate parameters for which the multi-server c μ rule results in unstable queues). We then characterize sufficient conditions for stability (and also concentrations on busy periods). Using these results, we show that learning-based variants of the cμ rule again result in a constant regret (i.e. does not depend on the time horizon). This result hinges on (i) the busy period concentrations of the multi-server c μ rule, and that (ii) our learning-based rule is designed to dynamically explore server rates, but in such a manner that it eventually satisfies an explore-free condition.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction.

We consider a canonical scheduling problem in a discrete-time, multi-class, multi-server parallel server queueing system. In particular, we consider a system with distinct queues, and distinct servers. Each queue corresponds to a different class of arrivals; arrivals queue are Bernoulli(), i.i.d across time. Service rates are heterogeneous across every pair of queue and server (i.e., a “link”). At each time step, a central scheduler may match at most one queue to each server. Services are also Bernoulli; thus jobs may fail to be served when matched, and in this case the policy is allowed to choose a different server for the same job in subsequent time step(s). Jobs in queue incur a holding cost per time step spent waiting for service. Letting denote the queue length of queue at time , the performance measure of interest up to time is the cumulative expected holding cost incurred up to time :

(All our analysis extends to the case where the objective of interest is a time-discounted cost, i.e., where the ’th term is scaled by , where the discount factor satisfies .)

Our emphasis in this paper is on solving this problem when the link service rates are a priori unknown; the scheduler only learns the link service rates by matching queues to servers, and observing the outcomes. We use as our benchmark the rule for scheduling, when link service rates are known. The rule operates as follows: at each time step, each link from a nonempty queue to server is given a weight ; all other links are given weight zero. The server then chooses a maximum weight matching on the resulting graph as the schedule for that time step. It is well known that when there is only a single server, this rule delivers the optimal expected holding cost among all feasible scheduling policies. Further, there has been extensive analysis of the performance and optimality properties of this rule even in multiple server settings. (See related work below.)

When service rates are unknown, we measure the performance of any policy using (expected) regret at : this is the expected difference between the cumulative cost of the policy, and the cumulative cost of the rule. Our goal is to characterize policies that minimize regret. In typical learning problems such as the stochastic multiarmed bandit (MAB) problem, optimal policies must resolve an exploration-exploitation tradeoff. In particular, in order to minimize regret, the policy must invest effort to learn about unknown actions, some of which may later prove to be suboptimal—and thus incur regret in the process. In such settings, any optimal policy incurs regret that increases without bound as ; for example, for the standard MAB problem, it is well known that optimal regret scales as [15, 3, 1].

In this paper, we show a striking result: in a wide range of settings, the empirical rule—i.e., the

rule applied using the current estimates of the mean service rates—is regret optimal, and further, the resulting optimal regret is bounded by a

constant independent of . Thus, in such settings there is no tradeoff between exploration and exploitation. The scheduler can simply execute the optimal schedule given its current best estimate of the services rates of the links. In other words, the empirical rule benefits from free exploration.

We make three main contributions: (1) regret analysis of the empirical rule in the single server setting; (2) stability analysis of the rule in the multi-server setting; and (3) subsequent regret analysis of the empirical rule in the multi-server setting. We summarize these contributions below.

  1. Learning in the single-server setting. We begin our analysis by focusing on the single-server setting, where the rule is known to be optimal on any finite time horizon. This setting admits a particularly elegant analysis, due to the following two observations. First, the empirical rule is work-conserving, as is the benchmark rule with known service rates. Second, all work-conserving scheduling policies have the property that they induce the same busy period distribution on the queueing system. Using this observation, we can couple the empirical rule to the

    rule with known service rates, and divide our analysis into epochs defined by busy periods. At the end of any busy period, all queue lengths are identical in both systems: namely, zero. We show that after a sufficiently large number of busy periods have elapsed (namely,

    ), with high probability the empirical

    rule has sufficient knowledge of each arm that it exactly matches the rule going forward. Finally, we use the fact that any work-conserving policy induces a queue-length process that is geometrically ergodic to show that the expected regret is bounded by a constant.

  2. An interlude: Stability in the multi-server setting. Next, we turn our attention to regret analysis in the setting of multiple servers. Here, however, we face a challenge: in contrast to the single-server setting, where the rule is known to be optimal, with multiple servers the rule may not even be stabilizing, despite the availability of sufficient service capacity. Further, somewhat surprisingly there are no known general results in the literature on stability of the parallel server rule. In order to carry out regret analysis, of course, we require such conditions; therefore we develop them for our analysis. These results are of independent interest.

    We provide three results on stability. First, we construct a class of examples that demonstrate that the rule need not be stabilizing. Second, we develop a general condition for stability of the rule on a particular class of queueing networks, where the rule takes the form of a hierarchical static priority rule. Informally, these are networks where the configuration of service rates and costs is such that a priority structure among the queues can be embedded in a hierarchical graph. In particular, we show for these systems that stability is equivalent to geometric ergodicity of the resulting queue-length process. This condition is not directly over model primitives; thus in our third result we provide a stronger sufficient condition for geometric ergodicity of the rule that can be directly checked on model primitives for a generic scheduling problem. We show a number of network configurations for which this condition holds.

  3. Learning in the multi-server setting. Having determined a sufficient condition for stability, we turn our attention to learning in the multi-server setting. We show that for problem instances where the rule with known service rates yields a geometrically ergodic queue length process, the empirical rule yields a difference in queue lengths with the benchmark that decays at least polynomially with time. As in the single server setting, this again results in regret, following two insights that parallel our analysis of the single-server setting: first, that the system eventually reaches a state of “free exploration”; and second, that the tails of the busy period can be shown to sufficiently light.

1.1 Related work.

Many variants of the dynamic stochastic scheduling problem, for both discrete and continuous time queueing networks, have been long studied [21, 17]

. Conventionally, it has been studied in the Markov decision process framework where it is assumed that the service rates are known a priori, and the proposed solution is usually an

index type policy that schedules non-empty queues according to a static priority order based on the mean service time and holding-costs. The simplest variant of the problem is that of a multi-class single-server system for which the rule has been shown to be optimal in different settings [7, 6, 12]. Klimov [14] extended the rule to multi-class single-server systems with Bernoulli feedback. Van Mieghem [23] studies the case of convex costs for a queue and proves the asymptotic optimality of the generalized rule in heavy traffic. Ansell et al. [2] develop the Whittle’s index rule for an system with convex holding-costs.

The works in [11, 5] study a simple parallel server model—the N-network, which is a two queue, two server model with one flexible and one dedicated server—and propose policies that achieve asymptotic optimality in heavy traffic. Glazebrook and Mora [9] consider the parallel server system with multiple homogeneous servers and propose an index rule that is optimal as arrival rates approach system capacity. Lott and Teneketzis [16] also study the parallel server system with multiple homogeneous servers and derive sufficient conditions to guarantee the optimality of an index policy. Mandelbaum and Stolyar [18] study the continuous time parallel server system with non-homogeneous servers and convex costs. They prove the asymptotic optimality of the generalized rule in heavy traffic. Among the above papers, only [7, 6, 12, 16] consider the holding-cost across a finite horizon. The rest have their objective as the infinite horizon discounted and/or average costs. Our work provides results for both the finite horizon discounted cost and finite horizon total cost problems.

Another framework in which the problem can be studied is the stochastic multi-armed bandit problem, where the aim is to minimize the regret in finite time. Traditional work in the space of MAB problems focuses on the exploration-exploitation tradeoff and investigates various exploration strategies to achieve optimal regret [15, 3, 1]. More recently, exploration-free or greedy algorithms have been studied and shown to be effective in a few contexts. [19] studies the linear bandit problem in the Bayesian setting and shows asymptotic optimality of a greedy algorithm with respect to the known prior. For a variant of the linear contextual bandits, Bastani et al. [4] propose to reduce exploration by dynamically deciding to incorporate exploration only when it is clear that the greedy strategy is performing poorly. For a slightly different variant of the linear contextual bandits, Kannan et al. [13] show that perturbing the context randomly and dynamically can give non-trivial regret bounds for the greedy algorithm with some initial training. In a similar vein, our work proposes to reduce exploration through a conditional-exploration strategy. We show that this policy eventually transforms into a purely greedy strategy because the system naturally provides free exploration.

1.2 Organization of the paper.

We describe the queueing model and main objective of this work in Section 2. In Section 3, we present the analysis for the single server system. In Section 4, we show that the stability region for the rule is a strict subset of the capacity region and give sufficient conditions for geometric ergodicity under the rule. In Section 5, we extend the analysis presented in Section 3 to parallel server systems and show constant order regret when the system is geometrically ergodic under the rule. Appendix 6 is devoted to the study of a special class of scheduling rules called hierarchical rules, for which we exhibit a recursive procedure which verifies geometric ergodicity from the system parameters. The more technical proofs are organized in Appendices 711.

2 Problem Setting.

We describe the model, the objective, and the rule.

2.1 Parallel server system with linear costs.

Consider a discrete-time parallel server system with queues (indexed by ) and servers (indexed by ). Jobs arrive to queue according to a Bernoulli process with rate independent of other events. Denote the joint arrival process by . At any time, a server can be assigned only to a single job and vice-versa; however, multiple servers are allowed to be assigned to different jobs in the same queue. For convenience of exposition, we assume that jobs are assigned according to FCFS. At any time, the probability that a job from queue assigned to server is successfully served is independent of all other events. We denote this joint service distribution by . Jobs that are not successfully served remain in the queue and can be reassigned to any server in subsequent time-slots. The queues have infinite capacity, and denotes the waiting cost per job per time-slot for queue . For this system, a scheduling rule is defined as one that decides, at the beginning of every time-slot, the assignment of servers to queues. It is assumed that

  1. the outcome of an assignment is not known in advance, i.e., in any time-slot, whether or not a scheduled job is served successfully can be observed only at the end of the time-slot;

  2. the waiting cost per job is known for all the queues.

We study the learning variant of the problem, and therefore make the additional assumption that

  1. the arrival rates and success probabilities are unknown.

For time-slots, the expected total waiting cost in finite time is given by


Here is the queue-length of queue at the beginning of time-slot , with the evolution dynamics given by the equation

where and

are the arrival vector and allocated service vector respectively. In (

1), the instantaneous waiting cost is a linear function of the queue-lengths. [Stability] For a Markov policy , i.e., a scheduling rule that makes decisions in every time-slot based on the current queue-state, the system is said to be stable under if the chain is positive recurrent and

where is its invariant distribution.

For a given service rate (success probability) matrix and a Markov policy , let the stability region be the set of all arrival rates for which the system is stable under . The capacity region of the parallel server system with service rate matrix is given by . The capacity region can be characterized by the class of static-split scheduling policies.

where is the set of all right stochastic matrices.

2.2 The rule.

In this paper, we focus on the rule with linear costs for the parallel server system. This rule (see Algorithm 1), which is a straightforward generalization of the single server rule, allocates servers to jobs based on a priority rule determined by the product of the waiting cost and success probability.

At time :
Solve the following max weight optimization problem:
subject to
Assign a job from queue to server if and only if in the resulting solution.
Algorithm 1 The Algorithm with costs and rates

For a single server system, and when the success probabilities for all the links are known a priori, it has been established that the rule optimizes the expected total waiting cost over a finite time horizon [6]. For a parallel server system, there are no known algorithms that achieve optimal cost as in a single server system. For waiting costs that are strictly convex in queue-lengths, Mandelbaum and Stolyar [18] prove that, in heavy-traffic, the generalized rule optimizes the instantaneous waiting cost asymptotically.

In order for the rule to be unambiguously defined, we impose the assumption that


Our interest lies in designing scheduling algorithms that can mimic the rule in the absence of channel statistics. We evaluate an algorithm based on a finite time performance measure called regret. Conventionally, in bandit literature, regret measures the difference in the performance objective between an adaptive algorithm and a genie algorithm that has an a priori knowledge of the system parameters. For our problem, the genie algorithm applies the rule at every step, using the service matrix . Therefore, regret here is defined as the difference between total waiting costs (given by equation (1)) under the proposed algorithm and the algorithm. For any given parameter set such that , we study the asymptotic behavior of regret as the time-period tends to infinity.

3 Learning the Rule—Single Server System.

We first consider the single server system in order to highlight a few key aspects of the problem. We later extend our discussion and results to the parallel-server case in Section 5. For the single server system, we propose a natural ‘learning’ extension of the algorithm, which we refer to as the algorithm, or rule. This scheduling algorithm applies the rule using empirical means for obtained from past observations as a surrogate for the actual success probabilities. Let us denote the queue-lengths under the and rules by and respectively. Further, we denote the regret of the algorithm by

where and are the respective expected total waiting costs.

We show that the queue-length error for the algorithm decays geometrically with time. It then follows that the regret scales as a constant with increasing for any . It is interesting to observe that this scaling is achieved only by using the empirical means in every time-slot, without an explicit explore strategy. Our results show that this scheduling policy delivers free exploration due to some unique properties of the single server system, as we describe further below (see Section 3). For single server systems, the definition in Eq. 2 takes the form .

For any such that , there exist constants and such that

for any . In particular, there exists a constant independent of such that the regret satisfies .

Before proving the result, we briefly outline the intuition. The result relies on the following key observation.

Observation 1

The distribution of busy cycles is the same for all work conserving scheduling policies in a single server system.

This can be confirmed by considering a stochastically equivalent system where, for any , jobs arrive to queue with i.i.d. inter-arrival times distributed as and i.i.d. service times distributed as . In such a system, a scheduling algorithm only decides which part of the work is completed in each time slot and therefore, all work conserving algorithms give the same busy cycle.

We now see how Observation 1 can be used to prove Section 3. This observation implies that the and systems have the same queue length (equal to ) at the end of their common busy cycles. In order for the estimated priority order by the algorithm to agree with , it needs sufficient number of samples for all the links. Since the number of samples for each queue at the end of a busy cycle is equal to the total work (in terms of service time) arrived to the queue, it is sufficient to consider the end of a busy cycle by which the system has seen at least arrivals to every queue. Thus, every work conserving policy has the same number of samples for each of the links at the end of a busy cycle. Finally, we exploit the fact that busy periods have geometrically decaying tails to show that as a consequence, the algorithm makes the same scheduling decision as the rule after a random time that has finite expectation. This argument is a clear example of free exploration, since there is no need to incorporate an explicit exploration strategy into the scheduling algorithm as long as it is work conserving.


Proof of Section 3. The crux of the proof lies in characterizing the random time after which the algorithm makes the same scheduling decision as the rule in all future time-slots. In any time-slot , the algorithm makes the same scheduling decision as the rule if (i) , and (ii) the estimated priority order agrees with the rule at .

Our argument crucially relies on Observation 1. We start by noting that the queue-length process under any work-conserving algorithm is geometrically ergodic and the busy cycle lengths have geometrically decaying tails. Specifically, there exist constants and such that the first hitting time of the state , denoted by , satisfies


To formalize the intuition in the paragraph preceding the proof, let where for some , and let be the end of the busy period that contains . Then from Eq. 3, using Markov’s inequality, we have


where the second inequality follows by the definition of . Now, let be the average number of successes in the first assignments of the server to queue . Consider the following two events:

  1. ,

  2. .

Then, conditioned on , the algorithm agrees with the rule after , and therefore its queue-length equals that of after . Thus, given , we have

. It is easy to show the following using the Chernoff–Hoeffding bound for Bernoulli random variables.


for some and

Using bounds (4), (5), for any norm function , we have

for some and . This also shows that the regret scales as with .

Note that an scaling with also holds for regret with discounted cost for any discount factor.

4 Stability of the Rule for Parallel Server Systems.

As for the single server system, we are interested in upper bounds on regret for the parallel server system. In the proof of Section 3, we crucially used the property of identically distributed busy cycles over work conserving policies. Note that, in this case, the stability region of rule (or any work conserving policy) is the entire capacity region, and the busy cycles have exponentially decaying tails for any arrival rate in this region.

In this section, we show for the parallel server system that the rule (which is based on linear costs) does not necessarily ensure stability for all arrival rates in the capacity region. In particular, it is not throughput optimal for a general parallel server system. In Subsection 4.2, we characterize a subset of the stability region of rule for which the busy cycles have exponentially decaying tails.

4.1 Instability of the rule in the general case.

As defined in Algorithm 1, the rule allocates server to a job in the queue that maximizes . We show that such a static priority policy, which prioritizes queues irrespective of their queue-lengths (other than their being non-empty) could be detrimental to the stability of the system. For e.g., in any system with , the rule prioritizes over for allocation of both the servers, which results in service allocation to only when there are less than jobs in . It is intuitively clear that such a policy is not stabilizing if the arrival rate of is larger than the service rate that this policy can allocate to . We formalize this in the theorem below, where we characterize the set of arrival rates outside the stability region of the rule for a class of systems.

For any system with service rates , costs , and arrival rates satisfying





is the stationary distribution of the Markov chain

, there exist positive constants depending on such that

It is easy to construct an example of a system with parameters satisfying Eqs. 7 and 6 and with . This shows that for systems, the stability region of the rule is, in general, a strict subset of the capacity region. Below, we give such an example: Pick any such that , and . For this choice of , let be the stationary distribution of when served by both servers. Now pick such that . Next, choose such that , and . Clearly, , since and . Thus, since the system parameters satisfy Eqs. 7 and 6, it follows by Subsection 4.1 that .

The criterion for instability in Subsection 4.1 is rather sharp, and this is evidenced by the following result.

Any system with service rates and costs satisfying (6) is stable under the rule if and only if


where is the stationary distribution of the Markov chain . In addition, (8) implies that the queueing process is geometrically ergodic under the rule. In particular, there exists a function , such that for some positive constants and , and constants , and a finite set , such that , where denotes the transition kernel of the chain . It is well known that this implies that there exist constants and such that , the first hitting time of the state , satisfies


The proofs of Subsections 4.1 and 4.1 can be found in Appendix 7.

4.2 Sufficient conditions for geometric ergodicity of the system.

We now obtain sufficient conditions for the busy cycles to have exponentially decaying tails in terms of the parameters . This condition, in particular, implies that the queue-length process is geometrically ergodic.

Let for . For any , let denote the total service rate assigned by the rule to queue when the queue-state is . If


for some , where is the probability simplex in , then we can construct an appropriate Lyapunov function for which the one-step drift given by the algorithm is negative outside a finite set. This enables us to show the following tail probability bound for the busy period of the system.

Let denote the first hitting time of the state under the rule. If Condition (10) is true for some , then there exist constants , , such that, for any ,


Details of the proof of this lemma are given in Appendix 8.

Below, we explicitly derive sufficient conditions given by (10) for a couple of examples. Further, for the case of the N-network in Subsection 4.2 (which is a special case of the network in Subsection 4.1), we compare it with the stability region.

Consider the example where queue has priority over queue for all servers. Without loss of generality, let , and let be the stationary distribution of (note that, in every time-slot, service offered to is independent of the current queue-length of ). Then, the stability region is given by

We now obtain a subset of the region (10) by choosing specific values of . For , (10) is satisfied if

To see this, note that for any , we have


which shows that the region given by (10) contains .

Consider the N-network, i.e., a system with , and let the first queue have higher priority according to the rule, i.e., . This is a special case of the system in Subsection 4.1. Let be the stationary distribution of . A closed form expression for can be found in Appendix 11. Thus, for this system, we can determine the stability region analytically through (8). Moreover, as seen in Subsection 4.1, we have geometric ergodicity in all of the stability region . Below, we compare the region given by (10) with .

Case 1:

– Server is allocated to Queue when it has only a single job in its queue. In this case, as discussed above, the stability region is given by

whereas, Condition (10) is equivalent to

This is the stability region of a system where the server has rates to the first queue and to the second queue.

Case 2:

– Server is allocated to Queue when it has only a single job in its queue. In this case, the stability region is given by

whereas, Condition (10) is equivalent to

In this example, while Condition (10) does not cover the entire stability region, the region it covers is “close” to the stability region in some limiting regimes. For example, in Case 1, when , we can show that

Similarly, in Case 2, when , , and , we can show that

5 Learning the Rule—Parallel Server System.

5.1 The algorithm.

We now propose a learning extension of the rule for the parallel server system. Recall that the number of samples for a link in any time-slot is the number of times it has been scheduled before that time-slot. For the single server system, a sufficient number of samples can be ensured without explicit exploration due to the stabilizing property of work-conserving policies, all of which have the same busy periods. However, this property does not hold in general for the parallel server system, and thus, a straightforward extension of the rule based on empirical means without explicit exploration may not obtain enough samples to learn the system. The following example shows how a naive extension of the rule could fail to stabilize a network. Consider a network with service rates , costs , and arrival rates satisfying

Clearly, this network is stable under the rule. We show that, under the policy that does not explore and schedules according to the empirical estimates of the service rates, the queues have linear growth with positive probability. For any , let be the empirical estimate of with samples. Let be the event that and . Conditioned on the event (which has a positive probability), the algorithm schedules only links and after obtaining the initial samples. Using Hoeffding’s inequality, we can derive concentrations for the total number of arrivals to each of the queues and the total service offered by links and to show that there exist constants , such that .

As a solution to the above problem, we propose an algorithm that dynamically decides to explore if the number of samples falls below a threshold. We refer to this as the algorithm for parallel server networks, and define it in Algorithm 2 below.

5.1.1 Dynamic explore—conditional -greedy.

In each time-slot, the algorithm explores conditionally based on the number of samples, i.e., uses an -greedy policy if the minimum number of samples over all links is below some threshold. More specifically, let:

  1. be a collection of assignments such that their union covers the complete bipartite graph;

  2. be the number of samples of link at time ;

  3. ;

  4. ;

  5. be the estimated rate matrix at time .

At time , if , the algorithm decides to explore with probability , otherwise it follows the rule using the estimated rate matrix .

At time ,
independent Bernoulli sample of mean
if  then
     Explore: Schedule from uniformly at random.
     Exploit: Schedule according to the rule with parameters .
end if
Algorithm 2 The algorithm for parallel server networks

5.2 regret for the algorithm.

In Subsection 5.2, we prove a regret bound that scales as a constant with increasing for a subset of the capacity region. This subset is given by the region in which the algorithm achieves exponentially decaying busy cycles. In the theorem which follows, we show that the queue-length error for the algorithm decays super-polynomially with time if (11) is satisfied. Again, as in the single server system, this translates to an regret. For any such that (11) is satisfied, we have

for any . In particular, there exists a constant independent of such that the regret satisfies . As for the single server system, the main idea in proving Subsection 5.2 is to characterize the coupling time of the queue-lengths of the actual and the genie systems. More specifically, we show that the queue-length of the system at time does not exceed that of the genie system with probability . For this, we first show in Section 9 that the algorithm obtains sufficient number of samples due to its conditional explore policy, thus enabling the algorithm to agree with the rule in its exploit phase after time . In turn, this ensures exponentially decaying tails for the busy cycles after time according to Subsection 4.2.

This concentration for the busy cycles can be used to further show that the following two ‘events’ occur with polynomially high probability:

  1. That the algorithm does not need to explore in the latter half of (Section 9). This can be explained as follows: whenever the system hits the zero state, there is a positive probability that only selective queues are non-empty in the subsequent time-slots. Therefore, for any work-conserving algorithm, every link has a positive probability of being scheduled at the beginning of a new busy cycle. If the algorithm stabilizes the system well enough to ensure that it hits the zero state regularly, then it obtains a sufficient number of samples without explicit exploration. We use the busy cycle tail bound in Subsection 4.2 to show that the system hits the zero state often enough to give at least samples in the first half of (The constant depends on the system parameters). This ensures that the algorithm does not need to explore in the latter half of .

  2. That the system hits the zero state at least once in the latter half of (Section 9). This can be verified using the busy cycle concentration in Subsection 4.2.

Next, we show (in Section 9) the following monotonicity property for the rule: if two systems with identical parameters, and initial queue-states satisfying element-wise follow the algorithm, then the same ordering of their respective queue-states is maintained in subsequent time-slots, i.e., for all .

To summarize the argument, we have with polynomially high probability that (i) the algorithm agrees with the rule while exploiting after time (Section 9), (ii) it only exploits in the latter half of and does not explore (Section 9), and (iii) the system reaches the zero state (which is smaller than any state that the genie system could be in) at least once in the latter half of (Section 9). Thus, the monotonicity property (in Section 9 in Appendix 9) shows that the system always maintains a queue-length not exceeding that of the genie system after it first hits the zero state in the latter half of . Effectively, at time , the regret is positive only with probability which gives us the required decay of expected queue-length error in Subsection 5.2.

The proofs of Subsection 5.2 and Sections 9, 9 and 9 are given in detail in Appendices 9 and 10, respectively.

The degradation of convergence rate of the queue-length error from exponential in a single server system to super-polynomial in a parallel server system can be explained by the addition of explicit exploration in the algorithm for the latter. In this situation, we can only show that the algorithm needs to explore with a probability that vanishes at a polynomial rate. However, for exponential convergence, one needs to establish that the algorithm deviates from the rule with a probability that vanishes at an exponential rate. Designing algorithms with the best achievable convergence rates is an area of future work.

5.3 Extension to other genie policies.

We now discuss the scope of generalizing the results in this paper to scheduling policies other than the algorithm. Consider the bipartite graph with queues and servers as the nodes and the links between them as the edges. We define a static priority rule as a scheduling policy which allocates servers to non-empty queues according to a given priority order for the links. For example, the rule is a static priority rule where the priority order of the links is given by the descending order of their weights . Now, consider genie algorithms that are based on static priority rules, i.e., in every time-slot, the same priority order is used to assign servers to non-empty queues. If the algorithm is replaced by any static priority genie algorithm, the same proof technique given above can be applied if the monotonicity property in Subsections 10.1 and 9 holds for the corresponding static priority rule. This monotonicity property can be proved for any rule with queue priority, i.e., a static priority rule where queues have a specified order of priority and, for each queue, the links are ordered according to their service rates to that queue. Therefore, the regret bound in Subsection 5.2 also holds for algorithms where the exploit rule in Algorithm 2 is replaced by rules with queue priority.

Moreover, Subsection 4.2 holds for any static priority algorithm, whereas the region of arrival rates given by Condition (10) depends on the priority rule for a general parallel server system. In Appendix 6, we show that exponential tail bounds for busy cycles hold within the entire stability region for a special class of policies that we refer to as hierarchical rules. Thus, for a hierarchical rule that satisfies the monotonicity property, we can show regret for Algorithm 2 (with the rule replaced by the hierarchical rule) within the entire stability region.

6 Stability of hierarchical rules in parallel-server networks.

In this section we extend the results of Subsections 4.1 and 4.1, and show a special class of rules for which geometric ergodicity holds in the entire stability region.

Consider a queueing network with classes of customers and servers. The queues are labeled as and the servers as . Set and . Each queue can be served by a subset of servers, and each server can serve a subset of queues. For each , let be the subset of servers that can serve queue , and for each , let be the subset of queues that can be served by server . For each and , if queue can be served by server , we denote as an edge in the bipartite graph formed by the nodes in and ; otherwise, we denote . Let be the collection of all these edges. Let be the bipartite graph formed by the nodes (vertices) and the edges . We assume that is connected.

A static priority rule can be identified with a permutation of the edges of the graph , i.e., one–to–one map defined by the priority rule – if edge has higher priority than edge . [Hierarchical Rule] For a static priority rule and for any and with , we say that if for all . A static priority rule is hierarchical if defines a partial order on , and for any and with , either or .

It is easy to see that if is a tree, then every static priority rule is hierarchical.

In the rest of this section, we study only hierarchical rules.

6.1 Hierarchical decomposition.

Consider a queueing network with graph , parameters and , with , under a hierarchical static priority rule . We let , , , , , and denote by the minimal elements of under . The dependence on the arrival rates is suppressed in this notation, since at each step of the decomposition the arrival rates match the original ones , while the service rates are modified.

Consider the subgraph with queue nodes and server nodes . Since consists of minimal elements, it follows that if and . Hence each queue , forms a Markov process, which is geometrically ergodic, since . Let , , denote the stationary distribution of .

Next, we remove the nodes and associated edges from , and denote the resulting graph, which might not be connected, by . We let denote the minimal elements of under . Removing these nodes and and associated edges from , we obtain a graph , and so on by induction. We let denote the largest integer such that .

Let , , and let denote the queueing process restricted to . It is clear that this is Markov. Provided that it is positive recurrent, we let denote its invariant probability measure.

6.2 The structure of the transition kernels.

Let for some . It is clear that the transition kernel of depends on , and thus takes the form . Due to the hierarchical rule, a server may not be available to queue if the queues have sufficient size. It is evident then that the transition kernel of has the following structure. There exists a finite partition of and associated transition kernels , with each corresponding to a queue with arrival rate and served by a subset of the servers , such that


We illustrate this via the following example. Consider the ‘W’ network in Figure 1.

Figure 1: Figure to demonstrate the structure of the transition kernels.

It is clear that


where we use the notation to denote the transition kernel of a single-queue, single-server system with parameters and . Continuing, we also have


with , and . Here corresponds to a transient process, with arrivals but no service.

Next we discuss the ergodic properties of the ‘W’ network in Figure 1. Suppose that the arrival rates lie in the capacity region. Then of course and is a geometrically ergodic Markov chain with stationary distribution . It follows by Eq. 13 and the proof of Subsections 4.1 and 4.1 that if


then the chain is geometrically ergodic, and if the opposite inequality holds in Eq. 15, then it is transient. Continuing, assume Eq. 15, and let denote the stationary distribution of . Applying the same reasoning to Eq. 14, it follows that if


then is geometrically ergodic, otherwise it is not. Thus, combining the above discussion with Subsection 4.1, it is clear that the queueing process is geometrically ergodic if and only if Eqs. 16 and 15 hold.

6.3 The averaged kernel.

Recall the notation introduced in Subsection 6.3. Suppose that the queueing process is geometrically ergodic, and as introduced earlier, let denote its invariant probability measure. We define the averaged kernel of Eq. 12 by

Recall that each kernel corresponds to a single-queue system with arrival rate , and service rates for a subset of the original server nodes ( might be empty). For each define


It is clear that