I Introduction
The problem of scheduling and resource allocation have been widely recognized as a way to improve the network performance and meet the service requirements in networks. Many resource allocation problems have been studied in the past in wired and wireless networks. In this paper, we are interested in the problem of scheduling in queueing systems where a set of users or queues share a set of servers. At each time slot, the servers are allocated to the users in such a way to minimize the total expected length of the users’ queues. Although this problem is well known in the literature, one can show that it is a Restless Bandit Problem (RBP), which is very hard to solve as we will see in the sequel. In fact, this problem has been well studied in the past from a stability perspective [6], [16], [7]. It has been shown that max weight policy is throughput optimal and many variants have been proposed to deal with different settings and conditions [16], [7]
. The main weakness of max weight policy is that it may result in a high (but finite) average delay. In order to improve the average delay in the system, we are interested in minimizing the total average length of the queues. This problem is a hard problem and can be cast as a Restless Bandit Problem (RBP), a particular model of Markov Decision Processes (MDP). RBPs are PSPACEHard see Papadimitriou et al.
[13], which means that their optimal solution is out of reach. One has therefore to develop a suboptimal but well performing policy. In this paper, We propose a Whittle index policy to deal with the aforementioned problem. The development of such policy is not straightforward and requires some analysis to prove that such policy exists and to make the corresponding derivation. First, we introduce a new discount factor in the reward function (denoted by ), analyze the Lagrangian relaxation of the resulting discounted reward problem, deriving the Whittle’s index for this new relaxed problem as a function of and then taking the limit when to find the Whittle’s index of the original problem. The interest of finding the Whittle’s index expressions for our problem is that we can use the known Whittle’s index policy (WIP) to allocate the resources to users. In this paper, we will show that for our model an explicit expression of Whittle’s index can be found. WIP has been proposed as a suboptimal policy for many problems in the literature, see for instance [11, 1]. It has also been shown to perform near optimally in many scenarios and in the particular case of multiclass M/M/1 queues, WIP which simplifies to the rule is optimal, see [2] and [9]. Therefore, WIP (when it is possible to obtain it) is a well performing policy. This motivates the development of the Whittle’s index expressions in this paper.Ia Related Work
There are lot of works which study the problem of resource allocation in wireless networks. For instance, in [5] [6] [16] [7], the authors give a throughput optimal policy for single channel, multichannel and multiuser MIMO contexts using max weight rule, which is known to not be delay optimal. To overcome this matter, many works have been developed in the past to minimize the average delay of the traffic of the users (e.g. see [4] and the references therein). Most of them describe the minimization problem as Markov Decision Process (MDP) and develop resource allocation policies using Bellman equation such as Value iteration algorithm. However, as we have already mentioned in abstract, MDP frameworks and Bellman equation are hard to solve them. [18] [3] try to minimize the average delay of the users’ queues using stochastic learning algorithm. Indeed, the stochastic learning algorithm consumes lot of time and users memories. Besides, it requires high computational complexity.
On the other hand, for some MDP problems, the optimal policy turns out to be reachable and has a form of index policy. For instance, in multiclass singleserver queue with linear holding costs, the optimal index policy is the rule that schedules the user with the highest , see [2]. Another classical result that can be seen as an index policy is the optimality of ShortestRemainingProcessingTime (SRPT), where the index of each customer is given by its remaining service time [15]. Both examples fit the general context of MultiArmed Bandit Problems (MABP), which is a particular case of MDP: at each decision time, we select only one bandit and its state evolves stochastically while the other bandits states stay unchanged. The aim of scheduler is to maximize the total average reward or to minimize the total average cost. Whittle introduces the called Restless Bandit (RBP) where the scheduler selects a fixed number of bandits, and all bandits (either active or not) might evolve. He defines the Whittle index and the Whittle index policy, and prove that it is asymptotically optimal under some conditions. However, in order to calculate Whittle’s index, there are two main difficulties: first, we need to establish indexability, and second, the calculation of the Whittle index itself might be infeasible in some cases. Whittle index policy has been derived for birthanddeath multiclass multi servers queue in [10]. In [17], an optimal index policy called Generalizedrule (Gcµ) is developed in the context of heavytraffic regime with convex delay cost. Furthermore, in contrast to rule policy, [12] establishes the optimality of Generalizedrule (Gcµ) even with multiple servers. In [1] the authors calculate Whittle’s index policy for a multiclass queue with general holding cost functions.
The aforementioned cited works consider a time continuous model. [8] considers different MDP model with discrete time slotted system, and with finite buffer length. In this paper, we consider the same model, but the buffer length is infinite. The major difference between our work and [8] will be in the whittle index derivation. In fact, we will use a new discounted cost approach in order to derive whittle index. We will explain the difference with respect to [8] in more details in Section IV. In this paper, we provide an explicit characterization of Whittle’s indices by introducing a discounted cost approach, and develop a Whittle index allocation policy for our original problem (average cost case) by adapting the Whittle’s indices expressions when discounted parameter is near to one. We find that optimal solution can be seen as rule for large queue state where is replaced by the weighted factor and is replaced by the transmission rate which represents the number of packets that can be transmitted per time slot.
The remainder of the paper is organized as follows. In Section II, we describe the system model and formulate the average cost minimization problem. In Section III, we introduce the Lagrangian relaxation and show the optimality of threshold/monotone policies for the relaxed dual problem. In Section IV, we characterize Whittle’s indices explicitly for all queue states and explain Whittle’s index policy. Numerical results are provided in Section V and Section VI concludes the paper. The proofs are provided in the appendix.
Ii System Model
Iia System model description
We consider a timeslotted system with one central scheduler, queues and uncorrelated ”channels” or ”servers” (). The words channels and servers will be used interchangeably. We consider a discrete slotted time system, where at each time slot, the scheduler chooses users among and allocates to each one exactly one channel. Let class denotes the class of users for which user can transmit at most packets per time slot, called maximum transmission rate, if a channel/server is assigned to the user. In other words, the server rate is not fixed for all queues and depends on the class of users. We consider that the number of different classes is . and that for all , (we will give later brief justification of this assumption). Let denotes the proportion of users in class with respect to the total number of users. We will use the terms users and queues interchangeably in this paper. We further denote by the number of packets that arrive to class queue at time slot . From above, it is clear that for all or equivalently
. Moreover, we assume that the packets arrival follows a uniform distribution. Therefore the probability that
packets arrive at time slot is , with . We denote by the transmission decision as follows: when the user in class is scheduled, and otherwise. Let denotes the number of packets in queue belonging to class. We consider that all users have infinite queue length, which is the main difference with respect to the work in [8] in which a finite queue length is assumed. Then we have:(1) 
IiB Problem formulation
We denote by the broad class of scheduling policies that make a scheduling decision based on the history of observed queue states and scheduling actions. Therefore, the scheduling problem consists on finding a policy in that minimizes the infinite horizon expected average queue length, subject to the constraint on the number of users selected in each time slot, i.e. the number of scheduled users must not exceed the number of available channels. According to little law, minimizing the average queue length will reduce the average delay experienced by the users. Denoting by , the weighted factor for each class by , and given the initial state , then the problem is formulated as:
(2)  
s.t. 
Iii Relaxed Problem and Threshold Policy
The problem described in Section IIB is a Restless bandit problem (RBP) since it consists in scheduling at each time users (or resources) among the N users and that at each time the state of each queue evolves even if the queue is not scheduled. See Whittle [19]. Since RBPs are PSPACEHard. See Papadimitriou et al. [13], therefore we need to develop an new approximation in order to derive well performing policies. For that, we will first analyze a relaxed version of the original problem and then use the structure of its optimal policy to develop a Whittle index policy for the original problem. The relaxation considered here is the Lagrangian relaxation approach. This latter consists of relaxing the constraint of available servers. In other words, we consider that the constraint in Equation (2), has to be satisfied on average and not in every time slot. That means:
(3) 
Denoting by the Lagrangian multiplier for the constrained problem, then the Lagrange function equals to:
Where can be seen as a subsidy for not transmitting, or the price to decide an active action. Therefore, the dual problem for a given is
(4) 
Iiia Problem Decomposition
The relaxed problem allows to decompose the dimensional problem into much simpler 1dimensional subproblems. For that, we fix the Lagrangian parameter and discard from the dual problem formulation the sum which does not depend on (since the problem considered is an optimization problem over a set of policies ). Hence, the dual problem will be equivalent to:
(5) 
In fact, the solution of this problem is the stationary policy that resolves the well known Bellman equation, e.g. see Ross [14]. Namely,
(6) 
Where represents the value function, is the optimal average cost and is the holding cost in class. The optimal decision for each state can be obtained by minimizing the right hand side of Equation (6). One can show that, for a given , this relaxed problem can be decomposed into independent subproblems. We skip the proof here for brevity and refer the reader to [8],as the model therein is similar to our model here except that the queues there have a limited capacity.
IiiB Threshold policy
In this section, we show that the solution for each individual problem (for each user ) is a threshold policy. We give first some useful definitions.
Definition 1.
For given class, a threshold policy is a policy for which there exists an such that when the queue of user is in state , the prescribed action is . And when the queue , the prescribed action is and .
Since there are only two possible actions, a policy is of the form threshold policy if and only if it is monotone in .
Definition 2.
We say that function is Rconvex in , if for any and in such that , we have:
Definition 3.
Let be a real valued function defined on , with , and . We say that is submodular if for all on .
The solution of Bellman equation (6) can be obtained by an algorithm called Value iteration. This consists in updating by the following equation
(7) 
After many iteration, will converge to the unique fixed point of the equation (6) called .
Definition 4.
We define the operator such that for each
Proposition 1.
For each class and user , the optimal solution that resolves the Bellman equation (6) is of type increasing threshold: there exists a state such that for each state the optimal decision is passive action, and for each state the optimal decision is active action.
In order to prove this result we need to prove that is submodular.
Proof outline: Since our model is similar to the one considered in [8], the proof is similar. We provide here a high level description of the proof:
1) One has to establish that for all , is increasing and Rconvex (this can be done by induction). From that one can conclude that is also increasing and Rconvex.
2) Demonstrate that if is increasing and Rconvex, then is submodular.
3) Conclude that the optimal solution is an increasing threshold in queue state, by exploiting the submodularity of the function
Iv Whittle’s index
In this section, we will introduce the notion of Whittle’s index, which will be useful to develop a new heuristic for original problem. We will review the main result and approach obtained in [8] and explain its limitation and why such approach cannot be used to find the Whittle index values if the queues have unlimited capacity.
At given state , the Whittle’s index is the Lagrange multiplier or subsidy for passivity for which the optimal decision at this state is indifferent (passive and active decision are both optimal). This definition requires that the property of indexability is satisfied. This property consists in establishing that as the subsidy for passivity, W, increases, the collection of states in which the optimal action is passive increases.
Before providing a rigorous definition of indexability, we recall from the previous section that the optimal policy for the relaxed problem for given W is a threshold policy. We denote by the stationary distribution of the states under threshold policy in class. We now formalize the concepts of indexability and Whittle’s index in the following definition.
Definition 5.
A class of queues is indexable if the set of states in which the passive action is the optimal action (denoted by ) increases in . That is, . When the class is indexable, the Whittle’s index in state in class is defined as:
We start showing the indexability of the problem. The proof is straightforward and can be obtained from the previous work in this area.
Proposition 2.
Assuming that for each and class, the optimal solution is of type threshold , and is increasing in , then the class is indexable.
Where is an optimal threshold at W(i.e. optimal solution of the relaxed problem for given W) in class and the stationary distribution of the queue states under threshold policy in class. One can show that the condition in the aforementioned proposition is satisfied for our problem. The proof is similar to the one in [8] and is skipped here for brevity.
Several works have been conducted in the past to find Whittle index values for different scheduling problems, e.g. [9] and the references therein. In [9], an algorithm is provided to compute Whittle’s index for a queueing system with one server. This algorithm in fact, gives recursively expression of whittle index for given state. However, the complexity of the algorithm grows with number of states and hence, cannot be practically applied to our context. More generally, this algorithm cannot be theoretically applied in some cases where the passive decision’s average time takes different values for an infinite set of states, since the number of iterations of the algorithm will be infinite. Since we consider that the queue state is not bounded, the above algorithm cannot be used. In [8], a closed form expression of Whittle’s index is given, which simplifies the complexity of the computation.
Let us now restate the Whittle’s index result in [8].
Proposition 3.
[8]
The Whittle’s indices expressions are defined for states and are given as,
for :
Based on the above Whittle’s index result, the work in [8] provides a Whittle index policy, which consists on allocating the servers to the users which have the highest Whittle index at time , denoted by .
However, the above result is limited to the case where the states are . The technique used in [8] consists of finding the stationary distribution of the states under threshold policy, reformulating the relaxed problem using this stationary distribution and analyzing this reformulated problem (which is similar to a deterministic one) to find the explicit expressions of Whittle index based on the algorithm that we have discussed before. In this paper, since the algorithm that gives us the whittle index expressions can not be applied for all states, we rely on another method which allows us to find the Whittle index values for all possible states. In order to work with the original cost function, we formulate a discounted reward problem in which is a discount factor. We analyze this discounted problem and found the Whittle index expressions (that depend on ) and then by taking , we obtain the Whittle’s index for our original problem.
Iva Problem reformulation using discounted cost approach
We start by formulating the original problem with the expected discounted cost:
(8)  
s.t. 
Following the same steps as in section IIB, we relax the problem and give the dual relaxed problem for given :
(9) 
Then we decompose it into individual problems since the Bellman equation that resolves the dual problem is decomposable. The Bellman equation for an individual problem is [14].
(10) 
In fact is no more than the discounted cost when the initial state is , , and .
Following the same method in section IIIB, we can prove that the optimal solution that satisfies this Bellman equation is a threshold policy, by proving that the function is submodular. We can also conclude that the value function has same structural property as in section IIIB, especially that the submodularity and convexity hold true. However, contrary to what has been done in [8], finding the steady state distribution will not give an explicit expression of the problem 9. Nevertheless, we can work only with the Bellman equation to derive the Whittle index thanks to the parameter which helps us to find the Whittle index for all states.
Definition 6.
We define and in class as the discounted costs starting at the initial queue state at which the decision taken is to not be scheduled () or to be scheduled () respectively and when the policy considered is threshold , explicitly:
Where is the value function under threshold policy .
Definition 7.
We define as function defined in , such that for all ,
Proposition 4.
For :
For :
Proof.
See Appendix A ∎
We emphasize that to prove that for fixed W, in class, a given state is indeed an optimal threshold (i.e. if the queue is not scheduled and otherwise it is scheduled), we just need to prove that it satisfies for all states and for (according to Bellman equation). In other words, we suppose that is a threshold (i.e. if the queue is not scheduled and otherwise it is scheduled), and we show that for all states and for (for given value of , the optimal threshold might not be unique).
Proposition 5.
For class, if there exists such that , then is an optimal threshold.
Proof.
See Appendix B ∎
Proposition 6.
If , then for , is an optimal threshold. And if , then for all is an optimal threshold.
Proof.
See Appendix C ∎
In order to establish the Whittle indices we study the function defined in definition 7.
Lemma 1.
is strictly increasing in , and decreasing in .
Proof.
is clearly strictly increasing in and decreasing in from its expression. ∎
To prove that the Whittle index for a given state is a given in class, we have to demonstrate that for all , at state the decision must be the active action. In other words, since the optimal solution is surely a threshold policy, we need to prove that for all states greater than , they cannot be the optimal threshold. For that, we will suppose that if the optimal threshold is higher than , including the case of infinite threshold, and we will prove that there is a contradiction.
Proposition 7.
If , then the optimal threshold is surely finite.
Proof.
See Appendix D ∎
Proposition 8.
For each queue state in class, the Whittle index expression is given by:
For , .
For , .
Proof.
See Appendix E ∎
We know that for , the solution for the problem 9 is the same as the problem 5, see [14]. Hence, to derive the Whittle index for the expected average cost’s case, we must tend to . However, for states greater or equal than , the Whittle indices tend to . On the other hand, by looking at our policy which consists on selecting the users at states with the highest Whittle index values, we can notice that this policy is the same if the order of the Whittle indices from the biggest to the smallest one is not affected even if the Whittle index values are modified. In the following, we denote by the Whittle index of state at classk.
Theorem 1.
For any , the Whittle index policy where the Whittle indices in each class are given by:
For :
For :
is exactly the Whittle index policy when the Whittle indices are given by proposition 8.
Proof.
See Appendix F ∎
When , the condition given in Theorem 1 is still true, and hence, we get the Whittle index policy for our original problem with expected average cost. The policy consists in allocating the channels (or servers) to the users having the highest Whittle index values computed in the aforementioned theorem. We can notice that the Whittle index of states greater than the maximum transmission rate are different only by and (). It is worth mentioning that the obtained policy can be seen as rule when all states are greater than , since we choose users with the highest .
V Numerical Results
In this section, we show that the Whittle index policy (denoted by WI) shows good performance when the number of users is large. Since computing the optimal policy is computationally prohibitive, we take advantage of the fact that the the optimal cost of the relaxed problem denoted by is less than the optimal one of the original problem. We therefore, compare between the cost obtained by our policy and the one obtained for the relaxed problem, for which a simple threshold policy is the optimal one. The gap between these two policies is then an upper bound of the gap between our policy and the optimal one (in terms of achieved average cost). In addition, we compare the average cost given by our policy WI with the one given by the myopic policy or the MaxWeight which schedule the queues that have the highest instantaneous incurred delay cost. We denote the average cost given by the policy WI and the average cost given by the myopic policy. We plot the results on Figures 1 and 2 where we consider two user classes of users with their respective transmission rate and , and the number of servers is equal to where N is the number of users. In Figure 1 we take and , while in Figure 2 the values are and . According to these figures, one can show that our policy WI is asymptotically optimal, and performs much better than the myopic policy. This confirms our main motivation behind developing the Whittle index policy as presented before in the paper.
Vi Conclusion
We considered the problem of resource allocation in a queueing system composed of queues and servers. We showed that minimizing the average expected queue length, in a time discrete slotted system, is a Restless Bandit Problem, for which finding the optimal solution is out of reach. We therefore developed a simple Whittle index policy for this problem. While the previous works on Whittle index for time discrete queueing systems have been mainly limited to the context of finite buffer length models, we provided in this paper an extension and derived an index policy without restricting the buffer length to be always less than an fixed value. Our development rely on the idea to introduce a new discount factor in the cost function, to derive the Whittle index as function of this discount factor and then obtain the index value of the original problem by taking the limit when this factor tends to 1. Numerical results show that our policy is asymptotically optimal in the many user regime.
References
 [1] PS Ansell, Kevin D Glazebrook, José NiñoMora, and M O’Keeffe. Whittle’s index policy for a multiclass queueing system with convex holding costs. Mathematical Methods of Operations Research, 57(1):21– 39, 2003.
 [2] C Buyukkoc, P Variaya, and J Walrand. c mu rule revisited. Adv. Appl. Prob., 17(1):237–238, 1985.
 [3] Ying Cui and Vincent KN Lau. Distributive stochastic learning for delayoptimal ofdma power and subband allocation. IEEE transactions on signal processing, 58(9):4848–4858, 2010.
 [4] Ying Cui, Vincent KN Lau, Rui Wang, Huang Huang, and Shunqing Zhang. A survey on delayaware resource control for wireless systems—large deviation theory, stochastic lyapunov drift, and distributed stochastic learning. IEEE Transactions on Information Theory, 58(3):1677–1701, 2012.
 [5] Matha Deghel, Mohamad Assaad, Merouane Debbah, and Anthony Ephremides. Queueing stability and csi probing of a tdd wireless network with interference alignment. IEEE Transactions on Information Theory, 64(1):547–576, 2018.
 [6] Apostolos Destounis, Mohamad Assaad, Mérouane Debbah, and Bessem Sayadi. Trafficaware training and scheduling for miso wireless downlink systems. IEEE Transactions on Information Theory, 61(5):2574– 2599, 2015.
 [7] Leonidas Georgiadis, Michael J Neely, Leandros Tassiulas, et al. Resource allocation and crosslayer control in wireless networks. Foundations and Trends® in Networking, 1(1):1–144, 2006.

[8]
Saad Kriouile, Maialen Larranaga, and Mohamad Assaad. Asymptotically optimal delayaware scheduling in wireless networks. https: //arxiv.org/pdf/1807.00352.pdf, 2018.
 [9] Maialen Larrañaga. Dynamic control of stochastic and fluid resourcesharing systems. PhD thesis, 2015.
 [10] Maialen Larrañaga, Urtzi Ayesta, and Ina Maria Verloop. Dynamic control of birthanddeath restless bandits: Application to resourceallocation problems. IEEE/ACM Trans. Netw., 24(6):3812–3825, 2016.
 [11] Keqin Liu and Qing Zhao. Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access. IEEE Transactions on Information Theory, 56(11):5547–5567, 2010.
 [12] Avishai Mandelbaum and Alexander L Stolyar. Scheduling flexible servers with convex delay costs: Heavytraffic optimality of the generalized cµrule. Operations Research, 52(6):836–855, 2004.
 [13] Christos H Papadimitriou and John N Tsitsiklis. The complexity of optimal queuing network control. Mathematics of Operations Research, 24(2):293–305, 1999.
 [14] Sheldon M Ross. Introduction to stochastic dynamic programming. Academic press, 2014.
 [15] Linus E Schrage and Louis W Miller. The queue m/g/1 with the shortest remaining processing time discipline. Operations Research, 14(4):670– 684, 1966.
 [16] Leandros Tassiulas and Anthony Ephremides. Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks. IEEE transactions on automatic control, 37(12):1936–1948, 1992.
 [17] Jan A Van Mieghem. Dynamic scheduling with convex delay costs: The generalized c— mu rule. The Annals of Applied Probability, pages 809–833, 1995.
 [18] Rui Wang and Vincent KN Lau. Delayaware twohop cooperative relay communications via approximate mdp and stochastic learning. IEEE Transactions on Information Theory, 59(11):7645–7670, 2013.
 [19] Peter Whittle. Restless bandits: Activity allocation in a changing world. Journal of applied probability, 25(A):287–298, 1988.
Appendix A Proof of proposition 4
1) :
We start first by giving a useful lemma
Lemma 2.
For all ,
Proof.
We decompose the discounted cost in the cost incurred at first time slot plus the discounted cost starting at the next time slot. At state , the decision taken is to transmit since , and at state , the decision taken is passive action since , hence,
Subtracting the second term from the first term,
∎
2):
We consider a given threshold , i.e., for states less than , we don’t transmit otherwise we transmit.
At state if we decide to transmit then the next possible states are ( varies from to ) with the probability to reach each state is , hence, we have,
At state , since then, the decision taken is passive action ( is threshold), thus if we decompose again , , Replacing by its value,
We know that,
Hence,
That means,
At state if we decide to not transmit then the next possible states are with the probability to reach each state is hence, we have, At state for , since then, the decision taken is active action ( is threshold), thus if we decompose again , , Replacing when by its value,
However is no more than , hence,
That means,
Then,
Hence,
Appendix B Proof of proposition 5
As the function is submodular, then for all , we have , and for all , we have . That means is indeed an optimal threshold.
Appendix C Proof of proposition 6
1):
According to proposition 4 for , .
Knowing that .
That means, for ,
Hence, using proposition 5, is indeed an optimal threshold.
2):
According to proposition 4 for , .
Knowing that .
Hence,
.
Applying proposition 5, the threshold is indeed an optimal solution when . That is true for all , which concludes the proof.
Appendix D Proof of proposition 7
We consider that the optimal solution is an infinite threshold and we consider is the value function under infinite threshold.
Lemma 3.
For all .
Proof.
Under infinite threshold, since the decision taken for all states is passive action, then we have for all ,
Comments
There are no comments yet.