Learning Augmented Index Policy for Optimal Service Placement at the Network Edge

We consider the problem of service placement at the network edge, in which a decision maker has to choose between N services to host at the edge to satisfy the demands of customers. Our goal is to design adaptive algorithms to minimize the average service delivery latency for customers. We pose the problem as a Markov decision process (MDP) in which the system state is given by describing, for each service, the number of customers that are currently waiting at the edge to obtain the service. However, solving this N-services MDP is computationally expensive due to the curse of dimensionality. To overcome this challenge, we show that the optimal policy for a single-service MDP has an appealing threshold structure, and derive explicitly the Whittle indices for each service as a function of the number of requests from customers based on the theory of Whittle index policy. Since request arrival and service delivery rates are usually unknown and possibly time-varying, we then develop efficient learning augmented algorithms that fully utilize the structure of optimal policies with a low learning regret. The first of these is UCB-Whittle, and relies upon the principle of optimism in the face of uncertainty. The second algorithm, Q-learning-Whittle, utilizes Q-learning iterations for each service by using a two time scale stochastic approximation. We characterize the non-asymptotic performance of UCB-Whittle by analyzing its learning regret, and also analyze the convergence properties of Q-learning-Whittle. Simulation results show that the proposed policies yield excellent empirical performance.


Online Reinforcement Learning of Optimal Threshold Policies for Markov Decision Processes

Markov Decision Process (MDP) problems can be solved using Dynamic Progr...

A Dynamic Reliability-Aware Service Placement for Network Function Virtualization (NFV)

Network softwarization is one of the major paradigm shifts in the next g...

R-Learning Based Admission Control for Service Federation in Multi-domain 5G Networks

Service federation in 5G/B5G networks enables service providers to orche...

Model-free Reinforcement Learning for Content Caching at the Wireless Edge via Restless Bandits

An explosive growth in the number of on-demand content requests has impo...

Dynamic gNodeB Sleep Control for Energy-Conserving 5G Radio Access Network

5G radio access network (RAN) is consuming much more energy than legacy ...

Online Partial Service Hosting at the Edge

We consider the problem of service hosting where an application provider...

RetroRenting: An Online Policy for Service Caching at the Edge

The rapid proliferation of shared edge computing platforms has enabled a...

I Introduction

An increasing number of emerging agents such as smart wearables, mobile phones, are trending towards varied high data-rate services such as video streaming, web browsing, software downloads. This has urged service providers (e.g., CDN) to pursue new service technologies that yield a desirable quality of service and also a good quality of experience. One such technology entails network densification by deploying edge servers each of which is empowered with a small base station (SBS), for e.g. the storage-assisted future mobile Internet architecture and cache-assisted 5G systems [1]. Service requests are often derived from dispersed sources including end users, devices and sensors located around the globe. As a result, the distributed infrastructure of such a storage-assisted system (e.g., Akamai CDN, Google Analytics, etc) typically has a hub-and-spoke model as illustrated in Figure 1.

A major challenge in such storage-assisted network edges is to decide which services each edge should host in order to efficiently satisfy the demands of end users. We refer to this as the optimal service placement problem. The service placement needs to take into account the heterogeneity of edge servers, end user requests, services, etc. For example, end users send service requests to geographically distributed edge servers near them. The requests are satisfied with a negligible latency if services are available in the edge. Otherwise, requests are further sent to the central service warehouse (CSW) at the cost of longer latency. While the CSW is often located in a well-provisioned data center, resources are typically limited at edges, i.e., the capacity of edge servers is usually limited. Furthermore, the network interface and hardware configuration of edge servers might greatly impact the response of services [14]. These issues pose significant challenges to ensuring that customers111We interchangeably use the term “end user” and “customer” in this paper. waiting at the network edge for a service face minimal delay.

Fig. 1: The distributed model for a typical service placement comprises a single central service warehouse and multiple edge servers connected by cheap backhaul links.

Existing works, e.g.,  [22, 16, 44, 43, 53]

that address the issue of placing services at edge to minimize customer delays suffer from severe limitations. One such limitation is that they consider a deterministic system model, and hence the resulting service placements are highly pessimistic. Secondly, their solutions often assume that the system dynamics (e.g., service request rates) are fixed and known to the service provider, while these quantities are typically unknown and also possibly time varying in real systems. In departure from these works, our goal is to model the network edges by using an appropriate stochastic framework. Secondly, we also leverage new machine learning (ML) techniques in order to efficiently learn the unknown network system parameters and derive optimal service placement strategies. This is particularly important since with the advent of cost-effective ML solutions, the network operator can deploy these solutions to optimize the system in real-time. In particular, we raise the following question:

Can we leverage ML for maximizing the benefits of storage resources available at the edges and optimizing the performance of service placement at the network edge?

Main Results

We first focus on the scenario when the system parameters, e.g. arrival rates and service delivery rates are known to the decision maker. We pose the problem of placing services at the network edge so as to minimize the cumulative delays faced by customers waiting in queue as a Markov decision process (MDP) in which the system state is given by describing, for each service, the number of customers that are currently waiting at the edge to obtain this service. This MDP is intractable in general since the optimal solution is complicated and state-of-the-art solutions have a complex structure that depend on the entire state-space description, and hence not easily implementable and learnt.

Our MDP can be viewed as a special instance of the Restless Multi-Armed Bandit (RMAB) [55, 19], where each service is a bandit, and its queue length (number of customers waiting to receive it) is the state of the bandit. The problem of minimizing the cumulative customer delay is seen to be equivalent to maximizing the rewards earned by “pulling bandits” (placing services at edge) under the constraint that the total number of services placed at the edge is less than the edge capacity. However, the RMAB based formulation in general suffers from the curse of dimensionality [8] and is provably hard [42].

Since Whittle index policy [55] is computationally tractable, and also known to be asymptotically optimal  [52], we propose to use it in order to overcome this computational challenge. We show that our MDP is indexable, and derive explicitly the Whittle indices for each service as a function of the number of customers that are waiting to receive it. Our derivation of Whittle indices relies on a key result which shows that the optimal policy for a single-service MDP, in which a service is charged price for being placed at the edge on top of being penalized for having outstanding customers waiting for it, has an optimal policy that is of threshold-type.

Since the system parameters, e.g. service request rates, delivery rates etc., are typically unknown and time varying in practice, we further explore the possibility of designing efficient learning augmented algorithms to address these. Although this can be posed as a RL problem, the resulting learning regret scales linearly with the size of the state-space [23], and hence would be too large for the algorithm to be of any practical use. Our contributions are to develop efficient RL based algorithms that have a low learning regret [33] because they utilize the structure of optimal policies.

Our first algorithm entitled UCB-Whittle relies upon the principle of optimism in the face of uncertainty [29, 3, 37]. UCB-Whittle combines the asymptotic optimality property of the Whittle policy [52] with the “efficient exploratory behavior” of UCB-based [3]

learning algorithms. Thus, UCB-Whittle maintains a confidence ball that contains the true (unknown) parameter with a high probability. Within an RL episode it uses the Whittle index policy based on an optimistic estimate of the parameter from within this confidence ball. Since the computational complexity of solving for Whittle index policy scales linearly with the number of services, UCB-Whittle is practical with a learning regret not scaling the number of services.

Our second algorithm entitled Q-learning-Whittle, utilizes Q-learning iterations for each service by using a two time scale stochastic approximation. It leverages the threshold-structure of the optimal policy to learn the state-action pair with a deterministic action at each state. Hence, Q-learning-Whittle only learns Q-value of those state-action pairs following the current threshold policy. This new update rule leads to a substantial enhancement over the sample efficiency of Q-learning. Our numerical evaluations show that the proposed policies yield excellent empirical performance.

The rest of the paper is organized as follows. We describe the system model and the MDP formulation in Section II. We present the Lagrangian relaxation of the original MDP and derive Whittle index policy in Section III. We design novel learning augmented index policies UCB-Whittle and Q-learning-Whittle in Section IV and the proofs of the asymptotical performance of these algorithms are given in Section V. Numerical results are presented in Section VI. We discuss related work in Section VII and conclude the paper in Section VIII. Additional proof details are presented in Appendix IX.

Ii System Model

We consider a heterogeneous network as shown in Figure 1 where each geographically distributed edge server covers an area to provide services to end users. End users send their requests to different services to edge servers near them. The edge servers provide the requested services to end users if the corresponding services are placed in the edge server. Otherwise, the requests are further sent to the CSW at the cost of longer latency.

In particular, we focus on one edge server in Figure 1 due to the distributed nature of our system. Furthermore, there is one service provider that provides distinct services indexed by to end users via the edge server. For simplicity, we assume that all services are unit sized. Our model can be easily generalized to variable sized services. The capacity of the edge server is , where

Fig. 2: A service placement system where the edge server can host at most

services at any moment in time.

We formulate the service placement problem using the framework of a Markov Decision Process (MDP) [45]. In particular, we consider the service requests as a continuous-time Markov process. We first specify the states, actions, rewards and transition kernel of the corresponding MDP.

State: The requests from end users for service arrive to the edge server according to a Poisson process with arrival rate

and the service delivery requirement is exponentially distributed with mean

We model each service as a bandit222In the rest of the paper, we interchangeably use the term “bandit” and “service”., and denote its state as a queue, as shown in Figure 2. The queue length of service at time is denoted as , which represents the number of customers (i.e., requests) waiting for service since the latest placement of service at the edge server. Furthermore, we assume that , and can be arbitrarily large but bounded value. The state of the system at time is then defined as .

Action: Decisions are made at moments when a bandit changes state, i.e., when there is a new service request. Upon each decision, two actions can be chosen for each bandit: action (passive), i.e., not to place the service at the edge server, and action (active), i.e., place the service at the edge server.

Transition Kernel: Throughout this paper we consider the restless multi-armed bandits (RMAB) problem [55] where each bandit is modeled as a a continuous birth-and-death process (BDP), which has been widely used to model real-world applications, e.g.,[30, 32]. Specifically, the state of bandit can evolve from to state or state

both when being active and being passive. The transition rates of the vector

are expressed as


where is a -dimensional vector with all zero elements except the -th entry being equal to and . In particular, given the action taken in state the dynamics of each bandit is independent of the others, as described in (1).

Different from conventional BDP, we define the transition rates as follows due to the unique nature of our service placement model. The birth rates satisfy , which represents the rates of new requests from end users to service , and the death rates satisfy with if service is active. A state-dependent death rate enables us to model more complicate scenarios such as the case in which the wireless channel capacity between the services and edge server is dynamically changed with the service state [32]. In this paper, we assume the departure rate as the classic queue for simplicity, i.e.,

Cost: For each service , let be the cost per unit time when in state and it is either passive () or active (). The cost represents the average latency for obtaining the service from the edge server for end users.

Optimal Service Placement Problem: We consider a policy to determine which bandits being active, i.e., to place which services on the edge server at any moment in time. Since we model bandits as a continuous BDP, a special case of a continuous-time Markov process, we can focus on policies that make their decisions only based on the current state of bandits. For a given policy let be the state of bandit at time and . Denote actions as , where means that service is made active at time under policy ; otherwise, service is made passive at time under policy

The expected average service delivery latency to end users under policy is defined as


where denotes the conditional expectation given policy , and is the set of feasible policies that also ensure that the resulting state process is ergodic.

Our goal is then to derive a policy to minimize the average delivery latency under an edge server capacity constraint. Hence, the optimal service placement problem can be formulated as the following MDP:

s.t. (3)

The following result states the existence of value function and average cost value for the MDP (II), which follows directly by applying Chapter 6 [45].

Lemma 1.

It is known that there exists and that satisfy the Dynamic Programming equation


where a stationary policy that realizes the minimum in (1) is optimal with and being the value function [45]. An optimal policy  (1) can be obtained numerically by using value iteration or policy improvement algorithms.

Though in principle one could use the iterative algorithms mentioned above, in reality the “curse of dimensionality”, i.e. exponential growth in the size of the state-space with the number of services , renders such a solution impractical. Thus, we resort to using the Whittle index policy that is computationally appealing.

Iii Service Placement using Whittle Index Policy

In general, even if the state-space were bounded (say for example by truncating the queue lengths), the MDP (II) is a hard problem to solve because the optimal decisions for services are strongly coupled. We realize that the problem (II) can be posed as a restless multi-armed bandit (RMAB) problem [55] in which is the state of the -th bandit. A tractable solution, i.e., one for which the computational complexity scales linearly with , to the RMAB was proposed by Whittle [55]. We briefly describe the notion of indexability and the Whittle index policy. We then show that our problem is indexable.

Iii-a Whittle Index Policy

Consider the following problem, which arises from (II) by relaxing its constraint to time-average:

s.t. (5)

It can be shown that the relaxed problem (III-A) is computationally tractable, has a computational complexity that scales linearly with , and has a decentralized solution in which the decisions for each service are made only on the basis of the state of that service.

Consider the Lagrangian associated with this problem,


where is the Lagrangian multiplier. Also define the associated dual problem,


Since the Lagrangian decouples into the sum of individual service MDPs, it turns out that in order to evaluate the dual function at , it suffices to solve single-service MDPs of the following form ( is a policy for the -th service):



Definition 1.

(Indexability): Consider the single-service MDP (8) for -th service. Let denote the set of those states for which the optimal action is to choose action (passive). Then, the -th MDP is indexable if the set increases with , i.e., if then . The original MDP (II) is indexable if each of the single-service MDPs are indexable.

Definition 2.

(Whittle Index) If the single-service MDP for the -th service is indexable, then the Whittle index in state is denoted , and is given as follows:

Thus, it is the smallest value of the parameter such that the optimal policy for service is indifferent towards and when the state is equal to .

Iii-B Mdp (Ii) is indexable

Our proof of indexability relies on the “threshold” property of the optimal policy for the single-service MDP, i.e., the service is placed on the edge server only when the number of requests for it at the edge is above a certain threshold.

Proposition 1.

Fix a , and consider the single-service MDP (8). The optimal policy for this problem is of threshold type. The threshold depends upon .

We now compute the stationary distribution of a threshold policy as a function of its threshold. This result is useful while proving indexability of the original MDP.

Proposition 2.

The stationary distribution of the threshold policy satisfies


where is a dummy states representing state to .

We conclude this subsection by show that all bandits are indexable under our model.

Proposition 3.

The MDP (II) is indexable.

We are now ready to derive the Whittle indices for the MDP (II).

Proposition 4.

The Whittle index is given by


when the right hand side of (11) is non-decreasing in , where a sub-script denotes the fact that the associated quantities involve a threshold policy with the value of threshold set equal to this value.

Remark 1.

If (11) is non-monotone in , then the Whittle index cannot be derived by equating the average cost of two consecutive threshold policies. Instead, the algorithm described in [20, 31] provides a means to obtain Whittle index. In particular, the threshold policy is compared to an appropriate threshold policy , we obtain Whittle index for all thresholds equals The right choice of is the result of an optimization problem. Similarly, the index in state is computed by comparing threshold policy to an appropriate threshold policy . Then the Whittle index is for all . The algorithm terminates when in one iteration

Remark 2.

Since the cost function and the stationary probabilities are known,  (11) can be numerically computed.

Remark 3.

From (11), it is clear that the index of bandit does not depend on the number of requests to other services , Therefore, it provides a systematic way to derive simple policies that can be easily implementable.

Iii-C Asymptotic Optimality

The Whittle index policy is in general not an optimal solution to the original problem (II). However, it has been proved that Whittle index policy is asymptotically optimal, e.g., [54, 52, 40] as the number of bandits that can be simultaneously made active grows proportionally with the total number of bandits.

Iv Reinforcement Learning Based Whittle Index Policy

The computation of Whittle indices requires us to know the arrival rates and service rates . Since these quantities are typically unknown and possibly also time-varying, the assumption that these are known to the service provider, is not practical. Hence, we now develop learning algorithms [33] that can ensure almost the same performance as the Whittle index policy. In particular, we propose two reinforcement learning algorithms that can fully exploit the structure of our proposed Whittle index policy. We also quantify the sub-optimality of our learning algorithm in terms of its learning regret [12]. For the ease of exposition, we present the algorithms design and main results in this section, and relegate the proofs to Section V.

Iv-a UCB-Whittle

We first propose a Upper Confidence Bounds (UCB) type algorithm based on the principle of optimism in the face of uncertainty [3, 33, 37]. To simplify the exposition, we will sample our continuous-time system at those discrete time-instants at which the system state changes, i.e. either an arrival or a departure occurs. This gives rise to a discrete-time controlled Markov process [45], and henceforth we will work exclusively with this discrete time system. We use to denote the mean delivery time of service , and to denote the mean inter-arrival times for requests for the -th service. In what follows, we will parameterize the system in terms of the mean inter-arrival times and mean delivery times of all services. This is useful since the empirical estimates of mean inter-arrival times and mean delivery times are unbiased, while the empirical estimates of mean arrival rates and delivery rates are biased. Hence, this parametrization greatly simplifies the exposition and analysis. We denote , and . Thus denotes the vector comprising of the true system parameters.

Let be two possible system state values, and be an allocation for the combined system composed of services. denotes one-step controlled transition probability for the combined system from state to state under allocation when the system parameter is . The subscript denotes its dependence upon the value of true parameter. Since Whittle indices are also a function of the parameter , we denote in order to depict this dependence. For , we use , and by notational abuse we also use to denote Whittle index policy that uses Whittle indices for placing services.

The learning algorithm knows that belongs to a finite set . For a stationary control policy that chooses on the basis of , we let denote its infinite horizon average expected cost (i.e., latency) when true system parameter is equal to , and denote the cumulative cost incurred during time-steps. As is common in the RL literature [23, 33, 37], we define the “gap” as follows,


Throughout, we assume that we are equipped with a probability space  [47].

Iv-A1 Learning Setup

We assume that the algorithm operates for time steps. We consider an “episodic RL” setup [50] in which the total operating time horizon of steps is composed of multiple “episodes”. Each episode is composed of consecutive time steps. Let denote the starting time of the -th episode, and the set of time-slots that comprise the -th episode. We assume that the system state is reset to at the beginning of each episode. This can be attained by discarding, at the end of each episode, those users from our system which have not received service by the end of the current episode. Denote by the sigma-algebra [48]

generated by the random variables

. A learning rule is a collection of maps 333 denotes the action space. that utilizes the past observation history in order to make service placement decisions for the current time-step .

Iv-A2 Learning Rule

Let denote the empirical estimates of the mean arrival times and mean delivery time of service at time , i.e.,



, and construct the confidence interval associated with

as follows,


where denote the radii of the confidence intervals, and are given as follows,


where is a user-specified parameter, denotes the number of samples obtained for estimating the arrival rate (delivery rate) of the -th service until time , and are constants. As discussed in Theorem 1, we will use in order to ensure that the learning algorithm has a good performance.

The confidence ball of the overall system is denoted as444Though we define confidence intervals and empirical estimates for each time, they are used only at times that mark the beginning of a new episode.


The learning rule then derives an “optimistic estimate” of the true parameter as follows,


In case is empty, then is chosen uniformly at random from the set . During , the learning rule implements . We summarize this as Algorithm 1.

Initialize: for all .
for Episodes  do
     Update the estimate by using  (13)-(14) .
     Calculate according to (IV-A2).
     Calculate optimistic estimate by solving (19)
     Calculate Whittle index policy and implement in the current episode
Algorithm 1 UCB-Whittle

Iv-A3 Learning Regret

We define the regret of the learning rule as follows,


where is the expected value of the cost incurred during time-slots when is used on the system that has parameter equal to . Our goal is to design a learning rule that minimizes the expected value of the regret . Thus, our benchmark is the Whittle index rule that is designed for the system with parameters . We next present the regret of UCB-Whittle algorithm.

Theorem 1.

Consider the UCB-Whittle Algorithm applied to adaptively place services on the edge server so as to minimize the average latency faced by users. We then have that with a probability greater than , the regret is bounded by , where is a user-specified parameter, is a lower bound on the mean inter-arrival times and delivery times, is the gap (12), and the quantity is as in Lemma 9. With , we obtain that the expected regret of UCB-Whittle is upper-bounded as follows,

Proof Sketch: We show that the sample path regret can be upper-bounded by the number of episodes in which is not equal to , i.e., sub-optimal episodes. We then decompose the sample space into “good set” and “bad set” . Loosely speaking, on we have that each user is sampled sufficiently many times, and moreover the sample estimates concentrate around the true value . We show that the good properties of imply a bound on the number of sub-optimal episodes, and hence the regret. The expected regret on is bounded primarily by bounding its probability. The proof details are presented in Section V-A2.

Remark 4.

It is well known by now that the learning regret of commonly encountered non-trivial learning tasks grows asymptotically as  [29, 33, 3]. A key research problem is to design a learning algorithm that yields the “lowest possible” pre-factor to this logarithmic regret. Theorem 1 shows that UCB-Whittle has a regret . An interesting question that arises is whether we could attain regret with computationally tractable algorithms, and also what is the “optimal” instance-dependent [33] pre-factor.

Iv-A4 Computational Complexity

Problem (19) needs to be solved at the beginning of -th episode. Solving (19) requires us to evaluate the average cost of the Whittle policy for each possible system parameter . However, deriving the average cost of Whittle policy is computationally expensive because of the curse of dimensionality [9]. However, we note that the Whittle index policy is asymptotically optimal as the population size is scaled up to . Moreover, it is shown in [52] that the limiting value of the normalized555Normalization involves dividing the expected cost by the population size. average cost is equal to the value of the allocation problem in which the hard constraints are relaxed to time-average constraints. This means that for large population size we could approximate the cost by the optimal value of the relaxed problem. The relaxed problem is tractable since the decisions of different services are decoupled, and hence the computational complexity scales linearly with the number of services. Moreover, since a threshold policy is optimal for each individual service, and the stationary probability for such a single service that employs a threshold policy, is easily computed, the quantity can be obtained easily.

Iv-B Q-learning-Whittle

We further design a heuristic learning algorithm for Whittle index, which bases on the off-policy Q-learning algorithm. The proposed algorithm leverages the threshold-structure of the optimal policy developed in Section 

III to learn the state-value action pair, which significantly differs from the conventional -greedy based Q-learning for RMAB [6].

We first rewrite the MDP for a single-service in (III-A) as


where we drop the subscript on the policy since we consider a single service . (21) can be further formulated as a dynamic programming, i.e.,


where is the value function for bandit in state under policy and is the optimum cost value for . Given (IV-B), we define the state-action value as


Since the Whittle index for bandit in state is defined as the value such that actions and are equally preferred in current state , i.e., , we can express the closed-form for as


However, the transition probability is unknown and therefore we cannot directly calculate the Whittle index according to (IV-B). In the following, we propose a Q-learning based algorithm to jointly update and .

To efficiently learn the Whittle index, we leverage the threshold structure of our policy as follows. Given a threshold policy any state is made passive and is made active otherwise. The Q-learning under this policy only updates and . The rest Q values are unchanged and set to be infinity. Under this setting, the desired Whittle index is the values satisfying


We notice from (IV-B) and (25) that is only a function of Whittle index for previous states, i.e., , which inspires a state-by-state update algorithm as follows.

Since each bandit shares the same update mechanism, we remove the subscript for notational simplicity. We consider the same learning setting as in UCB-Whittle, i.e., the algorithm operates for an episodic setup with each episode containing consecutive time horizon. Inside each episode , the Q-learning recursion for threshold is defined as (26),


where with being the start time slot for episode . is the Whittle index for episode and is kept fixed for the entire episode. The learning rate is assumed to satisfy

Our goal is to learn the Whittle index by iteratively update in the following manner


with satisfying For each state , the Whittle index is updated according to the double-time scale stochastic process according to (26) and (IV-B). The entire procedure is summarized in Algorithm 2 and its convergence is presented in Lemma 2.

1:Initialize: for all .
2:for  do
3:     Set the threshold policy as .
4:     for Episodes  do
5:         The initial state is assumed to be for all bandits.
6:         Update according to (IV-B).
7:         for  do
8:              Update according to (26).
9:              Update according to (26).               
10:     .
11:Return: .
Algorithm 2 Q-learning-Whittle

Note that Algorithm 2 is only for a single bandit. The server just needs to repeat the procedures of Algorithm 2 for times to achieve the Whittle indices for all bandits. Since there is no difference in the update mechanism for different bandits, we omit the algorithm description for the whole system. When considering the capacity of the server, i.e., it can only serve a maximum number of bandits, an easy implementation of the algorithm is to divide the bandits into multiple groups with size , and the server sequentially learns the Whittle indices for the bandits in each group.

Lemma 2.

The update of Whittle index in (IV-B) converges almost surely, i.e.,

Remark 5.

The conventional Q-learning based method [6] relies on -greedy mechanism for balancing the trade-off between exploration and exploitation. The major difference between our Q-learning-Whittle and the traditional Q-learning based index policy [6] lies in the update of the -value as in (26). For the traditional method, when bandit moves into a new state , with probability , it greedily selects an action to maximize . While with probability , it selects a random action. It needs to learn Q value of every state-action pair. However, our proposed threshold-type Q-learning-Whittle has a deterministic action at each state as in (26). Therefore, under each policy , it only learns Q values of those state-action pairs following the current threshold policy. Moreover, the Whittle index in current state only relies on the Whittle index in previous states, while the Whittle indices for all states are coupled in the conventional Q-learning based method. The proposed one will substantially enhance the sample efficiency and has a faster convergence speed, which will be verified in the numerical study.

V Proofs of Main Results

In this section, we provide the detailed proof of Theorem 1 on UCB-Whittle and Lemma 2 on Q-learning-Whittle.

V-a Proof of Theorem 1 on UCB-Whittle

Throughout, we will make the following assumption regarding the set that is known to contain the true parameter .

Assumption 1.

The process that denotes the number of services under the application of Whittle policy , is ergodic [38]

, i.e., the Markov chain is positive recurrent. Moreover, the associated average cost

is finite.

V-A1 Preliminary Results

We provide the following equivalent formulation of UCB-Whittle. This characterization turns out to be useful while analyzing its learning regret.

Lemma 3.

UCB-Whittle can equivalently be described as follows. At the beginning of , the learning algorithm computes a certain “value” for each as follows,


where is the confidence interval at time  (18). It then chooses the that has the least value , and implements the corresponding Whittle index rule during .


Note that the operation of obtaining could equivalently be re-written as follows


After exchanging the order of minimization, (29) reduces to


Since the Whittle index policy is asymptotically optimal [52] and there are only finitely many candidate policies in the inner-minimization problem above, (30) reduces to


However, this is exactly the problem (19) that needs to be solved in order to obtain . ∎

We now prove a concentration result for .

Lemma 4.

Define the set as follows,


Consider UCB-Whittle employed with the parameter equal to , where the parameter is chosen so as to satisfy . We have,


Fix the number of samples used for estimating at . It then follows from Lemma 11 with the parameter set equal to that the probability with which the estimate of service rate lies outside the confidence ball, is less than . Letting , we have that this probability is upper-bounded by .

Since can possibly assume number of values, and , the proof follows by using union bound on the number of estimation samples, and users. ∎

We now show that with a high probability, the value of the Whittle policy corresponding to the true value never falls above a certain threshold value of .

Lemma 5.

On the set we have .


It follows from the equivalent definition (28) of UCB-Whittle developed in Lemma 3 that we have if . The proof then follows from Lemma 4. ∎

Lemma 6.

Consider the process that evolves under the application of , where is the optimistic estimate obtained at the beginning of by solving (19). Then, on the set we have that the following holds true: there exists a and such that for each service ,

where is if service is placed at time , and is otherwise. A similar inequality is also satisfied by the cumulative arrivals.


Note that in order for the average cost to be finite, each service should be allocated a non-zero fraction of bandwidth at the server. Since under Assumption 1 the controlled process is ergodic and has finite cost, on the cost incurred under is definitely finite, and hence each service is provided a non-zero fraction of the total bandwidth at the server. The proof then follows since there are only finitely many choices for . ∎

We next show that if each service has been sampled “sufficiently” many times, then the value attached to a sub-optimal stays above the threshold .



where, and the parameter satisfies (see Lemma 9). Define to be the following set


where in the above we denote

On the set we can ensure lower bounds on the number of samples of a user, since they are tightly concentrated around its mean value. Since the regret at any time increases with the size of confidence balls (18), and since this size decreases with , we are able to obain tight upper-bounds on regret on .

Lemma 7.

Let be as in (34). On the set we have that for episodes , the value of any sub-optimal is greater than .


On we have that the number of samples of satisfy , and hence it follows from the definition of confidence intervals (16) that on we have . Let denote the solution of the problem (28). It follows from Lemma 9 that the following holds on ,

This completes the proof. ∎

Now we are ready to prove the regret of UCB-Whittle. We begin with the following result that allows us to decompose the regret into sum of “episodic regrets”.

Lemma 8.

For a learning algorithm , we can upper-bound its regret  (20) by a sum of “episodic regrets” as follows,

where denotes the number of episodes until .


The proof follows since in each episode, the learning algorithm uses a stationary control policy from within the set of stabilizing controllers . ∎

V-A2 Proof of Regret of UCB-Whittle

Proof of Theorem 1.

Lemma 8, the decomposition result show us the following:

  1. Episodic regret is in those episodes in which is equal to .

  2. If is not equal to , then the regret is bounded by times .

Define as the good set, where and are as in (32) and (35) respectively. Note that it follows from Lemma 4 and Lemma 10 that the probability of is greater than .

We thus have the following three sources of regret:

(i) Regret due to sub-optimal episodes on : It follows from Lemma 5 and 7 that the regret is in an episode if .

(ii) Regret on : As is seen in the proof of Lemma 4, the probability that the confidence interval in episode fails is upper-bounded as . Since the regret within such an episode is bounded by , the cumulative expected regret is upper-bounded by , where the constant .

(iii) Regret on : Since the probability of this set is , and the regret on can be trivially upper-bounded by , this component is .

The proof of the claimed upper-bound on expected regret then follows by adding the above three bounds. ∎

V-B Proof of Lemma 2 on Q-Learning-Whittle

Proof of Lemma 2.

A standard theorem in stochastic convergence (Theorem 2.3.1 of [28]) states that if are updated according to