I Introduction
An increasing number of emerging agents such as smart wearables, mobile phones, are trending towards varied high datarate services such as video streaming, web browsing, software downloads. This has urged service providers (e.g., CDN) to pursue new service technologies that yield a desirable quality of service and also a good quality of experience. One such technology entails network densification by deploying edge servers each of which is empowered with a small base station (SBS), for e.g. the storageassisted future mobile Internet architecture and cacheassisted 5G systems [1]. Service requests are often derived from dispersed sources including end users, devices and sensors located around the globe. As a result, the distributed infrastructure of such a storageassisted system (e.g., Akamai CDN, Google Analytics, etc) typically has a hubandspoke model as illustrated in Figure 1.
A major challenge in such storageassisted network edges is to decide which services each edge should host in order to efficiently satisfy the demands of end users. We refer to this as the optimal service placement problem. The service placement needs to take into account the heterogeneity of edge servers, end user requests, services, etc. For example, end users send service requests to geographically distributed edge servers near them. The requests are satisfied with a negligible latency if services are available in the edge. Otherwise, requests are further sent to the central service warehouse (CSW) at the cost of longer latency. While the CSW is often located in a wellprovisioned data center, resources are typically limited at edges, i.e., the capacity of edge servers is usually limited. Furthermore, the network interface and hardware configuration of edge servers might greatly impact the response of services [14]. These issues pose significant challenges to ensuring that customers^{1}^{1}1We interchangeably use the term “end user” and “customer” in this paper. waiting at the network edge for a service face minimal delay.
Existing works, e.g., [22, 16, 44, 43, 53]
that address the issue of placing services at edge to minimize customer delays suffer from severe limitations. One such limitation is that they consider a deterministic system model, and hence the resulting service placements are highly pessimistic. Secondly, their solutions often assume that the system dynamics (e.g., service request rates) are fixed and known to the service provider, while these quantities are typically unknown and also possibly time varying in real systems. In departure from these works, our goal is to model the network edges by using an appropriate stochastic framework. Secondly, we also leverage new machine learning (ML) techniques in order to efficiently learn the unknown network system parameters and derive optimal service placement strategies. This is particularly important since with the advent of costeffective ML solutions, the network operator can deploy these solutions to optimize the system in realtime. In particular, we raise the following question:
Can we leverage ML for maximizing the benefits of storage resources available at the edges and optimizing the performance of service placement at the network edge?Main Results
We first focus on the scenario when the system parameters, e.g. arrival rates and service delivery rates are known to the decision maker. We pose the problem of placing services at the network edge so as to minimize the cumulative delays faced by customers waiting in queue as a Markov decision process (MDP) in which the system state is given by describing, for each service, the number of customers that are currently waiting at the edge to obtain this service. This MDP is intractable in general since the optimal solution is complicated and stateoftheart solutions have a complex structure that depend on the entire statespace description, and hence not easily implementable and learnt.
Our MDP can be viewed as a special instance of the Restless MultiArmed Bandit (RMAB) [55, 19], where each service is a bandit, and its queue length (number of customers waiting to receive it) is the state of the bandit. The problem of minimizing the cumulative customer delay is seen to be equivalent to maximizing the rewards earned by “pulling bandits” (placing services at edge) under the constraint that the total number of services placed at the edge is less than the edge capacity. However, the RMAB based formulation in general suffers from the curse of dimensionality [8] and is provably hard [42].
Since Whittle index policy [55] is computationally tractable, and also known to be asymptotically optimal [52], we propose to use it in order to overcome this computational challenge. We show that our MDP is indexable, and derive explicitly the Whittle indices for each service as a function of the number of customers that are waiting to receive it. Our derivation of Whittle indices relies on a key result which shows that the optimal policy for a singleservice MDP, in which a service is charged price for being placed at the edge on top of being penalized for having outstanding customers waiting for it, has an optimal policy that is of thresholdtype.
Since the system parameters, e.g. service request rates, delivery rates etc., are typically unknown and time varying in practice, we further explore the possibility of designing efficient learning augmented algorithms to address these. Although this can be posed as a RL problem, the resulting learning regret scales linearly with the size of the statespace [23], and hence would be too large for the algorithm to be of any practical use. Our contributions are to develop efficient RL based algorithms that have a low learning regret [33] because they utilize the structure of optimal policies.
Our first algorithm entitled UCBWhittle relies upon the principle of optimism in the face of uncertainty [29, 3, 37]. UCBWhittle combines the asymptotic optimality property of the Whittle policy [52] with the “efficient exploratory behavior” of UCBbased [3]
learning algorithms. Thus, UCBWhittle maintains a confidence ball that contains the true (unknown) parameter with a high probability. Within an RL episode it uses the Whittle index policy based on an optimistic estimate of the parameter from within this confidence ball. Since the computational complexity of solving for Whittle index policy scales linearly with the number of services, UCBWhittle is practical with a learning regret not scaling the number of services.
Our second algorithm entitled QlearningWhittle, utilizes Qlearning iterations for each service by using a two time scale stochastic approximation. It leverages the thresholdstructure of the optimal policy to learn the stateaction pair with a deterministic action at each state. Hence, QlearningWhittle only learns Qvalue of those stateaction pairs following the current threshold policy. This new update rule leads to a substantial enhancement over the sample efficiency of Qlearning. Our numerical evaluations show that the proposed policies yield excellent empirical performance.
The rest of the paper is organized as follows. We describe the system model and the MDP formulation in Section II. We present the Lagrangian relaxation of the original MDP and derive Whittle index policy in Section III. We design novel learning augmented index policies UCBWhittle and QlearningWhittle in Section IV and the proofs of the asymptotical performance of these algorithms are given in Section V. Numerical results are presented in Section VI. We discuss related work in Section VII and conclude the paper in Section VIII. Additional proof details are presented in Appendix IX.
Ii System Model
We consider a heterogeneous network as shown in Figure 1 where each geographically distributed edge server covers an area to provide services to end users. End users send their requests to different services to edge servers near them. The edge servers provide the requested services to end users if the corresponding services are placed in the edge server. Otherwise, the requests are further sent to the CSW at the cost of longer latency.
In particular, we focus on one edge server in Figure 1 due to the distributed nature of our system. Furthermore, there is one service provider that provides distinct services indexed by to end users via the edge server. For simplicity, we assume that all services are unit sized. Our model can be easily generalized to variable sized services. The capacity of the edge server is , where
We formulate the service placement problem using the framework of a Markov Decision Process (MDP) [45]. In particular, we consider the service requests as a continuoustime Markov process. We first specify the states, actions, rewards and transition kernel of the corresponding MDP.
State: The requests from end users for service arrive to the edge server according to a Poisson process with arrival rate
and the service delivery requirement is exponentially distributed with mean
We model each service as a bandit^{2}^{2}2In the rest of the paper, we interchangeably use the term “bandit” and “service”., and denote its state as a queue, as shown in Figure 2. The queue length of service at time is denoted as , which represents the number of customers (i.e., requests) waiting for service since the latest placement of service at the edge server. Furthermore, we assume that , and can be arbitrarily large but bounded value. The state of the system at time is then defined as .Action: Decisions are made at moments when a bandit changes state, i.e., when there is a new service request. Upon each decision, two actions can be chosen for each bandit: action (passive), i.e., not to place the service at the edge server, and action (active), i.e., place the service at the edge server.
Transition Kernel: Throughout this paper we consider the restless multiarmed bandits (RMAB) problem [55] where each bandit is modeled as a a continuous birthanddeath process (BDP), which has been widely used to model realworld applications, e.g.,[30, 32]. Specifically, the state of bandit can evolve from to state or state
both when being active and being passive. The transition rates of the vector
are expressed as(1) 
where is a dimensional vector with all zero elements except the th entry being equal to and . In particular, given the action taken in state the dynamics of each bandit is independent of the others, as described in (1).
Different from conventional BDP, we define the transition rates as follows due to the unique nature of our service placement model. The birth rates satisfy , which represents the rates of new requests from end users to service , and the death rates satisfy with if service is active. A statedependent death rate enables us to model more complicate scenarios such as the case in which the wireless channel capacity between the services and edge server is dynamically changed with the service state [32]. In this paper, we assume the departure rate as the classic queue for simplicity, i.e.,
Cost: For each service , let be the cost per unit time when in state and it is either passive () or active (). The cost represents the average latency for obtaining the service from the edge server for end users.
Optimal Service Placement Problem: We consider a policy to determine which bandits being active, i.e., to place which services on the edge server at any moment in time. Since we model bandits as a continuous BDP, a special case of a continuoustime Markov process, we can focus on policies that make their decisions only based on the current state of bandits. For a given policy let be the state of bandit at time and . Denote actions as , where means that service is made active at time under policy ; otherwise, service is made passive at time under policy
The expected average service delivery latency to end users under policy is defined as
(2) 
where denotes the conditional expectation given policy , and is the set of feasible policies that also ensure that the resulting state process is ergodic.
Our goal is then to derive a policy to minimize the average delivery latency under an edge server capacity constraint. Hence, the optimal service placement problem can be formulated as the following MDP:
s.t.  (3) 
The following result states the existence of value function and average cost value for the MDP (II), which follows directly by applying Chapter 6 [45].
Lemma 1.
It is known that there exists and that satisfy the Dynamic Programming equation
(4) 
where a stationary policy that realizes the minimum in (1) is optimal with and being the value function [45]. An optimal policy (1) can be obtained numerically by using value iteration or policy improvement algorithms.
Though in principle one could use the iterative algorithms mentioned above, in reality the “curse of dimensionality”, i.e. exponential growth in the size of the statespace with the number of services , renders such a solution impractical. Thus, we resort to using the Whittle index policy that is computationally appealing.
Iii Service Placement using Whittle Index Policy
In general, even if the statespace were bounded (say for example by truncating the queue lengths), the MDP (II) is a hard problem to solve because the optimal decisions for services are strongly coupled. We realize that the problem (II) can be posed as a restless multiarmed bandit (RMAB) problem [55] in which is the state of the th bandit. A tractable solution, i.e., one for which the computational complexity scales linearly with , to the RMAB was proposed by Whittle [55]. We briefly describe the notion of indexability and the Whittle index policy. We then show that our problem is indexable.
Iiia Whittle Index Policy
Consider the following problem, which arises from (II) by relaxing its constraint to timeaverage:
s.t.  (5) 
It can be shown that the relaxed problem (IIIA) is computationally tractable, has a computational complexity that scales linearly with , and has a decentralized solution in which the decisions for each service are made only on the basis of the state of that service.
Consider the Lagrangian associated with this problem,
(6) 
where is the Lagrangian multiplier. Also define the associated dual problem,
(7) 
Since the Lagrangian decouples into the sum of individual service MDPs, it turns out that in order to evaluate the dual function at , it suffices to solve singleservice MDPs of the following form ( is a policy for the th service):
(8) 
where
(9) 
Definition 1.
(Indexability): Consider the singleservice MDP (8) for th service. Let denote the set of those states for which the optimal action is to choose action (passive). Then, the th MDP is indexable if the set increases with , i.e., if then . The original MDP (II) is indexable if each of the singleservice MDPs are indexable.
Definition 2.
(Whittle Index) If the singleservice MDP for the th service is indexable, then the Whittle index in state is denoted , and is given as follows:
Thus, it is the smallest value of the parameter such that the optimal policy for service is indifferent towards and when the state is equal to .
IiiB Mdp (Ii) is indexable
Our proof of indexability relies on the “threshold” property of the optimal policy for the singleservice MDP, i.e., the service is placed on the edge server only when the number of requests for it at the edge is above a certain threshold.
Proposition 1.
Fix a , and consider the singleservice MDP (8). The optimal policy for this problem is of threshold type. The threshold depends upon .
We now compute the stationary distribution of a threshold policy as a function of its threshold. This result is useful while proving indexability of the original MDP.
Proposition 2.
The stationary distribution of the threshold policy satisfies
(10)  
where is a dummy states representing state to .
We conclude this subsection by show that all bandits are indexable under our model.
Proposition 3.
The MDP (II) is indexable.
We are now ready to derive the Whittle indices for the MDP (II).
Proposition 4.
The Whittle index is given by
(11) 
when the right hand side of (11) is nondecreasing in , where a subscript denotes the fact that the associated quantities involve a threshold policy with the value of threshold set equal to this value.
Remark 1.
If (11) is nonmonotone in , then the Whittle index cannot be derived by equating the average cost of two consecutive threshold policies. Instead, the algorithm described in [20, 31] provides a means to obtain Whittle index. In particular, the threshold policy is compared to an appropriate threshold policy , we obtain Whittle index for all thresholds equals The right choice of is the result of an optimization problem. Similarly, the index in state is computed by comparing threshold policy to an appropriate threshold policy . Then the Whittle index is for all . The algorithm terminates when in one iteration
Remark 2.
Since the cost function and the stationary probabilities are known, (11) can be numerically computed.
Remark 3.
From (11), it is clear that the index of bandit does not depend on the number of requests to other services , Therefore, it provides a systematic way to derive simple policies that can be easily implementable.
IiiC Asymptotic Optimality
The Whittle index policy is in general not an optimal solution to the original problem (II). However, it has been proved that Whittle index policy is asymptotically optimal, e.g., [54, 52, 40] as the number of bandits that can be simultaneously made active grows proportionally with the total number of bandits.
Iv Reinforcement Learning Based Whittle Index Policy
The computation of Whittle indices requires us to know the arrival rates and service rates . Since these quantities are typically unknown and possibly also timevarying, the assumption that these are known to the service provider, is not practical. Hence, we now develop learning algorithms [33] that can ensure almost the same performance as the Whittle index policy. In particular, we propose two reinforcement learning algorithms that can fully exploit the structure of our proposed Whittle index policy. We also quantify the suboptimality of our learning algorithm in terms of its learning regret [12]. For the ease of exposition, we present the algorithms design and main results in this section, and relegate the proofs to Section V.
Iva UCBWhittle
We first propose a Upper Confidence Bounds (UCB) type algorithm based on the principle of optimism in the face of uncertainty [3, 33, 37]. To simplify the exposition, we will sample our continuoustime system at those discrete timeinstants at which the system state changes, i.e. either an arrival or a departure occurs. This gives rise to a discretetime controlled Markov process [45], and henceforth we will work exclusively with this discrete time system. We use to denote the mean delivery time of service , and to denote the mean interarrival times for requests for the th service. In what follows, we will parameterize the system in terms of the mean interarrival times and mean delivery times of all services. This is useful since the empirical estimates of mean interarrival times and mean delivery times are unbiased, while the empirical estimates of mean arrival rates and delivery rates are biased. Hence, this parametrization greatly simplifies the exposition and analysis. We denote , and . Thus denotes the vector comprising of the true system parameters.
Let be two possible system state values, and be an allocation for the combined system composed of services. denotes onestep controlled transition probability for the combined system from state to state under allocation when the system parameter is . The subscript denotes its dependence upon the value of true parameter. Since Whittle indices are also a function of the parameter , we denote in order to depict this dependence. For , we use , and by notational abuse we also use to denote Whittle index policy that uses Whittle indices for placing services.
The learning algorithm knows that belongs to a finite set . For a stationary control policy that chooses on the basis of , we let denote its infinite horizon average expected cost (i.e., latency) when true system parameter is equal to , and denote the cumulative cost incurred during timesteps. As is common in the RL literature [23, 33, 37], we define the “gap” as follows,
(12) 
Throughout, we assume that we are equipped with a probability space [47].
IvA1 Learning Setup
We assume that the algorithm operates for time steps. We consider an “episodic RL” setup [50] in which the total operating time horizon of steps is composed of multiple “episodes”. Each episode is composed of consecutive time steps. Let denote the starting time of the th episode, and the set of timeslots that comprise the th episode. We assume that the system state is reset to at the beginning of each episode. This can be attained by discarding, at the end of each episode, those users from our system which have not received service by the end of the current episode. Denote by the sigmaalgebra [48]
generated by the random variables
. A learning rule is a collection of maps ^{3}^{3}3 denotes the action space. that utilizes the past observation history in order to make service placement decisions for the current timestep .IvA2 Learning Rule
Let denote the empirical estimates of the mean arrival times and mean delivery time of service at time , i.e.,
(13)  
(14) 
Denote
, and construct the confidence interval associated with
as follows,(15) 
where denote the radii of the confidence intervals, and are given as follows,
(16)  
(17) 
where is a userspecified parameter, denotes the number of samples obtained for estimating the arrival rate (delivery rate) of the th service until time , and are constants. As discussed in Theorem 1, we will use in order to ensure that the learning algorithm has a good performance.
The confidence ball of the overall system is denoted as^{4}^{4}4Though we define confidence intervals and empirical estimates for each time, they are used only at times that mark the beginning of a new episode.
(18) 
The learning rule then derives an “optimistic estimate” of the true parameter as follows,
(19) 
In case is empty, then is chosen uniformly at random from the set . During , the learning rule implements . We summarize this as Algorithm 1.
IvA3 Learning Regret
We define the regret of the learning rule as follows,
(20) 
where is the expected value of the cost incurred during timeslots when is used on the system that has parameter equal to . Our goal is to design a learning rule that minimizes the expected value of the regret . Thus, our benchmark is the Whittle index rule that is designed for the system with parameters . We next present the regret of UCBWhittle algorithm.
Theorem 1.
Consider the UCBWhittle Algorithm applied to adaptively place services on the edge server so as to minimize the average latency faced by users. We then have that with a probability greater than , the regret is bounded by , where is a userspecified parameter, is a lower bound on the mean interarrival times and delivery times, is the gap (12), and the quantity is as in Lemma 9. With , we obtain that the expected regret of UCBWhittle is upperbounded as follows,
Proof Sketch: We show that the sample path regret can be upperbounded by the number of episodes in which is not equal to , i.e., suboptimal episodes. We then decompose the sample space into “good set” and “bad set” . Loosely speaking, on we have that each user is sampled sufficiently many times, and moreover the sample estimates concentrate around the true value . We show that the good properties of imply a bound on the number of suboptimal episodes, and hence the regret. The expected regret on is bounded primarily by bounding its probability. The proof details are presented in Section VA2.
Remark 4.
It is well known by now that the learning regret of commonly encountered nontrivial learning tasks grows asymptotically as [29, 33, 3]. A key research problem is to design a learning algorithm that yields the “lowest possible” prefactor to this logarithmic regret. Theorem 1 shows that UCBWhittle has a regret . An interesting question that arises is whether we could attain regret with computationally tractable algorithms, and also what is the “optimal” instancedependent [33] prefactor.
IvA4 Computational Complexity
Problem (19) needs to be solved at the beginning of th episode. Solving (19) requires us to evaluate the average cost of the Whittle policy for each possible system parameter . However, deriving the average cost of Whittle policy is computationally expensive because of the curse of dimensionality [9]. However, we note that the Whittle index policy is asymptotically optimal as the population size is scaled up to . Moreover, it is shown in [52] that the limiting value of the normalized^{5}^{5}5Normalization involves dividing the expected cost by the population size. average cost is equal to the value of the allocation problem in which the hard constraints are relaxed to timeaverage constraints. This means that for large population size we could approximate the cost by the optimal value of the relaxed problem. The relaxed problem is tractable since the decisions of different services are decoupled, and hence the computational complexity scales linearly with the number of services. Moreover, since a threshold policy is optimal for each individual service, and the stationary probability for such a single service that employs a threshold policy, is easily computed, the quantity can be obtained easily.
IvB QlearningWhittle
We further design a heuristic learning algorithm for Whittle index, which bases on the offpolicy Qlearning algorithm. The proposed algorithm leverages the thresholdstructure of the optimal policy developed in Section
III to learn the statevalue action pair, which significantly differs from the conventional greedy based Qlearning for RMAB [6].We first rewrite the MDP for a singleservice in (IIIA) as
(21) 
where we drop the subscript on the policy since we consider a single service . (21) can be further formulated as a dynamic programming, i.e.,
(22) 
where is the value function for bandit in state under policy and is the optimum cost value for . Given (IVB), we define the stateaction value as
(23) 
Since the Whittle index for bandit in state is defined as the value such that actions and are equally preferred in current state , i.e., , we can express the closedform for as
(24) 
However, the transition probability is unknown and therefore we cannot directly calculate the Whittle index according to (IVB). In the following, we propose a Qlearning based algorithm to jointly update and .
To efficiently learn the Whittle index, we leverage the threshold structure of our policy as follows. Given a threshold policy any state is made passive and is made active otherwise. The Qlearning under this policy only updates and . The rest Q values are unchanged and set to be infinity. Under this setting, the desired Whittle index is the values satisfying
(25) 
We notice from (IVB) and (25) that is only a function of Whittle index for previous states, i.e., , which inspires a statebystate update algorithm as follows.
Since each bandit shares the same update mechanism, we remove the subscript for notational simplicity. We consider the same learning setting as in UCBWhittle, i.e., the algorithm operates for an episodic setup with each episode containing consecutive time horizon. Inside each episode , the Qlearning recursion for threshold is defined as (26),
(26) 
where with being the start time slot for episode . is the Whittle index for episode and is kept fixed for the entire episode. The learning rate is assumed to satisfy
Our goal is to learn the Whittle index by iteratively update in the following manner
(27) 
with satisfying For each state , the Whittle index is updated according to the doubletime scale stochastic process according to (26) and (IVB). The entire procedure is summarized in Algorithm 2 and its convergence is presented in Lemma 2.
Note that Algorithm 2 is only for a single bandit. The server just needs to repeat the procedures of Algorithm 2 for times to achieve the Whittle indices for all bandits. Since there is no difference in the update mechanism for different bandits, we omit the algorithm description for the whole system. When considering the capacity of the server, i.e., it can only serve a maximum number of bandits, an easy implementation of the algorithm is to divide the bandits into multiple groups with size , and the server sequentially learns the Whittle indices for the bandits in each group.
Lemma 2.
The update of Whittle index in (IVB) converges almost surely, i.e.,
Remark 5.
The conventional Qlearning based method [6] relies on greedy mechanism for balancing the tradeoff between exploration and exploitation. The major difference between our QlearningWhittle and the traditional Qlearning based index policy [6] lies in the update of the value as in (26). For the traditional method, when bandit moves into a new state , with probability , it greedily selects an action to maximize . While with probability , it selects a random action. It needs to learn Q value of every stateaction pair. However, our proposed thresholdtype QlearningWhittle has a deterministic action at each state as in (26). Therefore, under each policy , it only learns Q values of those stateaction pairs following the current threshold policy. Moreover, the Whittle index in current state only relies on the Whittle index in previous states, while the Whittle indices for all states are coupled in the conventional Qlearning based method. The proposed one will substantially enhance the sample efficiency and has a faster convergence speed, which will be verified in the numerical study.
V Proofs of Main Results
In this section, we provide the detailed proof of Theorem 1 on UCBWhittle and Lemma 2 on QlearningWhittle.
Va Proof of Theorem 1 on UCBWhittle
Throughout, we will make the following assumption regarding the set that is known to contain the true parameter .
Assumption 1.
The process that denotes the number of services under the application of Whittle policy , is ergodic [38]
, i.e., the Markov chain is positive recurrent. Moreover, the associated average cost
is finite.VA1 Preliminary Results
We provide the following equivalent formulation of UCBWhittle. This characterization turns out to be useful while analyzing its learning regret.
Lemma 3.
UCBWhittle can equivalently be described as follows. At the beginning of , the learning algorithm computes a certain “value” for each as follows,
(28) 
where is the confidence interval at time (18). It then chooses the that has the least value , and implements the corresponding Whittle index rule during .
Proof.
Note that the operation of obtaining could equivalently be rewritten as follows
(29) 
After exchanging the order of minimization, (29) reduces to
(30) 
Since the Whittle index policy is asymptotically optimal [52] and there are only finitely many candidate policies in the innerminimization problem above, (30) reduces to
(31) 
However, this is exactly the problem (19) that needs to be solved in order to obtain . ∎
We now prove a concentration result for .
Lemma 4.
Define the set as follows,
(32) 
Consider UCBWhittle employed with the parameter equal to , where the parameter is chosen so as to satisfy . We have,
(33) 
Proof.
Fix the number of samples used for estimating at . It then follows from Lemma 11 with the parameter set equal to that the probability with which the estimate of service rate lies outside the confidence ball, is less than . Letting , we have that this probability is upperbounded by .
Since can possibly assume number of values, and , the proof follows by using union bound on the number of estimation samples, and users. ∎
We now show that with a high probability, the value of the Whittle policy corresponding to the true value never falls above a certain threshold value of .
Lemma 5.
On the set we have .
Proof.
Lemma 6.
Consider the process that evolves under the application of , where is the optimistic estimate obtained at the beginning of by solving (19). Then, on the set we have that the following holds true: there exists a and such that for each service ,
where is if service is placed at time , and is otherwise. A similar inequality is also satisfied by the cumulative arrivals.
Proof.
Note that in order for the average cost to be finite, each service should be allocated a nonzero fraction of bandwidth at the server. Since under Assumption 1 the controlled process is ergodic and has finite cost, on the cost incurred under is definitely finite, and hence each service is provided a nonzero fraction of the total bandwidth at the server. The proof then follows since there are only finitely many choices for . ∎
We next show that if each service has been sampled “sufficiently” many times, then the value attached to a suboptimal stays above the threshold .
Let
(34) 
where, and the parameter satisfies (see Lemma 9). Define to be the following set
(35) 
where in the above we denote
On the set we can ensure lower bounds on the number of samples of a user, since they are tightly concentrated around its mean value. Since the regret at any time increases with the size of confidence balls (18), and since this size decreases with , we are able to obain tight upperbounds on regret on .
Lemma 7.
Let be as in (34). On the set we have that for episodes , the value of any suboptimal is greater than .
Proof.
Now we are ready to prove the regret of UCBWhittle. We begin with the following result that allows us to decompose the regret into sum of “episodic regrets”.
Lemma 8.
For a learning algorithm , we can upperbound its regret (20) by a sum of “episodic regrets” as follows,
where denotes the number of episodes until .
Proof.
The proof follows since in each episode, the learning algorithm uses a stationary control policy from within the set of stabilizing controllers . ∎
VA2 Proof of Regret of UCBWhittle
Proof of Theorem 1.
Lemma 8, the decomposition result show us the following:

Episodic regret is in those episodes in which is equal to .

If is not equal to , then the regret is bounded by times .
Define as the good set, where and are as in (32) and (35) respectively. Note that it follows from Lemma 4 and Lemma 10 that the probability of is greater than .
We thus have the following three sources of regret:
(i) Regret due to suboptimal episodes on : It follows from Lemma 5 and 7 that the regret is in an episode if .
(ii) Regret on : As is seen in the proof of Lemma 4, the probability that the confidence interval in episode fails is upperbounded as . Since the regret within such an episode is bounded by , the cumulative expected regret is upperbounded by , where the constant .
(iii) Regret on : Since the probability of this set is , and the regret on can be trivially upperbounded by , this component is .
The proof of the claimed upperbound on expected regret then follows by adding the above three bounds. ∎