## I Introduction

The dramatic growth of wireless traffic due to an enormous increase in the number of mobile devices is posing many challenges to the current mobile network infrastructures. In addition to this increase in the volume of traffic, many emerging applications such as Augmented/Virtual Reality, autonomous vehicles and video streaming, are latency-sensitive. In view of this, the traditional approach of offloading the tasks to remote data centers is becoming less attractive. Furthermore, since these emerging applications typically require unprecedented computational power, it is not possible to run them on mobile devices, which are typically resource-constrained.

To support such stringent timeliness and computational requirements, mobile edge computing architectures have been proposed as a means to improve the quality of experience (QoE). These move servers from the cloud to edges, often wirelessly that are closer to end users. Such edge servers are often empowered with a small wireless base station, e.g., the storage-assisted future mobile Internet architecture and cache-assisted 5G systems [2].
By using such edge servers, content providers are able to ensure that contents or services are provided with a high QoE (with minimal latency). The success of edge servers relies upon “content caching”, in which popular content such as movies, videos, software are placed at the cache associated with the wireless edge. If the content requested by end users is available at the wireless edge, then it is promptly delivered to them. Otherwise, the request is forwarded to the remote server, and this gives rise to an increase in the latency. While the remote server often resides in a well-provisioned data center, resources are typically limited at the edge, so that the amount of content that can be cached at the wireless edge is often limited. These issues are further exacerbated in the case of wireless edges, where the requested content is delivered over *unreliable* channels.

In this work, we are interested in minimizing the average latency incurred while delivering contents to end users connected to a wireless edge via unreliable channels. We design dynamic policies that decide which content should be cached at the wireless edge so as to minimize the average latency of end users, while simultaneously satisfying resource constraints associated with the wireless edge?

We pose this problem as a Markov decision process (MDP)

[35] in Section III. This MDP turns out to be a restless multi-armed bandit (RMAB) problem [45]. Even though in theory RMAB can be solved by using relative value iteration [35, 5], this approach suffers from , and fails to provide an insight into the solution. Thus, it is desirable to derive low-complexity solutions and provide guarantees on their performance. A celebrated policy for RMAB is the Whittle index policy [45]. We propose to use the Whittle index policy for solving the problem of optimal caching.Following the approach taken by Whittle [45], we begin by relaxing the hard constraints of the original MDP, which requires that the number of cached contents at each time is exactly equal to the cache size. These are relaxed to a constraint which requires that the number of cached contents is equal to the cache size on average. We then consider the Lagrangian of this relaxed problem, which yields us a set of decoupled average-reward MDPs, which we call the per-content MDP. Instead of optimizing the average cost (latency) of this per-content MDP, we firstly consider a discounted per-content MDP, and prove that the optimal policy for each discounted per-content MDP has an appealing threshold structure. This structural result is then shown to also hold for the average latency problem. We use this structural result to show that our problem is indexable [45]. We derive Whittle indices for each content. Whittle index policy then prioritizes contents in decreasing order of their Whittle indices, and caches the maximum possible content that is allowed with the available cache size. Whittle index policy is computationally tractable since its complexity increases linearly with the number of contents. Moreover it is known to be asymptotically optimal [43, 41] as the number of contents and the cache size are scaled up, while keeping their ratio as a constant. Our contribution in Section IV is non-trivial since establishing indexability of RMAB problems is typically intractable [30], and Whittle indices of many practical problems remain unknown except for a few special cases.

Note that Whittle index policy needs to know controlled transition probabilities of the underlying MDP, which in our case amounts to knowing the statistics of the content request process associated with end users, and also the reliability of wireless channels. However, these parameters are often unknown, and time-varying. Hence in Section

V, we design an efficient reinforcement learning (RL) algorithm to make optimal content caching decisions dynamically when these parameters are unknown. We do not directly apply off-the-shelf RL methods such as UCRL2 [17][14] since the size of state-space grows exponentially with the number of contents, and hence the computational complexity and learning regret would also grow exponentially. Thus, the resulting algorithms would be too slow to be of any practical use. To overcome these challenges, we derive a model-free RL algorithm called Q-Whittle learning. By coupling conventional Q-functions with Whittle indices [3, 12], our Q-Whittle learning leverages the threshold-structure of the optimal policy to learn only Q-values of state-action pairs following the current threshold policy. This novel update rule significantly improves the sample efficiency of Q-learning with conventional -greedy policy. Finally, our Q-Whittle learning can be viewed through the lens of a two-timescale stochastic approximation (2TSA) [8, 20]. We prove a bound on its finite-time convergence rate in Section VI. To the best of our knowledge, our work is perhaps the first to consider a RL approach towards a Whittle index policy derived from a MDP in the context of content caching at the wireless edge with unreliable channels, and the first to provide a finite-time analysis of a Whittle index based Q-learning algorithm.Finally, we provide extensive numerical results using both synthetic and real traces to support our theoretical findings in Section VII, which demonstrate that our proposed algorithms produce significant performance gain over state of the arts.

## Ii Related Work

We overview two areas closely related to our work: content caching and restless bandits, and provide a brief discussion of our design methodology in the context of prior work.

Content Caching. The content caching problem has been studied in numerous domains with different objectives such as minimizing expected delay [34] or operational costs [1]. The joint content caching and request routing has also been investigated, e.g., [16, 23]. Most prior works formulated the problem as a constrained/stochastic optimization problem, etc. None of those works provided a formulation using the RMAB framework and developed an index based caching policy. Furthermore, all above works assumed full knowledge of request processes and hence did not incorporate a learning component. A recent line of works considered content caching from an online learning perspective, e.g., [33, 48], and used the performance metric of learning regret or competitive ratio. Works such as [36, 37] used deep RL methods. However, deep RL methods lack of theoretical performance guarantees. Our model, objective and formulation significantly depart from those considered in aforementioned works, where we pose the content caching problem at the wireless edge as a MDP and develop a simple index policy with performance guarantee that can be easily learned through a model-free RL framework.

Restless Bandits. The RMAB is a general framework for sequential decision making problems, e.g., [27, 22, 21]. However, RMAB is notoriously intractable [32]

. One celebrated heuristic is the Whittle index policy

[45]. However, Whittle index is well-defined only when the indexability condition is satisfied, which is in general hard to verify. Further, even when an arm is indexable, finding its Whittle index can still be intractable [30]. A few successes e.g. [22, 21] are all under specific assumptions and hard to be generalized. Additionally, the application of above Whittle index requires full system knowledge, which is often not the case in practice. Thus it is important to examine RMAB from a learning perspective, e.g., [9, 25, 26, 40, 31, 19, 42]. However, these methods do not exploit the special structure available in the problem and contend directly with an extremely high dimensional state-action space yielding the algorithms to be too slow to be of any practical use. Recently, RL based algorithms have been developed [3, 12, 46] to explore the problem structure through index policies. However, [3, 12] lack finite-time performance analysis and multi-timescale SA algorithms usually suffer from slow convergence, and [46] depends on a simulator for a finite-horizon setting which cannot be directly applied here since it is difficult to build a perfect simulator in complex wireless edge environments. In contrast, we provide a finite-time analysis of our Q-Whittle learning algorithm. The closest work is [11], which characterized the convergence rate for a general non-linear 2TSA. We generalize the result to the proof of convergence rate of our Q-Whittle learning algorithm.Our Design Philosophy. We make contributions to both areas in this paper. First, we formulate the content caching problem at the wireless edge with unreliable channels so as to minimize the average content request latency as an average-reward MDP, which turns out to be a RMAB. Second, we consider this RMAB from an online perspective given that the knowledge of content request process and unreliable wireless channel is often unknown and time-varying. A key differentiator between our approach and existing ones stems from two perspectives: (i) we focus on designing index policy for content caching at the wireless edge, which operate on a much smaller dimensional subspace by exploiting the inherent structure of our problem; and (ii) our index-based approach naturally lends itself to a lightweight model-free RL based framework that can fully exploit the structure of our index policy so as to reduce the high computational complexity.

## Iii System Model and Problem Formulation

In this section, we present the system model and formulate the average latency minimization problem.

### Iii-a System Model

Consider a wireless edge system connected to a remote server (e.g., data center) through backhaul links as shown in Figure 1. The wireless edge is equipped with a cache of size units in which it stores contents that are provided to end users. We denote the set of distinct contents as with Without loss of generality (W.l.o.g.), we assume that all contents are of unit size^{1}^{1}1Our model can be generalized to the case of contents with variable sizes by dividing contents into unit-sized chunks.. End users make requests for different contents to the wireless edge.
If the requested content is available at the wireless edge, then it is delivered to end users directly through a wireless channel that is unreliable. Otherwise, the request is sent to the remote server at the cost of a longer latency. The goal of the wireless edge is to decide which content to cache, subject to cache capacity constraint, so that the average content request latency experienced by end users is minimal.

Content Request and Delivery Model. We assume that requests for content arrive at the wireless edge from end users according to a Poisson process^{2}^{2}2Poisson arrivals have been widely used in the literature, e.g., [16, 23] and references therein. However, our model holds for general stationary process [4] and our RL based algorithm and analysis in Section V holds for any request process. with arrival rate .
The time taken to deliver content

to end users is a random variable that is exponentially distributed with mean

.Unreliable Channel. In case the content which is requested by a user is available with the cache, it is delivered to the user through a wireless channel, which is often unreliable due to noise or interference. We assume that the transmission succeeds with probability .

Queue Model. To each content , we associate a “request queue” at the wireless edge, which stores the number of outstanding requests for content at time . This is denoted by . This assumption is justified since the wireless edge is closer to end users and receives content requests at a much faster timescale as compared with the content downloads or updates in the wireless edge from the remote server. The content requested from an end user might not be served immediately, so that there will be latency associated with the user getting content. This motivates us to consider a queueing model that captures the latency experienced by end users.

### Iii-B System Dynamics

We now formulate the problem of average latency minimization for the above model as a MDP.

States. We denote the state of the wireless edge at time as , where is the number of outstanding requests for content from end users at time as described above.

Actions. The action for content at time is denoted as , where means that content is cached in the wireless edge at time , and otherwise. Denote Taking the cache capacity constraint at the wireless edge into account, we have that must satisfy the following constraints,

(1) |

A content caching policy maps the state of the wireless edge to the caching decision action , i.e.,

Transition Kernel. The state of the -th request queue can change from to either , or at the beginning of each decision epoch. More specifically,

(2) |

where is a

-dimension vector with the

-th entry being and all others being , and with Note that we allow for state-dependent content delivery rates, which enables us to model realistic settings [21, 22]. In particular, we consider the classic queue, i.e., .### Iii-C Problem Formulation

We denote by the instantaneous cost incurred by content at time . Note that this depends upon its state and also the action that is applied to it. It follows from Little’s Law [18] that minimizing the average latency is equivalent to minimizing the average total number of outstanding requests in the system. Hence, with this choice of instantaneous cost, the average cost represents the average latency for end users. We denote the immediate total cost at time as

(3) |

Our objective is to derive a policy that makes decisions regarding which content should be cached at the capacity-constrained wireless edge to minimize the average latency. This problem can be formulated as the following MDP:

s.t. | (4) |

where the subscript denotes the fact that the expectation is taken with respect to the measure induced by policy and is the set of all feasible content caching policies.

For simplicity, we convert the continuous-time MDP problem (III-C) into an equivalent discrete-time MDP problem by using the method of uniformization [35]. Thus, a time slot corresponds to either a new arrival (of request), or a departure (content is delivered to user). Let denote the set of time slots such that the state of the wireless edge does not change during each time slot. Denote the system state at time slot as . W.l.o.g., we scale time, and that the transition probabilities for the -th request queue in the MDP are defined as

(5) |

such that . The equivalent discrete-time MDP, obtained after uniformization, is as follows:

s.t. | (6) |

Henceforth, we refer to (III-C) as the “original MDP”. Since it is an infinite-horizon average cost per stage problem, in principle it can be solved by using the relative value iteration [35, 5].

###### Lemma 1.

One can always obtain an optimal policy which fulfills with methods such as the relative value iteration. However, this approach suffers from the curse of dimensionality, i.e., the computational complexity grows exponentially in the size of state space as a function of the distinct content number , rendering such a solution impractical. Furthermore, this approach fails to provide insight into the structure of the solution. To this end, many efforts have been focused on developing computationally appealing solutions.

### Iii-D Lagrangian Relaxation

In this subsection, we discuss the Lagrangian relaxation of the original MDP (III-C) and the corresponding per-content problems. The Lagrangian multipliers together with these per-content problems form the building block of our Whittle index policy, that will be formally introduced in Section IV.

Following Whittle’s approach [45], we first consider the following “relaxed problem,” which relaxes the “hard” constraint in (III-C) to an average constraint:

s.t. | (8) |

Next, we consider the following Lagrangian relaxation [13]. The Lagrangian can be written as,

(9) |

where is the Lagrangian multiplier, and is the caching policy. The dual problem is then defined as

(10) |

Given the Lagrangian multiplier , the relaxed problem decouples into “per-content MDPs,” where the MDP for the -th content is given as follows,

(11) |

where , and is the policy for content . With this decomposition, in order to evaluate the dual function at , it suffices to iteratively solve all independent per-content MDPs (11) [35, 5]. The relaxed problem (III-D) can be solved by solving each of these per-content MDPs, and then combining their solutions, i.e., to each content we apply the solution corresponding to its individual MDP. Note that this solution does not always provide a content caching policy that is feasible for the original problem (III-C), since the original problem requires that the cache capacity constraint (1) must be met at all times, rather than just in the average sense as in the constraint (III-D). Whittle index policy combines these solutions in such a way that the resulting allocation is also feasible for the original problem (III-C), i.e, it satisfies hard constraints.

## Iv Whittle Index Policy

We now design the Whittle index policy for content caching at the wireless edge with unreliable channels, as illustrated in Figure 2. To the best of our knowledge, Whittle index policy has not been used in order to solve this problem in the literature. Specifically, the content caching problem (III-C) can be posed as a RMAB problem in which each content corresponds to an arm . At each time slot , the queue length of the corresponding request queue is the state of arm , and is the action taken for the content . We let denote caching for content at time , while denote not caching. It is well known that the Whittle index policy is a computationally tractable solution to the RMAB [45], which has a computational complexity that scales linearly with .

### Iv-a Indexability and Whittle Index

Our proposed Whittle index policy is based on the solution to the relaxed problem (III-D). In order to derive this policy, we need to establish that our MDP is indexabile. Roughly speaking, this property requires to show that as the Lagrangian multiplier increases, the collection of states in which the optimal action is passive (i.e., not to cache) increases. This property was first introduced by Whittle [45] and we definite it formally here for completeness.

###### Definition 1.

Following the indexability property, the Whittle index in a particular state is defined as follows.

###### Definition 2.

(Whittle Index) The Whittle index in state for the indexable -th MDP (11) is the smallest value of the Lagrangian multiplier such that the optimal policy for content at state is indifferent towards actions and . We denote such a Whittle index as satisfying .

### Iv-B The per-content MDP (11) is indexable

Our proof of indexability relies on the “threshold” property of the optimal policy for each per-content MDP (11), i.e., content is cached in the wireless edge only when the number of outstanding requests for content is above a certain threshold. To this end, we focus on the per-content MDP (11) for a particular content , and drop the subscript in the rest of this subsection for ease of exposition.

#### Iv-B1 Optimal Threshold Policy

To avoid convergence issues in the presence of bounded value function [35], we start with the problem of minimizing the expected total discounted latency of content requests over the wireless edge, i.e., the discounted latency problem. Then we will extend our results to the average latency problem (11). The discounted latency problem in the equivalent discrete-time MDP is given as

(12) |

where is a discount factor. It is known that there exists an optimal deterministic stationary policy for the discounted latency problem [35]. Hence we only need to consider the class of deterministic stationary policies. We apply the value iteration method to find the optimal policy.

We assume that value functions of the initial state are bounded real-value functions, i.e., the system is stable. Let denote the Banach space of bounded real-value functions on with supremum norm. Define operator as

(13) |

where and the expectation is taken over all possible next state when action is taken at state . Let denote the optimal expected total discounted cost of initial state . Then we have that , i.e., is a solution of the Bellman equation satisfying

(14) |

As described in (2), the next state can be either or . Define for ease of expression. Hence we can further write (14) as

(15) |

The corresponding state-action value function satisfies

Therefore, we have

As shown in [45], for an extreme large value of the Lagrangian multiplier , it is optimal to keep the arm passive (never cache the content), i.e., for each state . Hence, we make the following assumption.

###### Assumption 1.

The Lagrangian multiplier is a finite positive real number such that there exists at least one state satisfying ^{3}^{3}3This assumption is valid and such a state always exists. Otherwise, the optimal action is for any state . From (15), we have . It is straightforward to show that such a recursion results in a non-decreasing in . Hence, there exists a lowest value of state satisfying for any finite .

We now show that the optimal policy for (12) is of the threshold-type under a fixed .

###### Proposition 1.

Proof is provided in Appendix A.

###### Remark 1.

Existing works [38, 15] among others has also shown that a threshold policy is optimal to an MDP-based problem formulation to show the Whittle indexability. The key is to show that if the optimal action at state is active, i.e., , the optimal action at state is also active, i.e., . This depends on showing the convexity [38] or monotonicity [15] of discounted value function by leveraging dynamic value iteration. These properties hold under a fixed and state-independent kernel [38, 15]; however, it is hard to show or might not hold at all when transition probabilities are state-dependent as in (15). To this end, existing results cannot be directly applied here, and the analyses in this paper and those in [38, 15] are significantly different.

The following proposition extends our results in Proposition 1 for the discounted latency problem in (12) to the original average latency in (11).

###### Proposition 2.

There exists an optimal stationary policy of the threshold-type for the average latency problem in (11).

###### Proof.

According to [24], the optimal expected total discounted latency under optimal policy with discount factor and the optimal average latency under optimal policy satisfy Since our action set is finite, there exists an optimal stationary policy for the average latency problem such that [24], which implies the optimal policy for (11) is of the threshold-type. ∎

#### Iv-B2 Indexability of the per-content MDP (11)

We are now ready to show that the per-content MDP (11) is indexable.

###### Proposition 3.

The per-content MDP (11) is indexable.

Proof is provided in Appendix B.

###### Proposition 4.

###### Remark 2.

Since the cost function and stationary probabilities are known, (16) can be numerically computed. From (16), it is clear that the index of content does not depend on the number of requests to other contents , Therefore, it provides a systematic way to derive simple policies that are easy to implement.

From (16), it is clear that the stationary distribution of the threshold policy is required to compute the Whittle indices. We now compute this stationary distribution under our model.

###### Proposition 5.

The stationary distribution of the threshold policy satisfies

(17) |

where is a dummy state representing state to .

Proof is provided in Appendix C.

### Iv-C Whittle Index Policy

We now describe how the solutions to the relaxed problem (III-D) are used to obtain a policy for the original problem (III-C). It is clear that the optimal solutions to (III-D) are not always feasible for (III-C), since in the later at most contents can be cached at the wireless edge. To this end, Whittle [45] proposed a heuristic, referred to as Whittle index policy, which assigns an index to the -th MDP (11) for all , which depends on its current state and current time. The Whittle index policy then activates the arms with the highest Whittle indices given that the -th arm is in state at the current time. Although Whittle index policy is not optimal to the original problem (III-C) in general, it has been shown that such a policy is asymptotically optimal [43, 41] as the number of contents and the cache size are scaled up, while keeping their ratio as a constant.

## V Q-Whittle Learning

The exact knowledge of the transition kernel associated with the MDP described in Section III-B is needed to compute the Whittle index policy developed in Section IV. However, such knowledge is often unavailable and varying over time at wireless edges. We now adopt a learning perspective on top of the Whittle index policy. Specifically, we design a novel model-free reinforcement learning augmented algorithm entitled Q-Whittle learning which leverages the threshold structure of the optimal policy developed in Section IV while learning Q-functions for different state-action pairs.

### V-a Preliminaries

We first review some preliminaries for Q-learning augmented Whittle index policy, which was first proposed in [12], and further generalized in [3]. The per-content MDP in (11) can be formulated as a DP [35, 5], i.e.,

(18) |

where is the minimal long-term average cost of this MDP with parameter , and is the optimal state value up to an additive constant, which depends on the parameter The Q-function can then be defined as [5]

(19) |

such that .

The Whittle index associated with state [45] is defined as the value such that actions and are equally favorable in state with a “subsidy” , i.e., . Combining with (19), the closed-form for satisfies [12]

(20) |

However, the unknown transition probabilities hinder us to directly evaluate the Whittle index according to (20). Next, we overcome this limitation by proposing a Q-learning based algorithm to jointly learn the Q-function and Whittle index by levering the inherent structure in our problem. Since the Q-function and Whittle index are coupled, it requires a two-timescale iteration wherein the faster timescale performs Q-function update with a fixed , which is updated in a slower timescale.

### V-B Q-Whittle Learning

Since the parameter introduced by the long-term average MDP is unknown, a widely-adopted approach is to learn the discounted MDP (with the same states, actions, cost function, and transition) instead, for some discount factor according to the Blackwell optimality theorem [7]. This indicates that there exists an optimal solution of the -discounted cumulative cost of the discounted MDP for all , and when is close to , this solution is also long-run average optimal. Such a technique has been applied to the study of average-reward MDP in [44] and references therein. Thus we focus on the discounted Q-learning below.

As the optimal policy for the per-content MDP (11) is of the threshold-type, i.e., provided a threshold , the arm is made passive for , and active , our key insight is that this appealing property can significantly reduce the exploration overhead for the update of Q-functions. Specifically, the Q-learning under such a threshold policy only needs to update Q-functions with all other state-action values unchanged since the optimal action for is deterministic, i.e., . Similarly, only is updated for . When the arm is in state , it randomly chooses actions or . For simplicity, we assume that these two actions are equally chosen in state This key observation dramatically reduces the complexity of Q-learning on top of Whittle index compared to existing methods, e.g., [3, 12].

More specifically, with this key observation, our Q-functions are updated as follows:

Case 1: When , we have

(21) |

where is the sequence of real numbers in , satisfying and . follows from the above insight that only Q-functions needs to be updated when . This differs significantly from existing methods, e.g. [3, 12], where both and need to be updated when . This is due to the fact that our proposed algorithm leverages the threshold-type optimal policy into the Q-function update, which either does not exist or is not leveraged in existing works. Similar insights lead to the updates of and .

Case 2: When , we have

(22) |

where the updates of , and leverage the similar insights as those in Case 1.

We denote by the Q value at the beginning of time step under the threshold policy . With the above Q-function updates, we have

(23) |

Given the above Q-function updates, the parameter under the threshold policy at time step is updated as

(24) |

where and

We summarize the Q-Whittle learning in Algorithm 1. Since the wireless edge can cache at most contents, an easy implementation is to find the possible activation set for threshold at epoch and activate the arms with highest Whittle indices . To leverage the exploration gain, with probability , the algorithm randomly select arms from (lines 6-9). The Q-function and parameter updates can be simply repeated for all contents (lines 11-12).

###### Remark 3.

Some definitions (e.g., ) in this paper are similar to those in [3, 12], which also studied Whittle index policy based Q-learning through a two-timescale update. However, our Q-Whittle learning algorithm significantly differs from those in [3, 12]. First, [3, 12] adopted the conventional -greedy rule for Q-value updates. In contrast, we leverage the property of optimal threshold-type policy into Q-value updates as in (V-B) and (V-B). Such a threshold-type Q-value update dramatically reduces the computational complexity (e.g., by reducing the exploration space by at least half) since each state only has a fixed action to explore. Second, the threshold policy further enables us to update Whittle indices in an incremental manner, i.e., the converged Whittle index in state can be taken as the initial value for the subsequent state (line 14 in Algorithm 1), instead of being randomly initiated as in [12, 3]. This further speeds up the learning process. In addition, [12] lacks convergence guarantee. While [3] directly considered Q-learning for average award with a tuning scheme for Whittle indices, it required a reference function to approximate the unknown parameter . The choice of the reference function is not unique and may be problem dependent. We depart from [3] by studying the discounted counterpart as motivated by [44] since the difference in the optimal value between the discounted and average settings is small as long as is close to [24, 7]. Recently, another line of work [6] leveraged Q-learning to approximate Whittle indices through a single-timescale SA where Q-function and Whittle indices were learned independently. [6] considered the finite-horizon MDP and cannot be directly applied to infinite-horizon discounted or average reward MDPs. Finally, we are the first to provide a finite-time analysis of Whittle index based Q-learning, which further differentiates our work.

## Vi Finite-Time Performance Analysis

In this section, we provide a finite-time analysis of our Q-Whittle learning algorithm, which can be viewed through the lens of 2TSA. Our key technique is motivated by [11] which deals with a general nonlinear 2TSA. To achieve this goal, we first need to rewrite our Q-Whittle learning updates in (23) and (24) in the form of a 2TSA.

### Vi-a Two-Timescale Stochastic Approximation

Given the threshold policy , the corresponding true Whittle index associated with the threshold state is . Denote as the optimal action-value function obtained by Q-Whittle learning in Algorithm 1. Following the conventional ODE method, we first transfer Q-Whittle updates in (23) and (24) into a standard 2TSA. For ease of exposition, we present the synchronous update where every entry of Q-function is updated at each time step [47, 39]. Our goal is to rewrite the updates (23) and (24) as

(25) | ||||

(26) |

where represents Q-function update at time step ; is an appropriate martingale difference sequence conditioned on -field generated by iterations and trajectory up to time step is a suitable error sequence; and are appropriate Lypschitz functions defined below that satisfy the conditions needed for our ODE analysis, and the step sizes satisfy Assumption 4 below.

Specifically, using the operator in (13), we rewrite the Q-learning update in (23) for as

where

Hence, we have

(27) |

which is Lipschitz in both and . Similarly, we have

(28) |

which is Lipschitz in W.l.o.g., we assume since the update of in (24) is deterministic. With identifications of these two functions, the asymptotic convergence of our 2TSA can be established by using the ODE method following the solution of suitably defined differential equations [8, 39, 3, 10, 11]. For ease of exposition, we temporally assume fixed step size here, then the 2TSA is reduced to the following differential equations:

(29) |

where the ratio represents the difference in timescale between these two updates. Our focus here is on characterizing the finite-time convergence rate of to the global asymptotically optimal equilibrium point of (29) for each Using an idea of [11], the key part of our analysis is based on the choice of two step sizes and a Lyapunov function. We first define the following two error terms as

(30) |

which characterizes the coupling between and . If and go to zero simultaneously, is established. Thus, to prove the convergence of of our 2TSA to its true value , we instead study the convergence of by providing the finite-time analysis for the mean squared error generated by (25)-(26). To couple the two updates, we define the following Lyapunov function

(31) |

We make the following assumptions in our analysis.

###### Assumption 2.

Provided , there exists an operator such that is the unique solution to where and are Lipschitz continuous with positive constants and such that

(32) |

The operator is Lipschitz continous with constant , i.e.,

###### Assumption 3.

###### Assumption 4.

The step sizes and satisfy , is non-increasing as and

### Vi-B Finite-Time Analysis of Q-Whittle Learning

###### Theorem 1.

Proof is provided in Appendix D.

###### Remark 4.

Our finite-time analysis of Q-Whittle learning consists of two steps. First, we rewrite Q-Whittle updates into a 2TSA in (25)-(26). The key is to identify two critical terms and Second, we prove a bound on finite-time convergence rate of Q-Whittle learning by leveraging and generalizing the machinery of non-linear two-timescale stochastic approximation [11]. The key is to the choice of two step sizes (as characterized in Theorem 1) and a Lyapunov function given in (VI-A). Though the main steps of our proofs are motivated by [11], we need to characterize the specific requirements for our settings as aforementioned. Need to mention that we do not need the assumption that and are strongly monotone as in [11], and hence requires a re-derivation of the main results.

## Vii Numerical Results

In this section, we numerically evaluate the performance of our Whittle index policy and Q-Whittle learning algorithm using both synthetic and real traces.

### Vii-a Baselines

Content Caching Algorithms. We compare our Whittle index policy to state-of-the-art methods when system parameters are known: (a) Greedy policy that stores contents with the largest request queues; (b) *Continuous-Greedy with Power Series approximation* (CG-PS) [29], an optimization based algorithm using

samples with a gradient estimator based on power series expansion; (c)

*Projected Gradient Ascent*(PGA-10) [16], an optimization based method with a measure period ; and (d) Least-Recently-Used (LRU). For CG-PS and PGA-10, we adopt the default settings in [29] and [16], and omit their descriptions. We refer interested readers to [29] and [16] for detail due to space constraints.

Q-learning based Algorithms. We compare our Q-Whittle learning to existing Q-learning algorithms (see Remark 3) when system parameters are unknown: (a)

*Q-learning Whittle Index Controller*(Fu) [12]; (b)

*Q learning for Whittle index*(AB) [3]; (c)

*Whittle Index Q-learning*(WIQL) [6]; and (d) our Whittle policy, i.e., assume full knowledge of underlying transition probabilities. The discount factor is , learning rates are initialized to and , and are decayed by half every time steps. The exploration and exploitation parameter parameter is set as .

### Vii-B Evaluation Using Synthetic Traces

We simulate a system with the number of distinct contents