1 Introduction
In contrast to the classical multiarmed bandits (MABs), restless multiarmed bandits (RMABs), introduced by Whittle (1988), assume reward distributions that change along with the time. Due to their nonstationary nature, RMABs can model more complicated systems and thus get more attention in practice and theoretical literature. In practice, they are used in a wide spectrum of applications including sensor management (Chp. 7 in Hero et al. (2007) and Chp. 5 in Biglieri et al. (2013)), dynamic channel access problems (Liu et al., 2011, 2013), and online recommendation systems (Meshram et al., 2017). Theoretically, a variety of research communities have contributed to the literature on restless bandits, e.g., complexity theory (Blondel and Tsitsiklis, 2000),
applied probability
(Weber and Weiss, 1990), and optimization (Bertsimas and NiñoMora, 2000).In this setting, there are independent arms indexed by .^{1}^{1}1For an integer , we denote the set by . Each arm is characterized by an internal state which evolves in a Markovian fashion according to the (possibly distinct) transition matrices and depending on whether the arm is pulled (i.e., active) or not (i.e., passive). The reward of pulling an arm depends on its state , which brings the nonstationarity.
We aggregate the transition matrices as and consider this problem as a Reinforcement Learning problem where is unknown to the learner. This problem has a complication in defining the baseline competitor against which the learner competes. It is not guaranteed, without additional assumptions, that the optimal policy exists, and even if it exists, Papadimitriou and Tsitsiklis (1999) show that it is generally PSPACE hard to compute the optimal policy.
Researchers take different paths to tackle this challenge. Some define the regret using a simpler policy, which can be easily computed (e.g., see Tekin and Liu (2012); Liu et al. (2013)). They compare the learner’s reward to a policy that pulls a fixed set of arms every round. Their algorithm is efficient and has a strong regret guarantee, , but this baseline policy is known to be weak in the RMAB setting, which makes the regret less meaningful. Our empirical results in Sec. 1 also show the weakness of this policy. Another breakthrough is made by Ortner et al. (2012) who show a sublinear regret bound against the optimal policy. However, they ignore the computational burden of their algorithm.
Jung and Tewari (2019) propose another interesting direction in that they introduce a deterministic policy mapping . It takes the system parameter as an input and outputs a deterministic stationary policy . Then the learner competes against the policy , where denotes the true system. This framework is general enough to include the best fixed arm policy and the optimal policy that are mentioned earlier. That being said, one can achieve an efficient algorithm by choosing an efficient mapping or make the regret more meaningful with a stronger policy. In fact, there are different lines of work (e.g., Whittle (1988); Liu and Zhao (2010); Meshram et al. (2017)) that study an efficient way, namely the Whittle index policy, to approximate the optimal algorithm. Using this policy as a mapping, one can obtain an efficient algorithm with a meaningful regret simultaneously.
In this paper, we also adopt the policy mapping from Jung and Tewari (2019) and answer an open question raised by them. Specifically, they prove the regret bound of Thompson sampling in the episodic restless bandits where the system periodically resets. From the episodic assumption, the problem boils down to a finite horizon problem, which makes the analysis simpler. However, there are many cases (e.g., online recommendations) where the periodic reset is not natural, and they mention the analysis of a learning algorithm in the infinite time horizon as an open question.
We identify explicit conditions in Sec. 4 that ensure the Bellman
equation of the entire Markov decision process (MDP). It is hard to analyze the vanilla Thompson sampling in this setting, and we adapt
Thompson sampling with dynamic episodes (TSDE) of Ouyang et al. (2017) in the fully observable MDP. TSDE (Algorithm 1) has one deterministic and one random termination conditions and switches to a new episode if one of these is met. At the beginning of each episode, TSDE draws a system parameter using the posterior distribution from which it computes a policy and runs this policy throughout the episode. We theoretically prove a sublinear regret bound of this algorithm and empirically test it on a simulated dynamic channel access problem.1.1 Main Result
As mentioned earlier, our learner competes against the policy without the knowledge of . We denote the average long term reward of on the system by , which is a welldefined notion under certain assumptions that will be discussed later. Then we define the frequentist regret by
(1) 
where is the learner’s reward at time . We focus on bounding the following Bayesian regret
(2) 
where is a prior distribution over and is known to the learner. Our main result is to bound the Bayesian regret of TSDE.
Theorem 1.
The Bayesian regret of TSDE satisfies the following bound
where the exact upper bound appears later in Sec. 5.
2 Preliminaries
We begin by formally defining our problem setting.
2.1 Problem Setting
As stated earlier, we focus on a Bayesian framework where the true system, denoted as , is a random object that is drawn from a prior distribution before the interaction with the system begins. In line with Ouyang et al. (2017), we assume that the prior is known to the learner, and we denote its support by .
At each time step , the learner selects arms from which become active while the others remain passive. Following Ortner et al. (2012), we impose the passiveMarkov chains to be irreducible and aperiodic. As a result, we can associated with each arm the mixing time of . Let be the distributions of the state of arm starting from a state and remaining passive for steps, and let be the stationary distribution. Then, we define
(3) 
and work under the assumption of known mixing time^{2}^{2}2The knowledge of maybe relaxed to the knowledge of an upper bound of it, without affecting our result..
Assumption 1 (Mixing times).
For all and , is irreducible and aperiodic, and is known to the learner.
The learner’s action at time is written as , indicating the active action. For all the chosen arms, the learner observes the state and receives a reward , where the rewards are deterministic known functions of the state for all . The objective of the learner is to choose the best sequence of arms, given the history (state and actions) observed so far, which maximizes the long term average reward
(4) 
2.2 From POMDP to MDP
By nature, the RMAB problem we consider is a partially observable Markov decision process (POMDP) since the arms evolve in a Markovian fashion and we only observe the states of the active arms. Nonetheless, one can turn this POMDP into a fully observable Markov decision process (MDP) by introducing belief states, i.e., distributions over states given the history. Notice that the number of belief states become therefore (countably) infinite even if the original problem is finite. Following Ortner et al. (2012) and Jung and Tewari (2019), we track the history introducing a metastate , fully observed at time , from which we can reconstruct the belief states. Formally, we define where
For each , is the last observation of the state process before time , is time elapsed from this last observation. Further, it is clear that is a Markov process on a countably infinite state space . As a result, the maximization of the partially observable problem in Eq. 4 is equivalent to the maximization of the fully observable one
(5) 
where
(6) 
We use the notation and to emphasize that the random behavior of is governed by the system . We also assume that the initial state is known to the learner.
2.3 Policy Mapping
To maximize the long term average reward in Eq. 5, Ortner et al. (2012) construct a finite approximation of the countable MDP which allows them, under a bounded diameter assumption, to compute optimal policy for a given . However, their computational complexity is prohibitive for practical applications. As explained in the introduction, we follow a different approach, in line with Jung and Tewari (2019), which achieves both tractability and optimality through the use of a policy mapping . It associates each parameter with a stationary deterministic policy . To ensure the wellposedness of the longterm average reward, we impose the following assumption on .
Assumption 2 (Bounded span).
For all , the parameter/policy pair satisfies Cond. 2.
3 Algorithm
Algorithm 1 builds on Thompson Sampling with Dynamic Episodes (TSDE) of Ouyang et al. (2017). At the beginning of each episode , we draw system parameters from the latest posterior , compute the policy , and run throughout the episode. We proceed to the next episode if one of the termination conditions, which will appear shortly, occurs.
Before introducing the termination conditions, let us discuss Asm. 1. As pointed out in Ortner et al. (2012, Eq. 1), we have for all and . As we want the accuracy of , which will not affect the regret significantly, we define
Here we assume the time horizon is known. When it is unknown, we can use the doubling trick and get the same regret bound up to a constant factor. We remark that .
For tuples , we define
Then we introduce the truncated counter
The intuition behind this aggregation is that the distribution of the states remains similar for sufficiently large , thanks to the mixing time. As a result, the possible number of tuples with is at most . When there is no ambiguity, we write for brevity and let be the set of all possible values of .
We terminate the episode if
(7) 
where represents the length of episode . This quantity can differ for each episode. This is where the name dynamic episodes comes from. In addition, the second condition makes the quantity random, and one recovers the wellknown lazy update scheme from this condition (Jaksch et al., 2010; Ouyang et al., 2017). The underlying intuition is that one should update the policy only after gathering enough additional information over the unknown Markov process.
4 Planning Problem
The MDP reformulation in Sec. 2.2 reduces the objective to maximizing Eq. 5. However, we inherit from the original POMDP problem severe difficulties in the planning task. For example, given the parametrization , how to efficiently compute a stationary and deterministic policy (i.e., maps a state to an action in a deterministic manner) that maximizes the average long term reward, and more importantly, does such policy exist? Unfortunately, the average reward POMDP problem is not well understood in contrast with the finite state average reward MDP. In particular, it is known (Bertsekas, 1995) that the long term average reward may not be constant w.r.t. the initial state. Even when this holds, 1) The Bellman equation may not have a solution. 2) Value Iteration may fail to converge to the optimal average reward. 3) There may not exist an optimal policy, stationary or nonstationary. 4) Finally, even when the optimal policy exists, Papadimitriou and Tsitsiklis (1999) show that it is generally PSPACE hard to compute it.
To overcome this difficulty, Ortner et al. (2012) perform a state aggregation to reduce the countably infinite MDP into a finite one, which under the bounded diameter assumption can be solved using standard techniques. Although this reduction allows them to compute an optimal policy, the computational complexity of their approach remains prohibitive for practical application. On the other hand, a significant amount of work has been done to design good policies in the RMAB framework, for instance the best fixed arm policy (that is optimal in the classical MAB framework), the myopic policy (Javidi et al., 2008), or the Whittle index policy (Whittle, 1988; Liu and Zhao, 2010). In line with Jung and Tewari (2019), we leverage this prior knowledge following an alternative approach that consists in competing with the best policy within some known class of policies. Formally, let be the set of stationary deterministic policies, and we assume a policy mapping is given and known to the learner. This set of deterministic mappings is quite rich in that the optimal policy can be also represented when it exists. If one cares more about the efficiency, one can use some efficient mappings while there is a tradeoff of weakening the competitor.
Finally, in contrast to Ortner et al. (2012), our approach does not turn the countable MDP problem into a finite one. Hence, it requires a further condition on the parameter space and the policy mapping for the average reward criterion in Eq. 5 to be wellposed. More precisely, we expect the average reward to be independent of the initial state and associated to a Bellman equation, with a bias function of a bounded span. For a given and associated policy , we introduce the following conditions.
Condition 1.
Let be the set of bounded span realvalued function. There exists and a constant which satisfy for all ,
where the expectation is taken over evolving from given the action and the system .
Under Cond. 1, it is known (see Prop. 2) that the long term average reward of is welldefined (the reduces to the standard ), independent of the initial state , and associated with the Bellman equation with a bounded span bias function. However, Cond. 1 is implicit and uneasy to assert as it relies on the existence result^{3}^{3}3If a function satisfies Cond. 1, it is not unique since adding any constant to still meet the requirement.. This motivates the alternative condition, known as the discounted approach in the literature.
Condition 2.
For any , let be the discounted infinite horizon value function defined as
Then is uniformly bounded for all .
The introduction of the discount factor guarantees that is a welldefined function, and hence Cond. 2 is reduced to assert the uniform boundedness of a known family of function. Further, it also guarantees that the long term average reward is welldefined as it implies Cond. 1.
Proposition 2.
Let be a system parameter and be a policy. Then the followings hold.

[leftmargin=*]
We denote as the uniform upper bound on the span.
5 Regret Bound
In this section, we bound the Bayesian regret of TSDE (Algorithm 1). The analysis crucially relies on four distinct properties: 1) the Bellman equation in Eq. 9 satisfied by the average cost at each policy update, 2) the Thompson sampling algorithm which samples parameters according to the posterior, hence ensuring that and are conditionally identical in distribution, 3)
the concentration of the empirical estimates around the
, and 4) the update scheme in Eq. 7 which controls the number of episodes while preserving sufficient measurability of the termination times.We provide here a proof sketch to explain how we leverage those properties and how they translate in key intermediate results that allow us to obtain the final bound. The formal proofs can be found in App. B.
5.1 Regret Decomposition
Under Asm. 2, Prop. 2 ensures that each sampled parameter policy pair satisfies the Bellman equation (Eq. 9):
As a result, we can decompose on each episode the frequentist regret and obtain over ,
where
See App. B for a more detailed derivation.
Bounding . The first regret term is addressed thanks to the wellknown expectation identity (see Russo and Van Roy (2014)), leveraging that conditionally, .
Lemma 3 (Expectation identity).
Suppose and have the same distribution given a history . For any measurable function , we have
As pointed out in Ouyang et al. (2017), one cannot apply Lemma 3 directly to and because of the measurability issue arising from the lazyupdate scheme in Eq. 7. In line with Ouyang et al. (2017), we overcome this difficulty thanks to the first deterministic termination rule in Eq. 7. Taking the expectation w.r.t. leads to the following lemma.
Lemma 4 (Ouyang et al. (2017), Lemma 3 and 4).
where is the total number of episodes until time .
Bounding . Clearly, involves telescopic sums over each episode . As a result, it solely depends on the number of policy switches and on the uniform span bound in Prop. 2.
Lemma 5.
As a result, both and reduce to a fine bound over the number of episodes, .
Bounding and . Finally, the last regret terms are dealing with the model misspecification. That is to say, they depend on the onpolicy error between the empirical estimate and the true transition model. Formally, Lemma 6 and 7 show that they scale with
where
is the probability distribution of arm
’s state under parametrization and is its empirical estimate at the beginning of episode . The core of the proofs thus lies in deriving a highprobability confidence set whose associated onpolicy error is cumulatively bounded by . We state the lemmas here and postpone the proofs to App. B.Lemma 6.
satisfies the following bound
Lemma 7.
satisfies the following bound
We detail the construction and probabilistic argument of the confidence set later in the section.
5.2 Bounding the Number of Episodes
As breifly discussed in Sec. 3, each episode has a random length , and the number of episodes also becomes random. In order to bound and , we first bound this quantity. As discussed in Osband and Van Roy (2014), the specific structure of our problem due to the MDP formulation of the original POMDP problem allows us to guarantee a tighter bound w.r.t. the number of states than straightforwardly applying the TSDE analysis on the metastate . In particular, we leverage this structure to obtain a bound that depends on the number of states through the summation instead of the product .
Lemma 8.
The number of episodes satisfies the following inequality almost surely
Proof.
Following Ouyang et al. (2017), we define macro episodes with start times for a subsequence such that and
Note that the macro episode starts when the second termination criterion happens. Ouyang et al. (2017) prove in their Lemma 1 that
(10) 
where is the number of macro episodes. We claim
(11) 
which prove our lemma when combined with Eq. 10.
For each , we define
This means that gets doubled times out of episodes. It leads to the following inequality
Then we have
where we added to account for the initial case and the third inequality holds due to Jensen’s inequality along with the fact that . The equality holds because is the total number of active arms until time . This proves our claim (Eq. 11) and therefore the lemma. ∎
5.3 Confidence Set
To bound and , we construct a confidence set for the system parameters . Recall that represents . Suppose at time , the state of arm was observed to be in rounds ago. Let denote the probability distribution of the arm’s state if the true system were . For an individual probability weight, we write for . Using the samples collected so far, we can also compute an empirical distribution . We construct a confidence set as a collection of such that is close to . Namely in episode , we define
where
Lemma 9.
For every episode , we can bound
Proof.
For an episode , pick and let . If equals to , then and the inequality becomes trivial. Suppose . We first analyze the case . Weissman et al. (2003) show that
(12) 
Setting , we get
(13) 
For the case , we want to prove the same probability bound in Eq. 13 but cannot directly use Eq. 12 due to aggregation. We can still show a similar bound by using the proof technique by Weissman et al. (2003).
For simplicity, write , , and . Then it can be easily checked that
Using this and the union bound, we can write
(14) 
By the definition of , we have
Then Hoeffding’s inequality implies that
Plugging this in Eq. 14, we get
which shows Eq. 13 for the case .
Since , applying the union bound finishes the proof. ∎
Furthermore, the confidence set satisfies that the cumulative onpolicy error (see Sec. 5.1) is bounded.
Lemma 10.
On the highprobability event , we can show
The proof of Lemma 10 is postponed to App. B. We want to emphasize that the set only appears in the proof and it has nothing to do with running TSDE. For example, we can set an arbitrary value for to make the proof works. The main idea of bounding and is that the event happens with high probability, and if so, then and behave similarly.
5.4 Putting Everything Together
Theorem 1 (Exact regret bound, restated).
The Bayesian regret of TSDE is bounded by
where .
6 Experiments
We empirically evaluated TSDE (Algorithm 1) on simulated data. Following Jung and Tewari (2019), we chose the GilbertElliott channel model in Figure 1 to model each arm. This model assumes binary states and is widely used in communication systems (e.g., see Liu and Zhao (2010)).
For simplicity, we assumed and . This means that the learner’s action does not affect the transition matrix and the binary reward equals one if and only if the state is good. We also assumed the initial states of the arms are all good. Each arm has two parameters: and . We set the prior to be uniform over a finite set . Expectations are approximated by the Monte Carlo simulation with size or greater.
We investigated three indexbased policies: the best fixed arm policy, the myopic policy, and the Whittle index policy. Indexbased policies compute an index for each arm only using the samples from this arm and choose the top arms. Due to their decoupling nature, these policies are computationally efficient. The best fixed arm policy computes the expected reward according to the stationary distribution. The myopic policy maximizes the expected regret of the current round. The Whittle index policy is first introduced by Whittle (1988) and shown to be powerful in this particular setting by Liu and Zhao (2010). The Whittle index policy is very popular in RMABs as it can efficiently approximate the optimal policy in many different settings. As a remark, all these policies are reduced to the best fixed arm policy in the stationary bandits.
We first analyzed the Bayesian regret. Here we used , , and . The true system was actually drawn from the uniform prior. The average rewards smoothed by the prior, , were (fixed), (myopic), and (Whittle), showing the power of the Whittle index policy. As described in Figure 2, the Bayesian regrets were sublinear regardless of the competitor policy. The loglog plot shows that they are indeed as the dotted line has a slope of .
Then we tested the frequentist setting to empirically validate that TSDE still performs well in this setting even though our theory only bounds the Bayesian regret. We chose , , , and
We again adopted the setting from Jung and Tewari (2019). This is particularly interesting because each arm has the same stationary distribution of . This means that the best fixed arm policy becomes indifferent among the arms. The average rewards, , were (fixed), (myopic), and (Whittle), again justifying the power of Whittle index policy. On the left plot of Figure 3, three horizontal dotted lines represent for each of the competitors. The solid lines show the timeaveraged cumulative rewards,
. Every solid line converged to the dotted line. The right figure plots the posterior probability of the true parameters using the Whittle index policy. For all arms, these probabilities monotonically increased to one, illustrating that TSDE were learning
properly. From this, we can assert that TSDE still performs reasonably well at least when the true parameters lie on the support of the prior.Acknowledgements
AT and YJ acknowledge the support of NSF CAREER grant IIS1452099. AT was also supported by a Sloan Research Fellowship.
References
 Bertsekas (1995) Dimitri P Bertsekas. Dynamic programming and optimal control, volume 2. Athena scientific Belmont, MA, 1995.

Bertsimas and NiñoMora (2000)
Dimitris Bertsimas and José NiñoMora.
Restless bandits, linear programming relaxations, and a primaldual index heuristic.
Operations Research, 48(1):80–90, 2000.  Biglieri et al. (2013) Ezio Biglieri, Andrea J Goldsmith, Larry J Greenstein, Narayan B Mandayam, and H Vincent Poor. Principles of cognitive radio. Cambridge University Press, 2013.
 Blondel and Tsitsiklis (2000) Vincent D Blondel and John N Tsitsiklis. A survey of computational complexity results in systems and control. Automatica, 36(9):1249–1274, 2000.
 Hero et al. (2007) Alfred Olivier Hero, David Castañón, Doug Cochran, and Keith Kastella. Foundations and applications of sensor management. Springer Science & Business Media, 2007.

Jaksch et al. (2010)
Thomas Jaksch, Ronald Ortner, and Peter Auer.
Nearoptimal regret bounds for reinforcement learning.
Journal of Machine Learning Research
, 11(Apr):1563–1600, 2010.  Javidi et al. (2008) Tara Javidi, Bhaskar Krishnamachari, Qing Zhao, and Mingyan Liu. Optimality of myopic sensing in multichannel opportunistic access. In 2008 IEEE International Conference on Communications, pages 2107–2112. IEEE, 2008.
 Jung and Tewari (2019) Young Hun Jung and Ambuj Tewari. Regret bounds for thompson sampling in restless bandit problems. arXiv preprint arXiv:1905.12673, 2019.
 Liu et al. (2011) Haoyang Liu, Keqin Liu, and Qing Zhao. Logarithmic weak regret of nonbayesian restless multiarmed bandit. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1968–1971. IEEE, 2011.
 Liu et al. (2013) Haoyang Liu, Keqin Liu, and Qing Zhao. Learning in a changing world: Restless multiarmed bandit with unknown dynamics. IEEE Transactions on Information Theory, 59(3):1902–1916, 2013.
 Liu and Zhao (2010) Keqin Liu and Qing Zhao. Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access. IEEE Transactions on Information Theory, 56(11):5547–5567, 2010.
 Meshram et al. (2017) Rahul Meshram, Aditya Gopalan, and D Manjunath. Restless bandits that hide their hand and recommendation systems. In IEEE International Conference on Communication Systems and Networks (COMSNETS), pages 206–213. IEEE, 2017.
 Ortner et al. (2012) Ronald Ortner, Daniil Ryabko, Peter Auer, and Rémi Munos. Regret bounds for restless markov bandits. In International Conference on Algorithmic Learning Theory, pages 214–228. Springer, 2012.
 Osband and Van Roy (2014) Ian Osband and Benjamin Van Roy. Nearoptimal reinforcement learning in factored mdps. In Advances in Neural Information Processing Systems, pages 604–612, 2014.
 Ouyang et al. (2017) Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning unknown markov decision processes: A thompson sampling approach. In Advances in Neural Information Processing Systems, pages 1333–1342, 2017.
 Papadimitriou and Tsitsiklis (1999) Christos H Papadimitriou and John N Tsitsiklis. The complexity of optimal queuing network control. Mathematics of Operations Research, 24(2):293–305, 1999.
 Platzman (1980) Loren K Platzman. Optimal infinitehorizon undiscounted control of finite probabilistic systems. SIAM Journal on Control and Optimization, 18(4):362–380, 1980.
 Puterman (2014) Martin L Puterman. Markov Decision Processes.: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
 Russo and Van Roy (2014) Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
 Tekin and Liu (2012) Cem Tekin and Mingyan Liu. Online learning of rested and restless bandits. IEEE Transactions on Information Theory, 58(8):5588–5611, 2012.
 Weber and Weiss (1990) Richard R Weber and Gideon Weiss. On an index policy for restless bandits. Journal of Applied Probability, 27(3):637–648, 1990.
 Weissman et al. (2003) Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. HewlettPackard Labs, Tech. Rep, 2003.
 Whittle (1988) Peter Whittle. Restless bandits: Activity allocation in a changing world. Journal of applied probability, 25(A):287–298, 1988.
References
 Bertsekas (1995) Dimitri P Bertsekas. Dynamic programming and optimal control, volume 2. Athena scientific Belmont, MA, 1995.

Bertsimas and NiñoMora (2000)
Dimitris Bertsimas and José NiñoMora.
Restless bandits, linear programming relaxations, and a primaldual index heuristic.
Operations Research, 48(1):80–90, 2000.  Biglieri et al. (2013) Ezio Biglieri, Andrea J Goldsmith, Larry J Greenstein, Narayan B Mandayam, and H Vincent Poor. Principles of cognitive radio. Cambridge University Press, 2013.
 Blondel and Tsitsiklis (2000) Vincent D Blondel and John N Tsitsiklis. A survey of computational complexity results in systems and control. Automatica, 36(9):1249–1274, 2000.
 Hero et al. (2007) Alfred Olivier Hero, David Castañón, Doug Cochran, and Keith Kastella. Foundations and applications of sensor management. Springer Science & Business Media, 2007.

Jaksch et al. (2010)
Thomas Jaksch, Ronald Ortner, and Peter Auer.
Nearoptimal regret bounds for reinforcement learning.
Journal of Machine Learning Research
, 11(Apr):1563–1600, 2010.  Javidi et al. (2008) Tara Javidi, Bhaskar Krishnamachari, Qing Zhao, and Mingyan Liu. Optimality of myopic sensing in multichannel opportunistic access. In 2008 IEEE International Conference on Communications, pages 2107–2112. IEEE, 2008.
 Jung and Tewari (2019) Young Hun Jung and Ambuj Tewari. Regret bounds for thompson sampling in restless bandit problems. arXiv preprint arXiv:1905.12673, 2019.
 Liu et al. (2011) Haoyang Liu, Keqin Liu, and Qing Zhao. Logarithmic weak regret of nonbayesian restless multiarmed bandit. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1968–1971. IEEE, 2011.
 Liu et al. (2013) Haoyang Liu, Keqin Liu, and Qing Zhao. Learning in a changing world: Restless multiarmed bandit with unknown dynamics. IEEE Transactions on Information Theory, 59(3):1902–1916, 2013.
 Liu and Zhao (2010) Keqin Liu and Qing Zhao. Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access. IEEE Transactions on Information Theory, 56(11):5547–5567, 2010.
 Meshram et al. (2017) Rahul Meshram, Aditya Gopalan, and D Manjunath. Restless bandits that hide their hand and recommendation systems. In IEEE International Conference on Communication Systems and Networks (COMSNETS), pages 206–213. IEEE, 2017.
 Ortner et al. (2012) Ronald Ortner, Daniil Ryabko, Peter Auer, and Rémi Munos. Regret bounds for restless markov bandits. In International Conference on Algorithmic Learning Theory, pages 214–228. Springer, 2012.
 Osband and Van Roy (2014) Ian Osband and Benjamin Van Roy. Nearoptimal reinforcement learning in factored mdps. In Advances in Neural Information Processing Systems, pages 604–612, 2014.
 Ouyang et al. (2017) Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning unknown markov decision processes: A thompson sampling approach. In Advances in Neural Information Processing Systems, pages 1333–1342, 2017.
 Papadimitriou and Tsitsiklis (1999) Christos H Papadimitriou and John N Tsitsiklis. The complexity of optimal queuing network control. Mathematics of Operations Research, 24(2):293–305, 1999.
 Platzman (1980) Loren K Platzman. Optimal infinitehorizon undiscounted control of finite probabilistic systems. SIAM Journal on Control and Optimization, 18(4):362–380, 1980.
 Puterman (2014) Martin L Puterman. Markov Decision Processes.: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
 Russo and Van Roy (2014) Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
 Tekin and Liu (2012) Cem Tekin and Mingyan Liu. Online learning of rested and restless bandits. IEEE Transactions on Information Theory, 58(8):5588–5611, 2012.
 Weber and Weiss (1990) Richard R Weber and Gideon Weiss. On an index policy for restless bandits. Journal of Applied Probability, 27(3):637–648, 1990.
 Weissman et al. (2003) Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. HewlettPackard Labs, Tech. Rep, 2003.
 Whittle (1988) Peter Whittle. Restless bandits: Activity allocation in a changing world. Journal of applied probability, 25(A):287–298, 1988.
Appendix A Proof of Prop. 2
We first prove that Cond. 1 guarantees the constant average cost and the associated Bellman equation and then show that Cond. 2 implies Cond. 1.

Let and satisfy Cond. 1 for some bounded function and constant . Then, for all ,
where the next ”meta”state is drawn according to the Markov transition probability of knowing the current state and the action under parametrization . Thus,
Multiplying by both sides of the equation and taking the expectation given leads to
Finally, since is bounded, letting one has and thus
Futhermore, since is constant, it ensures that is independent of the initial state. Replacing by in Cond. 1, we directly obtain that is associated with the Bellman equation. Since the function is arbitrary up to constant term (it still satisfies the Bellman equation and does not affect the span), we can set it without loss of generality to be nonnegative defining and the pair satisfies the Bellman equation (Eq. 9). Additionally, we have

We now show that Cond. 2 implies Cond. 1. The proof is adapted from Puterman (2014, Thm.8.10.7) which is derived for optimal policies. The core idea is to consider a sequence of discount factor and to choose an appropriate subsequence (also indexed by for ease of notation) to assert the existence of and thanks to the uniform boundedness of .
First, notice that for all , and thus that for all . Also, it is well known that satisfies the discounted Bellman equation:Let be an arbitrary state and define . Clearly, is uniformly bounded and satisfies
(15) Since and are uniformly bounded, so is . Further, the BolzanoWeierstrass theorem for bounded sequence together with a standard diagonal argument ensures that there exists a subsequence such that


converges pointwise to some function .
Finally, since is uniformly bounded so is :

Comments
There are no comments yet.