1 Introduction
Restless bandits [Whittle, 1988] are variants of multiarmed bandit (MAB) problems [Robbins, 1952]
. Unlike the classical MABs, the arms have nonstationary reward distributions. Specifically, we will focus on the class of restless bandits whose arms change their states based on Markov chains. Restless bandits are also distinguished from
rested bandits where only the active arms evolve and the passive arms remain frozen. We will assume that each arm changes according to two different Markov chains depending on whether it is played or not. Because of their extra flexibility in modeling nonstationarity, restless bandits have been applied to practical problems such as dynamic channel access [Liu et al., 2011, 2013] and online recommendation system [Meshram et al., 2017].Due to the arms’ nonstationary nature, playing the same set of arms for every round usually does not produce the optimal performance. This makes the optimal policy highly nontrivial, and Papadimitriou and Tsitsiklis [1999] show that it is generally PSPACE hard to identify the optimal policy for restless bandits. As a consequence, many researchers have been devoted to find an efficient way to approximate the optimal policy [Liu and Zhao, 2010, Meshram et al., 2018]. This line of work primarily focuses on the optimization perspective in that the system parameters are already known.
Since the true system parameters are unavailable in many cases, it becomes important to examine restless bandits from a learning perspective. Due to the learner’s additional uncertainty, however, analyzing a learning algorithm in restless bandits is significantly challenging. Liu et al. [2011, 2013] and Tekin and Liu [2012] prove bounds for confidence bound based algorithms, but their competitor always selects a fixed set of actions, which is known to be weak (see Section 5 for an empirical example of the weakness of the best fixed action competitor). Dai et al. [2011, 2014] show bounds against the optimal policy, but their assumptions on the underlying model are very limited. Ortner et al. [2012] prove an bound in general restless bandits, but their algorithm is intractable in general.
In a different line of work, Osband et al. [2013]
study Thompson sampling in the setting of a fully observable Markov decision process (MDP) and show the Bayesian regret bound of
(hiding dependence on system parameters like state and action space size). Unfortunately, this result is not applicable in our setting as ours is partially observable due to bandit feedback. Following Ortner et al. [2012], it is possible to transform our setting to the fully observable case, but then we end up having exponentially many states, which restricts the practical utility of existing results.In this work, we analyze Thompson sampling in restless bandits where the system resets every episode of a fixed length and the rewards are binary. We directly tackle the partial observability and achieve a meaningful regret bound, which when restricted to the classical MABs matches the Thompson sampling result in that setting. We are not the first to analyze Thompson sampling in restless bandits, and Meshram et al. [2016] study this type of algorithm as well, but their regret analysis remains in the onearmedcase with a fixed reward of not pulling the arm. They explicitly mention that a regret analysis of Thompson sampling in the multiarmed case is an interesting open question.
2 Problem setting
We begin by introducing our setting. There are arms, and the algorithm selects arms every round. We denote the learner’s action at time
by a binary vector
where . We call the selected arms as active and the rest as passive. We assume each arm has binary states, , which evolve as a Markov chain with transition matrix either or , depending on whether the learner pulled the arm or not.At round , pulling an arm incurs a binary reward , which is the arm’s current state. As we are in the bandit setting, the learner only observes the rewards of active arms, which we denote by , and does not observe the passive arms’ rewards nor their states. This feature makes our setting to be a partially observable Markov decision process, or POMDP. We denote the history of the learner’s actions and rewards up to time by .
We assume the system resets every episode of length , which is also known to the learner. This means that at the beginning of each episode, the states of the arms are drawn from an initial distribution. The entire time horizon is denoted by , and for simplicity, we assume it is a multiple of , or .
2.1 Bayesian regret and competitor policy
Let denote the entire parameters of the system. It includes transition matrices and , and an initial distribution of each arm’s state. The learner does not have the knowledge of these parameters at the beginning.
In order to define a regret, we need a competitor policy, or a benchmark. We first define a class of deterministic policies and policy mappings.
Definition 1.
A deterministic policy takes time index and history as an input and outputs a fixed action . A deterministic policy mapping takes system parameters as an input and outputs a deterministic policy .
We fix a deterministic policy mapping and let our algorithm compete against a deterministic policy , where represents the true system parameters, which are unknown to the learner.
We keep our competitor policy abstract mainly because we are in the nonstationary setting. Unlike the classical (stationary) MABs, pulling the same set of arms with the largest expected rewards is not necessarily optimal. Moreover, it is in general PSPACE hard to compute the optimal policy when is given. Regarding these statements, we refer the readers to the book by Gittins et al. [1989]. As a consequence, researchers have identified conditions that the (efficient) myopic policy is optimal [Ahmad et al., 2009] or proven that a tractable indexbased policy has a reasonable performance against the optimal policy [Liu and Zhao, 2010].
We observe that most of proposed policies including the optimal policy, the myopic policy, or the indexbased policy are deterministic. Therefore, researchers can plug in whatever competitor policy of their choice, and our regret bound will apply as long as the chosen policy mapping is deterministic.
Before defining the regret, we introduce a value function
(1) 
This is the expected reward of running a policy from round to where the system parameters are and the starting history is . Note that the benchmark policy obtains rewards per episode in expectation. Thus, we can define the regret as
(2) 
If an algorithm chooses to fix a policy for the entire episode , which is the case of our algorithm, then the regret can be written as
We particularly focus on the case where is a random and bound the following Bayesian regret,
where is a prior distribution over the set of system parameters . We assume that the prior is known to the learner. We caution our readers that there is at least one other regret definition in the literature, which is called either frequentist regret or worstcase regret. For this type of regret, one views as a fixed unknown object and directly bounds . Even though our primary interest is to bound the Bayesian regret, we can establish a connection to the frequentist regret in the special case where the prior has a finite support and the benchmark is the optimal policy (see Corollary 6).
3 Algorithm
Our algorithm is an instance of Thompson sampling or posterior sampling, first proposed by Thompson [1933]. At the beginning of episode , the algorithm draws system parameters from the posterior and plays throughout the episode. Once an episode is over, it updates the posterior based on additional observations. Algorithm 1 describes the steps.
We want to point out that the history fulfills two different purposes. One is to update the posterior , and the other is as an input to a policy . For the latter, however, we do not need the entire history as the arms reset every episode. That is why we set (step 5) and feed to (step 7). Furthermore, as we assume that the arms evolve based on Markov chains, the history can be summarized as
(3) 
which means that an arm is played rounds ago and is the observed reward in that round. If an arm is never played in the episode, then becomes , and becomes the expected reward from the initial distribution based on . As we assume the episode length is fixed to be , there are possible values for . Due to the binary reward assumption, can take three values including the case where the arm is never played. From these, we can infer that there are possible tuples of . By considering these tuples as states and following the reasoning of Ortner et al. [2012], one can view our POMDP as a fully observable MDP. Then one can use the existing algorithms for fully observable MDPs (e.g., Osband et al. [2013]), but the regret bounds easily become vacuous since the number of states depends exponentially on the number of arms .
Due to its generality, it is hard to analyze the time and space complexity of Algorithm 1. Two major steps are computing the policy (step 4) and updating posterior (step 10). Computing the policy depends on our choice of competitor mapping . If the competitor policy has better performance but is harder to compute, then our regret bound gets more meaningful as the benchmark is stronger, but the running time gets longer. Regarding the posterior update, the computational burden depends on the choice of the prior and its support. If there is a closedform update, then the step is computationally cheap, but otherwise the burden increases with respect to the size of the support.
4 Regret bound
In this section, we prove that the Bayesian regret of Algorithm 1 is at most . One main idea of analyzing Thompson sampling is that the distributions of and are identical given the history up to the end of episode (e.g., see Lattimore and Szepesvári [2018, Chp. 36]). To state it more formally, let be the algebra generated by the history
. Then we call a random variable
is measurable, or simply measurable, if its value is deterministically known given the information . Similarly, we call a random function is measurable if its mapping is deterministically known given . We record as a lemma an observation made by Russo and Van Roy [2014].Lemma 2.
(Expectation identity) Suppose and have the same distribution given . For any measurable function , we have
Recall that we assume the competitor mapping is deterministic. Furthermore, the value function in (1) is deterministic given and . This implies
(4) 
where is the history up to the end of episode . This observation leads to the following regret decomposition.
Lemma 3.
(Regret decomposition) The Bayesian regret of Algorithm 1 can be decomposed as
Proof.
Note that we can compute as we know and . We can also infer the value of from the algorithm’s observations. The main point of Lemma 3 is to rewrite the Bayesian regret using terms that are relatively easy to analyze.
Next, we define the Bellman operator
It is not hard to check that . The next lemma further decomposes the regret.
Lemma 4.
(Perepisode regret decomposition) Fix and , and let . Then we have
Proof.
Using the relation , we may write
The second term can be written as
and we can repeat this times to obtain the desired equation. ∎
Now we are ready to prove our main theorem. A complete proof can be found in Appendix A.
Theorem 5.
(Bayesian regret bound of Thompson sampling) The Bayesian regret of Algorithm 1 satisfies the following bound
Remark.
If the system is the classical stationary MAB, then it corresponds to the case , and our result reproduces the result of [Lattimore and Szepesvári, 2018, Chp. 36]. Furthermore, when , we can think of the problem as choosing the passive arms, and the smaller bound with replaced by would apply.
Proof Sketch.
We fix an episode and analyze the regret in this episode. Let so that the episode starts at time . Define
It counts the number of rounds where the arm was chosen by the learner with history and (see (3) for definition). Note that
where is the initial success rate of the arm . This implies there are tuples of .
Let
denote the conditional probability of
given a history and system parameters . Also let denote the empirical mean of this quantity (usingpast observations and set the estimate to
if ). Then defineSince is measurable, so is the set . Using the Hoeffding inequality, one can show . In other words, we can claim that with high probability, is small for all .
We now turn our attention to the following Bellman operator
Since is a deterministic policy, is also deterministic given and . Let be the active arms at time and write . Then we can rewrite
where . Under the event that , we have
where the dependence on comes from the mapping from to . When and are close for all , we can actually bound the difference between the following Bellman operators as
Then by applying Lemma 4, we get
The above inequality holds whenever . When or , which happens with probability less than , we have a trivial bound . We can deduce
Combining this with Lemma 3, we can show
(5) 
After some algebra, bounding sums of finite differences by integrals, and applying the CauchySchwartz inequality, we can bound the second summation by
(6) 
As discussed in Section 2, researchers sometimes pay more attention to the case where the true parameters are deterministically fixed in advance, in which the frequentist regret becomes more relevant. It is not easy to directly extend our analysis to the frequentist regret in general, but we can achieve a meaningful bound with extra assumptions. Suppose our prior is discrete and the competitor is the optimal policy. Then we know is always nonnegative due to the optimality of the benchmark and can deduce , where is the probability mass on . This leads to the following corollary.
Corollary 6.
(Frequentist regret bound of Thompson sampling) Suppose the prior is discrete and puts a nonzero mass on parameters . Additionally, assume that the competitor policy is the optimal policy. Then Algorithm 1 satisfies the following bound
5 Experiments
We empirically investigate the GilberElliot channel model, which is studied by Liu and Zhao [2010] in a restless bandit perspective. This model can be broadly used in communication systems such as cognitive radio networks, downlink scheduling in cellular systems, opportunistic transmission over fading channels, and resourceconstrained jamming and antijamming.
Each arm has two parameters and , which determine the transition matrix. We assume and each arm’s transition matrix is independent on the learner’s action. There are only two states, good and bad, and the reward of playing an arm is if its state is good and otherwise. Figure 1 summarizes this model. We assume the initial distribution of an arm follows the stationary distribution. In other words, its initial state is good with probability .
We fix and . We use Monte Carlo simulation with size or greater to approximate expectations. As each arm has two parameters, there are parameters. For these, we set the prior distribution to be uniform over a finite support .
5.1 Competitors
As mentioned earlier, one important strength of our result is that various policy mappings can be used as benchmarks. Here we test three different policies: the best fixed arm policy, the myopic policy, and the Whittle index policy. We want to emphasize again that these competitor policies know the system parameters while our algorithm does not.
The best fixed arm policy computes the stationary distribution for all and pulls the arms with top values. The myopic policy keeps updating the belief for the arm being in a good state and pulls the top arms. Finally, the Whittle index policy computes the Whittle index of each arm and uses it to rank the arms. The Whittle index is proposed by Whittle [1988], and Liu and Zhao [2010] find a closedform formula to compute the Whittle index in this particular setting. The Whittle index policy is very popular in optimization literature as it decouples the optimization process into independent problems for each arm, which significantly reduces the computational complexity while maintaining a reasonable performance against the optimal policy.
One observation is that these three policies are reduced to the best fixed arm policy in the stationary case. However, the first two policies are known to be suboptimal in general [Gittins et al., 1989]. Liu and Zhao [2010] justify both theoretically and empirically the performance of the Whittle index policy for the GilberElliot channel model.
5.2 Results
We first analyze the Bayesian regret. For this, we use and . The value functions of the best fixed arm policy, the myopic policy, and the Whittle index policy are and , respectively. If a competitor policy has a weak performance, then Thompson sampling also uses this weak policy mapping to get a policy for the episode . This implies that the regret does not necessarily become negative when the benchmark policy is weak. Figure 2 shows the trend of the Bayesian regret as a function of episode indices. Regardless of the choice of policy mapping, the regret is sublinear, and the slope of  plot is less than , which agrees with Theorem 5.
Next we fix true parameters and investigate the model’s behavior more closely. For this, we choose , , and . This choice results in for all , and the best fixed arm policy becomes indifferent. Therefore achieving zero regret against the best fixed arm becomes trivial. We use the same uniform prior as the previous experiment. Figure 3 presents the trend of value functions and how Thompson sampling puts more posterior weights on the correct parameters as it proceeds. Three horizontal lines in the left figure represent the values of the competitor policies. The values of the best fixed arm policy, the myopic policy, and the Whittle index policy are and , respectively. It is a good example why one should not pull the same arms all the time in restless bandits. The value function of Thompson sampling successfully converges to the optimal competitor value for every benchmark while the one with the myopic policy needs more episodes to fully converge. This supports Corollary 6 in that our model can be used even in the nonBayesian setting as far as the prior has a nonzero weight on the true parameters. Also, the posterior weights on the correct parameters monotonically increase (Figure 3, right), which again confirms our model’s performance. We measure these weights when the competitor map is the Whittle index policy.
6 Discussion and future directions
In this paper, we have analyzed Thompson sampling in restless bandits with binary rewards. The Bayesian regret can be theoretically bounded as , which naturally extends the results in the stationary MAB. One primary strength of our analysis is that the bound applies to arbitrary deterministic competitor policy mappings, which include the optimal policy and many other practical policies. Experiments with the simulated GilberElliot channel models support the theoretical results. In the special case where the prior has a discrete support and the benchmark is the optimal policy, our result extends to the frequentist regret, which is also supported by empirical results.
There are at least two interesting directions to be explored.

[leftmargin=*]

Our setting is episodic with known length . The system resets periodically, which makes the analysis of the regret simpler. However, it is sometimes unrealistic to assume this periodic reset (e.g., online recommendation system studied by Meshram et al. [2017]). Analyzing a learning algorithm in the nonepisodic setting will be useful.

Corollary 6 is not directly applicable in the case of continuous prior. In stationary MABs, it has been shown that Thompson sampling enjoys the frequentist regret bound of with additional assumptions [Lattimore and Szepesvári, 2018, Chp. 36]. Extending this to the restless bandit setting will be an interesting problem.
Acknowledgments
We acknowledge the support of NSF CAREER grant IIS1452099. AT was also supported by a Sloan Research Fellowship. AT visited Criteo AI Lab, Paris and had discussions with Criteo researchers – Marc Abeille, Clément Calauzènes, and Jérémie Mary – regarding nonstationarity in bandit problems. These discussions were very helpful in attracting our attention to the regret analysis of restless bandit problems and the need for considering a variety of benchmark competitors when defining regret.
References
 Ahmad et al. [2009] Sahand Haji Ali Ahmad, Mingyan Liu, Tara Javidi, Qing Zhao, and Bhaskar Krishnamachari. Optimality of myopic sensing in multichannel opportunistic access. IEEE Transactions on Information Theory, 55(9):4040–4050, 2009.
 Dai et al. [2011] Wenhan Dai, Yi Gai, Bhaskar Krishnamachari, and Qing Zhao. The nonbayesian restless multiarmed bandit: A case of nearlogarithmic regret. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2940–2943. IEEE, 2011.
 Dai et al. [2014] Wenhan Dai, Yi Gai, and Bhaskar Krishnamachari. Online learning for multichannel opportunistic access over unknown markovian channels. In IEEE International Conference on Sensing, Communication, and Networking (SECON), pages 64–71. IEEE, 2014.
 Gittins et al. [1989] John C Gittins, Kevin D Glazebrook, Richard Weber, and Richard Weber. Multiarmed bandit allocation indices, volume 25. Wiley Online Library, 1989.
 Lattimore and Szepesvári [2018] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint, 2018.
 Liu et al. [2011] Haoyang Liu, Keqin Liu, and Qing Zhao. Logarithmic weak regret of nonbayesian restless multiarmed bandit. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1968–1971. IEEE, 2011.
 Liu et al. [2013] Haoyang Liu, Keqin Liu, and Qing Zhao. Learning in a changing world: Restless multiarmed bandit with unknown dynamics. IEEE Transactions on Information Theory, 59(3):1902–1916, 2013.
 Liu and Zhao [2010] Keqin Liu and Qing Zhao. Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access. IEEE Transactions on Information Theory, 56(11):5547–5567, 2010.
 Meshram et al. [2016] Rahul Meshram, Aditya Gopalan, and D Manjunath. Optimal recommendation to users that react: Online learning for a class of pomdps. In IEEE 55th Conference on Decision and Control (CDC), pages 7210–7215. IEEE, 2016.
 Meshram et al. [2017] Rahul Meshram, Aditya Gopalan, and D Manjunath. Restless bandits that hide their hand and recommendation systems. In IEEE International Conference on Communication Systems and Networks (COMSNETS), pages 206–213. IEEE, 2017.
 Meshram et al. [2018] Rahul Meshram, D Manjunath, and Aditya Gopalan. On the whittle index for restless multiarmed hidden markov bandits. IEEE Transactions on Automatic Control, 63(9):3046–3053, 2018.
 Ortner et al. [2012] Ronald Ortner, Daniil Ryabko, Peter Auer, and Rémi Munos. Regret bounds for restless markov bandits. In International Conference on Algorithmic Learning Theory, pages 214–228. Springer, 2012.

Osband et al. [2013]
Ian Osband, Daniel Russo, and Benjamin Van Roy.
(more) efficient reinforcement learning via posterior sampling.
In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.  Papadimitriou and Tsitsiklis [1999] Christos H Papadimitriou and John N Tsitsiklis. The complexity of optimal queuing network control. Mathematics of Operations Research, 24(2):293–305, 1999.
 Robbins [1952] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.
 Russo and Van Roy [2014] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
 Tekin and Liu [2012] Cem Tekin and Mingyan Liu. Online learning of rested and restless bandits. IEEE Transactions on Information Theory, 58(8):5588–5611, 2012.
 Thompson [1933] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
 Whittle [1988] Peter Whittle. Restless bandits: Activity allocation in a changing world. Journal of applied probability, 25(A):287–298, 1988.
Appendix A Proof of Theorem 5
We begin by introducing a technical lemma.
Lemma 7.
Let and for . Then we can show
(7) 
Proof.
Fix a binary vector . For simplicity, let and . Since is either or , we have . Then we can deduce
When summing up for all binary vectors , we can write the coefficient of as
where the second equality holds because . This completes the proof. ∎
Now we prove the main theorem.
Theorem 5.
(Bayesian regret bound of Thompson sampling) The Bayesian regret of Algorithm 1 satisfies the following bound
Proof.
We fix an episode and analyze the regret in this episode. Let so that the episode starts at time . Define
It counts the number of rounds where the arm was chosen by the learner with history and (see (3) for definition). Note that
where is the initial success rate of the arm . This implies there are tuples of .
Let denote the conditional probability of given a history and system parameters . Also let denote the empirical mean of this quantity (using past observations and set the estimate to if ). Then define
Since is measurable, so is the set . Using the Hoeffding inequality, one can show .
We now turn our attention to the following Bellman operator
Since is a deterministic policy, is also deterministic given and . Let be the active arms at time and write . Then we can rewrite
(8) 
where . Under the event that , we have
(9) 
where the dependence on comes from the mapping from to . Lemma 7 provides
(10) 
From (8), (10), and the fact that , we obtain given and the event ,
Then by applying Lemma 4, we get
The above inequality holds whenever . When or , which happens with probability less than , we have a trivial bound . We can deduce
Combining this with Lemma 3, we can show
(11) 
We further analyze the summation to finish the argument. Note that for this summation, we have . We shorten to for simplicity. By the definition of in (9), we get
(12) 
where the second inequality holds because there are possible tuples of and a tuple can contribute at most to the first summation.
We can bound the second term as follows
Comments
There are no comments yet.