Thompson Sampling in Non-Episodic Restless Bandits

10/12/2019 ∙ by Young Hun Jung, et al. ∙ Criteo University of Michigan 6

Restless bandit problems assume time-varying reward distributions of the arms, which adds flexibility to the model but makes the analysis more challenging. We study learning algorithms over the unknown reward distributions and prove a sub-linear, O(√(T)log T), regret bound for a variant of Thompson sampling. Our analysis applies in the infinite time horizon setting, resolving the open question raised by Jung and Tewari (2019) whose analysis is limited to the episodic case. We adopt their policy mapping framework, which allows our algorithm to be efficient and simultaneously keeps the regret meaningful. Our algorithm adapts the TSDE algorithm of Ouyang et al. (2017) in a non-trivial manner to account for the special structure of restless bandits. We test our algorithm on a simulated dynamic channel access problem with several policy mappings, and the empirical regrets agree with the theoretical bound regardless of the choice of the policy mapping.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In contrast to the classical multi-armed bandits (MABs), restless multi-armed bandits (RMABs), introduced by Whittle (1988), assume reward distributions that change along with the time. Due to their non-stationary nature, RMABs can model more complicated systems and thus get more attention in practice and theoretical literature. In practice, they are used in a wide spectrum of applications including sensor management (Chp. 7 in Hero et al. (2007) and Chp. 5 in Biglieri et al. (2013)), dynamic channel access problems (Liu et al., 2011, 2013), and online recommendation systems (Meshram et al., 2017). Theoretically, a variety of research communities have contributed to the literature on restless bandits, e.g., complexity theory (Blondel and Tsitsiklis, 2000),

applied probability

(Weber and Weiss, 1990), and optimization (Bertsimas and Niño-Mora, 2000).

In this setting, there are independent arms indexed by .111For an integer , we denote the set by . Each arm is characterized by an internal state which evolves in a Markovian fashion according to the (possibly distinct) transition matrices and depending on whether the arm is pulled (i.e., active) or not (i.e., passive). The reward of pulling an arm depends on its state , which brings the non-stationarity.

We aggregate the transition matrices as and consider this problem as a Reinforcement Learning problem where is unknown to the learner. This problem has a complication in defining the baseline competitor against which the learner competes. It is not guaranteed, without additional assumptions, that the optimal policy exists, and even if it exists, Papadimitriou and Tsitsiklis (1999) show that it is generally PSPACE hard to compute the optimal policy.

Researchers take different paths to tackle this challenge. Some define the regret using a simpler policy, which can be easily computed (e.g., see Tekin and Liu (2012); Liu et al. (2013)). They compare the learner’s reward to a policy that pulls a fixed set of arms every round. Their algorithm is efficient and has a strong regret guarantee, , but this baseline policy is known to be weak in the RMAB setting, which makes the regret less meaningful. Our empirical results in Sec. 1 also show the weakness of this policy. Another breakthrough is made by Ortner et al. (2012) who show a sub-linear regret bound against the optimal policy. However, they ignore the computational burden of their algorithm.

Jung and Tewari (2019) propose another interesting direction in that they introduce a deterministic policy mapping . It takes the system parameter as an input and outputs a deterministic stationary policy . Then the learner competes against the policy , where denotes the true system. This framework is general enough to include the best fixed arm policy and the optimal policy that are mentioned earlier. That being said, one can achieve an efficient algorithm by choosing an efficient mapping or make the regret more meaningful with a stronger policy. In fact, there are different lines of work (e.g., Whittle (1988); Liu and Zhao (2010); Meshram et al. (2017)) that study an efficient way, namely the Whittle index policy, to approximate the optimal algorithm. Using this policy as a mapping, one can obtain an efficient algorithm with a meaningful regret simultaneously.

In this paper, we also adopt the policy mapping from Jung and Tewari (2019) and answer an open question raised by them. Specifically, they prove the regret bound of Thompson sampling in the episodic restless bandits where the system periodically resets. From the episodic assumption, the problem boils down to a finite horizon problem, which makes the analysis simpler. However, there are many cases (e.g., online recommendations) where the periodic reset is not natural, and they mention the analysis of a learning algorithm in the infinite time horizon as an open question.

We identify explicit conditions in Sec. 4 that ensure the Bellman

equation of the entire Markov decision process (MDP). It is hard to analyze the vanilla Thompson sampling in this setting, and we adapt

Thompson sampling with dynamic episodes (TSDE) of Ouyang et al. (2017) in the fully observable MDP. TSDE (Algorithm 1) has one deterministic and one random termination conditions and switches to a new episode if one of these is met. At the beginning of each episode, TSDE draws a system parameter using the posterior distribution from which it computes a policy and runs this policy throughout the episode. We theoretically prove a sub-linear regret bound of this algorithm and empirically test it on a simulated dynamic channel access problem.

1.1 Main Result

As mentioned earlier, our learner competes against the policy without the knowledge of . We denote the average long term reward of on the system by , which is a well-defined notion under certain assumptions that will be discussed later. Then we define the frequentist regret by

(1)

where is the learner’s reward at time . We focus on bounding the following Bayesian regret

(2)

where is a prior distribution over and is known to the learner. Our main result is to bound the Bayesian regret of TSDE.

Theorem 1.

The Bayesian regret of TSDE satisfies the following bound

where the exact upper bound appears later in Sec. 5.

2 Preliminaries

We begin by formally defining our problem setting.

2.1 Problem Setting

As stated earlier, we focus on a Bayesian framework where the true system, denoted as , is a random object that is drawn from a prior distribution before the interaction with the system begins. In line with Ouyang et al. (2017), we assume that the prior is known to the learner, and we denote its support by .

At each time step , the learner selects arms from which become active while the others remain passive. Following Ortner et al. (2012), we impose the passiveMarkov chains to be irreducible and aperiodic. As a result, we can associated with each arm the mixing time of . Let be the distributions of the state of arm starting from a state and remaining passive for steps, and let be the stationary distribution. Then, we define

(3)

and work under the assumption of known mixing time222The knowledge of maybe relaxed to the knowledge of an upper bound of it, without affecting our result..

Assumption 1 (Mixing times).

For all and , is irreducible and aperiodic, and is known to the learner.

The learner’s action at time is written as , indicating the active action. For all the chosen arms, the learner observes the state and receives a reward , where the rewards are deterministic known functions of the state for all . The objective of the learner is to choose the best sequence of arms, given the history (state and actions) observed so far, which maximizes the long term average reward

(4)

2.2 From POMDP to MDP

By nature, the RMAB problem we consider is a partially observable Markov decision process (POMDP) since the arms evolve in a Markovian fashion and we only observe the states of the active arms. Nonetheless, one can turn this POMDP into a fully observable Markov decision process (MDP) by introducing belief states, i.e., distributions over states given the history. Notice that the number of belief states become therefore (countably) infinite even if the original problem is finite. Following Ortner et al. (2012) and Jung and Tewari (2019), we track the history introducing a meta-state , fully observed at time , from which we can reconstruct the belief states. Formally, we define where

For each , is the last observation of the state process before time , is time elapsed from this last observation. Further, it is clear that is a Markov process on a countably infinite state space . As a result, the maximization of the partially observable problem in Eq. 4 is equivalent to the maximization of the fully observable one

(5)

where

(6)

We use the notation and to emphasize that the random behavior of is governed by the system . We also assume that the initial state is known to the learner.

2.3 Policy Mapping

To maximize the long term average reward in Eq. 5Ortner et al. (2012) construct a finite approximation of the countable MDP which allows them, under a bounded diameter assumption, to compute -optimal policy for a given . However, their computational complexity is prohibitive for practical applications. As explained in the introduction, we follow a different approach, in line with Jung and Tewari (2019), which achieves both tractability and optimality through the use of a policy mapping . It associates each parameter with a stationary deterministic policy . To ensure the well-posedness of the long-term average reward, we impose the following assumption on .

Assumption 2 (Bounded span).

For all , the parameter/policy pair satisfies Cond. 2.

Cond. 2 is formalized and discussed in detail in Sec. 4. Asm. 2 should be understood as the counterpart of the bounded diameter assumption made by Ortner et al. (2012) or the bounded span assumption by Ouyang et al. (2017) adapted to our policy mapping approach.

3 Algorithm

Algorithm 1 builds on Thompson Sampling with Dynamic Episodes (TSDE) of Ouyang et al. (2017). At the beginning of each episode , we draw system parameters from the latest posterior , compute the policy , and run throughout the episode. We proceed to the next episode if one of the termination conditions, which will appear shortly, occurs.

1:  Input prior , policy mapping ,
2:  Input mixing time , initial state
3:  Initialize , ,
4:  for episodes  do
5:     Set and
6:     Draw and compute
7:     while not termination condition (Eq. 7do
8:        Select active arms
9:        Observe states for active arms
10:        Update to and to
11:        Increment
12:     end while
13:  end for
Algorithm 1 TSDE in restless bandits

Before introducing the termination conditions, let us discuss Asm. 1. As pointed out in Ortner et al. (2012, Eq. 1), we have for all and . As we want the accuracy of , which will not affect the regret significantly, we define

Here we assume the time horizon is known. When it is unknown, we can use the doubling trick and get the same regret bound up to a constant factor. We remark that .

For tuples , we define

Then we introduce the truncated counter

The intuition behind this aggregation is that the distribution of the states remains similar for sufficiently large , thanks to the mixing time. As a result, the possible number of tuples with is at most . When there is no ambiguity, we write for brevity and let be the set of all possible values of .

We terminate the episode if

(7)

where represents the length of episode . This quantity can differ for each episode. This is where the name dynamic episodes comes from. In addition, the second condition makes the quantity random, and one recovers the well-known lazy update scheme from this condition (Jaksch et al., 2010; Ouyang et al., 2017). The underlying intuition is that one should update the policy only after gathering enough additional information over the unknown Markov process.

4 Planning Problem

The MDP reformulation in Sec. 2.2 reduces the objective to maximizing Eq. 5. However, we inherit from the original POMDP problem severe difficulties in the planning task. For example, given the parametrization , how to efficiently compute a stationary and deterministic policy (i.e., maps a state to an action in a deterministic manner) that maximizes the average long term reward, and more importantly, does such policy exist? Unfortunately, the average reward POMDP problem is not well understood in contrast with the finite state average reward MDP. In particular, it is known (Bertsekas, 1995) that the long term average reward may not be constant w.r.t. the initial state. Even when this holds, 1) The Bellman equation may not have a solution. 2) Value Iteration may fail to converge to the optimal average reward. 3) There may not exist an optimal policy, stationary or non-stationary. 4) Finally, even when the optimal policy exists, Papadimitriou and Tsitsiklis (1999) show that it is generally PSPACE hard to compute it.

To overcome this difficulty, Ortner et al. (2012) perform a state aggregation to reduce the countably infinite MDP into a finite one, which under the bounded diameter assumption can be solved using standard techniques. Although this reduction allows them to compute an -optimal policy, the computational complexity of their approach remains prohibitive for practical application. On the other hand, a significant amount of work has been done to design good policies in the RMAB framework, for instance the best fixed arm policy (that is optimal in the classical MAB framework), the myopic policy (Javidi et al., 2008), or the Whittle index policy (Whittle, 1988; Liu and Zhao, 2010). In line with Jung and Tewari (2019), we leverage this prior knowledge following an alternative approach that consists in competing with the best policy within some known class of policies. Formally, let be the set of stationary deterministic policies, and we assume a policy mapping is given and known to the learner. This set of deterministic mappings is quite rich in that the optimal policy can be also represented when it exists. If one cares more about the efficiency, one can use some efficient mappings while there is a trade-off of weakening the competitor.

Finally, in contrast to Ortner et al. (2012), our approach does not turn the countable MDP problem into a finite one. Hence, it requires a further condition on the parameter space and the policy mapping for the average reward criterion in Eq. 5 to be well-posed. More precisely, we expect the average reward to be independent of the initial state and associated to a Bellman equation, with a bias function of a bounded span. For a given and associated policy , we introduce the following conditions.

Condition 1.

Let be the set of bounded span real-valued function. There exists and a constant which satisfy for all ,

where the expectation is taken over evolving from given the action and the system .

Under Cond. 1, it is known (see Prop. 2) that the long term average reward of is well-defined (the reduces to the standard ), independent of the initial state , and associated with the Bellman equation with a bounded span bias function. However, Cond. 1 is implicit and uneasy to assert as it relies on the existence result333If a function satisfies Cond. 1, it is not unique since adding any constant to still meet the requirement.. This motivates the alternative condition, known as the discounted approach in the literature.

Condition 2.

For any , let be the discounted infinite horizon value function defined as

Then is uniformly bounded for all .

The introduction of the discount factor guarantees that is a well-defined function, and hence Cond. 2 is reduced to assert the uniform boundedness of a known family of function. Further, it also guarantees that the long term average reward is well-defined as it implies Cond. 1.

Proposition 2.

Let be a system parameter and be a policy. Then the followings hold.

  • [leftmargin=*]

  • Cond. 2 implies Cond. 1.

  • Under Cond. 1 (or Cond. 2), the quantity

    (8)

    is constant and independent of the initial state. Further, there exists a non-negative function , with bounded span , such that for any ,

    (9)

We denote as the uniform upper bound on the span.

The proof of Prop. 2 can be adapted from Puterman (2014, Thm.8.10.7) for a given (i.e., not necessarily optimal) policy. We postpone the proof to App. A.

5 Regret Bound

In this section, we bound the Bayesian regret of TSDE (Algorithm 1). The analysis crucially relies on four distinct properties: 1) the Bellman equation in Eq. 9 satisfied by the average cost at each policy update, 2) the Thompson sampling algorithm which samples parameters according to the posterior, hence ensuring that and are conditionally identical in distribution, 3)

the concentration of the empirical estimates around the

, and 4) the update scheme in Eq. 7 which controls the number of episodes while preserving sufficient measurability of the termination times.

We provide here a proof sketch to explain how we leverage those properties and how they translate in key intermediate results that allow us to obtain the final bound. The formal proofs can be found in App. B.

5.1 Regret Decomposition

Under Asm. 2, Prop. 2 ensures that each sampled parameter policy pair satisfies the Bellman equation (Eq. 9):

As a result, we can decompose on each episode the frequentist regret and obtain over ,

where

See App. B for a more detailed derivation.

Bounding . The first regret term is addressed thanks to the well-known expectation identity (see Russo and Van Roy (2014)), leveraging that conditionally, .

Lemma 3 (Expectation identity).

Suppose and have the same distribution given a history . For any -measurable function , we have

As pointed out in Ouyang et al. (2017), one cannot apply Lemma 3 directly to and because of the measurability issue arising from the lazy-update scheme in Eq. 7. In line with Ouyang et al. (2017), we overcome this difficulty thanks to the first deterministic termination rule in Eq. 7. Taking the expectation w.r.t. leads to the following lemma.

Lemma 4 (Ouyang et al. (2017), Lemma 3 and 4).

where is the total number of episodes until time .

Bounding . Clearly, involves telescopic sums over each episode . As a result, it solely depends on the number of policy switches and on the uniform span bound in Prop. 2.

Lemma 5.

As a result, both and reduce to a fine bound over the number of episodes, .

Bounding and . Finally, the last regret terms are dealing with the model misspecification. That is to say, they depend on the on-policy error between the empirical estimate and the true transition model. Formally, Lemma 6 and 7 show that they scale with

where

is the probability distribution of arm

’s state under parametrization and is its empirical estimate at the beginning of episode . The core of the proofs thus lies in deriving a high-probability confidence set whose associated on-policy error is cumulatively bounded by . We state the lemmas here and postpone the proofs to App. B.

Lemma 6.

satisfies the following bound

Lemma 7.

satisfies the following bound

We detail the construction and probabilistic argument of the confidence set later in the section.

5.2 Bounding the Number of Episodes

As breifly discussed in Sec. 3, each episode has a random length , and the number of episodes also becomes random. In order to bound and , we first bound this quantity. As discussed in Osband and Van Roy (2014), the specific structure of our problem due to the MDP formulation of the original POMDP problem allows us to guarantee a tighter bound w.r.t. the number of states than straightforwardly applying the TSDE analysis on the meta-state . In particular, we leverage this structure to obtain a bound that depends on the number of states through the summation instead of the product .

Lemma 8.

The number of episodes satisfies the following inequality almost surely

Proof.

Following Ouyang et al. (2017), we define macro episodes with start times for a sub-sequence such that and

Note that the macro episode starts when the second termination criterion happens. Ouyang et al. (2017) prove in their Lemma 1 that

(10)

where is the number of macro episodes. We claim

(11)

which prove our lemma when combined with Eq. 10.

For each , we define

This means that gets doubled times out of episodes. It leads to the following inequality

Then we have

where we added to account for the initial case and the third inequality holds due to Jensen’s inequality along with the fact that . The equality holds because is the total number of active arms until time . This proves our claim (Eq. 11) and therefore the lemma. ∎

5.3 Confidence Set

To bound and , we construct a confidence set for the system parameters . Recall that represents . Suppose at time , the state of arm was observed to be in rounds ago. Let denote the probability distribution of the arm’s state if the true system were . For an individual probability weight, we write for . Using the samples collected so far, we can also compute an empirical distribution . We construct a confidence set as a collection of such that is close to . Namely in episode , we define

where

Since is -measurable, Lemma 3 provides

The following lemma bounds this probability.

Lemma 9.

For every episode , we can bound

Proof.

For an episode , pick and let . If equals to , then and the inequality becomes trivial. Suppose . We first analyze the case . Weissman et al. (2003) show that

(12)

Setting , we get

(13)

For the case , we want to prove the same probability bound in Eq. 13 but cannot directly use Eq. 12 due to aggregation. We can still show a similar bound by using the proof technique by Weissman et al. (2003).

For simplicity, write , , and . Then it can be easily checked that

Using this and the union bound, we can write

(14)

By the definition of , we have

Then Hoeffding’s inequality implies that

Plugging this in Eq. 14, we get

which shows Eq. 13 for the case .

Since , applying the union bound finishes the proof. ∎

Furthermore, the confidence set satisfies that the cumulative on-policy error (see Sec. 5.1) is bounded.

Lemma 10.

On the high-probability event , we can show

The proof of Lemma 10 is postponed to App. B. We want to emphasize that the set only appears in the proof and it has nothing to do with running TSDE. For example, we can set an arbitrary value for to make the proof works. The main idea of bounding and is that the event happens with high probability, and if so, then and behave similarly.

5.4 Putting Everything Together

Plugging Lemma 4, 5, 6, and 7 into the regret decomposition, we prove our main result.

Theorem 1 (Exact regret bound, restated).

The Bayesian regret of TSDE is bounded by

where .

6 Experiments

We empirically evaluated TSDE (Algorithm 1) on simulated data. Following Jung and Tewari (2019), we chose the Gilbert-Elliott channel model in Figure 1 to model each arm. This model assumes binary states and is widely used in communication systems (e.g., see Liu and Zhao (2010)).

Figure 1: The Gilbert-Elliott channel model

For simplicity, we assumed and . This means that the learner’s action does not affect the transition matrix and the binary reward equals one if and only if the state is good. We also assumed the initial states of the arms are all good. Each arm has two parameters: and . We set the prior to be uniform over a finite set . Expectations are approximated by the Monte Carlo simulation with size or greater.

We investigated three index-based policies: the best fixed arm policy, the myopic policy, and the Whittle index policy. Index-based policies compute an index for each arm only using the samples from this arm and choose the top arms. Due to their decoupling nature, these policies are computationally efficient. The best fixed arm policy computes the expected reward according to the stationary distribution. The myopic policy maximizes the expected regret of the current round. The Whittle index policy is first introduced by Whittle (1988) and shown to be powerful in this particular setting by Liu and Zhao (2010). The Whittle index policy is very popular in RMABs as it can efficiently approximate the optimal policy in many different settings. As a remark, all these policies are reduced to the best fixed arm policy in the stationary bandits.


Figure 2: Bayesian regrets of TSDE (left) and their - plots (right)

We first analyzed the Bayesian regret. Here we used , , and . The true system was actually drawn from the uniform prior. The average rewards smoothed by the prior, , were (fixed), (myopic), and (Whittle), showing the power of the Whittle index policy. As described in Figure 2, the Bayesian regrets were sub-linear regardless of the competitor policy. The log-log plot shows that they are indeed as the dotted line has a slope of .


Figure 3: Average rewards of TSDE converge to their benchmarks (left); Posterior weights of the true parameters monotonically increase to one (right)

Then we tested the frequentist setting to empirically validate that TSDE still performs well in this setting even though our theory only bounds the Bayesian regret. We chose , , , and

We again adopted the setting from Jung and Tewari (2019). This is particularly interesting because each arm has the same stationary distribution of . This means that the best fixed arm policy becomes indifferent among the arms. The average rewards, , were (fixed), (myopic), and (Whittle), again justifying the power of Whittle index policy. On the left plot of Figure 3, three horizontal dotted lines represent for each of the competitors. The solid lines show the time-averaged cumulative rewards,

. Every solid line converged to the dotted line. The right figure plots the posterior probability of the true parameters using the Whittle index policy. For all arms, these probabilities monotonically increased to one, illustrating that TSDE were learning

properly. From this, we can assert that TSDE still performs reasonably well at least when the true parameters lie on the support of the prior.

Acknowledgements

AT and YJ acknowledge the support of NSF CAREER grant IIS-1452099. AT was also supported by a Sloan Research Fellowship.

References

  • Bertsekas (1995) Dimitri P Bertsekas. Dynamic programming and optimal control, volume 2. Athena scientific Belmont, MA, 1995.
  • Bertsimas and Niño-Mora (2000) Dimitris Bertsimas and José Niño-Mora.

    Restless bandits, linear programming relaxations, and a primal-dual index heuristic.

    Operations Research, 48(1):80–90, 2000.
  • Biglieri et al. (2013) Ezio Biglieri, Andrea J Goldsmith, Larry J Greenstein, Narayan B Mandayam, and H Vincent Poor. Principles of cognitive radio. Cambridge University Press, 2013.
  • Blondel and Tsitsiklis (2000) Vincent D Blondel and John N Tsitsiklis. A survey of computational complexity results in systems and control. Automatica, 36(9):1249–1274, 2000.
  • Hero et al. (2007) Alfred Olivier Hero, David Castañón, Doug Cochran, and Keith Kastella. Foundations and applications of sensor management. Springer Science & Business Media, 2007.
  • Jaksch et al. (2010) Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning.

    Journal of Machine Learning Research

    , 11(Apr):1563–1600, 2010.
  • Javidi et al. (2008) Tara Javidi, Bhaskar Krishnamachari, Qing Zhao, and Mingyan Liu. Optimality of myopic sensing in multi-channel opportunistic access. In 2008 IEEE International Conference on Communications, pages 2107–2112. IEEE, 2008.
  • Jung and Tewari (2019) Young Hun Jung and Ambuj Tewari. Regret bounds for thompson sampling in restless bandit problems. arXiv preprint arXiv:1905.12673, 2019.
  • Liu et al. (2011) Haoyang Liu, Keqin Liu, and Qing Zhao. Logarithmic weak regret of non-bayesian restless multi-armed bandit. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1968–1971. IEEE, 2011.
  • Liu et al. (2013) Haoyang Liu, Keqin Liu, and Qing Zhao. Learning in a changing world: Restless multiarmed bandit with unknown dynamics. IEEE Transactions on Information Theory, 59(3):1902–1916, 2013.
  • Liu and Zhao (2010) Keqin Liu and Qing Zhao. Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access. IEEE Transactions on Information Theory, 56(11):5547–5567, 2010.
  • Meshram et al. (2017) Rahul Meshram, Aditya Gopalan, and D Manjunath. Restless bandits that hide their hand and recommendation systems. In IEEE International Conference on Communication Systems and Networks (COMSNETS), pages 206–213. IEEE, 2017.
  • Ortner et al. (2012) Ronald Ortner, Daniil Ryabko, Peter Auer, and Rémi Munos. Regret bounds for restless markov bandits. In International Conference on Algorithmic Learning Theory, pages 214–228. Springer, 2012.
  • Osband and Van Roy (2014) Ian Osband and Benjamin Van Roy. Near-optimal reinforcement learning in factored mdps. In Advances in Neural Information Processing Systems, pages 604–612, 2014.
  • Ouyang et al. (2017) Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning unknown markov decision processes: A thompson sampling approach. In Advances in Neural Information Processing Systems, pages 1333–1342, 2017.
  • Papadimitriou and Tsitsiklis (1999) Christos H Papadimitriou and John N Tsitsiklis. The complexity of optimal queuing network control. Mathematics of Operations Research, 24(2):293–305, 1999.
  • Platzman (1980) Loren K Platzman. Optimal infinite-horizon undiscounted control of finite probabilistic systems. SIAM Journal on Control and Optimization, 18(4):362–380, 1980.
  • Puterman (2014) Martin L Puterman. Markov Decision Processes.: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
  • Russo and Van Roy (2014) Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
  • Tekin and Liu (2012) Cem Tekin and Mingyan Liu. Online learning of rested and restless bandits. IEEE Transactions on Information Theory, 58(8):5588–5611, 2012.
  • Weber and Weiss (1990) Richard R Weber and Gideon Weiss. On an index policy for restless bandits. Journal of Applied Probability, 27(3):637–648, 1990.
  • Weissman et al. (2003) Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.
  • Whittle (1988) Peter Whittle. Restless bandits: Activity allocation in a changing world. Journal of applied probability, 25(A):287–298, 1988.

References

  • Bertsekas (1995) Dimitri P Bertsekas. Dynamic programming and optimal control, volume 2. Athena scientific Belmont, MA, 1995.
  • Bertsimas and Niño-Mora (2000) Dimitris Bertsimas and José Niño-Mora.

    Restless bandits, linear programming relaxations, and a primal-dual index heuristic.

    Operations Research, 48(1):80–90, 2000.
  • Biglieri et al. (2013) Ezio Biglieri, Andrea J Goldsmith, Larry J Greenstein, Narayan B Mandayam, and H Vincent Poor. Principles of cognitive radio. Cambridge University Press, 2013.
  • Blondel and Tsitsiklis (2000) Vincent D Blondel and John N Tsitsiklis. A survey of computational complexity results in systems and control. Automatica, 36(9):1249–1274, 2000.
  • Hero et al. (2007) Alfred Olivier Hero, David Castañón, Doug Cochran, and Keith Kastella. Foundations and applications of sensor management. Springer Science & Business Media, 2007.
  • Jaksch et al. (2010) Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning.

    Journal of Machine Learning Research

    , 11(Apr):1563–1600, 2010.
  • Javidi et al. (2008) Tara Javidi, Bhaskar Krishnamachari, Qing Zhao, and Mingyan Liu. Optimality of myopic sensing in multi-channel opportunistic access. In 2008 IEEE International Conference on Communications, pages 2107–2112. IEEE, 2008.
  • Jung and Tewari (2019) Young Hun Jung and Ambuj Tewari. Regret bounds for thompson sampling in restless bandit problems. arXiv preprint arXiv:1905.12673, 2019.
  • Liu et al. (2011) Haoyang Liu, Keqin Liu, and Qing Zhao. Logarithmic weak regret of non-bayesian restless multi-armed bandit. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1968–1971. IEEE, 2011.
  • Liu et al. (2013) Haoyang Liu, Keqin Liu, and Qing Zhao. Learning in a changing world: Restless multiarmed bandit with unknown dynamics. IEEE Transactions on Information Theory, 59(3):1902–1916, 2013.
  • Liu and Zhao (2010) Keqin Liu and Qing Zhao. Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access. IEEE Transactions on Information Theory, 56(11):5547–5567, 2010.
  • Meshram et al. (2017) Rahul Meshram, Aditya Gopalan, and D Manjunath. Restless bandits that hide their hand and recommendation systems. In IEEE International Conference on Communication Systems and Networks (COMSNETS), pages 206–213. IEEE, 2017.
  • Ortner et al. (2012) Ronald Ortner, Daniil Ryabko, Peter Auer, and Rémi Munos. Regret bounds for restless markov bandits. In International Conference on Algorithmic Learning Theory, pages 214–228. Springer, 2012.
  • Osband and Van Roy (2014) Ian Osband and Benjamin Van Roy. Near-optimal reinforcement learning in factored mdps. In Advances in Neural Information Processing Systems, pages 604–612, 2014.
  • Ouyang et al. (2017) Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning unknown markov decision processes: A thompson sampling approach. In Advances in Neural Information Processing Systems, pages 1333–1342, 2017.
  • Papadimitriou and Tsitsiklis (1999) Christos H Papadimitriou and John N Tsitsiklis. The complexity of optimal queuing network control. Mathematics of Operations Research, 24(2):293–305, 1999.
  • Platzman (1980) Loren K Platzman. Optimal infinite-horizon undiscounted control of finite probabilistic systems. SIAM Journal on Control and Optimization, 18(4):362–380, 1980.
  • Puterman (2014) Martin L Puterman. Markov Decision Processes.: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
  • Russo and Van Roy (2014) Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
  • Tekin and Liu (2012) Cem Tekin and Mingyan Liu. Online learning of rested and restless bandits. IEEE Transactions on Information Theory, 58(8):5588–5611, 2012.
  • Weber and Weiss (1990) Richard R Weber and Gideon Weiss. On an index policy for restless bandits. Journal of Applied Probability, 27(3):637–648, 1990.
  • Weissman et al. (2003) Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.
  • Whittle (1988) Peter Whittle. Restless bandits: Activity allocation in a changing world. Journal of applied probability, 25(A):287–298, 1988.

Appendix A Proof of Prop. 2

We first prove that Cond. 1 guarantees the constant average cost and the associated Bellman equation and then show that Cond. 2 implies Cond. 1.

  1. Let and satisfy Cond. 1 for some bounded function and constant . Then, for all ,

    where the next ”meta”-state is drawn according to the Markov transition probability of knowing the current state and the action under parametrization . Thus,

    Multiplying by both sides of the equation and taking the expectation given leads to

    Finally, since is bounded, letting one has and thus

    Futhermore, since is constant, it ensures that is independent of the initial state. Replacing by in Cond. 1, we directly obtain that is associated with the Bellman equation. Since the function is arbitrary up to constant term (it still satisfies the Bellman equation and does not affect the span), we can set it without loss of generality to be non-negative defining and the pair satisfies the Bellman equation (Eq. 9). Additionally, we have

  2. We now show that Cond. 2 implies Cond. 1. The proof is adapted from Puterman (2014, Thm.8.10.7) which is derived for optimal policies. The core idea is to consider a sequence of discount factor and to choose an appropriate subsequence (also indexed by for ease of notation) to assert the existence of and thanks to the uniform boundedness of .
    First, notice that for all , and thus that for all . Also, it is well known that satisfies the discounted Bellman equation:

    Let be an arbitrary state and define . Clearly, is uniformly bounded and satisfies

    (15)

    Since and are uniformly bounded, so is . Further, the Bolzano-Weierstrass theorem for bounded sequence together with a standard diagonal argument ensures that there exists a subsequence such that

    • converges pointwise to some function .

    Finally, since is uniformly bounded so is :