Thompson Sampling in Non-Episodic Restless Bandits

10/12/2019
by   Young Hun Jung, et al.
6

Restless bandit problems assume time-varying reward distributions of the arms, which adds flexibility to the model but makes the analysis more challenging. We study learning algorithms over the unknown reward distributions and prove a sub-linear, O(√(T)log T), regret bound for a variant of Thompson sampling. Our analysis applies in the infinite time horizon setting, resolving the open question raised by Jung and Tewari (2019) whose analysis is limited to the episodic case. We adopt their policy mapping framework, which allows our algorithm to be efficient and simultaneously keeps the regret meaningful. Our algorithm adapts the TSDE algorithm of Ouyang et al. (2017) in a non-trivial manner to account for the special structure of restless bandits. We test our algorithm on a simulated dynamic channel access problem with several policy mappings, and the empirical regrets agree with the theoretical bound regardless of the choice of the policy mapping.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/23/2020

Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

We develop several new algorithms for learning Markov Decision Processes...
research
07/12/2021

Continuous Time Bandits With Sampling Costs

We consider a continuous-time multi-arm bandit problem (CTMAB), where th...
research
05/29/2019

Regret Bounds for Thompson Sampling in Restless Bandit Problems

Restless bandit problems are instances of non-stationary multi-armed ban...
research
12/06/2019

Solving Bernoulli Rank-One Bandits with Unimodal Thompson Sampling

Stochastic Rank-One Bandits (Katarya et al, (2017a,b)) are a simple fram...
research
10/23/2017

Sequential Matrix Completion

We propose a novel algorithm for sequential matrix completion in a recom...
research
11/27/2018

Optimal Learning for Dynamic Coding in Deadline-Constrained Multi-Channel Networks

We study the problem of serving randomly arriving and delay-sensitive tr...
research
06/15/2023

Langevin Thompson Sampling with Logarithmic Communication: Bandits and Reinforcement Learning

Thompson sampling (TS) is widely used in sequential decision making due ...

Please sign up or login with your details

Forgot password? Click here to reset