Regret Bounds for Thompson Sampling in Restless Bandit Problems

05/29/2019
by   Young Hun Jung, et al.
0

Restless bandit problems are instances of non-stationary multi-armed bandits. These problems have been studied well from the optimization perspective, where we aim to efficiently find a near-optimal policy when system parameters are known. However, very few papers adopt a learning perspective, where the parameters are unknown. In this paper, we analyze the performance of Thompson sampling in restless bandits with unknown parameters. We consider a general policy map to define our competitor and prove an Õ(√(T)) Bayesian regret bound. Our competitor is flexible enough to represent various benchmarks including the best fixed action policy, the optimal policy, the Whittle index policy, or the myopic policy. We also present empirical results that support our theoretical findings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/24/2022

Complete Policy Regret Bounds for Tallying Bandits

Policy regret is a well established notion of measuring the performance ...
research
10/14/2022

Continuous-in-time Limit for Bayesian Bandits

This paper revisits the bandit problem in the Bayesian setting. The Baye...
research
09/15/2012

Further Optimal Regret Bounds for Thompson Sampling

Thompson Sampling is one of the oldest heuristics for multi-armed bandit...
research
09/29/2021

Batched Bandits with Crowd Externalities

In Batched Multi-Armed Bandits (BMAB), the policy is not allowed to be u...
research
03/07/2018

Satisficing in Time-Sensitive Bandit Learning

Much of the recent literature on bandit learning focuses on algorithms t...
research
10/12/2019

Thompson Sampling in Non-Episodic Restless Bandits

Restless bandit problems assume time-varying reward distributions of the...
research
12/03/2018

Thompson Sampling for Noncompliant Bandits

Thompson sampling, a Bayesian method for balancing exploration and explo...

Please sign up or login with your details

Forgot password? Click here to reset