Regret Bounds for Restless Markov Bandits

09/12/2012
by   Ronald Ortner, et al.
0

We consider the restless Markov bandit problem, in which the state of each arm evolves according to a Markov process independently of the learner's actions. We suggest an algorithm that after T steps achieves Õ(√(T)) regret with respect to the best policy that knows the distributions of all arms. No assumptions on the Markov chains are made except that they are irreducible. In addition, we show that index-based policies are necessarily suboptimal for the considered problem.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/28/2022

Restless Multi-Armed Bandits under Exogenous Global Markov Process

We consider an extension to the restless multi-armed bandit (RMAB) probl...
research
12/17/2021

Learning in Restless Bandits under Exogenous Global Markov Process

We consider an extension to the restless multi-armed bandit (RMAB) probl...
research
11/05/2020

Restless-UCB, an Efficient and Low-complexity Algorithm for Online Restless Bandits

We study the online restless bandit problem, where the state of each arm...
research
02/07/2022

On learning Whittle index policy for restless bandits with scalable regret

Reinforcement learning is an attractive approach to learn good resource ...
research
07/20/2020

Regret Analysis of a Markov Policy Gradient Algorithm for Multi-arm Bandits

We consider a policy gradient algorithm applied to a finite-arm bandit p...
research
07/08/2022

Information-Gathering in Latent Bandits

In the latent bandit problem, the learner has access to reward distribut...
research
02/09/2016

Compliance-Aware Bandits

Motivated by clinical trials, we study bandits with observable non-compl...

Please sign up or login with your details

Forgot password? Click here to reset