Regret Bounds for Restless Markov Bandits

09/12/2012
by   Ronald Ortner, et al.
0

We consider the restless Markov bandit problem, in which the state of each arm evolves according to a Markov process independently of the learner's actions. We suggest an algorithm that after T steps achieves Õ(√(T)) regret with respect to the best policy that knows the distributions of all arms. No assumptions on the Markov chains are made except that they are irreducible. In addition, we show that index-based policies are necessarily suboptimal for the considered problem.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset