Restless-UCB, an Efficient and Low-complexity Algorithm for Online Restless Bandits

11/05/2020
by   Siwei Wang, et al.
0

We study the online restless bandit problem, where the state of each arm evolves according to a Markov chain, and the reward of pulling an arm depends on both the pulled arm and the current state of the corresponding Markov chain. In this paper, we propose Restless-UCB, a learning policy that follows the explore-then-commit framework. In Restless-UCB, we present a novel method to construct offline instances, which only requires O(N) time-complexity (N is the number of arms) and is exponentially better than the complexity of existing learning policy. We also prove that Restless-UCB achieves a regret upper bound of Õ((N+M^3)T^2 3), where M is the Markov chain state space size and T is the time horizon. Compared to existing algorithms, our result eliminates the exponential factor (in M,N) in the regret upper bound, due to a novel exploitation of the sparsity in transitions in general restless bandit problems. As a result, our analysis technique can also be adopted to tighten the regret bounds of existing algorithms. Finally, we conduct experiments based on real-world dataset, to compare the Restless-UCB policy with state-of-the-art benchmarks. Our results show that Restless-UCB outperforms existing algorithms in regret, and significantly reduces the running time.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/20/2020

Regret Analysis of a Markov Policy Gradient Algorithm for Multi-arm Bandits

We consider a policy gradient algorithm applied to a finite-arm bandit p...
research
05/23/2019

Average reward reinforcement learning with unknown mixing times

We derive and analyze learning algorithms for policy evaluation, apprent...
research
09/20/2021

Reinforcement Learning for Finite-Horizon Restless Multi-Armed Multi-Action Bandits

We study a finite-horizon restless multi-armed bandit problem with multi...
research
09/12/2012

Regret Bounds for Restless Markov Bandits

We consider the restless Markov bandit problem, in which the state of ea...
research
06/01/2017

Scalable Generalized Linear Bandits: Online Computation and Hashing

Generalized Linear Bandits (GLBs), a natural extension of the stochastic...
research
12/02/2021

Convergence Guarantees for Deep Epsilon Greedy Policy Learning

Policy learning is a quickly growing area. As robotics and computers con...
research
01/20/2023

GBOSE: Generalized Bandit Orthogonalized Semiparametric Estimation

In sequential decision-making scenarios i.e., mobile health recommendati...

Please sign up or login with your details

Forgot password? Click here to reset