Optimistic Whittle Index Policy: Online Learning for Restless Bandits

05/30/2022
by   Kai Wang, et al.
13

Restless multi-armed bandits (RMABs) extend multi-armed bandits to allow for stateful arms, where the state of each arm evolves restlessly with different transitions depending on whether that arm is pulled. However, solving RMABs requires information on transition dynamics, which is often not available upfront. To plan in RMAB settings with unknown transitions, we propose the first online learning algorithm based on the Whittle index policy, using an upper confidence bound (UCB) approach to learn transition dynamics. Specifically, we formulate a bilinear program to compute the optimistic Whittle index from the confidence bounds in transition dynamics. Our algorithm, UCWhittle, achieves sublinear O(√(T log T)) frequentist regret to solve RMABs with unknown transitions. Empirically, we demonstrate that UCWhittle leverages the structure of RMABs and the Whittle index policy solution to achieve better performance than existing online learning baselines across three domains, including on real-world maternal and childcare data aimed at reducing maternal mortality.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/30/2023

Indexability of Finite State Restless Multi-Armed Bandit and Rollout Policy

We consider finite state restless multi-armed bandit problem. The decisi...
research
02/27/2020

Online Learning for Active Cache Synchronization

Existing multi-armed bandit (MAB) models make two implicit assumptions: ...
research
07/04/2021

Robust Restless Bandits: Tackling Interval Uncertainty with Deep Reinforcement Learning

We introduce Robust Restless Bandits, a challenging generalization of re...
research
10/30/2022

Revisiting Simple Regret Minimization in Multi-Armed Bandits

Simple regret is a natural and parameter-free performance criterion for ...
research
12/07/2022

Stochastic Rising Bandits

This paper is in the field of stochastic Multi-Armed Bandits (MABs), i.e...
research
06/09/2020

Online Learning in Iterated Prisoner's Dilemma to Mimic Human Behavior

Prisoner's Dilemma mainly treat the choice to cooperate or defect as an ...
research
01/04/2018

Lazy Restless Bandits for Decision Making with Limited Observation Capability: Applications in Wireless Networks

In this work we formulate the problem of restless multi-armed bandits wi...

Please sign up or login with your details

Forgot password? Click here to reset