Thompson Sampling with Information Relaxation Penalties

02/12/2019
by   Seungki Min, et al.
0

We consider a finite time horizon multi-armed bandit (MAB) problem in a Bayesian framework, for which we develop a general set of control policies that leverage ideas from information relaxations of stochastic dynamic optimization problems. In crude terms, an information relaxation allows the decision maker (DM) to have access to the future (unknown) rewards and incorporate them in her optimization problem to pick an action at time t, but penalizes the decision maker for using this information. In our setting, the future rewards allow the DM to better estimate the unknown mean reward parameters of the multiple arms, and optimize her sequence of actions. By picking different information penalties, the DM can construct a family of policies of increasing complexity that, for example, include Thompson Sampling and the true optimal (but intractable) policy as special cases. We systematically develop this framework of information relaxation sampling, propose an intuitive family of control policies for our motivating finite time horizon Bayesian MAB problem, and prove associated structural results and performance bounds. Numerical experiments suggest that this new class of policies performs well, in particular in settings where the finite time horizon introduces significant tension in the problem. Finally, inspired by the finite time horizon Gittins index, we propose an index policy that builds on our framework that particularly outperforms to the state-of-the-art algorithms in our numerical experiments.

READ FULL TEXT
research
04/30/2023

Indexability of Finite State Restless Multi-Armed Bandit and Rollout Policy

We consider finite state restless multi-armed bandit problem. The decisi...
research
03/21/2023

Adaptive Experimentation at Scale: Bayesian Algorithms for Flexible Batches

Standard bandit algorithms that assume continual reallocation of measure...
research
09/20/2021

Reinforcement Learning for Finite-Horizon Restless Multi-Armed Multi-Action Bandits

We study a finite-horizon restless multi-armed bandit problem with multi...
research
06/28/2019

Adaptive Sequential Experiments with Unknown Information Flows

Systems that make sequential decisions in the presence of partial feedba...
research
01/06/2016

On Bayesian index policies for sequential resource allocation

This paper is about index policies for minimizing (frequentist) regret i...
research
10/29/2021

Variational Bayesian Optimistic Sampling

We consider online sequential decision problems where an agent must bala...
research
10/21/2021

Can Q-learning solve Multi Armed Bantids?

When a reinforcement learning (RL) method has to decide between several ...

Please sign up or login with your details

Forgot password? Click here to reset