Variational Bayesian Optimistic Sampling

10/29/2021
by   Brendan O'Donoghue, et al.
0

We consider online sequential decision problems where an agent must balance exploration and exploitation. We derive a set of Bayesian `optimistic' policies which, in the stochastic multi-armed bandit case, includes the Thompson sampling policy. We provide a new analysis showing that any algorithm producing policies in the optimistic set enjoys Õ(√(AT)) Bayesian regret for a problem with A actions after T rounds. We extend the regret analysis for optimistic policies to bilinear saddle-point problems which include zero-sum matrix games and constrained bandits as special cases. In this case we show that Thompson sampling can produce policies outside of the optimistic set and suffer linear regret in some instances. Finding a policy inside the optimistic set amounts to solving a convex optimization problem and we call the resulting algorithm `variational Bayesian optimistic sampling' (VBOS). The procedure works for any posteriors, , it does not require the posterior to have any special properties, such as log-concavity, unimodality, or smoothness. The variational view of the problem has many useful properties, including the ability to tune the exploration-exploitation tradeoff, add regularization, incorporate constraints, and linearly parameterize the policy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/16/2022

Thompson Sampling with Virtual Helping Agents

We address the problem of online sequential decision making, i.e., balan...
research
09/10/2017

Bayesian bandits: balancing the exploration-exploitation tradeoff via double sampling

Reinforcement learning studies how to balance exploration and exploitati...
research
06/30/2023

Thompson sampling for improved exploration in GFlowNets

Generative flow networks (GFlowNets) are amortized variational inference...
research
03/27/2017

Thompson Sampling for Linear-Quadratic Control Problems

We consider the exploration-exploitation tradeoff in linear quadratic (L...
research
05/25/2018

Myopic Bayesian Design of Experiments via Posterior Sampling and Probabilistic Programming

We design a new myopic strategy for a wide class of sequential design of...
research
06/04/2021

Fair Exploration via Axiomatic Bargaining

Motivated by the consideration of fairly sharing the cost of exploration...
research
02/12/2019

Thompson Sampling with Information Relaxation Penalties

We consider a finite time horizon multi-armed bandit (MAB) problem in a ...

Please sign up or login with your details

Forgot password? Click here to reset