Nonparametric Gaussian mixture models for the multi-armed contextual bandit

08/08/2018
by   Iñigo Urteaga, et al.
0

The multi-armed bandit is a sequential allocation task where an agent must learn a policy that maximizes long term payoff, where only the reward of the played arm is observed at each iteration. In the stochastic setting, the reward for each action is generated from an unknown distribution, which depends on a given 'context', available at each interaction with the world. Thompson sampling is a generative, interpretable multi-armed bandit algorithm that has been shown both to perform well in practice, and to enjoy optimality properties for certain reward functions. Nevertheless, Thompson sampling requires sampling from parameter posteriors and calculation of expected rewards, which are possible for a very limited choice of distributions. We here extend Thompson sampling to more complex scenarios by adopting a very flexible set of reward distributions: nonparametric Gaussian mixture models. The generative process of Bayesian nonparametric mixtures naturally aligns with the Bayesian modeling of multi-armed bandits. This allows for the implementation of an efficient and flexible Thompson sampling algorithm: the nonparametric model autonomously determines its complexity in an online fashion, as it observes new rewards for the played arms. We show how the proposed method sequentially learns the nonparametric mixture model that best approximates the true underlying reward distribution. Our contribution is valuable for practical scenarios, as it avoids stringent model specifications, and yet attains reduced regret.

READ FULL TEXT
research
09/10/2017

Variational inference for the multi-armed contextual bandit

In many biomedical, science, and engineering problems, one must sequenti...
research
08/08/2018

(Sequential) Importance Sampling Bandits

The multi-armed bandit (MAB) problem is a sequential allocation task whe...
research
09/10/2017

Bayesian bandits: balancing the exploration-exploitation tradeoff via double sampling

Reinforcement learning studies how to balance exploration and exploitati...
research
08/26/2023

Motion Planning as Online Learning: A Multi-Armed Bandit Approach to Kinodynamic Sampling-Based Planning

Kinodynamic motion planners allow robots to perform complex manipulation...
research
07/01/2020

Variable Selection via Thompson Sampling

Thompson sampling is a heuristic algorithm for the multi-armed bandit pr...
research
01/23/2023

Congested Bandits: Optimal Routing via Short-term Resets

For traffic routing platforms, the choice of which route to recommend to...
research
07/04/2022

Autonomous Drug Design with Multi-armed Bandits

Recent developments in artificial intelligence and automation could pote...

Please sign up or login with your details

Forgot password? Click here to reset