DeepAI AI Chat
Log In Sign Up

Nonparametric Gaussian mixture models for the multi-armed contextual bandit

by   Iñigo Urteaga, et al.
Columbia University

The multi-armed bandit is a sequential allocation task where an agent must learn a policy that maximizes long term payoff, where only the reward of the played arm is observed at each iteration. In the stochastic setting, the reward for each action is generated from an unknown distribution, which depends on a given 'context', available at each interaction with the world. Thompson sampling is a generative, interpretable multi-armed bandit algorithm that has been shown both to perform well in practice, and to enjoy optimality properties for certain reward functions. Nevertheless, Thompson sampling requires sampling from parameter posteriors and calculation of expected rewards, which are possible for a very limited choice of distributions. We here extend Thompson sampling to more complex scenarios by adopting a very flexible set of reward distributions: nonparametric Gaussian mixture models. The generative process of Bayesian nonparametric mixtures naturally aligns with the Bayesian modeling of multi-armed bandits. This allows for the implementation of an efficient and flexible Thompson sampling algorithm: the nonparametric model autonomously determines its complexity in an online fashion, as it observes new rewards for the played arms. We show how the proposed method sequentially learns the nonparametric mixture model that best approximates the true underlying reward distribution. Our contribution is valuable for practical scenarios, as it avoids stringent model specifications, and yet attains reduced regret.


Variational inference for the multi-armed contextual bandit

In many biomedical, science, and engineering problems, one must sequenti...

(Sequential) Importance Sampling Bandits

The multi-armed bandit (MAB) problem is a sequential allocation task whe...

Bayesian bandits: balancing the exploration-exploitation tradeoff via double sampling

Reinforcement learning studies how to balance exploration and exploitati...

Variable Selection via Thompson Sampling

Thompson sampling is a heuristic algorithm for the multi-armed bandit pr...

Autonomous Drug Design with Multi-armed Bandits

Recent developments in artificial intelligence and automation could pote...

Debiasing Samples from Online Learning Using Bootstrap

It has been recently shown in the literature that the sample averages fr...

Odds-Ratio Thompson Sampling to Control for Time-Varying Effect

Multi-armed bandit methods have been used for dynamic experiments partic...