DeepAI AI Chat
Log In Sign Up

Variational inference for the multi-armed contextual bandit

by   Iñigo Urteaga, et al.
Columbia University

In many biomedical, science, and engineering problems, one must sequentially decide which action to take next so as to maximize rewards. Reinforcement learning is an area of machine learning that studies how this maximization balances exploration and exploitation, optimizing interactions with the world while simultaneously learning how the world operates. One general class of algorithms for this type of learning is the multi-armed bandit setting and, in particular, the contextual bandit case, in which observed rewards are dependent on each action as well as on given information or 'context' available at each interaction with the world. The Thompson sampling algorithm has recently been shown to perform well in real-world settings and to enjoy provable optimality properties for this set of problems. It facilitates generative and interpretable modeling of the problem at hand, though complexity of the model limits its application, since one must both sample from the distributions modeled and calculate their expected rewards. We here show how these limitations can be overcome using variational approximations, applying to the reinforcement learning case advances developed for the inference case in the machine learning community over the past two decades. We consider bandit applications where the true reward distribution is unknown and approximate it with a mixture model, whose parameters are inferred via variational inference.


Bayesian bandits: balancing the exploration-exploitation tradeoff via double sampling

Reinforcement learning studies how to balance exploration and exploitati...

Nonparametric Gaussian mixture models for the multi-armed contextual bandit

The multi-armed bandit is a sequential allocation task where an agent mu...

Contextual Bandit with Missing Rewards

We consider a novel variant of the contextual bandit problem (i.e., the ...

Counterfactual Contextual Multi-Armed Bandit: a Real-World Application to Diagnose Apple Diseases

Post-harvest diseases of apple are one of the major issues in the econom...

Hedging using reinforcement learning: Contextual k-Armed Bandit versus Q-learning

The construction of replication strategies for contingent claims in the ...

A Practical Method for Solving Contextual Bandit Problems Using Decision Trees

Many efficient algorithms with strong theoretical guarantees have been p...

(Sequential) Importance Sampling Bandits

The multi-armed bandit (MAB) problem is a sequential allocation task whe...