Variational inference for the multi-armed contextual bandit

09/10/2017
by   Iñigo Urteaga, et al.
0

In many biomedical, science, and engineering problems, one must sequentially decide which action to take next so as to maximize rewards. Reinforcement learning is an area of machine learning that studies how this maximization balances exploration and exploitation, optimizing interactions with the world while simultaneously learning how the world operates. One general class of algorithms for this type of learning is the multi-armed bandit setting and, in particular, the contextual bandit case, in which observed rewards are dependent on each action as well as on given information or 'context' available at each interaction with the world. The Thompson sampling algorithm has recently been shown to perform well in real-world settings and to enjoy provable optimality properties for this set of problems. It facilitates generative and interpretable modeling of the problem at hand, though complexity of the model limits its application, since one must both sample from the distributions modeled and calculate their expected rewards. We here show how these limitations can be overcome using variational approximations, applying to the reinforcement learning case advances developed for the inference case in the machine learning community over the past two decades. We consider bandit applications where the true reward distribution is unknown and approximate it with a mixture model, whose parameters are inferred via variational inference.

READ FULL TEXT
research
09/10/2017

Bayesian bandits: balancing the exploration-exploitation tradeoff via double sampling

Reinforcement learning studies how to balance exploration and exploitati...
research
08/08/2018

Nonparametric Gaussian mixture models for the multi-armed contextual bandit

The multi-armed bandit is a sequential allocation task where an agent mu...
research
07/13/2020

Contextual Bandit with Missing Rewards

We consider a novel variant of the contextual bandit problem (i.e., the ...
research
09/20/2022

Multi-armed Bandit Learning on a Graph

The multi-armed bandit(MAB) problem is a simple yet powerful framework t...
research
08/08/2018

(Sequential) Importance Sampling Bandits

The multi-armed bandit (MAB) problem is a sequential allocation task whe...
research
07/03/2020

Hedging using reinforcement learning: Contextual k-Armed Bandit versus Q-learning

The construction of replication strategies for contingent claims in the ...
research
12/15/2022

Ungeneralizable Contextual Logistic Bandit in Credit Scoring

The application of reinforcement learning in credit scoring has created ...

Please sign up or login with your details

Forgot password? Click here to reset