DeepAI AI Chat
Log In Sign Up

(Sequential) Importance Sampling Bandits

by   Iñigo Urteaga, et al.
Columbia University

The multi-armed bandit (MAB) problem is a sequential allocation task where the goal is to learn a policy that maximizes long term payoff, where only the reward of the executed action is observed; i.e., sequential optimal decisions are made, while simultaneously learning how the world operates. In the stochastic setting, the reward for each action is generated from an unknown distribution. To decide the next optimal action to take, one must compute sufficient statistics of this unknown reward distribution, e.g. upper-confidence bounds (UCB), or expectations in Thompson sampling. Closed-form expressions for these statistics of interest are analytically intractable except for simple cases. We here propose to leverage Monte Carlo estimation and, in particular, the flexibility of (sequential) importance sampling (IS) to allow for accurate estimation of the statistics of interest within the MAB problem. IS methods estimate posterior densities or expectations in probabilistic models that are analytically intractable. We first show how IS can be combined with state-of-the-art MAB algorithms (Thompson sampling and Bayes-UCB) for classic (Bernoulli and contextual linear-Gaussian) bandit problems. Furthermore, we leverage the power of sequential IS to extend the applicability of these algorithms beyond the classic settings, and tackle additional useful cases. Specifically, we study the dynamic linear-Gaussian bandit, and both the static and dynamic logistic cases too. The flexibility of (sequential) importance sampling is shown to be fundamental for obtaining efficient estimates of the key sufficient statistics in these challenging scenarios.


page 25

page 26

page 27

page 28

page 29

page 30

page 31

page 33


Nonparametric Gaussian mixture models for the multi-armed contextual bandit

The multi-armed bandit is a sequential allocation task where an agent mu...

Contextual Bandits with Stochastic Experts

We consider the problem of contextual bandits with stochastic experts, w...

An empirical evaluation of active inference in multi-armed bandits

A key feature of sequential decision making under uncertainty is a need ...

Variational inference for the multi-armed contextual bandit

In many biomedical, science, and engineering problems, one must sequenti...

Debiasing Samples from Online Learning Using Bootstrap

It has been recently shown in the literature that the sample averages fr...

Balanced Off-Policy Evaluation General Action Spaces

In many practical applications of contextual bandits, online learning is...

GBOSE: Generalized Bandit Orthogonalized Semiparametric Estimation

In sequential decision-making scenarios i.e., mobile health recommendati...

Code Repositories


Public repository for the work on bandit problems

view repo