Differentiable Bandit Exploration

02/17/2020
by   Craig Boutilier, et al.
22

We learn bandit policies that maximize the average reward over bandit instances drawn from an unknown distribution P, from a sample from P. Our approach is an instance of meta-learning and its appeal is that the properties of P can be exploited without restricting it. We parameterize our policies in a differentiable way and optimize them by policy gradients - an approach that is easy to implement and pleasantly general. Then the challenge is to design effective gradient estimators and good policy classes. To make policy gradients practical, we introduce novel variance reduction techniques. We experiment with various bandit policy classes, including neural networks and a novel soft-elimination policy. The latter has regret guarantees and is a natural starting point for our optimization. Our experiments highlight the versatility of our approach. We also observe that neural network policies can learn implicit biases, which are only expressed through sampled bandit instances during training.

READ FULL TEXT
research
06/09/2020

Differentiable Meta-Learning in Contextual Bandits

We study a contextual bandit setting where the learning agent has access...
research
01/23/2019

Meta-Learning for Contextual Bandit Exploration

We describe MELEE, a meta-learning algorithm for learning a good explora...
research
12/29/2018

Meta Reinforcement Learning with Distribution of Exploration Parameters Learned by Evolution Strategies

In this paper, we propose a novel meta-learning method in a reinforcemen...
research
10/23/2020

A Practical Guide of Off-Policy Evaluation for Bandit Problems

Off-policy evaluation (OPE) is the problem of estimating the value of a ...
research
07/20/2020

A Short Note on Soft-max and Policy Gradients in Bandits Problems

This is a short communication on a Lyapunov function argument for softma...
research
03/05/2020

Generalized Policy Elimination: an efficient algorithm for Nonparametric Contextual Bandits

We propose the Generalized Policy Elimination (GPE) algorithm, an oracle...
research
06/04/2022

Interpolating Between Softmax Policy Gradient and Neural Replicator Dynamics with Capped Implicit Exploration

Neural replicator dynamics (NeuRD) is an alternative to the foundational...

Please sign up or login with your details

Forgot password? Click here to reset