Safe Exploration for Optimizing Contextual Bandits

02/02/2020
by   Rolf Jagerman, et al.
0

Contextual bandit problems are a natural fit for many information retrieval tasks, such as learning to rank, text classification, recommendation, etc. However, existing learning methods for contextual bandit problems have one of two drawbacks: they either do not explore the space of all possible document rankings (i.e., actions) and, thus, may miss the optimal ranking, or they present suboptimal rankings to a user and, thus, may harm the user experience. We introduce a new learning method for contextual bandit problems, Safe Exploration Algorithm (SEA), which overcomes the above drawbacks. SEA starts by using a baseline (or production) ranking system (i.e., policy), which does not harm the user experience and, thus, is safe to execute, but has suboptimal performance and, thus, needs to be improved. Then SEA uses counterfactual learning to learn a new policy based on the behavior of the baseline policy. SEA also uses high-confidence off-policy evaluation to estimate the performance of the newly learned policy. Once the performance of the newly learned policy is at least as good as the performance of the baseline policy, SEA starts using the new policy to execute new actions, allowing it to actively explore favorable regions of the action space. This way, SEA never performs worse than the baseline policy and, thus, does not harm the user experience, while still exploring the action space and, thus, being able to find an optimal policy. Our experiments using text classification and document retrieval confirm the above by comparing SEA (and a boundless variant called BSEA) to online and offline learning methods for contextual bandit problems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/23/2019

Meta-Learning for Contextual Bandit Exploration

We describe MELEE, a meta-learning algorithm for learning a good explora...
research
07/24/2021

Combining Online Learning and Offline Learning for Contextual Bandits with Deficient Support

We address policy learning with logged data in contextual bandits. Curre...
research
02/08/2020

Improved Algorithms for Conservative Exploration in Bandits

In many fields such as digital marketing, healthcare, finance, and robot...
research
03/20/2023

A Unified Framework of Policy Learning for Contextual Bandit with Confounding Bias and Missing Observations

We study the offline contextual bandit problem, where we aim to acquire ...
research
10/23/2019

BanditRank: Learning to Rank Using Contextual Bandits

We propose an extensible deep learning method that uses reinforcement le...
research
02/27/2010

Learning from Logged Implicit Exploration Data

We provide a sound and consistent foundation for the use of nonrandom ex...
research
02/11/2021

Robust Generalization and Safe Query-Specialization in Counterfactual Learning to Rank

Existing work in counterfactual Learning to Rank (LTR) has focussed on o...

Please sign up or login with your details

Forgot password? Click here to reset