Off-policy Bandits with Deficient Support

06/16/2020
by   Noveen Sachdeva, et al.
0

Learning effective contextual-bandit policies from past actions of a deployed system is highly desirable in many settings (e.g. voice assistants, recommendation, search), since it enables the reuse of large amounts of log data. State-of-the-art methods for such off-policy learning, however, are based on inverse propensity score (IPS) weighting. A key theoretical requirement of IPS weighting is that the policy that logged the data has "full support", which typically translates into requiring non-zero probability for any action in any context. Unfortunately, many real-world systems produce support deficient data, especially when the action space is large, and we show how existing methods can fail catastrophically. To overcome this gap between theory and applications, we identify three approaches that provide various guarantees for IPS-based learning despite the inherent limitations of support-deficient data: restricting the action space, reward extrapolation, and restricting the policy space. We systematically analyze the statistical and computational properties of these three approaches, and we empirically evaluate their effectiveness. In addition to providing the first systematic analysis of support-deficiency in contextual-bandit learning, we conclude with recommendations that provide practical guidance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/24/2021

Combining Online Learning and Offline Learning for Contextual Bandits with Deficient Support

We address policy learning with logged data in contextual bandits. Curre...
research
02/13/2022

Off-Policy Evaluation for Large Action Spaces via Embeddings

Off-policy evaluation (OPE) in contextual bandits has seen rapid adoptio...
research
06/09/2022

Conformal Off-Policy Prediction in Contextual Bandits

Most off-policy evaluation methods for contextual bandits have focused o...
research
01/15/2019

Imitation-Regularized Offline Learning

We study the problem of offline learning in automated decision systems u...
research
11/16/2020

Corrupted Contextual Bandits with Action Order Constraints

We consider a variant of the novel contextual bandit problem with corrup...
research
03/05/2022

Off-Policy Evaluation in Embedded Spaces

Off-policy evaluation methods are important in recommendation systems an...
research
02/02/2023

Practical Bandits: An Industry Perspective

The bandit paradigm provides a unified modeling framework for problems t...

Please sign up or login with your details

Forgot password? Click here to reset