Learning from eXtreme Bandit Feedback

09/27/2020
by   Romain Lopez, et al.
3

We study the problem of batch learning from bandit feedback in the setting of extremely large action spaces. Learning from extreme bandit feedback is ubiquitous in recommendation systems, in which billions of decisions are made over sets consisting of millions of choices in a single day, yielding massive observational data. In these large-scale real-world applications, supervised learning frameworks such as eXtreme Multi-label Classification (XMC) are widely used despite the fact that they incur significant biases due to the mismatch between bandit feedback and supervised labels. Such biases can be mitigated by importance sampling techniques, but these techniques suffer from impractical variance when dealing with a large number of actions. In this paper, we introduce a selective importance sampling estimator (sIS) that operates in a significantly more favorable bias-variance regime. The sIS estimator is obtained by performing importance sampling on the conditional expectation of the reward with respect to a small subset of actions for each instance (a form of Rao-Blackwellization). We employ this estimator in a novel algorithmic procedure—named Policy Optimization for eXtreme Models (POXM)—for learning from bandit feedback on XMC tasks. In POXM, the selected actions for the sIS estimator are the top-p actions of the logging policy, where p is adjusted from the data and is significantly smaller than the size of the action space. We use a supervised-to-bandit conversion on three XMC datasets to benchmark our POXM method against three competing methods: BanditNet, a previously applied partial matching pruning strategy, and a supervised learning baseline. Whereas BanditNet sometimes improves marginally over the logging policy, our experiments show that POXM systematically and significantly improves over all baselines.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/01/2018

Off-Policy Evaluation and Learning from Logged Bandit Feedback: Error Reduction via Surrogate Policy

When learning from a batch of logged bandit feedback, the discrepancy be...
research
12/15/2020

Policy Optimization as Online Learning with Mediator Feedback

Policy Optimization (PO) is a widely used approach to address continuous...
research
09/18/2019

Learning from Bandit Feedback: An Overview of the State-of-the-art

In machine learning we often try to optimise a decision rule that would ...
research
06/09/2019

Balanced Off-Policy Evaluation General Action Spaces

In many practical applications of contextual bandits, online learning is...
research
06/09/2019

Balanced Off-Policy Evaluation in General Action Spaces

In many practical applications of contextual bandits, online learning is...
research
02/13/2018

Online Variance Reduction for Stochastic Optimization

Modern stochastic optimization methods often rely on uniform sampling wh...
research
03/11/2023

Uncertainty-Aware Off-Policy Learning

Off-policy learning, referring to the procedure of policy optimization w...

Please sign up or login with your details

Forgot password? Click here to reset