Learning from Bandit Feedback: An Overview of the State-of-the-art

09/18/2019
by   Olivier Jeunen, et al.
0

In machine learning we often try to optimise a decision rule that would have worked well over a historical dataset; this is the so called empirical risk minimisation principle. In the context of learning from recommender system logs, applying this principle becomes a problem because we do not have available the reward of decisions we did not do. In order to handle this "bandit-feedback" setting, several Counterfactual Risk Minimisation (CRM) methods have been proposed in recent years, that attempt to estimate the performance of different policies on historical data. Through importance sampling and various variance reduction techniques, these methods allow more robust learning and inference than classical approaches. It is difficult to accurately estimate the performance of policies that frequently perform actions that were infrequently done in the past and a number of different types of estimators have been proposed. In this paper, we review several methods, based on different off-policy estimators, for learning from bandit feedback. We discuss key differences and commonalities among existing approaches, and compare their empirical performance on the RecoGym simulation environment. To the best of our knowledge, this work is the first comparison study for bandit algorithms in a recommender system setting.

READ FULL TEXT

page 1

page 2

page 3

research
09/10/2018

Efficient Counterfactual Learning from Bandit Feedback

What is the most statistically efficient way to do off-policy evaluation...
research
04/24/2019

Three Methods for Training on Bandit Feedback

There are three quite distinct ways to train a machine learning model on...
research
02/03/2022

Variance-Optimal Augmentation Logging for Counterfactual Evaluation in Contextual Bandits

Methods for offline A/B testing and counterfactual learning are seeing r...
research
09/27/2020

Learning from eXtreme Bandit Feedback

We study the problem of batch learning from bandit feedback in the setti...
research
08/01/2018

Off-Policy Evaluation and Learning from Logged Bandit Feedback: Error Reduction via Surrogate Policy

When learning from a batch of logged bandit feedback, the discrepancy be...
research
02/09/2015

Counterfactual Risk Minimization: Learning from Logged Bandit Feedback

We develop a learning principle and an efficient algorithm for batch lea...
research
09/16/2022

Sales Channel Optimization via Simulations Based on Observational Data with Delayed Rewards: A Case Study at LinkedIn

Training models on data obtained from randomized experiments is ideal fo...

Please sign up or login with your details

Forgot password? Click here to reset