Off-Policy Evaluation and Learning from Logged Bandit Feedback: Error Reduction via Surrogate Policy

08/01/2018
by   Yuan Xie, et al.
0

When learning from a batch of logged bandit feedback, the discrepancy between the policy to be learned and the off-policy training data imposes statistical and computational challenges. Unlike classical supervised learning and online learning settings, in batch contextual bandit learning, one only has access to a collection of logged feedback from the actions taken by a historical policy, and expect to learn a policy that takes good actions in possibly unseen contexts. Such a batch learning setting is ubiquitous in online and interactive systems, such as ad platforms and recommendation systems. Existing approaches based on inverse propensity weights, such as Inverse Propensity Scoring (IPS) and Policy Optimizer for Exponential Models (POEM), enjoy unbiasedness but often suffer from large mean squared error. In this work, we introduce a new approach named Maximum Likelihood Inverse Propensity Scoring (MLIPS) for batch learning from logged bandit feedback. Instead of using the given historical policy as the proposal in inverse propensity weights, we estimate a maximum likelihood surrogate policy based on the logged action-context pairs, and then use this surrogate policy as the proposal. We prove that MLIPS is asymptotically unbiased, and moreover, has a smaller nonasymptotic mean squared error than IPS. Such an error reduction phenomenon is somewhat surprising as the estimated surrogate policy is less accurate than the given historical policy. Results on multi-label classification problems and a large- scale ad placement dataset demonstrate the empirical effectiveness of MLIPS. Furthermore, the proposed surrogate policy technique is complementary to existing error reduction techniques, and when combined, is able to consistently boost the performance of several widely used approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/09/2015

Counterfactual Risk Minimization: Learning from Logged Bandit Feedback

We develop a learning principle and an efficient algorithm for batch lea...
research
09/27/2020

Learning from eXtreme Bandit Feedback

We study the problem of batch learning from bandit feedback in the setti...
research
06/12/2017

Data-Efficient Policy Evaluation Through Behavior Policy Search

We consider the task of evaluating a policy for a Markov decision proces...
research
09/18/2019

Learning from Bandit Feedback: An Overview of the State-of-the-art

In machine learning we often try to optimise a decision rule that would ...
research
04/04/2016

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

In this paper we present a new way of predicting the performance of a re...
research
04/24/2019

Three Methods for Training on Bandit Feedback

There are three quite distinct ways to train a machine learning model on...
research
12/05/2017

Approaching the Ad Placement Problem with Online Linear Classification: The winning solution to the NIPS'17 Ad Placement Challenge

The task of computational advertising is to select the most suitable adv...

Please sign up or login with your details

Forgot password? Click here to reset