Variance-Optimal Augmentation Logging for Counterfactual Evaluation in Contextual Bandits

02/03/2022
by   Aaron David Tucker, et al.
0

Methods for offline A/B testing and counterfactual learning are seeing rapid adoption in search and recommender systems, since they allow efficient reuse of existing log data. However, there are fundamental limits to using existing log data alone, since the counterfactual estimators that are commonly used in these methods can have large bias and large variance when the logging policy is very different from the target policy being evaluated. To overcome this limitation, we explore the question of how to design data-gathering policies that most effectively augment an existing dataset of bandit feedback with additional observations for both learning and evaluation. To this effect, this paper introduces Minimum Variance Augmentation Logging (MVAL), a method for constructing logging policies that minimize the variance of the downstream evaluation or learning problem. We explore multiple approaches to computing MVAL policies efficiently, and find that they can be substantially more effective in decreasing the variance of an estimator than naïve approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/06/2018

CAB: Continuous Adaptive Blending Estimator for Policy Evaluation and Learning

The ability to perform offline A/B-testing and off-policy learning using...
research
09/10/2018

Efficient Counterfactual Learning from Bandit Feedback

What is the most statistically efficient way to do off-policy evaluation...
research
12/04/2022

Counterfactual Learning with General Data-generating Policies

Off-policy evaluation (OPE) attempts to predict the performance of count...
research
08/22/2018

Genie: An Open Box Counterfactual Policy Estimator for Optimizing Sponsored Search Marketplace

In this paper, we propose an offline counterfactual policy estimation fr...
research
09/18/2019

Learning from Bandit Feedback: An Overview of the State-of-the-art

In machine learning we often try to optimise a decision rule that would ...
research
10/28/2021

Sayer: Using Implicit Feedback to Optimize System Policies

We observe that many system policies that make threshold decisions invol...
research
10/16/2012

Sample-efficient Nonstationary Policy Evaluation for Contextual Bandits

We present and prove properties of a new offline policy evaluator for an...

Please sign up or login with your details

Forgot password? Click here to reset