Off-Policy Evaluation Using Information Borrowing and Context-Based Switching

12/18/2021
by   Sutanoy Dasgupta, et al.
9

We consider the off-policy evaluation (OPE) problem in contextual bandits, where the goal is to estimate the value of a target policy using the data collected by a logging policy. Most popular approaches to the OPE are variants of the doubly robust (DR) estimator obtained by combining a direct method (DM) estimator and a correction term involving the inverse propensity score (IPS). Existing algorithms primarily focus on strategies to reduce the variance of the DR estimator arising from large IPS. We propose a new approach called the Doubly Robust with Information borrowing and Context-based switching (DR-IC) estimator that focuses on reducing both bias and variance. The DR-IC estimator replaces the standard DM estimator with a parametric reward model that borrows information from the 'closer' contexts through a correlation structure that depends on the IPS. The DR-IC estimator also adaptively interpolates between this modified DM estimator and a modified DR estimator based on a context-specific switching rule. We give provable guarantees on the performance of the DR-IC estimator. We also demonstrate the superior performance of the DR-IC estimator compared to the state-of-the-art OPE algorithms on a number of benchmark problems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/10/2018

More Robust Doubly Robust Off-policy Evaluation

We study the problem of off-policy evaluation (OPE) in reinforcement lea...
research
06/03/2021

Off-Policy Evaluation via Adaptive Weighting with Data from Contextual Bandits

It has become increasingly common for data to be collected adaptively, f...
research
02/23/2023

Asymptotically Unbiased Off-Policy Policy Evaluation when Reusing Old Data in Nonstationary Environments

In this work, we consider the off-policy policy evaluation problem for c...
research
05/28/2021

Enhanced Doubly Robust Learning for Debiasing Post-click Conversion Rate Estimation

Post-click conversion, as a strong signal indicating the user preference...
research
02/19/2022

Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning

Off-policy evaluation and learning (OPE/L) use offline observational dat...
research
02/16/2018

Policy Evaluation and Optimization with Continuous Treatments

We study the problem of policy evaluation and learning from batched cont...
research
03/31/2022

Doubly-Robust Estimation for Unbiased Learning-to-Rank from Position-Biased Click Feedback

Clicks on rankings suffer from position bias: generally items on lower r...

Please sign up or login with your details

Forgot password? Click here to reset