Unbiased Estimation of the Value of an Optimized Policy

06/07/2018
by   Elon Portugaly, et al.
0

Randomized trials, also known as A/B tests, are used to select between two policies: a control and a treatment. Given a corresponding set of features, we can ideally learn an optimized policy P that maps the A/B test data features to action space and optimizes reward. However, although A/B testing provides an unbiased estimator for the value of deploying B (i.e., switching from policy A to B), direct application of those samples to learn the the optimized policy P generally does not provide an unbiased estimator of the value of P as the samples were observed when constructing P. In situations where the cost and risks associated of deploying a policy are high, such an unbiased estimator is highly desirable. We present a procedure for learning optimized policies and getting unbiased estimates for the value of deploying them. We wrap any policy learning procedure with a bagging process and obtain out-of-bag policy inclusion decisions for each sample. We then prove that inverse-propensity-weighting effect estimator is unbiased when applied to the optimized subset. Likewise, we apply the same idea to obtain out-of-bag unbiased per-sample value estimate of the measurement that is independent of the randomized treatment, and use these estimates to build an unbiased doubly-robust effect estimator. Lastly, we empirically shown that even when the average treatment effect is negative we can find a positive optimized policy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/24/2019

The covariate-adjusted residual estimator and its use in both randomized trials and observational settings

We often seek to estimate the causal effect of an exposure on a particul...
research
04/27/2022

Structural Cumulative Survival Models for Robust Estimation of Treatment Effects Accounting for Treatment Switching in Randomized Experiments

We propose an instrumental variable estimator to estimate the treatment ...
research
12/19/2020

Inference in experiments conditional on observed imbalances in covariates

Double blind randomized controlled trials are traditionally seen as the ...
research
06/27/2023

Value-aware Importance Weighting for Off-policy Reinforcement Learning

Importance sampling is a central idea underlying off-policy prediction i...
research
06/10/2020

Adaptive Experimental Design with Temporal Interference: A Maximum Likelihood Approach

Suppose an online platform wants to compare a treatment and control poli...
research
10/28/2021

Sayer: Using Implicit Feedback to Optimize System Policies

We observe that many system policies that make threshold decisions invol...

Please sign up or login with your details

Forgot password? Click here to reset