Doubly Robust Policy Evaluation and Learning

03/23/2011
by   Miroslav Dudík, et al.
0

We study decision making in environments where the reward is only partially observed, but can be modeled as a function of an action and an observed context. This setting, known as contextual bandits, encompasses a wide variety of applications including health-care policy and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions and received rewards. The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy. Previous approaches rely either on models of rewards or models of the past policy. The former are plagued by a large bias whereas the latter have a large variance. In this work, we leverage the strength and overcome the weaknesses of the two approaches by applying the doubly robust technique to the problems of policy evaluation and optimization. We prove that this approach yields accurate value estimates when we have either a good (but not necessarily consistent) model of rewards or a good (but not necessarily consistent) model of past policy. Extensive empirical comparison demonstrates that the doubly robust approach uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies. As such, we expect the doubly robust approach to become common practice.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/10/2015

Doubly Robust Policy Evaluation and Optimization

We study sequential decision making in environments where rewards are on...
research
12/23/2016

Constructing Effective Personalized Policies Using Counterfactual Inference from Biased Data Sets with Many Features

This paper proposes a novel approach for constructing effective personal...
research
10/16/2012

Sample-efficient Nonstationary Policy Evaluation for Contextual Bandits

We present and prove properties of a new offline policy evaluator for an...
research
01/15/2019

Imitation-Regularized Offline Learning

We study the problem of offline learning in automated decision systems u...
research
05/30/2019

Defining Admissible Rewards for High Confidence Policy Evaluation

A key impediment to reinforcement learning (RL) in real applications wit...
research
10/19/2022

Anytime-valid off-policy inference for contextual bandits

Contextual bandit algorithms are ubiquitous tools for active sequential ...
research
11/08/2021

Safe Optimal Design with Applications in Policy Learning

Motivated by practical needs in online experimentation and off-policy le...

Please sign up or login with your details

Forgot password? Click here to reset