Reliable Off-policy Evaluation for Reinforcement Learning

11/08/2020
by   Jie Wang, et al.
0

In a sequential decision-making problem, off-policy evaluation (OPE) estimates the expected cumulative reward of a target policy using logged transition data generated from a different behavior policy, without execution of the target policy. Reinforcement learning in high-stake environments, such as healthcare and education, is often limited to off-policy settings due to safety or ethical concerns, or inability of exploration. Hence it is imperative to quantify the uncertainty of the off-policy estimate before deployment of the target policy. In this paper, we propose a novel framework that provides robust and optimistic cumulative reward estimates with statistical guarantees and develop non-asymptotic as well as asymptotic confidence intervals for OPE, leveraging methodologies from distributionally robust optimization. Our theoretical results are also supported by empirical analysis.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/22/2020

CoinDICE: Off-Policy Confidence Interval Estimation

We study high-confidence behavior-agnostic off-policy evaluation in rein...
research
02/10/2020

Interpretable Off-Policy Evaluation in Reinforcement Learning by Highlighting Influential Transitions

Off-policy evaluation in reinforcement learning offers the chance of usi...
research
11/28/2021

Identification of Subgroups With Similar Benefits in Off-Policy Policy Evaluation

Off-policy policy evaluation methods for sequential decision making can ...
research
05/10/2021

Deeply-Debiased Off-Policy Interval Estimation

Off-policy evaluation learns a target policy's value with a historical d...
research
06/21/2019

Leveraging Reinforcement Learning Techniques for Effective Policy Adoption and Validation

Rewards and punishments in different forms are pervasive and present in ...
research
03/09/2021

Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and Dual Bounds

Off-policy evaluation (OPE) is the task of estimating the expected rewar...
research
12/29/2022

Quantile Off-Policy Evaluation via Deep Conditional Generative Learning

Off-Policy evaluation (OPE) is concerned with evaluating a new target po...

Please sign up or login with your details

Forgot password? Click here to reset