Understanding the Curse of Horizon in Off-Policy Evaluation via Conditional Importance Sampling

10/15/2019
by   Yao Liu, et al.
0

We establish a connection between the importance sampling estimators typically used for off-policy policy evaluation in reinforcement learning and the extended conditional Monte Carlo method. We show with some examples that in the finite horizon case there is no strict ordering in general between the variance of such conditional importance sampling estimators: the variance of the per-decision or stationary variants may, in fact, be higher than that of the crude importance sampling estimator. We also provide sufficient conditions for the finite horizon case under which the per-decision or stationary estimators can reduce the variance. We then develop an asymptotic analysis and derive sufficient conditions under which there exists an exponential v.s. polynomial gap (in terms of horizon T) between the variance of importance sampling and that of the per-decision or stationary estimators.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/13/2021

State Relevance for Off-Policy Evaluation

Importance sampling-based estimators for off-policy evaluation (OPE) are...
research
10/20/2019

From Importance Sampling to Doubly Robust Policy Gradient

We show that policy gradient (PG) and its variance reduction variants ca...
research
10/21/2020

Optimal Off-Policy Evaluation from Multiple Logging Policies

We study off-policy evaluation (OPE) from multiple logging policies, eac...
research
10/16/2019

Conditional Importance Sampling for Off-Policy Learning

The principal contribution of this paper is a conceptual framework for o...
research
10/16/2019

Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation

Infinite horizon off-policy policy evaluation is a highly challenging ta...
research
03/24/2020

Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning

Off-policy estimation for long-horizon problems is important in many rea...
research
07/26/2022

Future-Dependent Value-Based Off-Policy Evaluation in POMDPs

We study off-policy evaluation (OPE) for partially observable MDPs (POMD...

Please sign up or login with your details

Forgot password? Click here to reset