Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation

10/29/2018
by   Qiang Liu, et al.
0

We consider the off-policy estimation problem of estimating the expected reward of a target policy using samples collected by a different behavior policy. Importance sampling (IS) has been a key technique to derive (nearly) unbiased estimators, but is known to suffer from an excessively high variance in long-horizon problems. In the extreme case of in infinite-horizon problems, the variance of an IS-based estimator may even be unbounded. In this paper, we propose a new off-policy estimation method that applies IS directly on the stationary state-visitation distributions to avoid the exploding variance issue faced by existing estimators.Our key contribution is a novel approach to estimating the density ratio of two stationary distributions, with trajectories sampled from only the behavior distribution. We develop a mini-max loss function for the estimation problem, and derive a closed-form solution for the case of RKHS. We support our method with both theoretical and empirical analyses.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/16/2019

Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation

Infinite horizon off-policy policy evaluation is a highly challenging ta...
research
03/24/2020

Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning

Off-policy estimation for long-horizon problems is important in many rea...
research
04/16/2023

Optimal distributions for randomized unbiased estimators with an infinite horizon and an adaptive algorithm

The randomized unbiased estimators of Rhee and Glynn (Operations Researc...
research
02/19/2023

Distributional Offline Policy Evaluation with Predictive Error Guarantees

We study the problem of estimating the distribution of the return of a p...
research
04/11/2023

A Tale of Sampling and Estimation in Discounted Reinforcement Learning

The most relevant problems in discounted reinforcement learning involve ...
research
02/01/2023

Adaptive hedging horizon and hedging performance estimation

In this study, we constitute an adaptive hedging method based on empiric...
research
10/10/2019

Infinite-horizon Off-Policy Policy Evaluation with Multiple Behavior Policies

We consider off-policy policy evaluation when the trajectory data are ge...

Please sign up or login with your details

Forgot password? Click here to reset