SOPE: Spectrum of Off-Policy Estimators

11/06/2021
by   Christina J. Yuan, et al.
0

Many sequential decision making problems are high-stakes and require off-policy evaluation (OPE) of a new policy using historical data collected using some other policy. One of the most common OPE techniques that provides unbiased estimates is trajectory based importance sampling (IS). However, due to the high variance of trajectory IS estimates, importance sampling methods based on state-action visitation distributions (SIS) have recently been adopted. Unfortunately, while SIS often provides lower variance estimates for long horizons, estimating the state-action distribution ratios can be challenging and lead to biased estimates. In this paper, we present a new perspective on this bias-variance trade-off and show the existence of a spectrum of estimators whose endpoints are SIS and IS. Additionally, we also establish a spectrum for doubly-robust and weighted version of these estimators. We provide empirical evidence that estimators in this spectrum can be used to trade-off between the bias and variance of IS and SIS and can achieve lower mean-squared error than both IS and SIS.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/07/2022

Low Variance Off-policy Evaluation with State-based Importance Sampling

In off-policy reinforcement learning, a behaviour policy performs explor...
research
12/14/2022

Scaling Marginalized Importance Sampling to High-Dimensional State-Spaces via State Abstraction

We consider the problem of off-policy evaluation (OPE) in reinforcement ...
research
04/04/2016

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

In this paper we present a new way of predicting the performance of a re...
research
01/26/2023

Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning

Off-policy learning from multistep returns is crucial for sample-efficie...
research
04/18/2021

Off-Policy Risk Assessment in Contextual Bandits

To evaluate prospective contextual bandit policies when experimentation ...
research
07/13/2023

Leveraging Factored Action Spaces for Off-Policy Evaluation

Off-policy evaluation (OPE) aims to estimate the benefit of following a ...
research
08/24/2022

Entropy Regularization for Population Estimation

Entropy regularization is known to improve exploration in sequential dec...

Please sign up or login with your details

Forgot password? Click here to reset