Scaling Marginalized Importance Sampling to High-Dimensional State-Spaces via State Abstraction

12/14/2022
by   Brahma S. Pavse, et al.
0

We consider the problem of off-policy evaluation (OPE) in reinforcement learning (RL), where the goal is to estimate the performance of an evaluation policy, π_e, using a fixed dataset, 𝒟, collected by one or more policies that may be different from π_e. Current OPE algorithms may produce poor OPE estimates under policy distribution shift i.e., when the probability of a particular state-action pair occurring under π_e is very different from the probability of that same pair occurring in 𝒟 (Voloshin et al. 2021, Fu et al. 2021). In this work, we propose to improve the accuracy of OPE estimators by projecting the high-dimensional state-space into a low-dimensional state-space using concepts from the state abstraction literature. Specifically, we consider marginalized importance sampling (MIS) OPE algorithms which compute state-action distribution correction ratios to produce their OPE estimate. In the original ground state-space, these ratios may have high variance which may lead to high variance OPE. However, we prove that in the lower-dimensional abstract state-space the ratios can have lower variance resulting in lower variance OPE. We then highlight the challenges that arise when estimating the abstract ratios from data, identify sufficient conditions to overcome these issues, and present a minimax optimization problem whose solution yields these abstract ratios. Finally, our empirical evaluation on difficult, high-dimensional state-space OPE tasks shows that the abstract ratios can make MIS OPE estimators achieve lower mean-squared error and more robust to hyperparameter tuning than the ground ratios.

READ FULL TEXT

page 1

page 23

page 24

research
11/06/2021

SOPE: Spectrum of Off-Policy Estimators

Many sequential decision making problems are high-stakes and require off...
research
12/07/2022

Low Variance Off-policy Evaluation with State-based Importance Sampling

In off-policy reinforcement learning, a behaviour policy performs explor...
research
01/29/2020

Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning

We consider the problem of off-policy evaluation for reinforcement learn...
research
06/10/2019

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

In many real-world reinforcement learning applications, access to the en...
research
09/10/2021

An Empirical Comparison of Off-policy Prediction Learning Algorithms in the Four Rooms Environment

Many off-policy prediction learning algorithms have been proposed in the...
research
11/29/2021

Robust On-Policy Data Collection for Data-Efficient Policy Evaluation

This paper considers how to complement offline reinforcement learning (R...
research
01/29/2020

GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values

We present GradientDICE for estimating the density ratio between the sta...

Please sign up or login with your details

Forgot password? Click here to reset