Off-Policy Risk Assessment in Markov Decision Processes

09/21/2022
by   Audrey Huang, et al.
0

Addressing such diverse ends as safety alignment with human preferences, and the efficiency of learning, a growing line of reinforcement learning research focuses on risk functionals that depend on the entire distribution of returns. Recent work on off-policy risk assessment (OPRA) for contextual bandits introduced consistent estimators for the target policy's CDF of returns along with finite sample guarantees that extend to (and hold simultaneously over) all risk. In this paper, we lift OPRA to Markov decision processes (MDPs), where importance sampling (IS) CDF estimators suffer high variance on longer trajectories due to small effective sample size. To mitigate these problems, we incorporate model-based estimation to develop the first doubly robust (DR) estimator for the CDF of returns in MDPs. This estimator enjoys significantly less variance and, when the model is well specified, achieves the Cramer-Rao variance lower bound. Moreover, for many risk functionals, the downstream estimates enjoy both lower bias and lower variance. Additionally, we derive the first minimax lower bounds for off-policy CDF and risk estimation, which match our error bounds up to a constant factor. Finally, we demonstrate the precision of our DR CDF estimates experimentally on several different environments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/10/2018

More Robust Doubly Robust Off-policy Evaluation

We study the problem of off-policy evaluation (OPE) in reinforcement lea...
research
04/18/2021

Off-Policy Risk Assessment in Contextual Bandits

To evaluate prospective contextual bandit policies when experimentation ...
research
09/12/2014

On Minimax Optimal Offline Policy Evaluation

This paper studies the off-policy evaluation problem, where one aims to ...
research
01/31/2023

Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both Worlds in Stochastic and Deterministic Environments

We study variance-dependent regret bounds for Markov decision processes ...
research
06/08/2019

Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling

Motivated by the many real-world applications of reinforcement learning ...
research
06/09/2019

Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning

Off-policy evaluation (OPE) in both contextual bandits and reinforcement...
research
03/09/2022

ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling

This paper studies the problem of data collection for policy evaluation ...

Please sign up or login with your details

Forgot password? Click here to reset