Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling

06/08/2019
by   Tengyang Xie, et al.
0

Motivated by the many real-world applications of reinforcement learning (RL) that require safe-policy iterations, we consider the problem of off-policy evaluation (OPE) --- the problem of evaluating a new policy using the historical data obtained by different behavior policies --- under the model of nonstationary episodic Markov Decision Processes with a long horizon and large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon H. To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. MIS achieves a mean-squared error of O(H^2R_^2∑_t=1^H E_μ[(w_π,μ(s_t,a_t))^2]/n) for large n, where w_π,μ(s_t,a_t) is the ratio of the marginal distribution of tth step under π and μ, H is the horizon, R_ is the maximal rewards, and n is the sample size. The result nearly matches the Cramer-Rao lower bounds for DAG MDP in jiang2016doubly for most non-trivial regimes. To the best of our knowledge, this is the first OPE estimator with provably optimal dependence in H and the second moments of the importance weight. Besides theoretical optimality, we empirically demonstrate the superiority of our method in time-varying, partially observable, and long-horizon RL environments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/11/2015

Doubly Robust Off-policy Value Evaluation for Reinforcement Learning

We study the problem of off-policy value evaluation in reinforcement lea...
research
01/29/2020

Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning

We consider the problem of off-policy evaluation for reinforcement learn...
research
09/10/2021

Projected State-action Balancing Weights for Offline Reinforcement Learning

Offline policy evaluation (OPE) is considered a fundamental and challeng...
research
04/04/2016

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

In this paper we present a new way of predicting the performance of a re...
research
09/21/2022

Off-Policy Risk Assessment in Markov Decision Processes

Addressing such diverse ends as safety alignment with human preferences,...
research
06/06/2020

Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies

Offline reinforcement learning, wherein one uses off-policy data logged ...
research
01/19/2021

Minimax Off-Policy Evaluation for Multi-Armed Bandits

We study the problem of off-policy evaluation in the multi-armed bandit ...

Please sign up or login with your details

Forgot password? Click here to reset