Representation Balancing MDPs for Off-Policy Policy Evaluation

05/23/2018
by   Yao Liu, et al.
0

We study the problem of off-policy policy evaluation (OPPE) in RL. In contrast to prior work, we consider how to estimate both the individual policy value and average policy value accurately. We draw inspiration from recent work in causal reasoning, and propose a new finite sample generalization error bound for value estimates from MDP models. Using this upper bound as an objective, we develop a learning algorithm of an MDP model with a balanced representation, and show that our approach can yield substantially lower MSE in a common synthetic domain and on a challenging real-world sepsis management problem.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/28/2020

Best Policy Identification in discounted MDPs: Problem-specific Sample Complexity

We investigate the problem of best-policy identification in discounted M...
research
06/24/2021

A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs

We derive a novel asymptotic problem-dependent lower-bound for regret mi...
research
09/28/2022

Online Policy Optimization for Robust MDP

Reinforcement learning (RL) has exceeded human performance in many synth...
research
01/09/2023

Minimax Weight Learning for Absorbing MDPs

Reinforcement learning policy evaluation problems are often modeled as f...
research
02/11/2015

Off-policy evaluation for MDPs with unknown structure

Off-policy learning in dynamic decision problems is essential for provid...
research
07/11/2012

Discretized Approximations for POMDP with Average Cost

In this paper, we propose a new lower approximation scheme for POMDP wit...
research
04/02/2022

Model-Free and Model-Based Policy Evaluation when Causality is Uncertain

When decision-makers can directly intervene, policy evaluation algorithm...

Please sign up or login with your details

Forgot password? Click here to reset