Future-Dependent Value-Based Off-Policy Evaluation in POMDPs

07/26/2022
by   Masatoshi Uehara, et al.
3

We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential importance sampling estimators and fitted-Q evaluation suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs. Future-dependent value functions play similar roles as classical value functions in fully-observable MDPs. We derive a new Bellman equation for future-dependent value functions as conditional moment equations that use history proxies as instrumental variables. We further propose a minimax learning method to learn future-dependent value functions using the new Bellman equation. We obtain the PAC result, which implies our OPE estimator is consistent as long as futures and histories contain sufficient information about latent states, and the Bellman completeness. Finally, we extend our methods to learning of dynamics and establish the connection between our approach and the well-known spectral learning methods in POMDPs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/12/2021

A Minimax Learning Approach to Off-Policy Evaluation in Partially Observable Markov Decision Processes

We consider off-policy evaluation (OPE) in Partially Observable Markov D...
research
10/15/2019

Understanding the Curse of Horizon in Off-Policy Evaluation via Conditional Importance Sampling

We establish a connection between the importance sampling estimators typ...
research
09/09/2019

Off-Policy Evaluation in Partially Observable Environments

This work studies the problem of batch off-policy evaluation for Reinfor...
research
09/22/2021

A Spectral Approach to Off-Policy Evaluation for POMDPs

We consider off-policy evaluation (OPE) in Partially Observable Markov D...
research
09/12/2022

Statistical Estimation of Confounded Linear MDPs: An Instrumental Variable Approach

In an Markov decision process (MDP), unobservable confounders may exist ...
research
09/21/2022

Off-Policy Evaluation for Episodic Partially Observable Markov Decision Processes under Non-Parametric Models

We study the problem of off-policy evaluation (OPE) for episodic Partial...
research
10/28/2019

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

We provide theoretical investigations into off-policy evaluation in rein...

Please sign up or login with your details

Forgot password? Click here to reset