Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with Latent Confounders

07/27/2020
by   Andrew Bennett, et al.
5

Off-policy evaluation (OPE) in reinforcement learning is an important problem in settings where experimentation is limited, such as education and healthcare. But, in these very same settings, observed actions are often confounded by unobserved variables making OPE even more difficult. We study an OPE problem in an infinite-horizon, ergodic Markov decision process with unobserved confounders, where states and actions can act as proxies for the unobserved confounders. We show how, given only a latent variable model for states and actions, policy value can be identified from off-policy data. Our method involves two stages. In the first, we show how to use proxies to estimate stationary distribution ratios, extending recent work on breaking the curse of horizon to the confounded setting. In the second, we show optimal balancing can be combined with such learned ratios to obtain policy value while avoiding direct modeling of reward functions. We establish theoretical guarantees of consistency, and benchmark our method empirically.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/11/2020

Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement Learning

Off-policy evaluation of sequential decision policies from observational...
research
10/28/2021

Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes

In applications of offline reinforcement learning to observational data,...
research
09/12/2019

Efficiently Breaking the Curse of Horizon: Double Reinforcement Learning in Infinite-Horizon Processes

Off-policy evaluation (OPE) in reinforcement learning is notoriously dif...
research
02/22/2022

Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process

This paper is concerned with constructing a confidence interval for a ta...
research
03/24/2020

Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning

Off-policy estimation for long-horizon problems is important in many rea...
research
09/10/2021

Projected State-action Balancing Weights for Offline Reinforcement Learning

Offline policy evaluation (OPE) is considered a fundamental and challeng...
research
08/06/2019

Policy Evaluation with Latent Confounders via Optimal Balance

Evaluating novel contextual bandit policies using logged data is crucial...

Please sign up or login with your details

Forgot password? Click here to reset