Stateful Offline Contextual Policy Evaluation and Learning

10/19/2021
by   Nathan Kallus, et al.
0

We study off-policy evaluation and learning from sequential data in a structured class of Markov decision processes that arise from repeated interactions with an exogenous sequence of arrivals with contexts, which generate unknown individual-level responses to agent actions. This model can be thought of as an offline generalization of contextual bandits with resource constraints. We formalize the relevant causal structure of problems such as dynamic personalized pricing and other operations management problems in the presence of potentially high-dimensional user types. The key insight is that an individual-level response is often not causally affected by the state variable and can therefore easily be generalized across timesteps and states. When this is true, we study implications for (doubly robust) off-policy evaluation and learning by instead leveraging single time-step evaluation, estimating the expectation over a single arrival via data from a population, for fitted-value iteration in a marginal MDP. We study sample complexity and analyze error amplification that leads to the persistence, rather than attenuation, of confounding error over time. In simulations of dynamic and capacitated pricing, we show improved out-of-sample policy performance in this class of relevant problems.

READ FULL TEXT
research
10/28/2021

Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes

In applications of offline reinforcement learning to observational data,...
research
09/21/2022

Off-Policy Evaluation for Episodic Partially Observable Markov Decision Processes under Non-Parametric Models

We study the problem of off-policy evaluation (OPE) for episodic Partial...
research
09/12/2014

On Minimax Optimal Offline Policy Evaluation

This paper studies the off-policy evaluation problem, where one aims to ...
research
02/12/2018

Policy Gradients for Contextual Bandits

We study a generalized contextual-bandits problem, where there is a stat...
research
05/13/2023

Thompson Sampling for Parameterized Markov Decision Processes with Uninformative Actions

We study parameterized MDPs (PMDPs) in which the key parameters of inter...
research
07/13/2021

Inverse Contextual Bandits: Learning How Behavior Evolves over Time

Understanding an agent's priorities by observing their behavior is criti...
research
02/01/2023

Robust Fitted-Q-Evaluation and Iteration under Sequentially Exogenous Unobserved Confounders

Offline reinforcement learning is important in domains such as medicine,...

Please sign up or login with your details

Forgot password? Click here to reset