Universal Off-Policy Evaluation

04/26/2021
by   Yash Chandak, et al.
0

When faced with sequential decision-making problems, it is often useful to be able to predict what would happen if decisions were made using a new policy. Those predictions must often be based on data collected under some previously used decision-making rule. Many previous methods enable such off-policy (or counterfactual) estimation of the expected value of a performance measure called the return. In this paper, we take the first steps towards a universal off-policy estimator (UnO) – one that provides off-policy estimates and high-confidence bounds for any parameter of the return distribution. We use UnO for estimating and simultaneously bounding the mean, variance, quantiles/median, inter-quantile range, CVaR, and the entire cumulative distribution of returns. Finally, we also discuss Uno's applicability in various settings, including fully observable, partially observable (i.e., with unobserved confounders), Markovian, non-Markovian, stationary, smoothly non-stationary, and discrete distribution shifts.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/25/2021

High-Confidence Off-Policy (or Counterfactual) Variance Estimation

Many sequential decision-making systems leverage data collected using pr...
research
12/29/2022

Quantile Off-Policy Evaluation via Deep Conditional Generative Learning

Off-Policy evaluation (OPE) is concerned with evaluating a new target po...
research
02/25/2022

Decision Making in Non-Stationary Environments with Policy-Augmented Monte Carlo Tree Search

Decision-making under uncertainty (DMU) is present in many important pro...
research
01/24/2023

Off-Policy Evaluation for Action-Dependent Non-Stationary Environments

Methods for sequential decision-making are often built upon a foundation...
research
11/28/2021

Identification of Subgroups With Similar Benefits in Off-Policy Policy Evaluation

Off-policy policy evaluation methods for sequential decision making can ...
research
12/10/2021

Blockwise Sequential Model Learning for Partially Observable Reinforcement Learning

This paper proposes a new sequential model learning architecture to solv...
research
10/01/2019

The Choice Function Framework for Online Policy Improvement

There are notable examples of online search improving over hand-coded or...

Please sign up or login with your details

Forgot password? Click here to reset