On Minimax Optimal Offline Policy Evaluation

09/12/2014
by   Lihong Li, et al.
0

This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy. We first consider the multi-armed bandit case, establish a minimax risk lower bound, and analyze the risk of two standard estimators. It is shown, and verified in simulation, that one is minimax optimal up to a constant, while another can be arbitrarily worse, despite its empirical success and popularity. The results are applied to related problems in contextual bandits and fixed-horizon Markov decision processes, and are also related to semi-supervised learning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/03/2020

MOTS: Minimax Optimal Thompson Sampling

Thompson sampling is one of the most widely used algorithms for many onl...
research
05/30/2023

Sharp high-probability sample complexities for policy evaluation with linear function approximation

This paper is concerned with the problem of policy evaluation with linea...
research
09/21/2022

Off-Policy Risk Assessment in Markov Decision Processes

Addressing such diverse ends as safety alignment with human preferences,...
research
10/19/2021

Stateful Offline Contextual Policy Evaluation and Learning

We study off-policy evaluation and learning from sequential data in a st...
research
01/19/2021

Minimax Off-Policy Evaluation for Multi-Armed Bandits

We study the problem of off-policy evaluation in the multi-armed bandit ...
research
10/24/2021

Off-Policy Evaluation in Partially Observed Markov Decision Processes

We consider off-policy evaluation of dynamic treatment rules under the a...
research
01/05/2021

Off-Policy Evaluation of Slate Policies under Bayes Risk

We study the problem of off-policy evaluation for slate bandits, for the...

Please sign up or login with your details

Forgot password? Click here to reset