Minimax Off-Policy Evaluation for Multi-Armed Bandits

01/19/2021
by   Cong Ma, et al.
0

We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards, and develop minimax rate-optimal procedures under three settings. First, when the behavior policy is known, we show that the Switch estimator, a method that alternates between the plug-in and importance sampling estimators, is minimax rate-optimal for all sample sizes. Second, when the behavior policy is unknown, we analyze performance in terms of the competitive ratio, thereby revealing a fundamental gap between the settings of known and unknown behavior policies. When the behavior policy is unknown, any estimator must have mean-squared error larger – relative to the oracle estimator equipped with the knowledge of the behavior policy – by a multiplicative factor proportional to the support size of the target policy. Moreover, we demonstrate that the plug-in approach achieves this worst-case competitive ratio up to a logarithmic factor. Third, we initiate the study of the partial knowledge setting in which it is assumed that the minimum probability taken by the behavior policy is known. We show that the plug-in estimator is optimal for relatively large values of the minimum probability, but is sub-optimal when the minimum probability is low. In order to remedy this gap, we propose a new estimator based on approximation by Chebyshev polynomials that provably achieves the optimal estimation error. Numerical experiments on both simulated and real data corroborate our theoretical findings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/29/2023

SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits

In this paper, we study the problem of optimal data collection for polic...
research
03/18/2022

Approximate Function Evaluation via Multi-Armed Bandits

We study the problem of estimating the value of a known smooth function ...
research
06/03/2021

A Closer Look at the Worst-case Behavior of Multi-armed Bandit Algorithms

One of the key drivers of complexity in the classical (stochastic) multi...
research
09/12/2014

On Minimax Optimal Offline Policy Evaluation

This paper studies the off-policy evaluation problem, where one aims to ...
research
06/08/2019

Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling

Motivated by the many real-world applications of reinforcement learning ...
research
02/18/2020

Adaptive Estimator Selection for Off-Policy Evaluation

We develop a generic data-driven method for estimator selection in off-p...
research
06/28/2019

Adaptive Sequential Experiments with Unknown Information Flows

Systems that make sequential decisions in the presence of partial feedba...

Please sign up or login with your details

Forgot password? Click here to reset