Confidence Interval for Off-Policy Evaluation from Dependent Samples via Bandit Algorithm: Approach from Standardized Martingales

06/12/2020
by   Masahiro Kato, et al.
0

This study addresses the problem of off-policy evaluation (OPE) from dependent samples obtained via the bandit algorithm. The goal of OPE is to evaluate a new policy using historical data obtained from behavior policies generated by the bandit algorithm. Because the bandit algorithm updates the policy based on past observations, the samples are not independent and identically distributed (i.i.d.). However, several existing methods for OPE do not take this issue into account and are based on the assumption that samples are i.i.d. In this study, we address this problem by constructing an estimator from a standardized martingale difference sequence. To standardize the sequence, we consider using evaluation data or sample splitting with a two-step estimation. This technique produces an estimator with asymptotic normality without restricting a class of behavior policies. In an experiment, the proposed estimator performs better than existing methods, which assume that the behavior policy converges to a time-invariant policy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/23/2020

Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under Batch Update Policy

The goal of off-policy evaluation (OPE) is to evaluate a new policy usin...
research
10/08/2020

Theoretical and Experimental Comparison of Off-Policy Evaluation from Dependent Samples

We theoretically and experimentally compare estimators for off-policy ev...
research
10/23/2020

A Practical Guide of Off-Policy Evaluation for Bandit Problems

Off-policy evaluation (OPE) is the problem of estimating the value of a ...
research
12/04/2022

Counterfactual Learning with General Data-generating Policies

Off-policy evaluation (OPE) attempts to predict the performance of count...
research
04/04/2016

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

In this paper we present a new way of predicting the performance of a re...
research
03/09/2021

Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and Dual Bounds

Off-policy evaluation (OPE) is the task of estimating the expected rewar...
research
08/17/2020

A Large-scale Open Dataset for Bandit Algorithms

We build and publicize the Open Bandit Dataset and Pipeline to facilitat...

Please sign up or login with your details

Forgot password? Click here to reset