Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under Batch Update Policy

10/23/2020
by   Masahiro Kato, et al.
7

The goal of off-policy evaluation (OPE) is to evaluate a new policy using historical data obtained via a behavior policy. However, because the contextual bandit algorithm updates the policy based on past observations, the samples are not independent and identically distributed (i.i.d.). This paper tackles this problem by constructing an estimator from a martingale difference sequence (MDS) for the dependent samples. In the data-generating process, we do not assume the convergence of the policy, but the policy uses the same conditional probability of choosing an action during a certain period. Then, we derive an asymptotically normal estimator of the value of an evaluation policy. As another advantage of our method, the batch-based approach simultaneously solves the deficient support problem. Using benchmark and real-world datasets, we experimentally confirm the effectiveness of the proposed method.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/12/2020

Confidence Interval for Off-Policy Evaluation from Dependent Samples via Bandit Algorithm: Approach from Standardized Martingales

This study addresses the problem of off-policy evaluation (OPE) from dep...
research
10/08/2020

Theoretical and Experimental Comparison of Off-Policy Evaluation from Dependent Samples

We theoretically and experimentally compare estimators for off-policy ev...
research
10/23/2020

A Practical Guide of Off-Policy Evaluation for Bandit Problems

Off-policy evaluation (OPE) is the problem of estimating the value of a ...
research
02/26/2020

Off-Policy Evaluation and Learning for External Validity under a Covariate Shift

We consider the evaluation and training of a new policy for the evaluati...
research
02/13/2020

Adaptive Experimental Design for Efficient Treatment Effect Estimation: Randomized Allocation via Contextual Bandit Algorithm

Many scientific experiments have an interest in the estimation of the av...
research
06/15/2021

Control Variates for Slate Off-Policy Evaluation

We study the problem of off-policy evaluation from batched contextual ba...
research
06/10/2020

Distributional Robust Batch Contextual Bandits

Policy learning using historical observational data is an important prob...

Please sign up or login with your details

Forgot password? Click here to reset