1. Introduction
Interactive bandit and reinforcement learning algorithms have been used to optimize decision making in many reallife scenarios such as precision medicine, recommender systems, advertising, etc. We often use these algorithms to maximize the expected reward, but they also produce log data valuable for evaluating and redesigning future decision making. For example, the logs of a news recommender system record which news article was presented and whether the user read it, giving the decision maker a chance to make its recommendation more relevant. Exploiting log data is, however, more difficult than conventional supervised machine learning. This is because the result is only observed for the action chosen by the algorithm but not for all the other actions the system could have taken. The logs are also biased, as the logs overrepresent the actions favored by the algorithm used to collect the data. Online experiment or A/B test is a potential solution to this issue. It compares the performance of counterfactual algorithms in an online environment, enabling unbiased evaluation and comparison. However, A/B testing counterfactual algorithms is often difficult, since deploying a new policy to a real environment is timeconsuming and may damage user satisfaction
(Gilotte et al., 2018; Saito, 2020).This motivates us to study Offpolicy Evaluation (OPE), which aims to estimate the performance of an evaluation policy using only log data collected by a behavior policy. Such an evaluation allows us to compare the performance of candidate policies safely and helps us decide which policy to deploy in the field. This alternative offline evaluation approach thus has the potential to overcome the above issues with the online A/B test approach.
With growing interest in OPE, the research community has produced a number of estimators, including Direct Method (DM) (Beygelzimer and Langford, 2009)
, Inverse Probability Weighting (IPW)
(Precup, 2000; Strehl et al., 2010), SelfNormalized IPW (SNIPW) (Swaminathan and Joachims, 2015), Doubly Robust (DR) (Dudík et al., 2014), SwitchDR (Wang et al., 2017), and Doubly Robust with Optimistic Shrinkage (DRos) (Su et al., 2020a).One emerging challenge with this trend is that there is a need for practitioners to select and tune appropriate hyperparameters for OPE estimators for their specific application (Su
et al., 2020b; Voloshin
et al., 2019).
For example, DM first estimates the expected reward function using an arbitrary machine learning method, then uses its estimate for OPE.
Therefore, one has to identify a good machine learning method to estimate the expected reward before the offline evaluation phase.
Identifying the appropriate machine learning method for DM is difficult, because its accuracy cannot be easily quantified from bandit data (Jiang and Li, 2016).
Sophisticated estimators such as SwitchDR (Wang
et al., 2017) and DRos (Su et al., 2020a) show improved offline evaluation performance in some experiments.
However, these estimators have a larger number of hyperparameters to be tuned compared to the baseline estimators.
A difficulty here is that the estimation accuracy of OPE estimators is highly sensitive to the choice of hyperparameters, as implied in empirical studies (Voloshin
et al., 2019; Saito
et al., 2020).
When we rely on OPE in realworld applications, it is desirable to use an estimator that is robust to the choice of hyperparameters and achieves accurate evaluations without requiring significant hyperparameter tuning.
Moreover, we want the estimators to be robust to other possible configuration changes such as evaluation policies.
An estimator of this type is preferable, because tuning hyperparameters of OPE estimators with only logged bandit data is challenging in nature, and we often apply an estimator to several different policies to compare the performance of candidate policies offline.
The aim of this paper is thus to enable a safer OPE practice by developing a procedure to evaluate the estimators’ robustness.
Current dominant evaluation procedures. The current evaluation procedure used in OPE research is not suitable for evaluating the estimators’ robustness. Almost all OPE papers evaluate the estimator’s performance for a single given set of hyperparameters and an arbitrary evaluation policy (Saito et al., 2020; Dudík et al., 2014; Wang et al., 2017; Su et al., 2019; Su et al., 2020a; Narita et al., 2019; Vlassis et al., 2019; Liu et al., 2019; Farajtabar et al., 2018; Kato et al., 2020)
. Even though it is common to iterate trials with different random seeds to provide an estimate of the performance, this procedure cannot evaluate the estimators’ robustness to hyperparameter choices or the changes in evaluation policies, which is critical in realworld scenarios. The estimator’s performance derived from this common procedure does not properly account for the uncertainty in offline evaluation performance, as the reported performance metric is a single random variable drawn from the distribution over the estimator’s performance. Consequently, choosing an appropriate OPE estimator is difficult, as their robustness to hyperparameter choices or the changes in evaluation policies are not quantified in existing experiments.
Contributions. Motivated towards promoting a reliable use of OPE in practice, we develop an interpretable and scalable evaluation procedure for OPE estimators that quantifies their robustness to the choice of hyperparameters and possible changes in evaluation policies. Our evaluation procedure compares several OPE estimators as depicted in Figure 1
. This figure compares the offline evaluation performance of IPW and DM by illustrating their accuracy distributions as we vary their hyperparameters, evaluation policies, and random seeds. The xaxis is the squared error in offline evaluation; a lower value indicates that an estimator is more accurate. The figure is visually interpretable, and in this case, we are confident that IPW is better, having lower squared errors with high probability, being robust to the changes in configurations, and being more accurate even in the worst case. In addition to developing the evaluation procedure, we have implemented opensource Python software,
pyIEOE^{1}^{1}1https://github.com/sony/pyIEOE, so that researchers can easily implement our procedure in their experiments, and practitioners can identify the best estimator for their specific environment.Using our procedure and software, we evaluate a wide variety of existing OPE estimators on Open Bandit Dataset (Saito et al., 2020) (Section 5) and several classification datasets (Appendix A). Through these extensive experiments, we demonstrate that IEOE can provide informative results, in particular the estimators’ robustness to the hyperparameter settings and evaluation policy changes, which could not be obtained using typical experimental procedure in OPE research.
Finally, as a proof of concept, we use our procedure to select the best estimator for the offline evaluation of coupon treatment policies on a realworld ecommerce platform. The platform uses OPE to improve its coupon optimization policy safely without implementing A/B tests. However, the platform’s data scientists do not know which OPE estimator is appropriate for their setting. We apply our procedure to provide an appropriate estimator choice for the platform. This realworld application demonstrates how to use our procedure to reduce uncertainty and risk that we face in realworld offline evaluation.
Our contributions are summarized as follows.

We develop an experimental procedure called IEOE that is useful for identifying robust estimators and avoid the use of estimators sensitive to configuration changes.

We have implemented pyIEOE, opensource Python software, that facilitates the use of our experimental procedure both in research and in practice.

We conduct comprehensive benchmark experiments on public datasets and demonstrate that IEOE is useful for identifying estimators sensitive to configuration changes, and thus can help avoid potential failures in OPE.

We apply IEOE to a realworld OPE application and demonstrate how this procedure helps us safely conduct OPE in practice.
2. OffPolicy Evaluation
2.1. Setup
We consider a general contextual bandit setting. Let denote a reward or outcome variable (e.g., whether a coupon assignment results in an increase in revenue) and be a discrete action. We let
be a context vector (e.g., the user’s demographic profile) that the decision maker observes when picking an action. Rewards and contexts are sampled from unknown probability distributions
and , respectively. We call a function a policy. It maps each context into a distribution over actions, where is the probability of taking action given context vector .Let be a historical logged bandit feedback with observations. is a discrete variable indicating which action in is chosen for individual . and denote the reward and the context observed for individual . We assume that a logged bandit feedback dataset is generated by a behavior policy as follows:
where each contextactionreward triplet is sampled independently from the identical product distribution. Then, for a function , we use to denote its empirical expectation over observations in . We also use to denote the mean reward function for a given context and action.
In OPE, we are interested in using historical logged bandit data to estimate the following policy value of a given evaluation policy which might be different from :
Estimating before deploying in an online environment is useful in practice, because may perform poorly. Additionally, this makes it possible to select an evaluation policy that maximizes the policy value by comparing their estimated performances without incurring additional implementation cost.
2.2. Existing OPE Estimators
Given the policy value as the estimand, the goal of researchers is to propose an accurate estimator. OPE estimator estimates the policy value of an arbitrary evaluation policy as , where is an available logged bandit feedback dataset, and is a set of predefined hyperparameters of .
Below, we summarize the definitions and properties of several existing OPE estimators. We also summarize their builtin hyperparameters in Table 1.
OPE Estimators  Hyperparameters 
Direct Method (DM)  , 
Inverse Probability Weighting with Pessimistic Shrinkage (IPWps) (Su et al., 2020a; Strehl et al., 2010)  , () 
SelfNormalized Inverse Probability Weighting (SNIPW) (Swaminathan and Joachims, 2015)  () 
Doubly Robust with Pessimistic Shrinkage (DRps) (Dudík et al., 2014; Su et al., 2020a)  , , , () 
SelfNormalized Doubly Robust (SNDR)  , , () 
Switch Doubly Robust (SwitchDR) (Wang et al., 2017)  , , , () 
Doubly Robust with Optimistic Shrinkage (DRos) (Su et al., 2020a)  , , , () 
Note: is an estimator for the mean reward function constructed by an arbitrary machine learning method. is the number of folds in the crossfitting procedure. is an estimated behavior policy. This is unnecessary when we know the true behavior policy, and thus it is in parentheses. and are nonnegative hyperparameters for defining the corresponding estimators.
Direct Method (DM)
DM (Beygelzimer and Langford, 2009)
first trains a supervised machine learning method, such as ridge regression, to estimate the mean reward function
. DM then estimates the policy value aswhere is the estimated mean reward function. If is a good approximation to the mean reward function, this estimator accurately estimates the policy value of the evaluation policy. If fails to approximate the mean reward function well, however, the final estimator tends to fail in OPE.
Inverse Probability Weighting (IPW)
To alleviate the issue with DM, researchers often use IPW (Precup, 2000; Strehl et al., 2010). IPW reweights the rewards by the ratio of the evaluation policy to the behavior policy, as
where
is called the importance weight. When the behavior policy is known, IPW is unbiased and consistent for the policy value. However, it can have high variance, especially when the evaluation policy deviates significantly from the behavior policy. To reduce the variance of IPW, the following weight clipping is often applied.
where is a clipping hyperparamter. A lower value of greatly reduces the variance while introducing a large bias. Following Su et al. (Su et al., 2020a), we call IPW with weight clipping as IPW with Pessimistic Shrinkage (IPWps). When , IPWps is identical to IPW.
Doubly Robust (DR)
DR (Dudík et al., 2014) combines DM and IPW as follows.
DR uses the estimated mean reward function as a control variate to decrease the variance of IPW. It is also doubly robust in that it is consistent to the policy value if either the importance weight or the mean reward estimator is accurate. The weight clipping can also be applied to DR as follows.
where is a clipping hyperparamter. DR with weight clipping is called DR with Pessimistic Shrinkage (DRps). When , DRps is identical to DR.
SelfNormalized Estimators
SNIPW (Swaminathan and Joachims, 2015) is an approach to address the variance issue of IPW. It estimates the policy value by dividing the sum of weighted rewards by the sum of importance weights as:
SNIPW is more stable than IPW, because the policy value estimated by SNIPW is bounded in the support of rewards, and its conditional variance given action and context is bounded by the conditional variance of the rewards (Kallus and Uehara, 2019). IPW does not have these properties. We can define SelfNormalized Doubly Robust (SNDR) in a similar manner as follows.
Switch Estimator
DR can still be subject to the variance issue, particularly when the importance weights are large due to low overlap between behavior and evaluation policies. SwitchDR (Wang et al., 2017) aims to further reduce the variance by using DM where the importance weight is large:
where is the indicator function and
is a hyperparameter. SwitchDR interpolates between DM and DR. When
, it is identical to DM, while yields DR.Doubly Robust with Optimistic Shrinkage (DRos)
Su et al. (Su et al., 2020a) proposes DRos based on a new weight function that directly minimizes sharp bounds on the meansquarederror (MSE) of the resulting estimator. DRos is defined as
where is a hyperparameter and is defined as . When , leading to DM. On the other hand, as , leading to DR.
CrossFitting Procedure.
To obtain a reward estimator, , we sometimes use crossfitting to avoid the substantial bias that might arise due to overfitting (Narita et al., 2020). The crossfitting procedure constructs a modeldependent estimator such as DM and DR as follows:

Take a fold random partition of size of logged bandit feedback dataset such that the size of each fold is . Also, for each , we define .

For each , construct reward estimators using the subset of data .

Given and modeldependent estimator , estimate the policy value by .
Hyperparameter Tuning Procedure.
As Table 1 summarizes, most OPE estimators have hyperparameters such as , , , and that should appropriately be set. Su et al. (Su et al., 2020a) proposes to select a set of hyperparameters based on the following criterion.
(1) 
where is the sample variance in OPE, and is the upper bound of the bias estimated using . There are several ways to derive the bias upper bound as stated in Su et al. (Su et al., 2020a). One way is the direct bias estimation:
where is the confidence delta to derive the high probability upper bound, and is the maximum importance weight. is the importance weight modified by a hyperparameter. For example, for IPWps and DRps, , and for SwitchDR, .
3. Evaluating Offline Evaluation
So far, we have seen that the OPE community has developed a variety of OPE estimators. What every OPE research paper should do in their experiments is to compare the performance (estimation accuracy) of the existing estimators and report the results. A typical and dominant method to do so is to estimate the following meansquarederror (MSE) as the estimator’s performance metric:
where is the policy value and is an estimator to be evaluated. MSE measures the squared distance between the policy value and its estimated value; a lower value means a more accurate OPE by . Researchers often calculate the MSE of each estimator several times with different random seeds and report its mean.
The issue with this procedure is that most of the estimators have some hyperparameters that should be chosen properly before the estimation process. Moreover, the estimation performance can vary when evaluating different evaluation policies (especially in finite sample cases). However, the current dominant procedure for evaluating OPE estimators uses only one set of hyperparameters and an arbitrary evaluation policy for each estimator, and then discusses the derived results (Wang et al., 2017; Farajtabar et al., 2018; Su et al., 2019; Agarwal et al., 2017; Vlassis et al., 2019).^{2}^{2}2This is why we use to denote MSE so as to highlight that it depends on the estimator’s hyperparameters and an evaluation policy . This type of simplified experimental procedure does not accurately capture the uncertainty in the performance of OPE estimators. Specifically, it cannot evaluate the robustness to hyperparameter choices and evaluation policy settings, as the reported score is for a single arbitrary set of hyperparameters and for a single evaluation policy.
What is often critical in offline evaluation practices is to identify an estimator that performs well for a variety of evaluation policies without problemspecific hyperparameter tuning. An estimator robust to the changes in such configurations is usable reliably in uncertain reallife scenarios. In contrast, an estimator which performs well only on a narrow set of hyperparameters and evaluation policies entails a higher risk of failure in its particular application. Therefore, we want to avoid using such sensitive estimators as these estimators are more likely to fail. In the next section, we describe an experimental procedure that can evaluate the estimators’ robustness to experimental configurations, leading to informative estimator comparisons in OPE research and a reliable estimator selection in practice.
4. Interpretable Evaluation for Offline Evaluation
Here, we outline our experimental protocol, Interpretable Evaluation for Offline Evaluation (IEOE). As we have discussed, the expected value of performance (e.g., MSE) alone is insufficient to properly evaluate the realworld applicability of an estimator, as it discards information about its robustness to hyperparameter choices and changes in evaluation policies. We can conduct a more informative experiment by estimating the cumulative distribution function (CDF) of an estimator’s performance, as done in some studies on reinforcement learning (Engstrom et al., 2020; Jordan et al., 2020; Jordan et al., 2018). CDF is the function, , where is a random variable representing the performance metric of an estimator (e.g., the squared error).^{3}^{3}3In the following, without loss of generality, we assume that a lower value of means more accurate OPE. maps a performance metric to the probability that the estimator achieves a performance better or equal to that score, i.e., .
When we have size of realizations of , i.e., , we can estimate the CDF by
(2) 
Using the CDF for evaluating OPE estimators allows researchers to compare different estimators with respect to their robustness to the varying configurations. Specifically, we can use the CDF to evaluate OPE estimators by examining the CDF of the estimators’ performance visually or computing some summary scores of the CDF as the estimators’ performance metric. For example, we can score an estimator by the area under the CDF curve (AUCDF): Another possible summary score is conditional valueatrisk (CVaR) which computes the expected value of a random variable above a given probability : where is the inverse of the CDF. When using CVaR, the estimators are evaluated based on the average performance of the bottom percent of trials. For example,
is the average performance of the worst 30% of trials. In addition, we can use standard deviation (Std),
, and some other moments such as the skewness of
as summary scores.IEOE with Synthetic or Classification Data
In research papers, it is common to use synthetic or classification data to evaluate OPE estimators (Dudík et al., 2014; Wang et al., 2017; Su et al., 2020a; Kallus and Uehara, 2019; Kallus et al., 2021). We first present how to apply the IEOE procedure to synthetic or classification data in Algorithm 1. To evaluate the estimation performance of , we need to specify a candidate set of hyperparameters , a set of evaluation policies , a hyperparameter sampling function , and a set of random seeds . Then, for every seed , the algorithm samples a set of hyperparameters based on sampler . What kind of we use can change depending on the purpose of the evaluation of OPE. For example, we can use a hyperparameter tuning method for OPE estimators such as the method described in Section 2.2 as
, assuming practitioners use it in realworld applications. When we cannot implement such a hyperparameter tuning method for OPE due to its implementation cost or risk of overfitting, we can be conservative and use the uniform distribution as
in the evaluation of OPE. Next, the IEOE algorithm samples an evaluation policy from the discrete uniform distribution. Then, it replicates the data generating process using the bootstrap sampling from . A bootstrapped logged bandit feedback dataset is defined as where each tuple is sampled independently from with replacement. Finally, for sampled tuple , it computes a performance metric (e.g., the squared error). After applying Algorithm 1 to several estimators and obtaining the empirical CDF of their evaluation performances, we can visualize them or compute some summary scores to evaluate and compare the estimators’ robustness.IEOE with RealWorld Data
It is also possible to apply IEOE to realworld logged bandit data. Algorithm 2 presents IEOE that can be used in realworld applications. To evaluate the performance of with realworld data, we need to prepare several logged bandit feedback datasets where each dataset is collected by a policy . Then, for every seed , the algorithm samples a set of hyperparameters based on a sampler . Next, the algorithm samples an evaluation policy from the discrete uniform distribution. Then, the evaluation and test sets are defined as and where the evaluation set is used in OPE and the test set is used to calculate the groundtruth performance of . Then, the algorithm replicates the environment using the bootstrap sampling from . A bootstrapped logged bandit feedback dataset is defined as where each tuple is sampled independently from with replacement. Finally, for a sampled tuple , it computes the squared error as follows.
where is the onpolicy estimate of the policy value of estimated with the test set.
5. Experiments with Open Bandit Dataset
In this section, we use IEOE and evaluate the robustness of a wide variety of OPE estimators on Open Bandit Dataset (OBD)^{4}^{4}4https://research.zozo.com/data.html. We run the experiments using our pyIEOE software. By using it, anyone can replicate the results easily.^{5}^{5}5The code to replicate the results is available at: https://github.com/sony/pyIEOE/benchmark . We also provide detailed description of the software in Appendix B.
5.1. Setup
OBD is a set of logged bandit feedback datasets collected on a largescale fashion ecommerce platform provided by Saito et al. (Saito et al., 2020). There are three campaigns, ”ALL”, ”Men”, and ”Women”. We use size 30,000 and 300,000 of randomly subsampled data from the ”ALL” campaign. The dataset contains user context as feature vector , fashion item recommendation as action , and click indicator as reward . The dimensions of the feature vector is 20, and the number of actions is 80.
The dataset consists of subsets of data collected by two different policies, the uniform random policy and the Bernoulli Thompson Sampling policy
(Thompson, 1933). We let denote the dataset collected by uniform random policy and denote that collected by Bernoulli Thompson Sampling policy . We apply Algorithm 2 to obtain a set of SEs as the performance metric of the estimators.OPE Estimators Hyperparameter Spaces DM IPWps , () SNIPW () DRps , , , () SNDR , , () SwitchDR , , , () DRos , , , () Note: LR/RR means that LogisticRegression (LR) is used when is binary and RidgeRigression (RR) is used otherwise. RF stands for RandomForest. is an estimated behavior policy. This is unnecessary when we know the true behavior policy. We estimate the behavior policy only in the experiments with classification data in Appendix A. Therefore, is in parentheses. means that we do not use crossfitting and train on the whole . 
Machine Learning Models Hyperparameter Spaces LogisticRegression (binary outcome) RidgeRegression (continuous outcome) RandomForest , LightGBM , , Note: We follow the scikitlearn package as to the names of the hyperparameters As default, we use for LogisticRegression, for RandomForest, and for LightGBM. 
5.2. Estimators and Hyperparameters
We use our protocol and evaluate DM, IPWps, SNIPW, DRps, SNDR, SwitchDR, and DRos in an interpretable manner.
In the experiment, we use the true behavior policy contained in the dataset to derive importance weights. In this setting, SNIPW is hyperparameterfree, while the other estimators need to be tested for robustness to the choice of the predefined hyperparameters and changes in evaluation policies. In addition, we use the hyperparameter tuning method described in Section 2.2 to tune estimatorspecific hyperparameters such as and . Then, we use RandomizedSearchCV implemented in scikitlearn with to tune hyperparameters of reward estimator . Tables 3 and 3 describe hyperparameter spaces for each estimator. Finally, we set .
Small () Large () 
25 710 OPE Estimators  Mean (typical metric)  AUCDF  CVaR  Std  Mean (typical metric)  AUCDF  CVaR  Std 
DM  
IPWps  
SNIPW  1.20  1.33  
DRps  0.989  1.44  2.48  0.887  2.67  5.02  
SNDR  0.995  1.44  3.27  0.827  3.50  6.01  
SwitchDR  3.28  0.825  3.50  6.00  
DRos  3.28  0.825  3.50  6.00  
Note: Larger value is better for AUCDF and lower value is better for Mean, CVaR, and Std. Note that we normalize the scores by dividing them by the best score among all estimators. We use for and for to calculate AUCDF. The and fonts represent the best and secondbest estimators, respectively. The fonts represent the worst estimator.
5.3. Results
Figure 2 visually compares the CDF of the estimators’ squared error. Table 4 reports AUCDF, , and Std as summary scores.
When the dataset size is small (), we see that the typical way of reporting only the mean of the squared error cannot tell which estimator is accurate or robust. However, some other summary scores show that DM has more robust and stable estimation performance than other estimators, having lower CVaR and Std. Moreover, Figure 2 provides more detailed information about the estimators’ performance. Specifically, DM performs better in the worst case while the other estimators show better performance in the region where squared error is lower than . Thus, when we are conservative and prioritize the worst case performance, DM is the most appropriate choice. Otherwise, other estimators might be a better choice. We cannot obtain this conclusion by comparing only the mean (typical metric) of the squared error.
When the dataset size is large (), we confirm that IPWps and SNIPW are more accurate than other modelbased estimators. In particular, Figure 2 shows that IPWps performs better than other estimators in all region, meaning that we should use it whether we prioritize the best or the worst case performance.
Overall, the results indicate that an appropriate estimator can drastically change depending on the situation such as the data size. Therefore, we argue that identifying a reasonable estimator before conducting OPE is essential in practice. Moreover, we demonstrate that the IEOE procedure can provide more informative insight as to the estimators’ performance compared to the typical metric.
6. RealWorld Application
In this section, we apply the IEOE procedure to a realworld application.
6.1. Setup
To show how to use IEOE in a realworld application, we conducted a data collection experiment on a real ecommerce platform in September 2020. The platform wants to use OPE to improve the performance of its coupon optimization policy safely without conducting A/B tests. However, it does not know which estimator is appropriate for its specific application and environment. Therefore, we apply the IEOE procedure with the aim of providing a suitable estimator choice for the platform.
During the data collection experiment, we constructed , , and by randomly assigning three different policies (, , and ) to users on the platform. In this application, is a user’s context vector, is a coupon assignment variable (where there are four different types of coupons, i.e., ), and is either a user’s content consumption indicator (binary outcome) or the revenue from each user observed within the 7day period after the coupon assignment (continuous outcome). The total number of users considered in the experiment was 39,687, and each of , , and has approximately one third of the users.
Note that, in this application, there is a risk of overfitting due to the intensive hyperparameter tuning of OPE estimators, as the size of the logged bandit feedback data is not large. Moreover, the data scientists want to use an OPE estimator to evaluate the performance of several candidate policies. Therefore, we aim to find an estimator that performs stably for a wide range of evaluation policies with fewer hyperparameters.
6.2. Performance Metric
To apply our evaluation procedure, we need to define a performance metric (in step 8 of Algorithm 2). We can do this by using our realworld data. We first pick one of the three policies as evaluation policy and regard the others as behavior policies. When we choose as the evaluation policy, we define and . Then, by applying Algorithm 2, we obtain a set of SEs to evaluate the robustness and realworld applicability of the estimators.
6.3. Estimators and Hyperparameters
We use the IEOE protocol to evaluate the robustness of DM, IPWps, SNIPW, DRps, SNDR, SwitchDR, and DRos. Then, we utilize the experimental results to help the data scientists of the platform choose an appropriate estimator.
During the data collection experiment, we logged the true action choice probabilities of the three policies, and thus SNIPW is hyperparameterfree. We use the hyperparameter spaces defined in Tables 3 and 3 for our realworld application. In addition, we use the hyperparameter tuning method described in Section 2.2 to tune estimatorspecific hyperparameters such as and . Then, we use the uniform distribution as to sample hyperparameters of reward regression model . Finally, we set and .
binary continuous 
Binary Outcome  Continuous Outcome  
25 710 OPE Estimators  Mean (typical metric)  AUCDF  CVaR  Std  Mean (typical metric)  AUCDF  CVaR  Std 
DM  8.70  0.946  10.92  1.29  1.47  2.19  
IPWps  
SNIPW  
DRps  8.16  10.27  34.54  1.44  0.957  1.60  2.11  
SNDR  0.942  32.19  1.21  0.935  
SwitchDR  8.16  10.27  34.54  1.48  0.919  1.57  1.68  
DRos  8.16  10.27  34.54  1.43  0.968  1.60  2.21  
Note: Binary Outcome is the results when the outcome is each user’s content consumption indicator. Continuous Outcome is the results when the outcome is the revenue from each user observed within the 7day period after the coupon assignment. Larger value is better for AUCDF and lower value is better for Mean, CVaR, and Std. Note that we normalize the scores by dividing them by the best score among all estimators. We use for the binary outcome and for the continuous outcome to calculate AUCDF. The and fonts represent the best and secondbest estimators, respectively. The fonts represent the worst estimator.
6.4. Results
We applied Algorithm 2 to the above estimators for the binary and continuous outcome data, respectively.
Figure 3 compares the CDF of the estimators’ squared error for each outcome. First, it is obvious that SNIPW is the best estimator for the binary outcome case, achieving the best accuracy in almost all regions. We can also argue that SNIPW is preferable for the continuous outcome case, because it reveals the most accurate estimation in the worst case and is hyperparameterfree, although it underperforms DM in some cases. On the other hand, IPWps performs poorly for both outcomes, because our dataset is not large and some behavior policies are near deterministic, making IPWps an unstable estimator. Moreover, SwitchDR fails to accurately evaluate the performance of the evaluation policies. Thus, it is unsafe to use these estimators in our application, even though we tune their hyperparameters ( or ).
We additionally confirm the above observations in a quantitative manner. For both binary and continuous outcomes, we compute AUCDF, CVaR, and Std of the squared error for each OPE estimator. We report these summary scores in Table 5, and the results demonstrate that SNIPW clearly outperforms other estimators in almost all situations. In particular, SNIPW is the best with respect to CVaR and Std for both binary and continuous outcomes, showing that this estimator is the most stable estimator in our environment. Moreover, SNIPW is hyperparameterfree, and overfitting is less likely to occur compared to other estimators having some hyperparameters to be tuned. Through this evaluation of OPE estimators, we concluded that the ecommerce platform should use SNIPW for its offline evaluation. After comprehensive accuracy and stability verification, the platform is now using SNIPW to improve its coupon optimization policy safely.
7. Conclusion and Future Work
In this paper, we argued that the current dominant evaluation procedure for OPE cannot evaluate the robustness of the estimators’ performance. Instead, the IEOE procedure can provide an interpretable way to evaluate how robust each estimator is to the choice of hyperparameters or changes in evaluation policies. We have also developed opensource software to streamline our interpretable evaluation procedure. It enables rapid benchmarking and validation of OPE estimators so that practitioners can spend more time on real decision making problems, and OPE researchers can focus more on tackling advanced technical questions. We perform an extensive evaluation of a wide variety of OPE estimators and demonstrated that our experiments are more informative than a typical procedure, showing which estimators are more sensitive to configuration changes. Finally, we applied our procedure to a realworld application and demonstrated its practical usage.
Although our procedure is useful to evaluate the robustness of estimators, we need to prepare at least two logged bandit feedback datasets collected by different policies to apply it to realworld applications, as described in Algorithm 2. Thus, it would be beneficial to construct a procedure to enable the evaluation of OPE estimators with only logged bandit data collected by a single policy.
Acknowledgements.
The authors would like to thank Masahiro Nomura, Ryo Kuroiwa, and Richard Liu for their help in reviewing the paper. Additionally, we would like to thank the anonymous reviewers for their constructive reviews and discussions.References
 (1)
 Agarwal et al. (2017) Aman Agarwal, Soumya Basu, Tobias Schnabel, and Thorsten Joachims. 2017. Effective evaluation using logged bandit feedback from multiple loggers. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 687–696.
 Beygelzimer and Langford (2009) Alina Beygelzimer and John Langford. 2009. The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 129–138.
 Dua and Graff (2017) Dheeru Dua and Casey Graff. 2017. UCI machine learning repository. (2017).
 Dudík et al. (2014) Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. 2014. Doubly robust policy evaluation and optimization. Statist. Sci. 29, 4 (2014), 485–511.
 Engstrom et al. (2020) Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. 2020. Implementation Matters in Deep RL: A Case Study on PPO and TRPO. In International Conference on Learning Representations.
 Farajtabar et al. (2018) Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. 2018. More robust doubly robust offpolicy evaluation. In International Conference on Machine Learning, Vol. 80. PMLR, 1447–1456.
 Gilotte et al. (2018) Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline a/b testing for recommender systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. 198–206.
 Jiang and Li (2016) Nan Jiang and Lihong Li. 2016. Doubly robust offpolicy value evaluation for reinforcement learning. In International Conference on Machine Learning, Vol. 48. PMLR, 652–661.
 Jordan et al. (2020) Scott Jordan, Yash Chandak, Daniel Cohen, Mengxue Zhang, and Philip Thomas. 2020. Evaluating the performance of reinforcement learning algorithms. In Proceedings of the 37th International Conference on Machine Learning. PMLR, 4962–4973.
 Jordan et al. (2018) Scott M Jordan, Daniel Cohen, and Philip S Thomas. 2018. Using cumulative distribution based performance analysis to benchmark models. In NeurIPS 2018 Workshop on Critiquing and Correcting Trends in Machine Learning.
 Kallus et al. (2021) Nathan Kallus, Yuta Saito, and Masatoshi Uehara. 2021. Optimal OffPolicy Evaluation from Multiple Logging Policies. In Proceedings of the 38th International Conference on Machine Learning, Vol. 139. PMLR, 5247–5256.
 Kallus and Uehara (2019) Nathan Kallus and Masatoshi Uehara. 2019. Intrinsically Efficient, Stable, and Bounded OffPolicy Evaluation for Reinforcement Learning. In Advances in Neural Information Processing Systems, Vol. 32. 3325–3334.
 Kato et al. (2020) Masahiro Kato, Shota Yasui, and Masatoshi Uehara. 2020. OffPolicy Evaluation and Learning for External Validity under a Covariate Shift. In Advances in Neural Information Processing Systems, Vol. 33. 49–61.
 Liu et al. (2019) Anqi Liu, Hao Liu, Anima Anandkumar, and Yisong Yue. 2019. Triply Robust OffPolicy Evaluation. arXiv preprint arXiv:1911.05811 (2019).

Narita
et al. (2019)
Yusuke Narita, Shota
Yasui, and Kohei Yata. 2019.
Efficient counterfactual learning from bandit
feedback. In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33. 4634–4641.  Narita et al. (2020) Yusuke Narita, Shota Yasui, and Kohei Yata. 2020. Offpolicy Bandit and Reinforcement Learning. arXiv preprint arXiv:2002.08536 (2020).
 Precup (2000) Doina Precup. 2000. Eligibility traces for offpolicy policy evaluation. Computer Science Department Faculty Publication Series (2000), 80.
 Raghu et al. (2018) Aniruddh Raghu, Omer Gottesman, Yao Liu, Matthieu Komorowski, Aldo Faisal, Finale DoshiVelez, and Emma Brunskill. 2018. Behaviour policy estimation in offpolicy policy evaluation: Calibration matters. arXiv preprint arXiv:1807.01066 (2018).
 Saito (2020) Yuta Saito. 2020. Doubly robust estimator for ranking metrics with postclick conversions. In Fourteenth ACM Conference on Recommender Systems. 92–100.
 Saito et al. (2020) Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. 2020. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible OffPolicy Evaluation. arXiv preprint arXiv:2008.07146 (2020).
 Strehl et al. (2010) Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. 2010. Learning from Logged Implicit Exploration Data, In Advances in Neural Information Processing Systems. Advances in Neural Information Processing Systems 23, 2217–2225.
 Su et al. (2020a) Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. 2020a. Doubly robust offpolicy evaluation with shrinkage. In International Conference on Machine Learning, Vol. 119. PMLR, 9167–9176.
 Su et al. (2020b) Yi Su, Pavithra Srinath, and Akshay Krishnamurthy. 2020b. Adaptive Estimator Selection for OffPolicy Evaluation. In Proceedings of the 37th International Conference on Machine Learning, Vol. 119. PMLR, 9196–9205.
 Su et al. (2019) Yi Su, Lequn Wang, Michele Santacatterina, and Thorsten Joachims. 2019. Cab: Continuous adaptive blending for policy evaluation and learning. In International Conference on Machine Learning, Vol. 97. PMLR, 6005–6014.
 Swaminathan and Joachims (2015) Adith Swaminathan and Thorsten Joachims. 2015. The selfnormalized estimator for counterfactual learning. In Advances in Neural Information Processing Systems, Vol. 28. 3231–3239.
 Thompson (1933) William R Thompson. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 3/4 (1933), 285–294.
 Vlassis et al. (2019) Nikos Vlassis, Aurelien Bibaut, Maria Dimakopoulou, and Tony Jebara. 2019. On the design of estimators for bandit offpolicy evaluation. In International Conference on Machine Learning, Vol. 97. PMLR, 6468–6476.
 Voloshin et al. (2019) Cameron Voloshin, Hoang M Le, Nan Jiang, and Yisong Yue. 2019. Empirical study of offpolicy policy evaluation for reinforcement learning. arXiv preprint arXiv:1911.06854 (2019).
 Wang et al. (2017) YuXiang Wang, Alekh Agarwal, and Miroslav Dudık. 2017. Optimal and adaptive offpolicy evaluation in contextual bandits. In International Conference on Machine Learning, Vol. 70. PMLR, 3589–3597.
Appendix A Benchmark Experiments on Classification Datasets
Here, we conduct experiments on three classification datasets, OptDigits, PenDigits, and SatImage provided at the UCI repository (Dua and Graff, 2017). Table 6 shows some statistics of the datasets used in the benchmark experiment.
a.1. Setup
Following previous studies (Farajtabar et al., 2018; Dudík et al., 2014; Wang et al., 2017; Kallus et al., 2021), we transform classification data to contextual bandit feedback data. In a classification dataset , we have feature vector and groundtruth label . Here, we regard a machine learning classifier as a deterministic policy that chooses class label as an action from feature vector . We then define reward variable . Since the original classifier is deterministic, we make it stochastic by combining and the uniform random policy as:
where is an additional experimental setting.
To apply IEOE to classification data, we first randomly split each dataset into train and test sets. Then, we train a classifier on , and use it to construct a behavior policy and a class of evaluation policies . By running behavior policy on , we transform to logged bandit feedback data , where is the action sampled by the behavior policy. Then, by applying the following procedure, we compute the squared error (SE) of for each iteration in Algorithm 1:

Estimate the policy value for tuple sampled in the algorithm.

Estimate using the fully observed rewards in , i.e., .

Compare the offpolicy estimate with its groundtruth using SE as a performance metric of , i.e., .
a.2. Estimators and Hyperparameters
We use IEOE to evaluate the robustness of DM, IPWps, SNIPW, DRps, SNDR, SwitchDR, and DRos.
Here, we run the experiments under two different settings. First, we test the case where the true behavior policy is available. Next, we investigate the OPE estimators with estimated behavior policy , where we assume that the true behavior policy is unknown. In this case, we additionally test the OPE estimators for robustness to the choice of machine learning method to obtain .
Tables 3 and 3 (in the main text) describe hyperparameter spaces for each estimator. We use RandomizedSearchCV implemented in scikitlearn with to tune hyperparameters of reward estimator and behavior policy estimator . We additionally use CalibratedClassifierCV implemented in scikitlearn with when estimating the behavior policy, as calibrating the behavior policy estimator matters in OPE (Raghu et al., 2018). Then, we use the hyperparameter tuning method described in Section 2.2 to tune estimatorspecific hyperparameters such as and . Table 7 describes how we construct the true behavior policy and five different evaluation policies in . Finally, we set .
a.3. Results
Figures 5 and 5 visually compare the CDF of the estimators’ squared error for each dataset in true and estimated behavior policy settings. We also confirm the observations in a quantitative manner by computing AUCDF, , and Std of the squared error of each OPE estimator. We report these summary scores in Tables 9 and 9.
First, in the setting where the true behavior policy is available, it is obvious that IPWps is the best estimator and achieves the most accurate estimation in almost all regions (see Figure 5). SNIPW also performs comparably better than other estimators. In contrast, modeldependent estimators, especially DM, perform poorly compared to the typical estimators such as IPWps and SNIPW. We observe here that these modeldependent estimators perform worse, when the reward estimator has a serious bias issue. On the other hand, we do not have to care about the specification of when we use IPWps or SNIPW. Therefore, our experimental procedure poses a possibility that simple estimators with fewer hyperparameters tend to perform well and be robust for a wide variety of settings when the true behavior policy is recorded.
In the setting where the behavior policy needs to be estimated, we observe similar trends. First, Figure 5 and Table 9 show that IPWps achieves the most accurate estimation even when it uses the estimated behavior policy. Second, estimators based on DR such as DRps, SwitchDR, and DRos show considerably large squared errors when the behavior policy is estimated. This is because DR is vulnerable to the overfitting of . DR produces large squared errors when overfits the data and outputs extreme estimations (we observe that the minimum estimated action choice probability is ). With these extreme estimated action choice probabilities, the importance weights used in these estimators also become large, amplifying the estimation error of reward estimator . This leads to serious overestimation of the policy value of , even though the cutoff hyperparameters ( and ) are properly tuned.
We suggest that future OPE research use the IEOE procedure to test the stability and robustness of OPE estimators as we have demonstrated. This additional experimental effort will produce substantial information about the estimators’ usability in practice.
Appendix B Software Implementation
In addition to developing the evaluation procedure, we have implemented opensource Python software, pyIEOE, to streamline the evaluation of OPE with our experimental protocol. This package is built with the intention of being used with OpenBanditPipeline (obp).^{6}^{6}6https://github.com/sttech/zrobp
Below, we show the essential codes to conduct an interpretable evaluation of various OPE estimators with our software so that one can grasp the usage of the software easily. Primarily, only four lines of code are sufficient to complete our IEOE procedure in Algorithms 1 and 2 except for some preparations.
In the following subsections, we explain the procedure including preparations in detail, by showing an example of conducting an interpetable evaluation of OPE estimators on a synthetic bandit dataset.
b.1. Preparing Dataset and Evaluation Policies
Before using pyIEOE, we first need to prepare logged bandit feedback data and a set of evaluation policies. Here, each evaluation policy consists of its action distribution and groundtruth policy value. We can conduct this preparation by using the dataset module of obp.
In addition to synthetic dataset, users can utilize multiclass classification data, public realworld data (such as Open Bandit Dataset (Saito et al., 2020)), and their own realworld data to evaluate the robustness of OPE estimators by following preprocessing procedure of obp. Users are also free to define a set of evaluation policies by themselves.
b.2. Defining Hyperparameter Spaces
After preparing the synthetic data and a set of evaluation policies, we now define hyperparameter spaces of OPE estimators. Users of the software can define hyperparameter spaces of OPE estimators by themselves as follows.
b.3. Interpretable OPE Evaluation
Finally, we evaluate OPE estimators in an interpretable manner. Our software provides an easy procedure to conduct this evaluation of OPE workflow.
Users can intuitively evaluate the robustness of the estimators by comparing the CDF of the squared error. The quantitative comparison is also possible by calculating some summary scores such as AUCDF and CVaR. In this case, it is easy to figure out that SNDR is more reliable than DRos.
Comments
There are no comments yet.