1 Background and Problem Description
We will assume a contextual bandit setup, where our data consists of independent observations of . For each unit, a state is observed, an action is taken in accordance with some policy , and a reward is observed in response. We use the notation to refer to both a policy and its density, and use to denote the action that would be taken under policy for a state .
The problem addressed is as follows: given a proposed policy and observed data collected following a policy , estimate the expected reward of instead following on the observed states. We denote the reward function as , and an estimated reward function as .
We assume the following throughout:
A 1.
A 2.
A 3.
,
A 4.
The distribution of rewards across potential actions is independent of policy, conditional on state.
1.1 OffPolicy Estimation
We now briefly review the three broad classes of offpolicy estimation: direct modeling, importance sampling, and doubly robust estimation. Throughout this section we assume that are data collected under observed policy , and is an action that would be taken under the proposed policy .
The direct method approach to this problem fits a regression model to approximate the reward function under the observed policy . The counterfactual policy value, , is estimated by predicting the rewards that would have been observed under the actions of policy , i.e. . In order for the resulting estimate to be consistent, the reward model needs to generalize well to the reward distribution that would be observed under policy . In practice, this method can be badly biased if the observed stateaction data does not adequately represent the counterfactual distribution (Dudík et al., 2011).
Importance sampling is another approach which reweights the observed rewards by an inverse propensity score (IPS), and a rejection sampling term, i.e., . Importance sampling, while unbiased, often suffers from high variance. The weighted importance sampling estimator (also called the “selfnormalized” or Hájek estimator) has been used to reduce variance, at the cost of small bias, while maintaining consistency (Swaminathan and Joachims, 2015; Cochran, 1977), . For continuous action spaces, Kallus and Zhou (2018) recently proposed an IPSbased method that replaces the indicator function with a kernel smoothing term , i.e., . The corresponding weighted importance sampling estimator is defined analogously.
Finally, doubly robust estimators combine the direct method and importance sampling. These tend to have lower variance, and are consistent if either the direct method regression model or the importance sampling weights are correctly specified (Dudík et al., 2014; Thomas and Brunskill, 2016). For discrete or continuous action spaces, the reward is estimated as (Dudík et al., 2014) , where is a suitable rejection sampling term. The SWITCH estimator of Wang et al. (2017) uses IPS unless the weight is too large, in which case it uses the direct method.
2 Balancing Importance Sampling
There are several weaknesses with existing approaches that leverage importance sampling with inverse propensity scores. First, the probability of some observed actions for some observed states may be very close to 0 to 1, leading to instability and small sample bias of the propensity score model
(Ma and Wang, 2018). Second, the propensity score model must be correctly specified. In the absence of this, prior work has shown that the performance of IPS can be arbitrarily bad, because there is no guarantee of balance under a misspecified propensity score (Kang et al., 2007; Smith and Todd, 2005; Imai and Ratkovic, 2014). In particular, a misspecified IPS will not, in general, seek to ensure that the weighted stateaction distribution of the observed policy will match that of the proposed policy. This implies that policy evaluation will be incorrect, as it reflects the performance of a policy on the wrong state distribution.Using doubly robust estimation partially addresses the case of a misspecified propensity model. However, while they provide consistent estimates when either the direct method or the propensity score model is unbiased, they do not protect against failure of both. To address this weakness, recent work has focused on weighting estimators that explicitly seek to optimize for balance, seeking weighting functions that make the choice of action independent from the observed contexts (Liu et al., 2018; Kallus, 2018)
. These estimators have been shown to provide strong results in their respective applications even under misspecification. However, their use is limited to discrete action spaces, and often involve hyperparameters that need to be set by heuristics, or are computationally intractable. To remedy this, we now describe BOPe which defines a class of balancing importance samplers for offpolicy evaluation.
BOPe leverages classifierbased density ratio estimation (Sugiyama et al., 2012; Menon and Ong, 2016) to learn importance sampling ratios. Specifically, off policy evaluation using BOPe consists of four steps:

Create a supervised learning problem using the concatenated proposed policy instances and observed policy instances , as covariates and giving a label () of 0 to the observed policy and 1 to the proposed policy.

Learn a classifier to distinguish between the observed and proposed policy.

Take the importance sampling ratio as .

Take the off policy estimate as .
where defines a rejection sampler term between the observed action and the proposed action . For discrete action spaces, this is simply . For continuous actions, we use the kernel term of Kallus and Zhou, that is , where is some kernel function. A corresponding doubly robust estimator can also be constructed. We can see how step three arrives at the importance sampler through an application of Bayes rule (Bickel et al., 2009), , where by design.
As described, this procedure provides a large degree of flexibility to practitioners, requiring only that a classification model be learned. The question as to which classifiers fit within this framework is given by the following assumption:
A 5.
The classifier is trained using a strictly proper composite loss^{1}^{1}1A loss is strictly composite if the Bayesoptimal score is given by where is a link function . Readers should see Buja et al. (2005) and Reid and Williamson (2010) for complete treatments of strictly proper composite losses., , with a twice differentiable Bayes risk, .
This assumption allows for a large number of widely used loss functions, such as logistic, exponential, and mean squared error, as well as models commonly used for distribution comparison such as the kernel based density ratio estimators of
Sugiyama et al. (2012), and maximum mean discrepancy (Kallus, 2018).Given that BOPe targets the policy density ratio, it optimizes a measure of balance as described in the following proposition, which is the difference between the reweighted source and target distributions:
Proposition 1.
Let and be realvalued functions of and , respectively. The functional discrepancy between the observed policy and the proposed policy under BOPe is given by where is a Bregman divergence.
The proof for this proposition can be found in the supplement. When , trivially reduces this discrepancy to 0. Thus, the degree to which balance is attained is implied by the quality of the approximation of to . The upper bound involves a Bregman divergence which depends on the classifier used. We discuss this more in the next section, when connecting the minimization of imbalance to the bias and variance of BOPe.
While the BOPe procedure as described above gives an importance sampling estimator, the resulting weights can be used in any offpolicy method which uses importance weights. Extension to doubly robust estimation is trivial, as well as methods which adaptively combine direct method predictions and importance sampling weights, such as the SWITCH estimator of Wang et al. (2017). In Section 5, we implement BOPe within both these frameworks, and compare to using inverse propensity score (IPS) weights.
3 Estimator Analysis and Asymptotics
In this section, we describe the statistical properties of our estimator, and prove consistency for the target policy value. Let denote the true class probability of observing data under the target policy instead of the behaviour policy . This is estimated with a probabilistic classifier on labelled stateaction data. Additionally, let denote the true policy density ratio, with estimator . We assume the classifier has regret that decays with increasing .
A 6.
Let be a probabilistic classifier such that for some constant .
Next, we require that our importance sampling weight estimator, , is independent of the observed rewards . This can be easily achieved through sample splitting, training the classifier and applying BOPe on independent datasets.
A 7.
Given observed stateaction data, the density ratio estimator is independent of the observed rewards .
Finally, we require certain regularity conditions and rates to use in our theoretical results.
A 8.
(i) The functions , and have bounded second derivatives with respect to , and (ii) In the continuous action domain, the bandwidth parameter .
We now show that the BOPe estimator is asymptotically unbiased, and derive a bound for its variance. We accomplish this by characterizing the asymptotic quantities in terms of the Bregman divergence between the estimated and true density ratios. In the propositions below, we use to denote and to denote .
Proposition 2.
In discrete action spaces, the expected bias of obeys the following bound: . In continuous action spaces, the expected bias of obeys the following bound
Proposition 3.
In discrete action spaces, the variance of obeys the following bound . In continuous action spaces, the variance of obeys the following bound where .
The proofs are deferred to the supplement. The implication of Proposition 2 is that the expected bias of BOPe is bounded from above by the Bregman divergence between the true density ratio between the observed and proposed policy and the model estimate of the density ratio. The specific Bregman divergence depends on the choice of classifier
; for example, a logistic regression classifier would imply a KLdivergence. We note that Bregman divergences define a wide variety of divergences including KLdivergence and maximum mean discrepancy
(Huszár, 2013) that are often considered in the analysis of offpolicy evaluation and covariate shift (Kallus, 2018; Bickel et al., 2009; Gretton et al., 2009). We can then appeal to Proposition 3 of Menon and Ong (2016) that provides an explicit link between the risk of the classifier and the Bregman divergence between and .We now prove our main result below:
Proposition 4.
Proof sketch.
This result follows by leveraging Propositions 2 and 3 to characterize the asymptotic bias and variance of . Then, we use Proposition 3 of Menon and Ong (2016) to connect the Bregman loss to the error of the classifier. Therefore, under Assumption 6, the bias and variance vanish as , and the mean squared error of tends to zero. ∎
The full proof and technical details for these results can be found in the supplement. It is worth briefly discussing the implications of Propositions 13 combined with Proposition 3 of Menon and Ong (2016) which ties classifier risk to the quality of the density ratio estimate. Proposition 1 implies that optimizing classifier performance directly translates into optimizing the quality of the importance sampler. This provides a powerful property for BOPe: the bias and variance of the estimated policy evaluation can be minimized by optimizing for classifier performance. Because the classifier risk is directly tied to the quality of the offpolicy estimate, the problem is essentially reduced to model selection for supervised learning.
4 Related Work
Related work can roughly be divided into three categories: offpolicy evaluation of contextual bandits, balancing estimators, and density ratio estimation. The most closely related work is prior work on offpolicy evaluation for contextual bandits. Li et al. (2011) introduced the use of rejection sampling for offline evaluation of contextual bandit problems. Within the causal inference community there is a long literature on the use of doublerobust estimators (c.f. Bang and Robins (2005), Kang et al. (2007), Tan (2010), Cao et al. (2009)). Dudík et al. (2011) later proposed the use of doublerobust estimation for offpolicy evaluation of contextual bandits, combining the double robust estimator of causal effects with a rejection sampler. Since then, several works have sought to minimize the variance and improve robustness of the doubly robust estimator. Farajtabar et al. (2018) and Wang et al. (2017) present work to minimize the variance of the estimators by reducing the dependence on the inverse propensity score in high variance settings. Swaminathan and Joachims (2015) use a Hájek style estimator (Hájek and others, 1964). Later work from Thomas (2015) and Swaminathan and Joachims (2015) build on this work to improve estimation.
A second related line of work is balancing estimators. Under correct specification of the conditional model Rosenbaum and Rubin (1983) show balance of the propensity score. More recently, a growing literature seeks to develop balancing estimators which are robust to misspecification. Hainmueller (2012) and Zubizarreta (2015) provide optimizationbased procedures which define weights that are balancing but are not necessarily valid propensity scores. Imai and Ratkovic (2014) later defined an estimator which strives to find a valid propensity score subject to balancing constraints. This was extended to general treatment regimes by Fong et al. (2018). However, none of these directly address the problem of offpolicy evaluation for contextual bandits. Kallus (2018) introduces a method for balanced policy evaluation that relies on a regularized estimator that seeks to minimize the maximum mean discrepancy (Gretton et al., 2012). Calculation of weights is achieved through a quadratic program, which presents computational challenges as sample size grows large. It is interesting to note that the proposed evaluation optimization of Kallus (2018)
fits within the assumptions of BOPe where the scoring rule is maximum mean discrepancy (a strictly proper scoring rule) and the model is learned with variance regularization. The accompanying classifier can be defined via a modification of support vector machine classification
(Bickel et al., 2009). Dimakopoulou et al. (2018) propose balancing in the context of online learning linear contextual bandits by reweighting based on the propensity score. This differs from this work in the focus on online learning rather than policy evaluation and the use of a linear modelbased propensity score which provides mean balance only in the case of correct specification. Wu and Wang (2018) propose a method which seeks to minimize an divergence to minimize regret, similar to the target in this work. However in the setting of Wu and Wang (2018) access to the true propensities are assumed, whereas BOPe estimates the density ratio directly from observed and proposed state action pairs.The final line of related work is density ratio estimation. The use of classification for density ratio estimation dates back to at least Qin (1998). Later work leverages classification for covariate shift adaptation (Bickel et al., 2007, 2009) and twosample testing (Friedman, 2004; LopezPaz and Oquab, 2017). However, this work represents the first time classifier based density estimation has been used for offpolicy evaluation. There is also a growing literature on density ratio estimation that are defined outside of the framework as classification. These methods largely rely on kernels to perform estimation (Huang et al., 2007; Sugiyama et al., 2012). KL importance estimation (KLIEP) (Sugiyama et al., 2008), and least squares importance fitting (LSIF) (Kanamori et al., 2009) are the most directly relevant, given their ability to optimize hyperparameters via cross validation. Interestingly, Menon and Ong (2016) provides a loss for classification based density ratio estimation that produces KLIEP and LSIF. Thus, these estimators can be included inside of BOPe by considering the corresponding loss functions for the classifier.
5 Experiments
In the experiments that follow, we evaluate direct method, importance sampling, and SWITCH estimators for offpolicy evaluation. For the latter two methods, we compare inverse propensity score and BOPe weights, and use the selfnormalized versions of the estimators given in section 1
. We defer our results for doubly robust estimators to the supplement, but found the same trends in those evaluations. The direct method, propensity score, and BOPe estimators are all trained as gradient boosted tree classifiers (or regressors for the continuous evaluations).
5.1 Discrete Action Spaces
In this section, we evaluate the accuracy of our estimator for the value of an unobserved policy in the discrete reward setting.
We employ the method of Dudík et al. (2011) to turn a kclass classification problem into a karmed contextual bandit problem.
We split our data, training a classifier on one half of the data (train
).
This classifier defines our target policy, wherein the action taken is the label predicted.
The reward is defined as an indicator of whether the predicted label is the true label.
The optimal policy, then, is to take an action equal to the true label in the original data.
Evaluating this policy corresponds to estimating the classifier’s accuracy.
In the second half of the dataset (test
) we retain only a ‘partially labeled’ dataset wherein we uniformly sample actions (labels) and observe the resulting rewards.
The train
half of the data is also used to train direct method, propensity score, and BOPe models.
These are then applied to the test
data to estimate the relevant quantities for offpolicy evaluation methods.
We compare the expected reward estimates to the true mean reward of the target policy applied to the test
data.
For each dataset, this process is repeated over 100 iterations, where we vary the actions under the observed uniform policy.
Our policy models are trained as random forest classifiers. These models use the default hyperparameter values from scikitlearn with the exception of the number of trees. In order to provide increasingly complex policies to evaluate, we increase the number of trees as a function of sample size:
. The propensity score, BOPe and direct method (onevsrest) models are gradient boosted decision trees with default XGBoost hyperparameters with the exception of the number of boosting iterations. In order to adapt the estimator to the size of the dataset, the number of iterations is set as a function of sample size:
. We use the same datasets from the UCI repository (Dua and Graff, 2017) used by Dudík et al. (2011), and summarize their characteristics in the supplement. The policy we evaluate is given by training a multiclass random forest model.The performance results of the estimators are summarized in Figure 1, where we plot the root mean squared error and bias averaged over 100 iterations. We see that the direct method estimator tends to be heavily biased for the true policy value, compared to BOPe and IPS. The direct method generally performs quite poorly in terms of overall accuracy. The standard BOPe estimator performs at least as well as and typically better than the IPS estimator. This also holds for the corresponding SWITCH estimators. While BOPe often has slightly higher bias than IPS, it strikes a better balance between bias and variance, leading to substantially improved accuracy in most cases.
5.2 Continuous Action Spaces
For the continuous action case, we provide a novel extension of the same transformation employed in the previous section for evaluation of discrete actions.
We take a selection of datasets with continuous outcomes, and train a predictive model on the train
half of the data, which constitutes our target policy.
The reward of a prediction (defined to be an action in our evaluation) is the negative of the Euclidean distance to the true outcome.
Thus, the optimal action is to choose actions equal to the true outcome as in the discrete evaluation.
Evaluating the behavior policy is equivalent to estimating the mean squared error of the predictive model.
As before, we retain the test
data for evaluation, while using the train
data to train direct method, propensity score, and BOPe models.
For our observed policy, we sample actions from the empirical distribution of train
outcomes, and compute the corresponding rewards.
We then estimate the target policy value, repeating this over 300 iterations.
We retain the same basic models from the previous section for this evaluation, swapping out classifiers for regressors as appropriate.
We use datasets from the UCI repository (Dua and Graff, 2017) and Kaggle, and summarize their characteristics in the supplement.
The policy we evaluate is given by training a random forest regression to predict the continuous outcome.
We also use gradient boosted regression trees for training direct method, propensity score, and BOPe models.
Specifically, to obtain a continuous propensity score, we apply our observed policy to the train
data, and train a model to predict actions from state features.
Then, conditional on state
, the action is assumed to come from a normal distribution with mean
and variance as is standard practice (Hirano and Imbens, 2004). For each stateaction pair in thetest
data, the propensity score is then the density of this distribution at .
As in the previous section, we compare BOPe to IPS (with the Kallus and Zhou (2018) kernel) and the direct method, including the relevant SWITCH estimators.
These results are displayed in Figure 2.
We see that BOPe outperforms the other methods uniformly across all datasets.
In contrast to the binary setting, BOPe does a better job of correct for bias than does the naïve IPS method.
This is not surprising as the IPS is forced to make strong assumptions about the conditional distribution of action given state which BOPe need not make.
Given real world data that rarely conforms to ideal theoretical distributions, this provides major benefits.
In addition to reducing bias, BOPe greatly reduces RMSE in most datasets.
The BOPe SWITCH estimator improves on the IPS version in both RMSE and bias in almost all cases.
On the power
dataset, BOPe provides half the RMSE of IPS when used within the SWITCH estimator.
On admissions
and auto
, BOPe incurs less than onethird of the RMSE than does standard IPS.
6 Conclusions
In this work, we introduced BOPe, a simple, flexible, and powerful method for offpolicy evaluation of contextual bandits. BOPe is easily implemented using off the shelf classifiers and trivially generalizes to arbitrary action types, e.g. continuous, multivalued. In section 3 we tie the bias and variance of our estimator with the risk of the classification task, and show that BOPe is inherently balanceseeking. As a consequence of the theoretical results, hyperparameter tuning and model selection can be performed by minimizing classification error using wellknown strategies from supervised learning. Experimental evidence indicates that BOPe provides state of the art performance for discrete and continuous actions spaces. A natural direction for future work is considering the case of evaluation with sequential decision making and structured action spaces. Our method could also be extended to perform policy optimization in all of these settings. It would also be interesting to consider the integration of BOPe with methods for variance reduction, e.g. Thomas and Brunskill (2016) and Farajtabar et al. (2018), to further improve performance.
References
 Acharya et al. [2019] Mohan Acharya, Asfia Armaan, and Aneeta Anthony. A comparison of regression models for prediction of graduate admissions. IEEE International Conference on Computational Intelligence in Data Science, 2019.
 Bang and Robins [2005] Heejung Bang and James M Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973, 2005.
 Bickel et al. [2007] Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning for differing training and test distributions. In Proceedings of the 24th international conference on Machine learning, pages 81–88. ACM, 2007.
 Bickel et al. [2009] Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10(Sep):2137–2155, 2009.
 Bregman [1967] L.M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3):200 – 217, 1967.
 Buja et al. [2005] Andreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class probability estimation and classification: Structure and applications. Working draft, November, 3, 2005.
 Cao et al. [2009] Weihua Cao, Anastasios A Tsiatis, and Marie Davidian. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika, 96(3):723–734, 2009.
 Cochran [1977] William G Cochran. Sampling techniques. Wiley, 1977.
 Cortez et al. [2009] Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4):547–553, 2009.
 Dimakopoulou et al. [2018] Maria Dimakopoulou, Zhengyuan Zhou, Susan Athey, and Guido Imbens. Balanced linear contextual bandits. arXiv preprint arXiv:1812.06227, 2018.
 Dua and Graff [2017] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
 Dudík et al. [2011] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 1097–1104, 2011.
 Dudík et al. [2014] Miroslav Dudík, Dumitru Erhan, John Langford, Lihong Li, et al. Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485–511, 2014.
 Farajtabar et al. [2018] Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust offpolicy evaluation. In International Conference on Machine Learning, pages 1446–1455, 2018.
 Fong et al. [2018] Christian Fong, Chad Hazlett, Kosuke Imai, et al. Covariate balancing propensity score for a continuous treatment: Application to the efficacy of political advertisements. The Annals of Applied Statistics, 12(1):156–177, 2018.
 Friedman [2004] Jerome Friedman. On multivariate goodnessoffit and twosample testing. Technical report, Stanford Linear Accelerator Center, Menlo Park, CA (US), 2004.
 Gretton et al. [2009] Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Schölkopf. Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4):5, 2009.
 Gretton et al. [2012] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel twosample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
 Hainmueller [2012] Jens Hainmueller. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis, 20(1):25–46, 2012.
 Hájek and others [1964] Jaroslav Hájek et al. Asymptotic theory of rejective sampling with varying probabilities from a finite population. The Annals of Mathematical Statistics, 35(4):1491–1523, 1964.
 Hirano and Imbens [2004] Keisuke Hirano and Guido W Imbens. The propensity score with continuous treatments. Applied Bayesian modeling and causal inference from incompletedata perspectives, 226164:73–84, 2004.
 Huang et al. [2007] Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Schölkopf, and Alex J Smola. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pages 601–608, 2007.
 Huszár [2013] Ferenc Huszár. Scoring rules, divergences and information in Bayesian machine learning. PhD thesis, University of Cambridge, 2013.
 Imai and Ratkovic [2014] Kosuke Imai and Marc Ratkovic. Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):243–263, 2014.
 Imbens [2000] Guido W Imbens. The role of the propensity score in estimating doseresponse functions. Biometrika, 87(3):706–710, 2000.

Kallus and Zhou [2018]
Nathan Kallus and Angela Zhou.
Policy evaluation and optimization with continuous treatments.
In
International Conference on Artificial Intelligence and Statistics
, pages 1243–1251, 2018.  Kallus [2018] Nathan Kallus. Balanced policy evaluation and learning. In Advances in Neural Information Processing Systems, pages 8909–8920, 2018.
 Kanamori et al. [2009] Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A leastsquares approach to direct importance estimation. Journal of Machine Learning Research, 10(Jul):1391–1445, 2009.
 Kang et al. [2007] Joseph DY Kang, Joseph L Schafer, et al. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 22(4):523–539, 2007.
 Kaya et al. [2012] Heysem Kaya, Pmar Tüfekci, and Fikret S Gürgen. Local and global learning methods for predicting power of a combined gas & steam turbine. In Proceedings of the International Conference on Emerging Trends in Computer and Electronics Engineering ICETCEE, pages 13–18, 2012.

Langford and Zhang [2007]
John Langford and Tong Zhang.
The epochgreedy algorithm for contextual multiarmed bandits.
In Proceedings of the 20th International Conference on Neural Information Processing Systems, pages 817–824. Citeseer, 2007.  Li et al. [2010] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, pages 661–670, New York, NY, USA, 2010. ACM.
 Li et al. [2011] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextualbanditbased news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297–306. ACM, 2011.
 Liu et al. [2018] Yao Liu, Omer Gottesman, Aniruddh Raghu, Matthieu Komorowski, Aldo A Faisal, Finale DoshiVelez, and Emma Brunskill. Representation balancing mdps for offpolicy policy evaluation. In Advances in Neural Information Processing Systems, pages 2649–2658, 2018.
 LopezPaz and Oquab [2017] David LopezPaz and Maxime Oquab. Revisiting classifier twosample tests. In International Conference on Learning Representations, 2017.
 Ma and Wang [2018] Xinwei Ma and Jingshen Wang. Robust inference using inverse probability weighting. arXiv preprint arXiv:1810.11397, 2018.
 Menon and Ong [2016] Aditya Menon and Cheng Soon Ong. Linking losses for density ratio and classprobability estimation. In International Conference on Machine Learning, pages 304–313, 2016.
 Qin [1998] Jing Qin. Inferences for casecontrol and semiparametric twosample density ratio models. Biometrika, 85(3):619–630, 1998.
 Reid and Williamson [2010] Mark D Reid and Robert C Williamson. Composite binary losses. Journal of Machine Learning Research, 11(Sep):2387–2422, 2010.
 Rosenbaum and Rubin [1983] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
 Siebert [1987] J.P. Siebert. Vehicle Recognition Using Rule Based Methods. TIRM–87018. Turing Institute, 1987.
 Smith and Todd [2005] Jeffrey A Smith and Petra E Todd. Does matching overcome lalonde’s critique of nonexperimental estimators? Journal of econometrics, 125(12):305–353, 2005.
 Sugiyama et al. [2008] Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, and Motoaki Kawanabe. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699–746, 2008.
 Sugiyama et al. [2012] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
 Swaminathan and Joachims [2015] Adith Swaminathan and Thorsten Joachims. The selfnormalized estimator for counterfactual learning. In advances in neural information processing systems, pages 3231–3239, 2015.
 Tan [2010] Zhiqiang Tan. Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika, 97(3):661–682, 2010.
 Tewari and Murphy [2017] Ambuj Tewari and Susan A Murphy. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495–517. Springer, 2017.

Thomas and Brunskill [2016]
Philip Thomas and Emma Brunskill.
Dataefficient offpolicy policy evaluation for reinforcement learning.
In International Conference on Machine Learning, pages 2139–2148, 2016.  Thomas [2015] Philip S Thomas. Safe reinforcement learning. PhD thesis, University of Massachusetts Libraries, 2015.
 Tüfekci [2014] Pınar Tüfekci. Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. International Journal of Electrical Power and Energy Systems, 60:126 – 140, 2014.
 Wang et al. [2017] YuXiang Wang, Alekh Agarwal, and Miroslav Dudı˝́k. Optimal and adaptive offpolicy evaluation in contextual bandits. In International Conference on Machine Learning, pages 3589–3597, 2017.
 Wu and Wang [2018] Hang Wu and May Wang. Variance regularized counterfactual risk minimization via variational divergence minimization. In International Conference on Machine Learning, pages 5349–5358, 2018.
 Zubizarreta [2015] José R Zubizarreta. Stable weights that balance covariates for estimation with incomplete outcome data. Journal of the American Statistical Association, 110(511):910–922, 2015.
References
 Acharya et al. [2019] Mohan Acharya, Asfia Armaan, and Aneeta Anthony. A comparison of regression models for prediction of graduate admissions. IEEE International Conference on Computational Intelligence in Data Science, 2019.
 Bang and Robins [2005] Heejung Bang and James M Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973, 2005.
 Bickel et al. [2007] Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning for differing training and test distributions. In Proceedings of the 24th international conference on Machine learning, pages 81–88. ACM, 2007.
 Bickel et al. [2009] Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10(Sep):2137–2155, 2009.
 Bregman [1967] L.M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3):200 – 217, 1967.
 Buja et al. [2005] Andreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class probability estimation and classification: Structure and applications. Working draft, November, 3, 2005.
 Cao et al. [2009] Weihua Cao, Anastasios A Tsiatis, and Marie Davidian. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika, 96(3):723–734, 2009.
 Cochran [1977] William G Cochran. Sampling techniques. Wiley, 1977.
 Cortez et al. [2009] Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4):547–553, 2009.
 Dimakopoulou et al. [2018] Maria Dimakopoulou, Zhengyuan Zhou, Susan Athey, and Guido Imbens. Balanced linear contextual bandits. arXiv preprint arXiv:1812.06227, 2018.
 Dua and Graff [2017] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
 Dudík et al. [2011] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 1097–1104, 2011.
 Dudík et al. [2014] Miroslav Dudík, Dumitru Erhan, John Langford, Lihong Li, et al. Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485–511, 2014.
 Farajtabar et al. [2018] Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust offpolicy evaluation. In International Conference on Machine Learning, pages 1446–1455, 2018.
 Fong et al. [2018] Christian Fong, Chad Hazlett, Kosuke Imai, et al. Covariate balancing propensity score for a continuous treatment: Application to the efficacy of political advertisements. The Annals of Applied Statistics, 12(1):156–177, 2018.
 Friedman [2004] Jerome Friedman. On multivariate goodnessoffit and twosample testing. Technical report, Stanford Linear Accelerator Center, Menlo Park, CA (US), 2004.
 Gretton et al. [2009] Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Schölkopf. Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4):5, 2009.
 Gretton et al. [2012] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel twosample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
 Hainmueller [2012] Jens Hainmueller. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis, 20(1):25–46, 2012.
 Hájek and others [1964] Jaroslav Hájek et al. Asymptotic theory of rejective sampling with varying probabilities from a finite population. The Annals of Mathematical Statistics, 35(4):1491–1523, 1964.
 Hirano and Imbens [2004] Keisuke Hirano and Guido W Imbens. The propensity score with continuous treatments. Applied Bayesian modeling and causal inference from incompletedata perspectives, 226164:73–84, 2004.
 Huang et al. [2007] Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Schölkopf, and Alex J Smola. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pages 601–608, 2007.
 Huszár [2013] Ferenc Huszár. Scoring rules, divergences and information in Bayesian machine learning. PhD thesis, University of Cambridge, 2013.
 Imai and Ratkovic [2014] Kosuke Imai and Marc Ratkovic. Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):243–263, 2014.
 Imbens [2000] Guido W Imbens. The role of the propensity score in estimating doseresponse functions. Biometrika, 87(3):706–710, 2000.

Kallus and Zhou [2018]
Nathan Kallus and Angela Zhou.
Policy evaluation and optimization with continuous treatments.
In
International Conference on Artificial Intelligence and Statistics
, pages 1243–1251, 2018.  Kallus [2018] Nathan Kallus. Balanced policy evaluation and learning. In Advances in Neural Information Processing Systems, pages 8909–8920, 2018.
 Kanamori et al. [2009] Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A leastsquares approach to direct importance estimation. Journal of Machine Learning Research, 10(Jul):1391–1445, 2009.
 Kang et al. [2007] Joseph DY Kang, Joseph L Schafer, et al. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 22(4):523–539, 2007.
 Kaya et al. [2012] Heysem Kaya, Pmar Tüfekci, and Fikret S Gürgen. Local and global learning methods for predicting power of a combined gas & steam turbine. In Proceedings of the International Conference on Emerging Trends in Computer and Electronics Engineering ICETCEE, pages 13–18, 2012.

Langford and Zhang [2007]
John Langford and Tong Zhang.
The epochgreedy algorithm for contextual multiarmed bandits.
In Proceedings of the 20th International Conference on Neural Information Processing Systems, pages 817–824. Citeseer, 2007.  Li et al. [2010] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, pages 661–670, New York, NY, USA, 2010. ACM.
 Li et al. [2011] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextualbanditbased news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297–306. ACM, 2011.
 Liu et al. [2018] Yao Liu, Omer Gottesman, Aniruddh Raghu, Matthieu Komorowski, Aldo A Faisal, Finale DoshiVelez, and Emma Brunskill. Representation balancing mdps for offpolicy policy evaluation. In Advances in Neural Information Processing Systems, pages 2649–2658, 2018.
 LopezPaz and Oquab [2017] David LopezPaz and Maxime Oquab. Revisiting classifier twosample tests. In International Conference on Learning Representations, 2017.
 Ma and Wang [2018] Xinwei Ma and Jingshen Wang. Robust inference using inverse probability weighting. arXiv preprint arXiv:1810.11397, 2018.
 Menon and Ong [2016] Aditya Menon and Cheng Soon Ong. Linking losses for density ratio and classprobability estimation. In International Conference on Machine Learning, pages 304–313, 2016.
 Qin [1998] Jing Qin. Inferences for casecontrol and semiparametric twosample density ratio models. Biometrika, 85(3):619–630, 1998.
 Reid and Williamson [2010] Mark D Reid and Robert C Williamson. Composite binary losses. Journal of Machine Learning Research, 11(Sep):2387–2422, 2010.
 Rosenbaum and Rubin [1983] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
 Siebert [1987] J.P. Siebert. Vehicle Recognition Using Rule Based Methods. TIRM–87018. Turing Institute, 1987.
 Smith and Todd [2005] Jeffrey A Smith and Petra E Todd. Does matching overcome lalonde’s critique of nonexperimental estimators? Journal of econometrics, 125(12):305–353, 2005.
 Sugiyama et al. [2008] Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, and Motoaki Kawanabe. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699–746, 2008.
 Sugiyama et al. [2012] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
 Swaminathan and Joachims [2015] Adith Swaminathan and Thorsten Joachims. The selfnormalized estimator for counterfactual learning. In advances in neural information processing systems, pages 3231–3239, 2015.
 Tan [2010] Zhiqiang Tan. Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika, 97(3):661–682, 2010.
 Tewari and Murphy [2017] Ambuj Tewari and Susan A Murphy. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495–517. Springer, 2017.

Thomas and Brunskill [2016]
Philip Thomas and Emma Brunskill.
Dataefficient offpolicy policy evaluation for reinforcement learning.
In International Conference on Machine Learning, pages 2139–2148, 2016.  Thomas [2015] Philip S Thomas. Safe reinforcement learning. PhD thesis, University of Massachusetts Libraries, 2015.
 Tüfekci [2014] Pınar Tüfekci. Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. International Journal of Electrical Power and Energy Systems, 60:126 – 140, 2014.
 Wang et al. [2017] YuXiang Wang, Alekh Agarwal, and Miroslav Dudı˝́k. Optimal and adaptive offpolicy evaluation in contextual bandits. In International Conference on Machine Learning, pages 3589–3597, 2017.
 Wu and Wang [2018] Hang Wu and May Wang. Variance regularized counterfactual risk minimization via variational divergence minimization. In International Conference on Machine Learning, pages 5349–5358, 2018.
 Zubizarreta [2015] José R Zubizarreta. Stable weights that balance covariates for estimation with incomplete outcome data. Journal of the American Statistical Association, 110(511):910–922, 2015.
Appendix A Appendix
a.1 Proofs of technical results
Here, we provide technical proofs of the propositions in Section 3.
Proposition 5.
Let be the class conditional and be the class conditional with marginal class probability . Let
be the joint distribution over
decomposed into and and the marginal . Under assumption A5, for any scorer , , where .The proof can be found in Menon and Ong [2016].
a.1.1 Proof of Proposition 2
Because the weights in the denominator are each consistent for 1, we have that the sum is consistent for . Therefore, by the continuous mapping theorem, we can consider the expectation of a single term in the numerator.
Recall that denotes the true density ratio and is the estimated density ratio. Further let . First, we consider the discrete action setting. We can express the expectation as:
We can show that the first term is equal to the policy value of , while the second term provides the estimator’s bias. Considering the first term, we have:
where denotes .
Now, considering the bias term, and bounding with the Bregman divergence between and , we have:
We now move on to the continuous action setting. We can express the expectation as:
We can show that the first term is equal to the true counterfactual policy value, while the second term describes the bias induced from estimating the density ratio. Considering the first term, we have:
Let . Thus, and . Then, taking a secondorder Taylor expansion of around :
This result follows similarly to those in Kallus and Zhou [Kallus and Zhou, 2018], by properties of kernels, bounded rewards, and since has a bounded second derivative with respect to .
Now, considering the bias term, we use the same substitution and Taylor expansion as before. We also bound by the Bregman divergence between and , yielding:
a.1.2 Proof of Proposition 3
We consider the second moment of a single numerator term, and write the estimator in terms of
and as above. We first consider the discrete action setting.Therefore, the variance of the estimator is bounded by:
Next, we consider the second moment of a term in the estimator in the continuous action setting:
We substitute as before. Then, and .
Next, we apply a secondorder Taylor series expansion of , , and around . Given that these functions have bounded second derivatives, we can bound the remainder by , as in Kallus and Zhou [Kallus and Zhou, 2018]. This yields:
Comments
There are no comments yet.