Balanced Off-Policy Evaluation General Action Spaces

06/09/2019
by   Arjun Sondhi, et al.
Facebook
University of Washington
0

In many practical applications of contextual bandits, online learning is infeasible and practitioners must rely on off-policy evaluation (OPE) of logged data collected from prior policies. OPE generally consists of a combination of two components: (i) directly estimating a model of the reward given state and action and (ii) importance sampling. While recent work has made significant advances adaptively combining these two components, less attention has been paid to improving the quality of the importance weights themselves. In this work we present balancing off-policy evaluation (BOP-e), an importance sampling procedure that directly optimizes for balance and can be plugged into any OPE estimator that uses importance sampling. BOP-e directly estimates the importance sampling ratio via a classifier which attempts to distinguish state-action pairs from an observed versus a proposed policy. BOP-e can be applied to continuous, mixed, and multi-valued action spaces without modification and is easily scalable to many observations. Further, we show that minimization of regret in the constructed binary classification problem translates directly into minimizing regret in the off-policy evaluation task. Finally, we provide experimental evidence that BOP-e outperforms inverse propensity weighting-based approaches for offline evaluation of policies in the contextual bandit setting under both discrete and continuous action spaces.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/09/2019

Balanced Off-Policy Evaluation in General Action Spaces

In many practical applications of contextual bandits, online learning is...
02/16/2018

Policy Evaluation and Optimization with Continuous Treatments

We study the problem of policy evaluation and learning from batched cont...
06/03/2021

Risk Minimization from Adaptively Collected Data: Guarantees for Supervised and Policy Learning

Empirical risk minimization (ERM) is the workhorse of machine learning, ...
12/15/2020

Policy Optimization as Online Learning with Mediator Feedback

Policy Optimization (PO) is a widely used approach to address continuous...
10/31/2018

On Exploration, Exploitation and Learning in Adaptive Importance Sampling

We study adaptive importance sampling (AIS) as an online learning proble...
09/27/2020

Learning from eXtreme Bandit Feedback

We study the problem of batch learning from bandit feedback in the setti...
04/06/2020

Comment: Entropy Learning for Dynamic Treatment Regimes

I congratulate Profs. Binyan Jiang, Rui Song, Jialiang Li, and Donglin Z...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Background and Problem Description

We will assume a contextual bandit setup, where our data consists of independent observations of . For each unit, a state is observed, an action is taken in accordance with some policy , and a reward is observed in response. We use the notation to refer to both a policy and its density, and use to denote the action that would be taken under policy for a state .

The problem addressed is as follows: given a proposed policy and observed data collected following a policy , estimate the expected reward of instead following on the observed states. We denote the reward function as , and an estimated reward function as .

We assume the following throughout:

A 1.

A 2.

A 3.

,

A 4.

The distribution of rewards across potential actions is independent of policy, conditional on state.

1.1 Off-Policy Estimation

We now briefly review the three broad classes of off-policy estimation: direct modeling, importance sampling, and doubly robust estimation. Throughout this section we assume that are data collected under observed policy , and is an action that would be taken under the proposed policy .

The direct method approach to this problem fits a regression model to approximate the reward function under the observed policy . The counterfactual policy value, , is estimated by predicting the rewards that would have been observed under the actions of policy , i.e. . In order for the resulting estimate to be consistent, the reward model needs to generalize well to the reward distribution that would be observed under policy . In practice, this method can be badly biased if the observed state-action data does not adequately represent the counterfactual distribution (Dudík et al., 2011).

Importance sampling is another approach which reweights the observed rewards by an inverse propensity score (IPS), and a rejection sampling term, i.e., . Importance sampling, while unbiased, often suffers from high variance. The weighted importance sampling estimator (also called the “self-normalized” or Hájek estimator) has been used to reduce variance, at the cost of small bias, while maintaining consistency (Swaminathan and Joachims, 2015; Cochran, 1977), . For continuous action spaces, Kallus and Zhou (2018) recently proposed an IPS-based method that replaces the indicator function with a kernel smoothing term , i.e., . The corresponding weighted importance sampling estimator is defined analogously.

Finally, doubly robust estimators combine the direct method and importance sampling. These tend to have lower variance, and are consistent if either the direct method regression model or the importance sampling weights are correctly specified (Dudík et al., 2014; Thomas and Brunskill, 2016). For discrete or continuous action spaces, the reward is estimated as (Dudík et al., 2014) , where is a suitable rejection sampling term. The SWITCH estimator of Wang et al. (2017) uses IPS unless the weight is too large, in which case it uses the direct method.

2 Balancing Importance Sampling

There are several weaknesses with existing approaches that leverage importance sampling with inverse propensity scores. First, the probability of some observed actions for some observed states may be very close to 0 to 1, leading to instability and small sample bias of the propensity score model

(Ma and Wang, 2018). Second, the propensity score model must be correctly specified. In the absence of this, prior work has shown that the performance of IPS can be arbitrarily bad, because there is no guarantee of balance under a misspecified propensity score (Kang et al., 2007; Smith and Todd, 2005; Imai and Ratkovic, 2014). In particular, a misspecified IPS will not, in general, seek to ensure that the weighted state-action distribution of the observed policy will match that of the proposed policy. This implies that policy evaluation will be incorrect, as it reflects the performance of a policy on the wrong state distribution.

Using doubly robust estimation partially addresses the case of a misspecified propensity model. However, while they provide consistent estimates when either the direct method or the propensity score model is unbiased, they do not protect against failure of both. To address this weakness, recent work has focused on weighting estimators that explicitly seek to optimize for balance, seeking weighting functions that make the choice of action independent from the observed contexts (Liu et al., 2018; Kallus, 2018)

. These estimators have been shown to provide strong results in their respective applications even under misspecification. However, their use is limited to discrete action spaces, and often involve hyperparameters that need to be set by heuristics, or are computationally intractable. To remedy this, we now describe BOP-e which defines a class of balancing importance samplers for off-policy evaluation.

BOP-e leverages classifier-based density ratio estimation (Sugiyama et al., 2012; Menon and Ong, 2016) to learn importance sampling ratios. Specifically, off policy evaluation using BOP-e consists of four steps:

  1. Create a supervised learning problem using the concatenated proposed policy instances  and observed policy instances , as covariates and giving a label () of 0 to the observed policy and 1 to the proposed policy.

  2. Learn a classifier to distinguish between the observed and proposed policy.

  3. Take the importance sampling ratio as .

  4. Take the off policy estimate as .

where defines a rejection sampler term between the observed action and the proposed action . For discrete action spaces, this is simply . For continuous actions, we use the kernel term of Kallus and Zhou, that is , where is some kernel function. A corresponding doubly robust estimator can also be constructed. We can see how step three arrives at the importance sampler through an application of Bayes rule (Bickel et al., 2009), , where by design.

As described, this procedure provides a large degree of flexibility to practitioners, requiring only that a classification model be learned. The question as to which classifiers fit within this framework is given by the following assumption:

A 5.

The classifier is trained using a strictly proper composite loss111A loss is strictly composite if the Bayes-optimal score is given by where is a link function . Readers should see Buja et al. (2005) and Reid and Williamson (2010) for complete treatments of strictly proper composite losses., , with a twice differentiable Bayes risk, .

This assumption allows for a large number of widely used loss functions, such as logistic, exponential, and mean squared error, as well as models commonly used for distribution comparison such as the kernel based density ratio estimators of

Sugiyama et al. (2012), and maximum mean discrepancy (Kallus, 2018).

Given that BOP-e targets the policy density ratio, it optimizes a measure of balance as described in the following proposition, which is the difference between the reweighted source and target distributions:

Proposition 1.

Let and be real-valued functions of and , respectively. The functional discrepancy between the observed policy and the proposed policy under BOP-e is given by where is a Bregman divergence.

The proof for this proposition can be found in the supplement. When , trivially reduces this discrepancy to 0. Thus, the degree to which balance is attained is implied by the quality of the approximation of to . The upper bound involves a Bregman divergence which depends on the classifier used. We discuss this more in the next section, when connecting the minimization of imbalance to the bias and variance of BOP-e.

While the BOP-e procedure as described above gives an importance sampling estimator, the resulting weights can be used in any off-policy method which uses importance weights. Extension to doubly robust estimation is trivial, as well as methods which adaptively combine direct method predictions and importance sampling weights, such as the SWITCH estimator of Wang et al. (2017). In Section 5, we implement BOP-e within both these frameworks, and compare to using inverse propensity score (IPS) weights.

3 Estimator Analysis and Asymptotics

In this section, we describe the statistical properties of our estimator, and prove consistency for the target policy value. Let denote the true class probability of observing data under the target policy instead of the behaviour policy . This is estimated with a probabilistic classifier on labelled state-action data. Additionally, let denote the true policy density ratio, with estimator . We assume the classifier has regret that decays with increasing .

A 6.

Let be a probabilistic classifier such that for some constant .

Next, we require that our importance sampling weight estimator, , is independent of the observed rewards . This can be easily achieved through sample splitting, training the classifier and applying BOP-e on independent datasets.

A 7.

Given observed state-action data, the density ratio estimator is independent of the observed rewards .

Finally, we require certain regularity conditions and rates to use in our theoretical results.

A 8.

(i) The functions , and have bounded second derivatives with respect to , and (ii) In the continuous action domain, the bandwidth parameter .

We now show that the BOP-e estimator is asymptotically unbiased, and derive a bound for its variance. We accomplish this by characterizing the asymptotic quantities in terms of the Bregman divergence between the estimated and true density ratios. In the propositions below, we use to denote and to denote .

Proposition 2.

In discrete action spaces, the expected bias of obeys the following bound: . In continuous action spaces, the expected bias of obeys the following bound

Proposition 3.

In discrete action spaces, the variance of obeys the following bound . In continuous action spaces, the variance of obeys the following bound where .

The proofs are deferred to the supplement. The implication of Proposition 2 is that the expected bias of BOP-e is bounded from above by the Bregman divergence between the true density ratio between the observed and proposed policy and the model estimate of the density ratio. The specific Bregman divergence depends on the choice of classifier

; for example, a logistic regression classifier would imply a KL-divergence. We note that Bregman divergences define a wide variety of divergences including KL-divergence and maximum mean discrepancy 

(Huszár, 2013) that are often considered in the analysis of off-policy evaluation and covariate shift (Kallus, 2018; Bickel et al., 2009; Gretton et al., 2009). We can then appeal to Proposition 3 of Menon and Ong (2016) that provides an explicit link between the risk of the classifier and the Bregman divergence between and .

We now prove our main result below:

Proposition 4.

Under Assumptions 1-8, and with bounded variance of the Bregman divergence, the BOP-e estimator is consistent for the counterfactual policy value, that is, as , .

Proof sketch.

This result follows by leveraging Propositions 2 and 3 to characterize the asymptotic bias and variance of . Then, we use Proposition 3 of Menon and Ong (2016) to connect the Bregman loss to the error of the classifier. Therefore, under Assumption 6, the bias and variance vanish as , and the mean squared error of tends to zero. ∎

The full proof and technical details for these results can be found in the supplement. It is worth briefly discussing the implications of Propositions 1-3 combined with Proposition 3 of Menon and Ong (2016) which ties classifier risk to the quality of the density ratio estimate. Proposition 1 implies that optimizing classifier performance directly translates into optimizing the quality of the importance sampler. This provides a powerful property for BOP-e: the bias and variance of the estimated policy evaluation can be minimized by optimizing for classifier performance. Because the classifier risk is directly tied to the quality of the off-policy estimate, the problem is essentially reduced to model selection for supervised learning.

4 Related Work

Related work can roughly be divided into three categories: off-policy evaluation of contextual bandits, balancing estimators, and density ratio estimation. The most closely related work is prior work on off-policy evaluation for contextual bandits. Li et al. (2011) introduced the use of rejection sampling for offline evaluation of contextual bandit problems. Within the causal inference community there is a long literature on the use of double-robust estimators (c.f. Bang and Robins (2005), Kang et al. (2007), Tan (2010), Cao et al. (2009)). Dudík et al. (2011) later proposed the use of double-robust estimation for off-policy evaluation of contextual bandits, combining the double robust estimator of causal effects with a rejection sampler. Since then, several works have sought to minimize the variance and improve robustness of the doubly robust estimator. Farajtabar et al. (2018) and Wang et al. (2017) present work to minimize the variance of the estimators by reducing the dependence on the inverse propensity score in high variance settings. Swaminathan and Joachims (2015) use a Hájek style estimator (Hájek and others, 1964). Later work from Thomas (2015) and Swaminathan and Joachims (2015) build on this work to improve estimation.

A second related line of work is balancing estimators. Under correct specification of the conditional model Rosenbaum and Rubin (1983) show balance of the propensity score. More recently, a growing literature seeks to develop balancing estimators which are robust to mis-specification. Hainmueller (2012) and Zubizarreta (2015) provide optimization-based procedures which define weights that are balancing but are not necessarily valid propensity scores. Imai and Ratkovic (2014) later defined an estimator which strives to find a valid propensity score subject to balancing constraints. This was extended to general treatment regimes by Fong et al. (2018). However, none of these directly address the problem of off-policy evaluation for contextual bandits. Kallus (2018) introduces a method for balanced policy evaluation that relies on a regularized estimator that seeks to minimize the maximum mean discrepancy (Gretton et al., 2012). Calculation of weights is achieved through a quadratic program, which presents computational challenges as sample size grows large. It is interesting to note that the proposed evaluation optimization of Kallus (2018)

fits within the assumptions of BOP-e where the scoring rule is maximum mean discrepancy (a strictly proper scoring rule) and the model is learned with variance regularization. The accompanying classifier can be defined via a modification of support vector machine classification 

(Bickel et al., 2009). Dimakopoulou et al. (2018) propose balancing in the context of online learning linear contextual bandits by reweighting based on the propensity score. This differs from this work in the focus on online learning rather than policy evaluation and the use of a linear model-based propensity score which provides mean balance only in the case of correct specification. Wu and Wang (2018) propose a method which seeks to minimize an -divergence to minimize regret, similar to the target in this work. However in the setting of Wu and Wang (2018) access to the true propensities are assumed, whereas BOP-e estimates the density ratio directly from observed and proposed state action pairs.

The final line of related work is density ratio estimation. The use of classification for density ratio estimation dates back to at least Qin (1998). Later work leverages classification for covariate shift adaptation (Bickel et al., 2007, 2009) and two-sample testing (Friedman, 2004; Lopez-Paz and Oquab, 2017). However, this work represents the first time classifier based density estimation has been used for off-policy evaluation. There is also a growing literature on density ratio estimation that are defined outside of the framework as classification. These methods largely rely on kernels to perform estimation (Huang et al., 2007; Sugiyama et al., 2012). KL importance estimation (KLIEP) (Sugiyama et al., 2008), and least squares importance fitting (LSIF)  (Kanamori et al., 2009) are the most directly relevant, given their ability to optimize hyper-parameters via cross validation. Interestingly, Menon and Ong (2016) provides a loss for classification based density ratio estimation that produces KLIEP and LSIF. Thus, these estimators can be included inside of BOP-e by considering the corresponding loss functions for the classifier.

5 Experiments

In the experiments that follow, we evaluate direct method, importance sampling, and SWITCH estimators for off-policy evaluation. For the latter two methods, we compare inverse propensity score and BOP-e weights, and use the self-normalized versions of the estimators given in section 1

. We defer our results for doubly robust estimators to the supplement, but found the same trends in those evaluations. The direct method, propensity score, and BOP-e estimators are all trained as gradient boosted tree classifiers (or regressors for the continuous evaluations).

5.1 Discrete Action Spaces

Figure 1: Root mean-squared error (RMSE) and bias plots for discrete action spaces using the classifier trick of Dudík et al. (2014).

In this section, we evaluate the accuracy of our estimator for the value of an unobserved policy in the discrete reward setting. We employ the method of Dudík et al. (2011) to turn a k-class classification problem into a k-armed contextual bandit problem. We split our data, training a classifier on one half of the data (train). This classifier defines our target policy, wherein the action taken is the label predicted. The reward is defined as an indicator of whether the predicted label is the true label. The optimal policy, then, is to take an action equal to the true label in the original data. Evaluating this policy corresponds to estimating the classifier’s accuracy.

In the second half of the dataset (test) we retain only a ‘partially labeled’ dataset wherein we uniformly sample actions (labels) and observe the resulting rewards. The train half of the data is also used to train direct method, propensity score, and BOP-e models. These are then applied to the test data to estimate the relevant quantities for off-policy evaluation methods. We compare the expected reward estimates to the true mean reward of the target policy applied to the test data. For each dataset, this process is repeated over 100 iterations, where we vary the actions under the observed uniform policy.

Our policy models are trained as random forest classifiers. These models use the default hyperparameter values from scikit-learn with the exception of the number of trees. In order to provide increasingly complex policies to evaluate, we increase the number of trees as a function of sample size:

. The propensity score, BOP-e and direct method (one-vs-rest) models are gradient boosted decision trees with default XGBoost hyperparameters with the exception of the number of boosting iterations. In order to adapt the estimator to the size of the dataset, the number of iterations is set as a function of sample size:

. We use the same datasets from the UCI repository (Dua and Graff, 2017) used by Dudík et al. (2011), and summarize their characteristics in the supplement. The policy we evaluate is given by training a multi-class random forest model.

The performance results of the estimators are summarized in Figure 1, where we plot the root mean squared error and bias averaged over 100 iterations. We see that the direct method estimator tends to be heavily biased for the true policy value, compared to BOP-e and IPS. The direct method generally performs quite poorly in terms of overall accuracy. The standard BOP-e estimator performs at least as well as and typically better than the IPS estimator. This also holds for the corresponding SWITCH estimators. While BOP-e often has slightly higher bias than IPS, it strikes a better balance between bias and variance, leading to substantially improved accuracy in most cases.

5.2 Continuous Action Spaces

Figure 2: Root mean-squared error (RMSE) and bias plots for continous action spaces using a modification of the classifier trick of Dudík et al. (2014) for regression detailed in section 5.2.

For the continuous action case, we provide a novel extension of the same transformation employed in the previous section for evaluation of discrete actions. We take a selection of datasets with continuous outcomes, and train a predictive model on the train half of the data, which constitutes our target policy. The reward of a prediction (defined to be an action in our evaluation) is the negative of the Euclidean distance to the true outcome. Thus, the optimal action is to choose actions equal to the true outcome as in the discrete evaluation. Evaluating the behavior policy is equivalent to estimating the mean squared error of the predictive model.

As before, we retain the test data for evaluation, while using the train data to train direct method, propensity score, and BOP-e models. For our observed policy, we sample actions from the empirical distribution of train outcomes, and compute the corresponding rewards. We then estimate the target policy value, repeating this over 300 iterations. We retain the same basic models from the previous section for this evaluation, swapping out classifiers for regressors as appropriate.

We use datasets from the UCI repository (Dua and Graff, 2017) and Kaggle, and summarize their characteristics in the supplement. The policy we evaluate is given by training a random forest regression to predict the continuous outcome. We also use gradient boosted regression trees for training direct method, propensity score, and BOP-e models. Specifically, to obtain a continuous propensity score, we apply our observed policy to the train data, and train a model to predict actions from state features. Then, conditional on state

, the action is assumed to come from a normal distribution with mean

and variance as is standard practice (Hirano and Imbens, 2004). For each state-action pair in the test data, the propensity score is then the density of this distribution at .

As in the previous section, we compare BOP-e to IPS (with the Kallus and Zhou (2018) kernel) and the direct method, including the relevant SWITCH estimators. These results are displayed in Figure 2. We see that BOP-e outperforms the other methods uniformly across all datasets. In contrast to the binary setting, BOP-e does a better job of correct for bias than does the naïve IPS method. This is not surprising as the IPS is forced to make strong assumptions about the conditional distribution of action given state which BOP-e need not make. Given real world data that rarely conforms to ideal theoretical distributions, this provides major benefits. In addition to reducing bias, BOP-e greatly reduces RMSE in most datasets. The BOP-e SWITCH estimator improves on the IPS version in both RMSE and bias in almost all cases. On the power dataset, BOP-e provides half the RMSE of IPS when used within the SWITCH estimator. On admissions and auto, BOP-e incurs less than one-third of the RMSE than does standard IPS.

6 Conclusions

In this work, we introduced BOP-e, a simple, flexible, and powerful method for off-policy evaluation of contextual bandits. BOP-e is easily implemented using off the shelf classifiers and trivially generalizes to arbitrary action types, e.g. continuous, multi-valued. In section 3 we tie the bias and variance of our estimator with the risk of the classification task, and show that BOP-e is inherently balance-seeking. As a consequence of the theoretical results, hyperparameter tuning and model selection can be performed by minimizing classification error using well-known strategies from supervised learning. Experimental evidence indicates that BOP-e provides state of the art performance for discrete and continuous actions spaces. A natural direction for future work is considering the case of evaluation with sequential decision making and structured action spaces. Our method could also be extended to perform policy optimization in all of these settings. It would also be interesting to consider the integration of BOP-e with methods for variance reduction, e.g. Thomas and Brunskill (2016) and Farajtabar et al. (2018), to further improve performance.


References

  • Acharya et al. [2019] Mohan Acharya, Asfia Armaan, and Aneeta Anthony. A comparison of regression models for prediction of graduate admissions. IEEE International Conference on Computational Intelligence in Data Science, 2019.
  • Bang and Robins [2005] Heejung Bang and James M Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973, 2005.
  • Bickel et al. [2007] Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning for differing training and test distributions. In Proceedings of the 24th international conference on Machine learning, pages 81–88. ACM, 2007.
  • Bickel et al. [2009] Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10(Sep):2137–2155, 2009.
  • Bregman [1967] L.M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3):200 – 217, 1967.
  • Buja et al. [2005] Andreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class probability estimation and classification: Structure and applications. Working draft, November, 3, 2005.
  • Cao et al. [2009] Weihua Cao, Anastasios A Tsiatis, and Marie Davidian. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika, 96(3):723–734, 2009.
  • Cochran [1977] William G Cochran. Sampling techniques. Wiley, 1977.
  • Cortez et al. [2009] Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4):547–553, 2009.
  • Dimakopoulou et al. [2018] Maria Dimakopoulou, Zhengyuan Zhou, Susan Athey, and Guido Imbens. Balanced linear contextual bandits. arXiv preprint arXiv:1812.06227, 2018.
  • Dua and Graff [2017] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
  • Dudík et al. [2011] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 1097–1104, 2011.
  • Dudík et al. [2014] Miroslav Dudík, Dumitru Erhan, John Langford, Lihong Li, et al. Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485–511, 2014.
  • Farajtabar et al. [2018] Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pages 1446–1455, 2018.
  • Fong et al. [2018] Christian Fong, Chad Hazlett, Kosuke Imai, et al. Covariate balancing propensity score for a continuous treatment: Application to the efficacy of political advertisements. The Annals of Applied Statistics, 12(1):156–177, 2018.
  • Friedman [2004] Jerome Friedman. On multivariate goodness-of-fit and two-sample testing. Technical report, Stanford Linear Accelerator Center, Menlo Park, CA (US), 2004.
  • Gretton et al. [2009] Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Schölkopf. Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4):5, 2009.
  • Gretton et al. [2012] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
  • Hainmueller [2012] Jens Hainmueller. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis, 20(1):25–46, 2012.
  • Hájek and others [1964] Jaroslav Hájek et al. Asymptotic theory of rejective sampling with varying probabilities from a finite population. The Annals of Mathematical Statistics, 35(4):1491–1523, 1964.
  • Hirano and Imbens [2004] Keisuke Hirano and Guido W Imbens. The propensity score with continuous treatments. Applied Bayesian modeling and causal inference from incomplete-data perspectives, 226164:73–84, 2004.
  • Huang et al. [2007] Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Schölkopf, and Alex J Smola. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pages 601–608, 2007.
  • Huszár [2013] Ferenc Huszár. Scoring rules, divergences and information in Bayesian machine learning. PhD thesis, University of Cambridge, 2013.
  • Imai and Ratkovic [2014] Kosuke Imai and Marc Ratkovic. Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):243–263, 2014.
  • Imbens [2000] Guido W Imbens. The role of the propensity score in estimating dose-response functions. Biometrika, 87(3):706–710, 2000.
  • Kallus and Zhou [2018] Nathan Kallus and Angela Zhou. Policy evaluation and optimization with continuous treatments. In

    International Conference on Artificial Intelligence and Statistics

    , pages 1243–1251, 2018.
  • Kallus [2018] Nathan Kallus. Balanced policy evaluation and learning. In Advances in Neural Information Processing Systems, pages 8909–8920, 2018.
  • Kanamori et al. [2009] Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10(Jul):1391–1445, 2009.
  • Kang et al. [2007] Joseph DY Kang, Joseph L Schafer, et al. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 22(4):523–539, 2007.
  • Kaya et al. [2012] Heysem Kaya, Pmar Tüfekci, and Fikret S Gürgen. Local and global learning methods for predicting power of a combined gas & steam turbine. In Proceedings of the International Conference on Emerging Trends in Computer and Electronics Engineering ICETCEE, pages 13–18, 2012.
  • Langford and Zhang [2007] John Langford and Tong Zhang.

    The epoch-greedy algorithm for contextual multi-armed bandits.

    In Proceedings of the 20th International Conference on Neural Information Processing Systems, pages 817–824. Citeseer, 2007.
  • Li et al. [2010] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, pages 661–670, New York, NY, USA, 2010. ACM.
  • Li et al. [2011] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297–306. ACM, 2011.
  • Liu et al. [2018] Yao Liu, Omer Gottesman, Aniruddh Raghu, Matthieu Komorowski, Aldo A Faisal, Finale Doshi-Velez, and Emma Brunskill. Representation balancing mdps for off-policy policy evaluation. In Advances in Neural Information Processing Systems, pages 2649–2658, 2018.
  • Lopez-Paz and Oquab [2017] David Lopez-Paz and Maxime Oquab. Revisiting classifier two-sample tests. In International Conference on Learning Representations, 2017.
  • Ma and Wang [2018] Xinwei Ma and Jingshen Wang. Robust inference using inverse probability weighting. arXiv preprint arXiv:1810.11397, 2018.
  • Menon and Ong [2016] Aditya Menon and Cheng Soon Ong. Linking losses for density ratio and class-probability estimation. In International Conference on Machine Learning, pages 304–313, 2016.
  • Qin [1998] Jing Qin. Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 85(3):619–630, 1998.
  • Reid and Williamson [2010] Mark D Reid and Robert C Williamson. Composite binary losses. Journal of Machine Learning Research, 11(Sep):2387–2422, 2010.
  • Rosenbaum and Rubin [1983] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
  • Siebert [1987] J.P. Siebert. Vehicle Recognition Using Rule Based Methods. TIRM–87-018. Turing Institute, 1987.
  • Smith and Todd [2005] Jeffrey A Smith and Petra E Todd. Does matching overcome lalonde’s critique of nonexperimental estimators? Journal of econometrics, 125(1-2):305–353, 2005.
  • Sugiyama et al. [2008] Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, and Motoaki Kawanabe. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699–746, 2008.
  • Sugiyama et al. [2012] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
  • Swaminathan and Joachims [2015] Adith Swaminathan and Thorsten Joachims. The self-normalized estimator for counterfactual learning. In advances in neural information processing systems, pages 3231–3239, 2015.
  • Tan [2010] Zhiqiang Tan. Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika, 97(3):661–682, 2010.
  • Tewari and Murphy [2017] Ambuj Tewari and Susan A Murphy. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495–517. Springer, 2017.
  • Thomas and Brunskill [2016] Philip Thomas and Emma Brunskill.

    Data-efficient off-policy policy evaluation for reinforcement learning.

    In International Conference on Machine Learning, pages 2139–2148, 2016.
  • Thomas [2015] Philip S Thomas. Safe reinforcement learning. PhD thesis, University of Massachusetts Libraries, 2015.
  • Tüfekci [2014] Pınar Tüfekci. Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. International Journal of Electrical Power and Energy Systems, 60:126 – 140, 2014.
  • Wang et al. [2017] Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudı˝́k. Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning, pages 3589–3597, 2017.
  • Wu and Wang [2018] Hang Wu and May Wang. Variance regularized counterfactual risk minimization via variational divergence minimization. In International Conference on Machine Learning, pages 5349–5358, 2018.
  • Zubizarreta [2015] José R Zubizarreta. Stable weights that balance covariates for estimation with incomplete outcome data. Journal of the American Statistical Association, 110(511):910–922, 2015.

References

  • Acharya et al. [2019] Mohan Acharya, Asfia Armaan, and Aneeta Anthony. A comparison of regression models for prediction of graduate admissions. IEEE International Conference on Computational Intelligence in Data Science, 2019.
  • Bang and Robins [2005] Heejung Bang and James M Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973, 2005.
  • Bickel et al. [2007] Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning for differing training and test distributions. In Proceedings of the 24th international conference on Machine learning, pages 81–88. ACM, 2007.
  • Bickel et al. [2009] Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10(Sep):2137–2155, 2009.
  • Bregman [1967] L.M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3):200 – 217, 1967.
  • Buja et al. [2005] Andreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class probability estimation and classification: Structure and applications. Working draft, November, 3, 2005.
  • Cao et al. [2009] Weihua Cao, Anastasios A Tsiatis, and Marie Davidian. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika, 96(3):723–734, 2009.
  • Cochran [1977] William G Cochran. Sampling techniques. Wiley, 1977.
  • Cortez et al. [2009] Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4):547–553, 2009.
  • Dimakopoulou et al. [2018] Maria Dimakopoulou, Zhengyuan Zhou, Susan Athey, and Guido Imbens. Balanced linear contextual bandits. arXiv preprint arXiv:1812.06227, 2018.
  • Dua and Graff [2017] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
  • Dudík et al. [2011] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 1097–1104, 2011.
  • Dudík et al. [2014] Miroslav Dudík, Dumitru Erhan, John Langford, Lihong Li, et al. Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485–511, 2014.
  • Farajtabar et al. [2018] Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pages 1446–1455, 2018.
  • Fong et al. [2018] Christian Fong, Chad Hazlett, Kosuke Imai, et al. Covariate balancing propensity score for a continuous treatment: Application to the efficacy of political advertisements. The Annals of Applied Statistics, 12(1):156–177, 2018.
  • Friedman [2004] Jerome Friedman. On multivariate goodness-of-fit and two-sample testing. Technical report, Stanford Linear Accelerator Center, Menlo Park, CA (US), 2004.
  • Gretton et al. [2009] Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Schölkopf. Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4):5, 2009.
  • Gretton et al. [2012] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
  • Hainmueller [2012] Jens Hainmueller. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis, 20(1):25–46, 2012.
  • Hájek and others [1964] Jaroslav Hájek et al. Asymptotic theory of rejective sampling with varying probabilities from a finite population. The Annals of Mathematical Statistics, 35(4):1491–1523, 1964.
  • Hirano and Imbens [2004] Keisuke Hirano and Guido W Imbens. The propensity score with continuous treatments. Applied Bayesian modeling and causal inference from incomplete-data perspectives, 226164:73–84, 2004.
  • Huang et al. [2007] Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Schölkopf, and Alex J Smola. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pages 601–608, 2007.
  • Huszár [2013] Ferenc Huszár. Scoring rules, divergences and information in Bayesian machine learning. PhD thesis, University of Cambridge, 2013.
  • Imai and Ratkovic [2014] Kosuke Imai and Marc Ratkovic. Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):243–263, 2014.
  • Imbens [2000] Guido W Imbens. The role of the propensity score in estimating dose-response functions. Biometrika, 87(3):706–710, 2000.
  • Kallus and Zhou [2018] Nathan Kallus and Angela Zhou. Policy evaluation and optimization with continuous treatments. In

    International Conference on Artificial Intelligence and Statistics

    , pages 1243–1251, 2018.
  • Kallus [2018] Nathan Kallus. Balanced policy evaluation and learning. In Advances in Neural Information Processing Systems, pages 8909–8920, 2018.
  • Kanamori et al. [2009] Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10(Jul):1391–1445, 2009.
  • Kang et al. [2007] Joseph DY Kang, Joseph L Schafer, et al. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 22(4):523–539, 2007.
  • Kaya et al. [2012] Heysem Kaya, Pmar Tüfekci, and Fikret S Gürgen. Local and global learning methods for predicting power of a combined gas & steam turbine. In Proceedings of the International Conference on Emerging Trends in Computer and Electronics Engineering ICETCEE, pages 13–18, 2012.
  • Langford and Zhang [2007] John Langford and Tong Zhang.

    The epoch-greedy algorithm for contextual multi-armed bandits.

    In Proceedings of the 20th International Conference on Neural Information Processing Systems, pages 817–824. Citeseer, 2007.
  • Li et al. [2010] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, pages 661–670, New York, NY, USA, 2010. ACM.
  • Li et al. [2011] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297–306. ACM, 2011.
  • Liu et al. [2018] Yao Liu, Omer Gottesman, Aniruddh Raghu, Matthieu Komorowski, Aldo A Faisal, Finale Doshi-Velez, and Emma Brunskill. Representation balancing mdps for off-policy policy evaluation. In Advances in Neural Information Processing Systems, pages 2649–2658, 2018.
  • Lopez-Paz and Oquab [2017] David Lopez-Paz and Maxime Oquab. Revisiting classifier two-sample tests. In International Conference on Learning Representations, 2017.
  • Ma and Wang [2018] Xinwei Ma and Jingshen Wang. Robust inference using inverse probability weighting. arXiv preprint arXiv:1810.11397, 2018.
  • Menon and Ong [2016] Aditya Menon and Cheng Soon Ong. Linking losses for density ratio and class-probability estimation. In International Conference on Machine Learning, pages 304–313, 2016.
  • Qin [1998] Jing Qin. Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 85(3):619–630, 1998.
  • Reid and Williamson [2010] Mark D Reid and Robert C Williamson. Composite binary losses. Journal of Machine Learning Research, 11(Sep):2387–2422, 2010.
  • Rosenbaum and Rubin [1983] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
  • Siebert [1987] J.P. Siebert. Vehicle Recognition Using Rule Based Methods. TIRM–87-018. Turing Institute, 1987.
  • Smith and Todd [2005] Jeffrey A Smith and Petra E Todd. Does matching overcome lalonde’s critique of nonexperimental estimators? Journal of econometrics, 125(1-2):305–353, 2005.
  • Sugiyama et al. [2008] Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, and Motoaki Kawanabe. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699–746, 2008.
  • Sugiyama et al. [2012] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
  • Swaminathan and Joachims [2015] Adith Swaminathan and Thorsten Joachims. The self-normalized estimator for counterfactual learning. In advances in neural information processing systems, pages 3231–3239, 2015.
  • Tan [2010] Zhiqiang Tan. Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika, 97(3):661–682, 2010.
  • Tewari and Murphy [2017] Ambuj Tewari and Susan A Murphy. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495–517. Springer, 2017.
  • Thomas and Brunskill [2016] Philip Thomas and Emma Brunskill.

    Data-efficient off-policy policy evaluation for reinforcement learning.

    In International Conference on Machine Learning, pages 2139–2148, 2016.
  • Thomas [2015] Philip S Thomas. Safe reinforcement learning. PhD thesis, University of Massachusetts Libraries, 2015.
  • Tüfekci [2014] Pınar Tüfekci. Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. International Journal of Electrical Power and Energy Systems, 60:126 – 140, 2014.
  • Wang et al. [2017] Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudı˝́k. Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning, pages 3589–3597, 2017.
  • Wu and Wang [2018] Hang Wu and May Wang. Variance regularized counterfactual risk minimization via variational divergence minimization. In International Conference on Machine Learning, pages 5349–5358, 2018.
  • Zubizarreta [2015] José R Zubizarreta. Stable weights that balance covariates for estimation with incomplete outcome data. Journal of the American Statistical Association, 110(511):910–922, 2015.

Appendix A Appendix

a.1 Proofs of technical results

Here, we provide technical proofs of the propositions in Section 3.

Proposition 5.

Let be the class conditional and be the class conditional with marginal class probability . Let

be the joint distribution over

decomposed into and and the marginal . Under assumption A5, for any scorer , , where .

The proof can be found in Menon and Ong [2016].

a.1.1 Proof of Proposition 2

Because the weights in the denominator are each consistent for 1, we have that the sum is consistent for . Therefore, by the continuous mapping theorem, we can consider the expectation of a single term in the numerator.

Recall that denotes the true density ratio and is the estimated density ratio. Further let . First, we consider the discrete action setting. We can express the expectation as:

We can show that the first term is equal to the policy value of , while the second term provides the estimator’s bias. Considering the first term, we have:

where denotes .

Now, considering the bias term, and bounding with the Bregman divergence between and , we have:

We now move on to the continuous action setting. We can express the expectation as:

We can show that the first term is equal to the true counterfactual policy value, while the second term describes the bias induced from estimating the density ratio. Considering the first term, we have:

Let . Thus, and . Then, taking a second-order Taylor expansion of around :

This result follows similarly to those in Kallus and Zhou [Kallus and Zhou, 2018], by properties of kernels, bounded rewards, and since has a bounded second derivative with respect to .

Now, considering the bias term, we use the same substitution and Taylor expansion as before. We also bound by the Bregman divergence between and , yielding:

a.1.2 Proof of Proposition 3

We consider the second moment of a single numerator term, and write the estimator in terms of

and as above. We first consider the discrete action setting.

Therefore, the variance of the estimator is bounded by:

Next, we consider the second moment of a term in the estimator in the continuous action setting:

We substitute as before. Then, and .

Next, we apply a second-order Taylor series expansion of , , and around . Given that these functions have bounded second derivatives, we can bound the remainder by , as in Kallus and Zhou [Kallus and Zhou, 2018]. This yields: