1 Introduction
Personalization is a key feature of modern decision making in a variety of contexts and learning how to personalize is a central problem in machine learning. In targeted advertising, learning systems observe data about incoming users such as their market segment, decide which ad to display to the individual user, and observe whether or not the user clicks on the ad. In personalized medicine, medical history, demographics, and genetics of patients may be leveraged to administer individually tailored treatments. Contextualbandit algorithms, where the learning system repeatedly takes an action for a context (feature vector) and observes the corresponding outcome, and other randomized experiments such as A/B tests, are the gold standards for comparing the efficacy of different policies and learning the best one. However, in practice, it can be prohibitively costly, risky, or impossible to collect new data through experimentation. Offpolicy evaluation and optimization is the problem of assessing and optimizing new personalized decision policies based on observational data collected under other historical policies.
The existing literature on policy evaluation has primarily considered only discrete action spaces, where the learning system chooses one of treatments for each unit [3, 4, 16], except for modelbased approaches for policy evaluation [12]. In many important applications, however, the treatment is a continuous variable. For example, the dosage of a medical drug is continuous and, by using personalized dosing policies, doctors may adjust dosages to account for individual factors such as genes. As another example, in dynamic pricing, different values of customer rebates (e.g., from to ) can be viewed as continuous treatments offered to the customer. The duration of or intensity of exposure to an intervention, such as a job training program, can be considered as a continuous treatment as well. Treating such variables as discrete, e.g., by discretizing the data, must rely on adhoc modeling and may impede the fidelity of evaluation and the performance of optimized policies. Interpreting treatments as continuous is helpful even if not all continuum values are observed since such an interpretation allows offpolicy evaluation to learn across treatments that are different but close.
We present a framework for policy evaluation and optimization with continuous treatments. Our proposed estimator, introduced in Sec. 2, effectively uses outcome data from data points where the treatment was close to the target policy. In Sec. 3
, we analyze the bias, variance, and the meansquared error of our policy evaluation. For the corresponding estimated optimal policy, we analyze the consistency of policy optimization in Sec.
4. Finally, in Sec. 5, we conduct experiments showing that our approach performs well. In particular, we consider a case study with clinical data from a dataset of warfarin patients.2 Policy Evaluation Methodology
2.1 Problem Setup and Notation
We study an observational dataset collected from interactions of the decisionmaking system assigning treatments to units . For each interaction, the system observes an input (context or feature) . Following the logging policy, the system assigns some treatment with probability . Outcome data
is observed which is generated from a joint distribution on the covariates, treatments, and induced outcomes,
, unknown to us. The observational dataset comprises of i.i.d. observations of data .The term treatment corresponds to arms or actions discussed in other works on offpolicy evaluation. The generalized propensity score (GPS) is defined as and generalizes the discrete propensity score for the continuous setting; we assume it exists [7]
. We assume the logging policy is known, which is reasonable when we have control over the system. Otherwise, the GPS can be imputed with standard approaches for predicting conditional densities such as regression under parametric noise models or kernel density estimation.
Offpolicy evaluation estimates the policy value
the value of the expected outcomes induced by the policy , corresponding to potential outcomes under the NeymanRubin causal framework. denotes the potential outcome of a unit had it received treatment [19]. We require standard assumptions in causal inference of unconfoundedness (also known as ignorability) and common support. Unconfoundedness asserts that for all : treatment is exogenous and its assignment depends only on . The data generation process described above is consistent with unconfoundedness. The common support condition requires that : otherwise, if possible treatments may never be observed in the dataset, there is no chance for accurate estimation of their respective outcomes.
The task of policy optimization is to determine the optimal policy within a restricted function class of policies . Since the optimal policy is deterministic (for each user, assign the optimal treatment), we focus on evaluating deterministic policies. The empirically optimal policy is the policy minimizing the estimated value from our proposed estimator. The bestinclass policy minimizes the unknown expected policy value.
2.2 Related Work
Previous approaches for offpolicy evaluation broadly include the direct method (DM), inverse propensity weighted estimators (IPW), or estimators which combine them. The direct method estimates the relationship between outcomes and the union of covariates and treatment , , and generates a plugin estimator. By unconfoundedness, regressionbased estimation of the conditional mean function corresponds to estimation of the potential outcome [19]. However, this approach is subject to issues with model misspecification and without addressing dataset imbalance from a logging policy, may over or underestimate the relevance of outcomes under a different policy [3]. In [12], the authors have investigated the evaluation of continuous treatment effects with the SuperLearner, an ensemble model which incorporates multiple models of the entire doseresponse surface.
IPWbased estimation normalizes the observed outcomes by the inverse of the propensity weights of the logging policy [8]. IPW estimation corrects distribution mismatch by averaging outcomes over a new dataset created out of reweighted instances where the logging and target policies assign the same treatment [16]. IPW is unbiased, with a slower rate of convergence dependent on the number of treatments [3], and it is optimal in the sense of minimax efficiency when no additional information about the reward structure is available [24]. However, dividing by the propensity score can inflate the variance of IPW estimators.
The doublyrobust (DR) estimator combines DM and IPW estimators. When the direct estimate of the reward estimator is biased, such as when using nonparametric or highdimensional regression of , the doubly robust estimator weights the model residuals by inverse propensity weights in order to remove the bias. DR achieves a multiplicative bias when propensities are estimated and its convergence requires only that one of the estimators are consistent [23, 4]. Recent work in [23, 24] switches between using an IPS and reward estimator, using the reward estimator when the propensity is smaller than some threshold which optimizes the MSE biasvariance tradeoff. In [21, 22], the authors propose counterfactual risk minimization for policy optimization by minimizing an upper bound on the MSE of an IPWbased estimator.
The continuous setting for policy evaluation and optimization presents new challenges. We note that the generalized propensity score as introduced in [7] is analyzed in the context of treatment effect evaluation, and is used in practice to motivate appropriate discretizations of a continuous treatment variable for assessing balance. Policy optimization in discrete action spaces is generally reduced to a weighted multiclass classification problem, where the classes are treatments and are weighted by their offpolicy evaluation. For each context, the policy determines the action which provides highest rewards as its classification label [4]. However, policy optimization in the continuous setting will be fundamentally different since the problem does not decompose into discrete classes of outcomes.
2.3 OffPolicy Continuous Estimator
Previous IPW approaches for offpolicy evaluation in discrete action spaces propose the following estimator, which filters the observational dataset by rejection and importance sampling:
In the continuous setting, we will not be able to employ rejection sampling since for any continuous probability density. The rejection sampling term can be viewed as a Dirac delta function, , and in the discrete case, it enforces that the only outcome data used for estimation are the observations where the same treatment was observed under the logging policy and is assigned by the target policy. For continuous treatments, our proposed estimator reweights the dataset to consider outcomes where the observed treatment and offpolicy treatment are close.
We propose the continuoustreatment offpolicy evaluator, denoted as , which smoothly relaxes the unit mass of the Dirac delta function using a kernel function :
Properties of the kernel function include symmetry about the origin () and that it integrates to 1 (). Kernel density estimates, also known as Parzenwindow estimation, can be viewed as smooth nonparametric generalizations of computing histogram ‘buckets’. Instead of assuming a specific parametric statistical model, kernel density estimation assumes smoothness of the underlying joint density [6]. We state our results for univariate kernels where and note that analogous results hold if we use multidimensional kernel functions. When , the estimator takes the form , where denotes a bandwidth matrix. Examples of kernels include Gaussian kernels, where or the Epanechnikov kernel, .
This approach extends the IPW and rejection sampling approach taken in discrete treatment spaces to continuous treatment spaces. The extent of the kernel smoothing, parametrized by the bandwidth , can be chosen to minimize the meansquarederror (MSE). In particular, our estimator differs from using the direct method with kernelbased regression of the conditional density
and evaluation of the estimate at the treatment policy. Thus, we avoid the curse of dimensionality: kernel regression performs dramatically worse as the number of covariate dimensions increases, whereas the convergence rates of our method rely only on the treatment dimension
[17].In the special case that the logging policy is unknown such that propensities must be imputed, and if a doseresponse estimator is available, our estimator can be extended to a doubly robust one that has bias in excess of that in Thm. 1 that is multiplicative in the estimation biases of propensities and dose response:
2.3.1 SelfNormalized Propensity Weight estimator
As discussed in [22], IPW methods can suffer from variance in estimates due to the propensity weights. Normalizing the IPW estimator by maintains consistency but can reduce variance by adjusting estimates of the treatment space that would have been be sampled with greater or lower probability.
2.3.2 Practical Concerns
When implementing our offpolicy evaluation estimator in practice, some adjustments need to be made for empirical performance.
Bandwidth selection: Selecting a good bandwidth is key to good evaluation and optimization. We compute the asymptotically optimal bandwidth in Thm. 1 below, but beyond the order in , the expression includes constants that are generally unknown a priori. In the case of kernel density estimation, the literature focuses on methods for bandwidth selection which do not incorporate loss scalings and would perform poorly in our setting [18]. Instead, we propose to select the optimal bandwidth via a plugin estimator, estimating the quantities in the expression for optimal bandwidth (eq. 3.2): we estimate the conditional density via kernel density estimation and subsequently estimate the second derivative and the conditional expectation via numerical integration.
Boundary bias: If the treatment space is bounded, the kernel may extend past the boundaries where necessarily no data point exists, biasing boundary estimates downwards. This can be addressed by truncating and normalizing the kernel: if then we simply divide each term in our estimator by .
Clipping propensity weights: When using IPWbased estimators, in practice if the propensity score is very small, it is clipped with some threshold , e.g., . This introduces additional bias but may significantly reduce the variance, yielding smaller total error.
3 OffPolicy Evaluation Analysis
3.1 Bias and Variance of Kernelized Policy Evaluation
We compute the bias and variance of the estimator and prove consistency. Some technical assumptions are required for the analysis:
Assumption 1.
The conditional outcome and treatment densities, and , exist.
Assumption 2.
The conditional outcome density is twice differentiable and the conditional treatment density is differentiable.
Assumption 3.
Outcomes
are bounded with finite second moments.
Assumption 4.
Common support between the treatment propensities observed in the data and the treatment policy : , for almost everywhere and some fixed .
Assumption 5.
(Unconfoundedness) for all : treatment is exogenous and its assignment depends only on .
Theorem 1.
Proof outline.
The theorem follows by applying Bayes’ rule with the GPS and Taylor expansion of around . Details in Appendix 7.1. ∎
The bias introduced by kernel density estimation is and depends on the curvature of the unknown density evaluated at the policy : if the outcome distribution changes rapidly with small changes in treatment value, our approach for leveraging local information will incur more bias. The variance depends inversely on the generalized propensity score. As expected, the estimator may have high variance in regions where we are unlikely to observe treatment .
3.2 Mean Squared Error and Consistency
We analyze mean squared error derived from bias and variance and characterize the optimal bandwidth. Intuitively, the bandwidth controls the scale of proximity we require on treatments: a bandwidth too large introduces high bias because we simply average over the entire dataset, while small bandwidths increase variance.
Proof outline.
Proof Outline.
Follows from convergent MSE and Markov’s inequality. Full proof in Appendix 7.3 ∎
Corollary 4.
Proof.
The result follows from Slutsky’s theorem since . ∎
4 Continuous Policy Optimization
Accurate offpolicy evaluation is a necessary prerequisite for policy optimization, the task of estimating which treatment policy minimizes expected desired outcomes. We analyze how the empirically optimal policy, the policy minimizing the offpolicy evaluations, performs outofsample.
For a constrained policy class, such as a space of linear policies (), the policy optimization problem can be interpreted as a weighted empirical risk minimization problem over a constrained policy space where we find
. Gradients can be computed easily with respect to the kernel function, applying the chain rule to
, and we provide additional examples in the Appendix. Equivalently we can optimize the other estimators and incorporate variance regularization. The nonconvex optimization can be solved by a numerical optimizer such as LBFGS or gradient descent, but generally with no guarantees for global convergence. In practice we take the best solution from random restarts.4.1 Consistency of Policy Optimization
Our analysis bounds the generalization error for the empirical riskminimizing policy, the error incurred by minimizing the empirical risk instead of the unknown expected risk. Generalization bounds for this problem follow from [1, Thm. 8]
, and depend on the Rademacher complexity of the loss function class. The empirical Rademacher and Rademacher complexity of a function class
are, respectively, defined as:where
are iid Rademacher random variables, symmetrically
or with probability . Restricting the function class provides better generalization error by reducing the Rademacher complexity: a function class which is less able to fit arbitrary data sequences is less vulnerable to overfitting.Assumption 6.
Outcome values are bounded on the interval . The inverse propensity weight,, is bounded on .
Assumption 7.
The kernel function is bounded by and has Lipschitz constant .
Theorem 5.
Proof Outline..
The corollary shows that the regret of our policy optimizer converges to zero, i.e., achieves bestinclass performance, as long as , , and . As an example, consider a function class of linear decision rules with bounded norm: . Assuming that , [10] shows that the Rademacher complexity of this class is bounded as . Therefore, the optimal bandwidth ensures consistent policy learning of the best linear policy. Similar results hold for policies in a bounded ball of a reproducing kernel Hilbert space.
Variance regularization: If the optimization space includes a policy which assigns treatments arbitrarily far from the observed treatments, such a policy trivially minimizes the loss by forcing
. Regularizing the objective by the estimated sample standard deviation
of the policy evaluation should mitigate this effect from overly expressive policy classes.5 Experiments
5.1 Validation on Synthetic Data
We first consider a controlled setting with synthetic data to illustrate our method. We consider , where and . We consider treatment assignment that is either completely randomized uniformly on the interval without regard to or treatment assignment that is confounded by
and is normally distributed as
. The optimal policy is linear and sets where .We consider how the performance of offpolicy evaluation changes with by evaluating the optimal policy with observational data points generated either using completely randomized treatments or treatments confounded by , clipping generalized propensities below 0.1. We use the Epanechnikov kernel with the selfnormalized estimator and estimate the bandwidth by using kernel density estimation of and for the conditional density . From this estimate we obtain an estimate for the second derivative by numerical differentiation and compute an approximate conditional expectation of via numerical integration. Since computing the bandwidth is numerically intensive, we compute it for one value of and adjust it for different by multiplying by
. We compare to standard clippedIPW discretetreatment policy evaluation by discretizing the treatments into 10 evenly sized bins from the minimal to maximal observed treatment, computing the discrete propensity score by integrating the GPS over the bin (“discretized OPE”). We also compare against the direct method, using either a trained random forest regressor (“DM RF”) or polynomial regression of order 3 (“DM Poly”).
For each between 10 and 300, we simulate 50 replications of the process. The results in Fig. 2
include 95% confidence intervals around the mean over replications. In both settings our policy evaluation indeed converges to the truth and while the discretization approach performs reasonably well, it is systematically biased, inconsistent, and has larger variance. The discretization is sensitive to the distribution of the data and the variation in the unknown true relationship between covariates, treatment, and outcome. In practical settings with real data, it is unclear what the best discretization would be.
To consider offpolicy optimization in this simple setting, we fix and evaluate linear policies with ranging over . Again we consider 50 replications. Figs. 2 and 4 show 95% confidence intervals around the mean over replications. Our policy evaluations are tight near the optimum and subsequent optimization over the range of values is consistent with the true optimal . In Fig. 4 we evaluate the consistency of policy optimization by analyzing the outofsample error of the empirically optimal policy computed on datasets of varying from 10 to 300. We compute 20 replications and display 95% confidence intervals around the mean, observing that following the theory, the outofsample error from offpolicy evaluation converges to zero as n increases.
5.2 Policy Optimization Simulation
We consider a similar controlled setting in higher dimension with richer doseresponse structure, still with synthetic data, and illustrate the resulting outcome distribution under various learned policies. We randomly generate independent tendimensional covariates (), normally distributed with zero mean and randomly generated covariances (following a normal distribution which is offset for positivity). The true outcome model is quadratic in : we set the noiseless outcome as where , , and . We induce sparsity on the coefficients by independently and randomly sampling 3 covariates to remain positive on each coefficient vector , , and . We include a constant treatment effect interaction term of .
We sample treatments as normally distributed conditional on covariates, and generate a training dataset of 400 instances and evaluate on a test set of 1000 instances. Outcomes are generated from the model with additional i.i.d. meanzero Gaussian noise with variance
. For policy learning, we consider the case that the propensity model is wellspecified but unknown and impute the generalized propensity score from the training data via linear regression.
In Fig. 5 we compare the outcome distributions under learned policies using a box plot, displaying the means on the right. For reference we compute the best treatment assignment for each given the full counterfactual model (best outofclass (o.o.c)). We evaluate continuous offpolicy evaluation over linear policies with a bandwidth of 2.6. When optimizing the offpolicy estimator for the linear policy, we use the LBFGS algorithm with multiple random restarts since the objective function is nonconvex and LBFGS performs well even in the nonconvex setting [15]. We evaluate the direct method (DM), using a random forest regressor with a linear policy space, which we optimize by using the numerical differentiation available with BFGS. We use the same random forest regressor for doublyrobust continuous policy optimization (DR CPO). We also consider a discretization approach which optimizes over a continuous treatment policy (linear) but evaluates performance by discretizing treatment into uniformly sized bins, running standard selfnormalized CPE. Discretizing the resulting linear policy yields a constant policy in this setting: we compare to the best constant policy found using continuous policy evaluation (CPE, cons.). Finally, the baseline is a constant policy which assigns the mean dose. Comparing the results, we see that off policy evaluation is able to improve upon the mean risk of other methods and nears the performance of the best treatment assignment with full information (which is out of the linear policy class). While the best constant policy found using OPE has good performance in the sense of mean risk, the linear policy is better able to personalize treatment based on covariates.
5.3 Warfarin case study
Unlike for discrete offpolicy evaluation, no evaluation datasets are available for continuous treatments with full counterfactuals. We evaluate our estimator in an experimental setting by developing a case study from a PharmGKB [9] dataset on warfarin dosing which includes information on patient covariates, final therapeutic dosages, and patient outcomes (INR, International Normalized Ratio). Warfarin is a blood thinner whose therapeutic dosage varies widely across patients and whose administration must be closely monitored to prevent adverse side effects. Previous work on predicting dosage policies [2, 11] has evaluated accuracy based on prediction of the correct category of dosage, “low” (<21 mg/wk), “medium” (> 21 mg/wk,< 49 mg/wk) or “high” (> 49 mg/wk). However, clinical guidelines suggest fine adjustments to dosage (1520%) during monitoring, and recommend splitting tablets to deliver precise treatment [9]. Therefore, treating warfarin dosage as a continuous variable better captures the inherently continuous nature of dosage amounts.
We develop a semisimulated study by simulating a dosage process in a way that allows us to simulate counterfactual outcomes. Following the procedure set out in [9], we consider correct prediction as being within 10% of the therapeutic dose , since measurements of patient INR are inherently noisy and dose is adjusted until the patient INR presents within a target range. Since the clinical risk of incorrect dosage increases with absolute distance from the target range [5], we use a semisimulated loss function of absolute distance from , instead of simulating unavailable INR outcomes:
We sample our dosage data
as a mixture of a patient’s BMI zscore
and i.i.d standard normal noise , scaled to preserve the moments of the therapeutic dose distribution, , such that with : . It follows that the propensity score is where and is the continuous density of a standard normal random variable.We impose bounds on the coefficient, , to prevent evaluating a policy with no overlap with the observed dataset, where is the maximal treatment, is the mean of the th covariate and
denotes dimension. We run a priori feature selection on the full dataset before evaluating policy optimization, using the importance weights from a random forest regressor on the therapeutic dose to select the 81 most important features.
Policy  Mean L1  Std. dev. L1  Mean L2 

Best  8.93  8.64  154.37 
Cont. OPE  10.19  10.19  207.78 
DM  11.02  10.96  241.68 
Mean dose  11.67  10.52  246.80 
Original  15.27  13.08  404.28 
We conduct policy optimization on these simulated outcomes and evaluate how the empirically optimal policy performs on the thresholded loss function with absolute and squared penalties. The bestinclass linear treatment policy from median regression, , has access to information about the true therapeutic dose (“Bestinclass” on the figure). We also evaluate the best linear model from a random forest regressor (DM estimator) for
. We compare against the linear policy found using our CPO method (“Continuous OPE”) which achieves a mean loss of 10.2. The baseline is a constant policy corresponding to the mean dose and for reference we plot the distribution of outcomes according to the original initial treatment assignment observed in the dataset, which doctors adjusted until a therapeutic dose was reached when patient INR was within the target range. We tested discrete offpolicy optimization (POEM and NORMPOEM) with various uniformlyspaced discretizations or quantilebased discretizations of dosages, but the propensity scores are mostly zero or one and hamper the resulting optimization, illustrating the difficulty of finding appropriate discretizations for real datasets
[21, 22].Comparing the results in Fig. 6, we see that our method is competitive with the bestinclass linear policy and improves upon the direct method, further reducing the median (Mean L1) and mean of the difference between the policy dose and therapeutic dose from the naive benchmark policy giving the mean dose. In Table 7 we also report the squared distance from
(mean L2): we see that performance of the DM policy shows less improvement from giving the mean dose when we weight outliers more heavily, and results in larger variance in the distribution of absolute losses (std. dev L1). Our approach for policy optimization based on continuous offpolicy evaluation works well in simulated and semiexperimental settings.
6 Conclusion
We developed an inversepropensityweighted estimator for offpolicy evaluation and learning with continuous treatments, extending previous methods which have only considered discrete actions. The estimator replaces the rejection sampling used in IPWbased estimators with a kernel function to incorporate local information about similar treatments. Our generalization bound for policy optimization shows that the empirically optimal treatment policy computed by minimizing the offpolicy evaluation also converges to the policy minimizing the expected loss. We demonstrate the efficacy of our approach for estimation and evaluation on simulated data, as well as on a realworld dataset of Warfarin dosages for patients.
Acknowledgments
This material is based upon work supported by the National Science Foundation under Grant No. 1656996. Angela Zhou is supported through the National Defense Science & Engineering Graduate Fellowship (NDSEG) Program.
References
 [1] Peter Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 2002.
 [2] Hamsa Bastani and Mohsen Bayati. Online decisionmaking with highdimensional covariates. Management Science, 2015.
 [3] Alina Beygelzimer and John Langford. The offset tree for learning with partial labels. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009.
 [4] Miroslav Dudik, Dumitru Erhan, John Langford, and Lihong Li. Doubly robust policy evaluation and optimization. Statistical Science, 2014.
 [5] Valentin Fuster, Lars E. Ryden, Davis S. Cannom, Harry J. Crijns, Anne B. Curtis, Kenneth A. Ellenbogen, Jonathan L. Halperin, G. Neal Kay, JeanYves Le Huezey, James E. Lowe, S. Bertil Olsson, Eric N. Prystowsky, Juan Luis Tamargo, and L. Samuel Wann. 2011 accf/aha/hrs focused updates incorporated into the acc/aha/esc 2006 guidelines for the management of patients with atrial fibrillation. Circulation, 2011.
 [6] Bruce Hansen. Lecture notes on nonparametrics. Technical report, University of Wisconsin, 2009.
 [7] Keisuke Hirano and Guido Imbens. The Propensity Score with Continuous Treatments, in Applied Bayesian Modeling and Causal Inference from IncompleteData Perspectives: An Essential Journey with Donald Rubin’s Statistical Family, chapter 7. John Wiley & Sons, Ltd, 2004.
 [8] Daniel Horvitz and Donovan Thompson. A generalization of sampling without replacement froma finite universe. Journal of the American Statistical Association, 1952.
 [9] T E International Warfarin Pharmacogenetics Consortium, Klein, R B Altman, N Eriksson, B F Gage, S E Kimmel, MT M Lee, N A Limdi, D Page, D M Roden, M J Wagner, M D Caldwell, and Johnson J A. Estimation of the warfarin dose with clnical and pharmacogenetic data. The New England Journal of Medicine, 2009.
 [10] Sham Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear predicttion: Risk bounds, margin bounds, and regularization. Advances in Neural Information Processing Systems, 2009.
 [11] Nathan Kallus. Recursive partitioning for personalization using observation data. Proceedings of the Thirtyfourth International Conference on Machine Learning, 2017.
 [12] Noemi Kreif, Richard Grieve, Ivan Dia, and David Harrison. Evaluation of the effect of a continuous treatment: A machine learning approach with an application to treatment for traumatic brain injury. Health Economics, 2015.
 [13] Gert R. Lanckriet and Bharath K. Sriperumbudur. On the convergence of the concaveconvex procedure. Advances in Neural Information Processing Systems 22, 2009.
 [14] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer, 1991.
 [15] DongHui Li and Masao Fukushima. On the global convergence of bfgs method for nonconvex unconstrained optimization problems. SIAM Journal on Optimization, 2000.
 [16] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextualbanditbased news article recommendation algorithms. Proceedings of the fourth ACM international conference on web search and data mining, 2011.
 [17] Adrian Pagan and Aman Ullah. Nonparametric Econometrics. Cambridge University Press, 1999.
 [18] Byeong Park and J.S. Marron. Comparison of datadriven bandwidth selectors. Journal of the American Statistical Association, 2009.
 [19] Donald Rubin. Estimating causal eeffect of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 1974.
 [20] Bernard Silverman. Weak and strong uniform consistency of the kernel estimate of a density and its derivatives. The Annals of Statistics, 1978.
 [21] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization. Journal of Machine Learning Research, 2015.
 [22] Adith Swaminathan and Thorsten Joachims. The selfnormalized estimator for counterfactual learning. Proceedings of NIPS, 2015.

[23]
Philip Thomas and Emma Brunskill.
Dataefficient offpolicy policy evaluation for reinforcement learning.
Journal of Machine Learning Research, 2016.  [24] YuXiang Wang, Alekh Agarwal, and Miroslav Dudik. Optimal and adaptive offpolicy evaluation in contextual bandits. Proceedings of Neural Information Processing Systems 2017, 2017.
7 Appendix
7.1 Bias and Variance: Proof of Theorem 1
Proof.
We provide full computations for the bias and variance
We compute the expectation of the estimator at one data point, , omitting the term. By linearity of expectation:
The analysis follows the structure of standard bias and variance calculations for kernel density estimation [17]. We can express the conditional expectation of the kernel estimator via the integral convolution of the kernel and the conditional density. Note that by the symmetric property of kernel functions, : we use them interchangeably. By iterated expectation and the definition of conditional expectation:
By a change of variables, let . Then and . Changing variables in the integral corresponds to computing a local expansion of the conditional outcome density around .
(1) 
We use the definition of Bayes’ rule and conditional densities, , with the definition of Q as , to transform the conditional density from the conditional density to the target density, .
Consider a 2nd order Taylor expansion of around :
Then we can compute the conditional expectation by integrating the approximation to the density term by term, where denotes the jth kernel moment. The second order term describes the bias.
For a symmetric kernel, the oddorder moments integrate to 0.
For the bias to vanish asymptotically, we require that , assuming that outcomes are bounded, and that the second derivative of the conditional density of given is bounded.
We also consider the multivariate case. We assume the kernel function for the vector u is a product kernel, in the sense that it is the product of univariate kernels: , each with bandwidth . The multidimensional change of variables takes the form . Then by the multivariate Taylor expansion of the conditional density , the bias is:
Calculations for Variance: A similar analysis follows for considering the variance; since we assume data are i.i.d. it suffices to consider the varaince of one term of the estimator.
We will rewrite the squared expectation using the bias analysis as approximately for bounded outcomes .
We compute by analyzing a Taylor expansion of the conditional density after a change of variables:
For convenience, we denote the quotient . We expand this function around the argument . In the asymptotic perspective, we omit the exact expressions of the derivatives , but we will require that the treatment density is nonzero for (a standard overlap condition required for counterfactual policy evaluation). Then consider each term of the expansion in turn:
The first term is equivalent to
The second term and third terms are equivalently would require integrating by parts, which requires specification on the structure of the kernel. Since the integration of the terms evaluate to constants, under the assumption that , the integral of the second and third terms of the expansion is .
So
is the ‘roughness’ term, where .
Combining these results:
Comments
There are no comments yet.