Permutation Weighting

by   David Arbour, et al.

This work introduces permutation weighting: a weighting estimator for observational causal inference under general treatment regimes which preserves arbitrary measures of covariate balance. We show that estimating weights which obey balance constraints is equivalent to a simple binary classification problem between the observed data and a permuted dataset (no matter the cardinality of treatment). Arbitrary probabilistic classifiers may be used in this method; the hypothesis space of the classifier corresponds to the nature of the balance constraints imposed through the resulting weights.We show equivalence between existing covariate balancing weight estimators and permutation weighting and demonstrate estimation with improved efficiency through this regime. We provide theoretical results on the consistency of estimation of causal effects, and the necessity of balance infinite samples. Empirical evaluations indicate that the proposed method outperforms existing state of the art weighting methods for causal effect estimation, even when the data generating process corresponds to the assumptions imposed by prior work.



There are no comments yet.


page 1

page 2

page 3

page 4


The Balancing Act in Causal Inference

The idea of covariate balance is at the core of causal inference. Invers...

PSweight: An R Package for Propensity Score Weighting Analysis

Propensity score weighting is an important tool for causal inference and...

Propensity Score Weighting for Causal Subgroup Analysis

A common goal in comparative effectiveness research is to estimate treat...

Energy Balancing of Covariate Distributions

Bias in causal comparisons has a direct correspondence with distribution...

Matching on Generalized Propensity Scores with Continuous Exposures

Generalized propensity scores (GPS) are commonly used when estimating th...

DeepMatch: Balancing Deep Covariate Representations for Causal Inference Using Adversarial Training

We study optimal covariate balance for causal inferences from observatio...

Estimating Average Treatment Effects with Support Vector Machines

Support vector machine (SVM) is one of the most popular classification a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Observational causal inference is a set of methods and techniques to infer causal effects in the absence of an explicit randomization mechanism. Given observed treatment, outcome, and the full set of confounding pretreatment covariates, identification of the causal effect of a treatment is made possible by rendering either treatment or outcome independent of the pretreatment covariates [1, 2]

. Propensity score reweighting is a common way of adjusting for the former condition in which outcomes are reweighted according to the inverse of the conditional probability of receiving the observed treatment after observing pretreatment covariates 

[3]. Unlike in a randomized control trial (RCT), in which the relationship between treatment and covariates is known by design, propensity score models require estimating this relationship from observed data. When the conditional distribution of treatment conditional on covariates is specified correctly, the reweighted data will behave as if it were generated from an experiment, with balance (statistical independence) over all observed and unobserved pretreatment covariates. Under misspecification, however, there are no guarantees of balance, and there may remain arbitrary dependencies between treatment and covariates. Inverse propensity score weighting has become widely used in the last decade in fields as diverse as epidemiology [4], economics [5] and computer science [6].

In RCTs, the theoretical unbiasedness of estimates across randomizations is insufficient when, conditional on a randomization, one observes severe imbalance on important baseline covariates. As Gossett noted in [7], “it would be pedantic to continue with an arrangement of plots known beforehand to be likely to lead to a misleading conclusion". The same is true in observational methods which seek to approximate an RCT: weighting methods which retain clear imbalance on important covariates should leave the credibility of their conclusions in doubt. Recent work has proposed covariate balancing propensity scores which augment the propensity score model to additionally focus on maintaining mean difference of zero between pretreatment covariates (c.f., [8, 9, 10]). Thus, even when the propensity score model is wrong, if all relevant covariates attain mean balance, differences in mean cannot bias causal effect estimation. These recent improvements, however, define balance only in terms of linear dependence [11], require bespoke optimization procedures and are typically only focused on binary treatments.

Our framework is based on explicitly constructing the target RCT [12] through permutation of treatment assignment and thereby allowing comparison of the permuted data to the observed data. By distinguishing these datasets using an arbitrary probabilistic classifier, we generate propensity score weights which remove all dependencies in the observed data detectable by the classifier using the tools of classifier-based density ratio estimation [13, 14, 15]. This procedure is amenable to any possible type of treatment – binary, multi-valued or continuous – and reduces them all to the same simple binary classification problem. This procedure does not require the specification of the conditional distribution of treatment. Rather, the choice of classifier and specification of the classification problem defines the form of the dependence between treatment and covariates, i.e., the balance conditions. As such, existing propensity score [3], and balancing [8, 16, 17]

methods which require balance constraints are subsumed as special cases of our framework. Given a suitably flexible classifier, arbitrary forms of balance can be attained. Cross-validation can be used to tune this classifier’s hyperparameters using off-the-shelf software. With this flexibility, researchers can simply choose a flexible classifier and focus on the shoe-leather

[18] of measuring the right set of variables with which to make causal inferences credible [2] rather than on arcane choices around specification.

2 Problem Statement

We first fix notation before describing the problem setup and our proposed solution. We denote random variables using upper case, constant values and observations drawn from random variables in lower case, and denote a set with boldface. Let

be a dataset consisting of observations from treatment, , outcome, , and a set of covariates, . We assume the following properties of the observed data throughout this work:


Independence conditional on , i.e.,


Positivity over treatment status, i.e.,


only consists of pretreatment covariates, i.e., , is not caused by either or

The causal estimand we seek to measure is the dose-response function, , i.e., the expected value of the outcome after intervening and assigning treatment to value . The dose-response function is a general construct that does not presuppose a specific type, e.g. binary, for treatment. Further, the identification of the dose-response function implies identification of the effect of any treatment contrast of interest. For example, the average treatment effect, a common quantity of interest in the causal inference literature under binary treatments, is given as .

A common method for estimating the dose-response function is weighting by the inverse of the conditional probability of receiving treatment given observed covariates, i.e., inverse propensity score weighting [3]. Weighting by the inverse of this score provides the standard inverse-probability weighting estimator of [19], which reweights data such that there is no relationship between and , providing identification of causal effects:

In the finite sample setting, Monte-Carlo approximation can be used to estimate this quantity from the observed data,

To improve efficiency. many practitioners use the Hajek [20] estimator which replaces with the sum of the weights,

. When the marginal distribution of treatment is far from uniform both inverse propensity score weighting and the Hajek estimator can have high variance. To remedy this, Robins 

[21] proposed stabilizing the inverse-probability weighting (IPSW) by placing the marginal density of treatment in the numerator, i.e.

When the conditional distribution has been correctly specified, inverse propensity score weighting results in the balance condition [3], i.e. . However, when the conditional distribution is not well specified, either in terms of either the functional form or the assumed sufficient set of pretreatment covariates, inverse propensity score weighting will fail to produce balance and the resulting causal estimate may be badly biased  [22].

Recent work by [9] and [8]

reorient the concept of the propensity score around its balancing properties to reduce bias from mis-specification. Under the assumption of linear dependence, a valid inverse propensity score weight entails that the weighted covariance of treatment with all covariates are zero. These moment conditions suffice to estimate weights which will correspond to IPSW if the propensity score is correctly specified. Even when the propensity score is incorrect, these estimators retain their balancing properties by minimizing the difference in means directly. This, in turn, will tend to reduce the bias in the causal estimator in two ways: indirectly by reducing the bias from unobserved factors correlated with measured covariates 

[23] and directly by eliminating the bias from the balanced components of the measured covariates. While balancing estimators have shown significant gains with respect to inverse propensity score weighting [8, 9] existing estimators simply seek to minimize mean balance, which corresponds to a balance condition where dependence is linear. Further, extending or revising the conditions of existing estimators requires devising a novel estimation procedure for each new balance criterion (c.f. [8], [11], [9]).

In this work we revisit the estimand defined by IPSW, which will allow us to define a broad class of balancing estimators. We begin with a previous unrecognized identity of the IPSW,


This form of the IPSW makes plain that weights are defining importance sampling weights where the target distribution follows the distribution under balance, i.e. . To be explicit, the goal of IPSW is to transform expectations (of

, for instance) over the observed joint distribution of

and to one in which and behave as if they were generated from an RCT. However, the importance sampling weights under IPSW are constructed indirectly by combining the conditional and marginal treatment densities. The contribution of this work is a method, permutation weighting, which estimates this quantity directly via a probabilistic classification problem which we describe in the sequel. Direct estimation provides more than just intuitive appeal. Unlike IPSW, direct estimation of the importance sampling ratio explicitly seeks to minimize imbalance. The result is that bias is reduced even in the case of mis-specification.

3 Permutation Weighting

We now introduce permutation weighting, which allows for the direct estimation of the importance sampler defined by equation 1. Permutation weighting consists of two steps:

  1. A pseudo-dataset, , is constructed by permuting the treatment values in the observed sample, which we will refer to as . As a result of the permutation is faithful to .

  2. The importance sampling weights, , are constructed by estimating the density ratio using and

To avoid possible bias from taking a single permutation, we repeat this procedure over multiple bootstrap samples, where separate bootstrap samples are drawn from , , and , the procedure is carried out as previously described, and the individual density ratios are averaged to provide the final estimate of the weight.

In order to estimate the density ratios (step 2), we employ classifier based density ratio estimation [13, 14, 15]. Classifier based density ratio estimation transforms the problem of density ratio estimation into classification by building a training set from the concatenation of and where both values of and are the features, and a label, , is given to denote the membership of the instance to and , respectively. A probabilistic classifier is then learned to recover . We introduce the following assumption which defines which classifiers can be used as estimators


The classifier used to estimate is trained using a strictly proper scoring rule.

Strictly proper scoring rules cover a large portion of possible loss functions including logistic loss, boosting, and mean squared error, along with the information gain and Gini impurity for decision trees 

[24] 222We refer the reader to the supplement for a full definition of strictly proper scoring rules.

After training the classifier the importance weights are recovered by taking the odds ratio


When the number of observations of the original observed data, , and permuted data, is equal the ratio of marginal probability of observed vs treatment cancels and we arrive at

The use of a probabilistic classifier for density ratio estimation has a growing literature [25, 26, 27] but it has yet to be employed in the context of observational causal inference to this point.

One important advantage of the use of a classifier to estimate the density ratio is the ability to infer hyper-parameters. Because the problem is a simple binary classification task, cross-validation can be used for tuning the parameters of the classifier and model-selection. The latter case holds particular appeal since the probabilistic classifier plays a critical role in this procedure–the hypothesis space of the classifier implicitly defines the functional form of dependence entailing the balance condition. For example, the first order balance conditions required by current methods (c.f. [8, 11, 9]

) occurs when the classifier is logistic regression. Additionally, a large number of other classifiers may be considered to capture non-linear/non-parameteric forms of dependence, such as boosting 

[26], decision trees [24], and kernel logistic regression [28]. The use of non-parametric estimators can often yield important gains in reducing bias, as demonstrated in the empirical evaluation. Toward this end, we provide an examination of the relationship between permutation weighting with a number of existing weighting estimators [21, 8, 9, 29, 16] in Appendix A. This examination yields equivalent formulations of the aforementioned methods as instances of permutation weighting in all but one case. By unifying these problems under the framework of permutation weighting, choosing the right balancing method reduces to model selection for binary classification which is a thoroughly studied and understood problem.

Before looking at some of the properties of permutation weighting we first introduce the following additional assumption


The dependence function between and resides in the space of functions that can be modeled by the user chosen classifier.

Another way of stating assumption A5 is that the chosen classifier is capable of modeling the function that would render and independent.

With the necessary assumptions in hand, we now examine some finite and large sample behavior of permutation weighting.

Proposition 1.

Presume assumptions A1, A2, A3, and A5. Permutation weighting minimizes a Bregman divergence between the estimated weights and the true density ratio between observed joint and permuted data, .

The proof follows directly from proposition 3 of [26]. The specific Bregman divergence which is being minimized follows as a consequence of the scoring rule used by the classifier. An equivalent way to state 1 is that permutation weighting minimizes the risk of the importance sampler targeting the independent distribution. As we show in the supplement, when mean balance is targeted with a logistic regression this corresponds to reweighting the treatment and control groups to minimize the distance to their permuted counterparts.

Consistency of the estimate is established with the following proposition.

Proposition 2.

Presume assumptions A1, A2, A3, and A5. Defining be the weights estimated by permutation weighting, as , where and are the marginal densities of and , respectively, and is the joint density.


We note that the triple-bootstrapping approach used by permutation weighting approximates a scheme which first considers a dataset consisting of the original dataset (with label 0) and the set of all possible permutations (with label 1) and then draws a weighted bootstrap sample where the probabilities of sampling are given as and , respectively 333This view also leads to a method for learning permutation weights in the very large sample setting using stochastic optimization and resampling. The full procedure is outlined in the supplement.. Through the lens of the weighted bootstrap, permutation weighting is an instance of aggregated bootstrapped (bagged) classifiers. [30] (Proposition 1) provides consistency of an average of bagged classifiers under the condition that each individual classifier is consistent, i.e. if each classifier minimizes the Bayes risk asymptotically then the average of the classifiers does as well. Connecting this with [26] who show that minimizing the risk of a classifier employing a strictly proper scoring rule produces valid density ratio estimates, we arrive at the proposition. ∎

4 Demonstrations

For the following simulation studies, we will examine only performance of simple weighting estimators as in equation 1. Appendix I provides similar results for the doubly-robust estimators of [31] as well as the estimation of weighted outcome regressions.

Our evaluations are based around integrated root mean squared error (IRMSE), as in [31], with indexing simulations and being the unconditional expectation of a given potential outcome in a single simulation, :

That is, we take an average of RMSE weighted over the marginal probability of treatment. Like [31], we perform this evaluation over , the middle 90% of the distribution of (in the case of binary treatments, we evaluate over the entire distribution of ).

We also evaluate the Integrated Mean Absolute Bias, which replaces the inner sum with .

When permutation weighting is used, we perform 100 independent bootstraps to generate weights.

4.1 Binary Treatment

Our first set of simulations follow the design of [32]

. In this simulation, four independent, standard normal covariates are drawn. A linear combination of them is taken to form the outcome and treatment process (the latter passed through a logit link). Two versions of this standard simulation are analyzed. One in which the four covariates are observed directly, and one in which only four non-linear and interactive functions of them are observed. For full details of the data generating process, see appendix


Results from simulations based on the correctly specified data generating process are shown in figure 1. A few points are worth noting in this case. First, the correct treatment and outcome models are linear. As such, the propensity score model is specified correctly and therefore performs well. Stable balancing weights also perform very strongly in this case (particularly in low sample sizes), as they both explicitly reduce the variance of the weights and correctly specify the form of the confounding relationship. These results show that, given sufficient sample size, PW with a logistic regression classifier (or boosting) replicates and eventually outperforms stable balancing weights (SBW), even when the true relationships are linear. Note, however, that bias under the boosted model is driven to the minimum very quickly, even though the more complicated model increases the variance (hurting the overall IRMSE). PW with logistic regression outperforms CBPS in this simulation, despite the fact that both seek to balance the correct specification of confounding. This difference in performance comes from the small data-dependent regularization in the score condition estimated by permutation weighting (details in C.2).

Figure 2 demonstrates results for the case when the researcher does not know the correct specifications of the confounding relationships of the covariate set with treatment. In these results, PW with boosting drastically improves on the state of the art weighting estimators, reducing by around half the IRMSE relative to state of the art balancing propensity scores at . At smaller sample sizes, the improvements are less substantial, but even by , PW with boosting outperforms stable balancing weights. This is unsurprising, given that boosting is able to learn a more expressive balancing model (and, therefore, reduce bias) more effectively than any other method. Once there is sufficient data to learn a flexible classification model, then, PW begins to greatly outperform the state-of-the-art. Detailed results in tabular format are available in appendix I.

Figure 1: Kang-Schafer simulation under correct specification of confounding variables.
Figure 2: Kang-Schafer simulation under misspecification of confounding variables.

4.2 Continuous Treatment

The same basic setup can be extended to the continuous treatment case by replacing the Bernoulli treatment assignment with a continuous analogue. For the following simulations, we simulate treatment dosage as a linear function as in [32], but adding in standard normal noise. Finally, the dosage enters the outcome model through a logit function to introduce a small non-linearity in response to dose. Complete details are in F.

Figure 3 shows the results when treatment is a linear function of the observed covariates. Propensity score weighting does quite well in terms of reducing bias (consistently with the lowest amount of bias out of all methods), but permutation weighting (particularly using a boosted model) does a better job of trading off bias and variance so as to reduce IRMSE (outperforming the normal-linear IPSW model by around 15% at , for instance). Thus, even when the propensity score model is well specified, boosting is able to out-perform it by more smartly regularizing (and, thereby, reducing variance). Since these simulations do not incorporate hyper-parameter tuning, it is likely that these effects would be even stronger if the boosting model were tuned data-adaptively, rather than with package defaults.

In the case of misspecification, figure 4 shows the learning curves of the various methods. Permutation weighting outperforms all methods at all sample sizes. While PW with a logistic classifier has very similar levels of bias than does [11], it does so with sufficiently lower variance that even at it reduces IRMSE by around 15%. The boosting model improves upon the GLM both in terms of bias and IRMSE; at , boosting provides estimates with around 40% lower IRMSE than does npCBPS. A useful comparison is that permutation weighting outperforms the current state of the art by around four times as much as the state of the art improves on no weighting whatsoever.

Figure 3: Continuous Kang-Schafer simulation under correct specification of confounding variables.
Figure 4: Continuous Kang-Schafer simulation under misspecification of confounding variables.

4.3 Lalonde evaluation with continuous treatment

To explore the behavior of permutation weighting under continuous treatment regimes with irregularly distributed data, we turn to the data of [33] and, in particular, the Panel Study of Income Dynamics observational sample (discarding all treated units from the experimental sample). This dataset has been used productively as a means of evaluating the performance of methods for observational causal inference [34]. In particular, the useful properties of this data are that: 1) treatment effects are known due to the existence of an experimental benchmark [33], 2) covariate data is sufficient to believe causal identification may be plausible (but arguments over the precise form of this identification is subject to debate as in [35]), 3) units in the experiment are very different to those in the observational sample, and 4) the data are highly non-ellipsoidal, consisting of point-masses and otherwise irregular distributions. Following the simulation study in [36]

, we simulate a nonlinear process determining assignment of units to dosage level and, then, to outcome based on observed covariates. The treatment process is made to behave similar to real-world data by estimating a random forest to predict presence in the experimental / observational sample as a function of observed covariates. Dosage is then a quartic function of that predicted score as well as the nonlinear function determining treatment assignment in

[36] (for full details, see appendix G). The shape of the true dose-response function is similarly a quartic function of dose.

Figure 5 shows the IRMSE of a variety of state-of-the-art weighting estimators on this simulated benchmark. With just a simple weighting estimator (the most direct test of the performance of a set of weights) only weights generated by permutation weighting out-perform the raw, unweighted data. When a logistic regression is used as the classifier, PW performs better than no weighting by around 8% in terms of IRMSE. When a boosted model is used, this gap grows substantially, with PW outperforming the raw estimates by around 30%. While the rank order remains essentially unchanged when an outcome model is incorporated into the estimator, the differences in performance are reduced substantially due to the quality of the estimator of the outcome model.

Figure 5: Integrated Root mean squared error for continuous-treatment simulation based on the LaLonde data

4.4 Dube et al (2018) replication

In [37], the authors report results from an observational study on the labor supply elasticity of workers on the Mechanical Turk platform. In this study, they seek to estimate the response (in terms of duration of time a task is available) from the price of a task. Negative values of this slope reflect positive labor supply elasticities. When the magnitude is high, this implies a large elasticity (and vice versa). Experimental studies have found an elasticity of on average. Figure  6 shows the estimated elasticities from [37] observational study using the Double ML technique of [38] on the left panel, while our results using only IPSW with permutation weights (estimated with a random forest) are on the right. It should be noted that Double ML estimates a conditional variance weighted average effect (which explains the clustering of points near the origin in the left panel). Permutation weighting, however, recovers the unconditional dose-response function without any form of conditional variance weighting, a much more difficult to target estimand. Nevertheless, permutation weighting is able to capture the same underlying causal effect. Crucially, both estimates are statistically indistinguishable from the experimental benchmark and from one another.

Figure 6: Estimated labor supply elasticities.

5 Summary

Weighting is one of the most commonly applied estimators for causal inference. This work provides a new lens on weighting by framing the problem in terms of importance sampling towards the distribution of treatment and covariates that would be observed under an RCT. Through this lens we introduced permutation weighting, which casts the weighting problem into a generic two-class classification problem and allows the standard machine learning toolkit to be applied to the problem. Permutation weighting generalizes existing balancing schemes, admits selection via cross-validation, and provides a framework to sensibly integrate generic treatment types within the same weighting estimator. Simulations have shown that permutation weighting outperforms existing methods even in conditions unfavorable to the assumptions underlying the model.

Appendix A Proper Scoring Rules

Definition Definition 1 provides a more formal explanation of proper scoring rules.

Definition 1.

[39] Let be a sample space, be a set of events, and and be two random variables defined with respect to and

. For the set of probability distributions

, is said to be a proper scoring rule if for any , . A scoring rule is a strictly proper scoring rule if that inequality is strict, i.e. .

More intuitively, strictly proper scoring rules define a function which, when minimized, provide calibrated forecasts. The most common examples of strictly proper scoring rules are logistic, exponential, and quadratic loss [39]. Strictly proper scoring rules also provide a natural connection to Bregman divergences–every proper scoring rule is associated with a Bregman divergence between the estimated and true forecasting distribution [40]. We refer readers to [39] for a thorough introduction to proper scoring rules, and [41] which provides theory that unifies Bregman divergences and scoring rules.

Appendix B Alternative Derivation of the Balance Conditions of Permutation Weighting with Logistic Regression

Given the prominence of mean balance within the weighting literature, we now provide additional focus on the mean balance condition of permutation weighting. Specifically, we derive the first order score condition for to allow for optimization via the generalized method of moments for permutation weighting using logistic regression and a feature specification of , where are the regression coefficients. We define the model as as a function that infers the probability that an instance, belongs to the permuted data, denoted or the original observed data, denoted as . In other words . We assume that there are total data instances, with of the instances being the permuted data and of the instances being the original data. Note that is a convex function and that its parameters , denote as is twice continuously differentiable as a consequence of using logistic loss (all strictly proper scoring rules obey these properties). We assume that we wish to maximize the log-likelihood of

which gives the following first order condition



The left-hand side may be rewritten as

The right-hand side may be rewritten as:

Substituting both in we have the following score condition:


A interpretation of the final term in equation 4 is the difference in reweighted means between treatment and control and overall population. The case of balance, i.e., when both treatment and control are reweighted to the marginal mean of , satisfies the score condition.

Appendix C Connections to Existing Weighting Estimators

We now examine the connection between permutation and a number of covariate balancing estimators in the literature. In what follows, we first revisit the relationship with stabiized inverse propensity score weighting, look at score based estimates, i.e. estimators which can be estimated using the generalized method of moments with a particular focus on the covariate balancing propensity score [8], then examine margin based estimators and their kernel extension, establishing an equivalence between permutation weighting and kernel mean matching [17], kernel balancing [16], and Kernel based covariate functional balancing [29] and relate permutation to stabilized balancing weights [9].

c.1 Stabilized Inverse Propensity Score

Perhaps the most immediately evident equivalence is to the stabilized inverse propensity score (IPSW) [21]. This relationship is addressed in the main text; it is included in this section for the benefit of completeness. Recall the definition of the stabilized propensity score weight is simply the marginal density of treatment divide by the conditional density of treatment given covariates, i.e., . Employing simply algebra we see that this quantity will be equivalent to permutation weights, i.e. , under correct specification of the conditional density. The crucial difference between permutation weighting and IPSW comes under mis-specification. Permutation weighting will still seek balance with respect to the conditions implied by the classifier. IPSW, on the other hand, may fail to seek balance under mis-specification. This can result in substantial bias, as we have seen in the empirical results of the main text.

c.2 Covariate Balancing Propensity Scores

We first note the score condition of the covariate balancing propensity score [8]:


Recalling the derivation of the score condition for PW with logistic loss from equation 4 provides a simple comparison:

Here we can see that both estimators are explicitly minimizing a balance condition, PW to the product, i.e. independent, distribution and CBPS between classes. In large samples these are equivalent conditions. However, an interesting interpretation emerges when we consider what happens in smaller samples. Permutation weighting will attempt to match the level of balance that exists in the empirical sample, which provides a data dependent regularization. This regularization likely explains the improvement of PW over CBPS in the case of the synthetic experiments involving mis-specification.

We note that A similar derivation yields the following for IPW:


Here we can see that balance is not directly optimized for, which explains much of the poor performance of IPW in the synthetic experiments with a misp-specified estimator.

c.3 Stabilized Balancing Weights, Kernel Mean Matching, and Kernel Balancing

We will now briefly introduce MMD, weighting methods predicated on MMD, e.g. [17, 42], followed by a discussion of their connection to stabilized balancing weights [9].

The maximum mean discrepancy (MMD) [43], is a two-sample test that distinguishes between two candidate distributions by finding the maximum mean distance between the means of the two samples after transformation, i.e.,


When is a reproducing kernel Hilbert space this can be estimated as the squared difference of their means in feature space. Letting be a kernel associated with a random variable and be the kernel associated with random variable , the finite sample estimate of equation 7 is given as

with and being the size of the samples drawn from and , respectively. The value of MMD reflects the maximum distance between these means There are a couple of points worth noting. First, if the kernel being used obeys certain properties, i.e is characteristic [44], the MMD is able to differentiate between two exponential-family distributions on an arbitrary number of moments [43]. Second, when a linear kernel is employed this is value is simply the squared difference in means between the two groups.

MMD has been used throughout the literature as an objective for minimizing imbalance. Within the context of domain adaptation, [17] introduce kernel mean matching (KMM) which defines an optimization procedure that seeks to find a set of weights such that the distance between the target and source distribution is minimized, specifically


This procedure was later rediscovered for the task of balancing weights by [16]. Somewhat surprisingly, the connection to permutation weighting can be easily obtained via results currently found in the literature. [41]

relate a pessimistic MMD to the support vector machine (SVM), seeking to maximize the MMD by solving the SVM problem

where indicates which dataset a sample has been drawn from and is coded . [15] (Section 8) shows that solving the KMM objective is equivalent to solving the above SVM problem under the additional modification that the values of for are fixed to a constant for one class, producing a Rocchio-style approximation [45] to the SVM. In both cases the final weighting is given directly by taking the value of the dual weights ().

The aforementioned classifiers may be employed within the context of permutation weighting by considering the two samples to be the observed data with a target distribution of the permuted data (as we have assumed throughout). In order to use the dual weights directly the bootstrap procedure is replaced by an average over permutations. Alternatively the weight function may be used directly by considering as in [15]. The benefit of the latter approach is the ability to use cross validation for setting the hyper-parameters of the classifiers. Asymptotically, as the independence property is obeyed by the permuted data, permutation weighting and these procedures are equivalent in the binary treatment setting. To see why this is the case, consider the explicit form of permutation weighting under MMD loss:

where we have again assumed that is a reproducing kernel Hilbert space. This is precisely the Hilbert-Schmidt independence criterion [46], which was shown by [47] to be equivalent to the maximum mean discrepancy between the two samples associated with treatment and control, respectively, when is binary. In the non-binary case the permutation weighting setup defines a novel kernel balancing estimator for general treatments.

Finally, we examine the relationship between permutation weighting and the stable balancing weights of [9]. Zubizarreta [9] defined the following quadratic program to infer what he refers to as stable balancing weights:

Subject to

Intuitively this attempts to minimize the variance of the weights subject to constraints on marginal balance conditions. Comparing this to the kernel mean matching problem (equation 8), we see that stable balancing weights emphasize uniform weights subject to a constraint of predetermined levels of marginal balance. Kernel mean matching on the other hand, seeks to minimize the maximum discrepancy between the two distributions. Minimizing the discrepancy rather than setting it to a fixed level is that it removes a large amount of possible human induced error in the form of additional hyperparameters. While the MMD approach does not have an explicit mechanism to reduce variance, an approximation can be applied by solving the SVM problem using a -SVM [48] which imposes an additional constraints that limits the size of individual weights.

Appendix D Stochastic Training of Permutation Weights

We now outline how stochastic training can be used to efficiently estimate permutation weights in large data setting. Let us assume that we have constructed a dataset consisting of the original dataset (with label 0) and the set of all possible permutations (with label 1). Under appropriate weighting of the classes the resulting weights will provide a consistent estimate of the density ratio . However, it is clearly infeasible to solve such a problem using traditional optimization techniques such as gradient descent. However, if we instead use stochastic optimization we can efficiently approximate it. The process is as follows: At each iteration a batch is drawn by taking a bootstrap sample from the observed data (negative class) and individual bootstrap samples from and . The loss and update to the parameters are carried out as it would in a standard SGD optimization. This process also enjoys consistency with convergence relying on the number of iterations [49]. Note that this is in contrast to non-stochastic optimization procedures such as gradient descent where convergence occurs as a function of sample size [49].

Appendix E Kang-Schafer Data Generating Process

Under the non-linear "misspecification", the covariates are not observed directly, but instead only the following transformations:

Appendix F Continuous Kang-Schafer Data Generating Process

Misspecification is handled identically to the binary case.

Appendix G LaLonde Simulation Data Generating Process

Presence in the experimental sample is estimated using a random forest (using all standard pre-treatment covariates). Predicted values from this model are denoted as

. These predicted values enter the dosage assignment model through a quartic function, along with non-linear effects of age, education and pre-treatment income. Dosage enters the outcome model through a quartic function (thus, the true dose-response function is quartic), along with post-treatment income and a non-linear function of pretreatment income. The outcome is drawn from a normal distribution with standard deviation of 10.

Appendix H Doubly Robust Estimation

A straightforward estimator for the dose response function is given by the so-called ‘direct method’ (e.g. see [6]). For this method of estimation, the dose-response at would be estimated as:

The direct method, then, trains a predictor which predicts with and then for each , marginalizes over the covariate distribution .

It’s possible to improve on the direct method by incorporating a double robust estimator as in [31]:

This quantity provides unbiased estimates when either the propensity score model or the outcome model is correctly specified. We can further swap out the propensity scores in the simple doubly-robust estimators for those generated through PW or any other IPSW-like method as in:

That is, we can observe that the first term of the doubly-robust estimator is simply the IPSW, and replace it with any IPSW-like weight. Moreover, the use of case-weights in estimating does nothing to dampen the consistency of the estimator (under our assumption of positivity), but with appropriate weights may do a better job of ensuring that accuracy is preferenced in areas of where it is actually needed for the dose-response function. Thus, in practice, we apply case-weights when estimating a machine learning model for use in the direct method.

In order to actually estimate a dose-response function from data, it is necessary to additionally pass these estimates for each observation through a flexible function approximation method (such as a local kernel regression method) to estimate the curve of . We will not dwell on this latter component, and interested readers may see [31] for more details.

Appendix I Extended simulation results

The following tables show IRMSE and Bias estimates (defined identically as in the main text) for all combinations of outcome estimation strategies (weighting only, direct method and doubly-robust), models (OLS and random forests) and weighting methods. Estimates of bias or IRMSE are followed by the standard error (estimated via non-parametric bootstrap).

i.1 Binary treatment – Kang and Schafer (2007)

Metric Unweighted Logit CBPS SBW PW (Logit) PW (Boosting)
Model Free Bias 9.414 0.094 0.369 0.051 0.093 0.053 0.502 0.003 0.759 0.026 0.105 0.037
IRMSE 9.477 0.106 0.825 0.079 0.704 0.053 0.501 0.003 0.831 0.031 0.487 0.033
Direct Method
OLS Bias 0.503 0.003 0.503 0.003 0.503 0.003 0.504 0.003 0.504 0.003 0.505 0.003
IRMSE 0.503 0.003 0.503 0.003 0.503 0.003 0.503 0.003 0.503 0.003 0.505 0.003
Random Forest Bias 0.416 0.014 0.035 0.005 0.032 0.005 0.059 0.004 0.028 0.005 0.031 0.006
IRMSE 0.435 0.014 0.081 0.005 0.083 0.005 0.133 0.008 0.077 0.004 0.079 0.005
Doubly Robust
OLS Bias 0.503 0.003 0.503 0.003 0.503 0.003 0.504 0.003 0.504 0.003 0.505 0.003
IRMSE 0.503 0.003 0.503 0.003 0.503 0.003 0.503 0.003 0.503 0.003 0.505 0.003
Random Forest Bias 0.466 0.017 0.196 0.011 0.192 0.010 0.024 0.011 0.193 0.012 0.198 0.012
IRMSE 0.488 0.016 0.218 0.010 0.213 0.009 0.126 0.009 0.214 0.010 0.221 0.010
Table 1: Well specified
model Metric Unweighted Logit CBPS SBW PW (Logit) PW (Boosting)
Model Free Bias 9.392 0.103 5.811 1.184 2.238 0.098 2.237 0.061 2.283 0.073 1.665 0.070
IRMSE 9.428 0.100 9.812 2.710 2.685 0.091 2.402 0.053 2.436 0.063 1.775 0.069
Direct Method
OLS Bias 2.845 0.064 2.671 0.097 2.677 0.103 2.236 0.061 2.174 0.058 1.199 0.051
IRMSE 2.878 0.061 2.763 0.087 2.770 0.100 2.401 0.053 2.235 0.055 1.268 0.049
Random Forest Bias 0.784 0.020 0.240 0.012 0.282 0.014 0.245 0.018 0.234 0.013 0.217 0.012
IRMSE 0.803 0.018 0.276 0.012 0.306 0.012 0.308 0.015 0.267 0.010 0.243 0.010
Doubly Robust
OLS Bias 2.845 0.064 2.671 0.097 2.677 0.103 2.236 0.061 2.174 0.058 1.199 0.051
IRMSE 2.878 0.061 2.763 0.087 2.770 0.100 2.401 0.053 2.235 0.055 1.268 0.049
Random Forest Bias 0.909 0.025 0.512 0.020 0.538 0.020 0.393 0.019 0.488 0.020 0.491 0.019
IRMSE 0.933 0.022 0.542 0.017 0.563 0.017 0.450 0.015 0.517 0.017 0.516 0.017
Table 2: Misspecified

i.2 Continuous treatment – Kang and Schafer (2007)

model Metric Unweighted Normal-Linear npCBPS PW (Logit) PW (Boosting)
Model Free Bias 15.750 0.114 4.113 0.285 7.996 0.377 8.862 0.116 6.021 0.149
IRMSE 16.142 0.114 8.685 0.242 11.445 0.191 9.697 0.119 7.576 0.143
Direct Method
OLS Bias 0.269 0.000 0.269 0.000 0.269 0.000 0.269 0.000 0.269 0.000
IRMSE 0.269 0.000 0.284 0.003 0.298 0.009 0.270 0.000 0.273 0.001
Random Forest Bias 2.507 0.031 1.171 0.019 1.444 0.026 1.302 0.014 1.280 0.016
IRMSE 2.574 0.030 1.229 0.021 1.499 0.028 1.339 0.015 1.321 0.017
Doubly Robust
OLS Bias 0.249 0.001 0.255 0.003 0.259 0.002 0.260 0.001 0.257 0.002
IRMSE 0.272 0.002 0.354 0.008 0.372 0.017 0.289 0.001 0.304 0.003
Random Forest Bias 2.620 0.033 1.423 0.021 1.719 0.027 1.603 0.015 1.560 0.017
IRMSE 2.711 0.031 1.500 0.025 1.797 0.029 1.663 0.017 1.624 0.019
Table 3: Well specified
model Metric Unweighted Normal-Linear npCBPS PW (Logit) PW (Boosting)
Model Free Bias 15.549 0.130 16.810 0.526 10.821 0.259 10.810 0.126 8.406 0.141
IRMSE 16.002 0.115 23.581 0.881 14.747 0.315 11.637 0.143 9.418 0.150
Direct Method
OLS Bias 0.269 0.000 1.351 0.676 0.551 0.195 2.539 0.043 1.276 0.053
IRMSE 0.269 0.000 4.436 2.533 1.687 0.178 2.549 0.049 1.323 0.051
Random Forest Bias 3.221 0.034 6.865 3.195 2.601 0.051 2.544 0.036 2.725 0.056
IRMSE 3.273 0.032 24.459 13.162 2.685 0.056 2.600 0.038 2.811 0.059
Doubly Robust
OLS Bias 2.450 0.050 1.709 0.461 1.326 0.105 2.382 0.044 1.306 0.059
IRMSE 2.879 0.061 6.012 1.929 3.922 0.211 2.692 0.045 2.285 0.047
Random Forest Bias 3.427 0.034 6.922 3.121 2.896 0.051 2.840 0.037 3.027 0.056
IRMSE 3.503 0.033 24.528 13.104 2.994 0.056 2.918 0.039 3.130 0.059
Table 4: Misspecified


  • [1] Donald B Rubin. Causal inference using potential outcomes. Journal of the American Statistical Association, 2011.
  • [2] Judea Pearl. Causality. Cambridge university press, 2009.
  • [3] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
  • [4] Stephen R. Cole and Miguel A. Hernán. Constructing inverse probability weights for marginal structural models. American Journal of Epidemiology, 168(6):656–664, 2008.
  • [5] Keisuke Hirano, Guido W Imbens, and Geert Ridder. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4):1161–1189, 2003.
  • [6] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 1097–1104. Omnipress, 2011.
  • [7] Student. Comparison between balanced and random arrangements of field plots. Biometrika, 29(3/4):363–378, 1938.
  • [8] Kosuke Imai and Marc Ratkovic. Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):243–263, 2014.
  • [9] José R Zubizarreta. Stable weights that balance covariates for estimation with incomplete outcome data. Journal of the American Statistical Association, 110(511):910–922, 2015.
  • [10] Jens Hainmueller. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis, 20(1):25–46, 2012.
  • [11] Christian Fong, Chad Hazlett, Kosuke Imai, et al. Covariate balancing propensity score for a continuous treatment: Application to the efficacy of political advertisements. The Annals of Applied Statistics, 12(1):156–177, 2018.
  • [12] Miguel A Hernán and Sarah L Taubman. Does obesity shorten life? the importance of well-defined interventions to answer causal questions. International journal of obesity, 32(S3):S8, 2008.
  • [13] Jing Qin. Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 85(3):619–630, 1998.
  • [14] Kuang Fu Cheng, Chih-Kang Chu, et al. Semiparametric density estimation under a two-sample density ratio model. Bernoulli, 10(4):583–604, 2004.
  • [15] Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning for differing training and test distributions. In Proceedings of the 24th international conference on Machine learning, pages 81–88. ACM, 2007.
  • [16] Chad Hazlett. Kernel balancing: A flexible non-parametric weighting procedure for estimating causal effects. arXiv preprint arXiv:1605.00155, 2016.
  • [17] Jiayuan Huang, Arthur Gretton, Karsten M Borgwardt, Bernhard Schölkopf, and Alex J Smola. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pages 601–608, 2007.
  • [18] David A. Freedman. Statistical models and shoe leather. Sociological Methodology, 21:291–313, 1991.
  • [19] Daniel G. Horvitz and Donovan J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663–685, 1952.
  • [20] Jaroslav Hajek. Asymptotic theory of rejective sampling with varying probabilities from a finite population. Ann. Math. Statist., 35(4):1491–1523, 12 1964.
  • [21] James M Robins. Causal inference from complex longitudinal data. In Latent variable modeling and applications to causality, pages 69–117. Springer, 1997.
  • [22] Valerie S Harder, Elizabeth A Stuart, and James C Anthony. Propensity score techniques and the assessment of measured covariate balance to test causal associations in psychological research. Psychological methods, 15(3):234, 2010.
  • [23] Paul R Rosenbaum. Observational studies. In Observational studies, pages 1–17. Springer, 2002.
  • [24] Andreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class probability estimation and classification: Structure and applications. Technical Report, 2005.
  • [25] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density-ratio matching under the bregman divergence: a unified framework of density-ratio estimation. Annals of the Institute of Statistical Mathematics, 64(5):1009–1044, 2012.
  • [26] Aditya Menon and Cheng Soon Ong. Linking losses for density ratio and class-probability estimation. In International Conference on Machine Learning, pages 304–313, 2016.
  • [27] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, 2016.
  • [28] Ji Zhu and Trevor Hastie. Kernel logistic regression and the import vector machine. In Advances in neural information processing systems, pages 1081–1088, 2002.
  • [29] Raymond KW Wong and Kwun Chuen Gary Chan. Kernel-based covariate functional balancing for observational studies. Biometrika, 105(1):199–213, 2017.
  • [30] GÊrard Biau, Luc Devroye, and GÃĄbor Lugosi. Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research, 9(Sep):2015–2033, 2008.
  • [31] Edward H. Kennedy, Zongming Ma, Matthew D. McHugh, and Dylan S. Small. Non-parametric methods for doubly robust estimation of continuous treatment effects. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(4):1229–1245, 2016.
  • [32] Joseph DY Kang, Joseph L Schafer, et al. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 22(4):523–539, 2007.
  • [33] Robert J LaLonde. Evaluating the econometric evaluations of training programs with experimental data. The American economic review, pages 604–620, 1986.
  • [34] Rajeev H Dehejia and Sadek Wahba. Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American statistical Association, 94(448):1053–1062, 1999.
  • [35] Jeffrey A Smith and Petra E Todd. Does matching overcome lalonde’s critique of nonexperimental estimators? Journal of econometrics, 125(1-2):305–353, 2005.
  • [36] Alexis Diamond and Jasjeet S Sekhon. Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. Review of Economics and Statistics, 95(3):932–945, 2013.
  • [37] Arindrajit Dube, Jeff Jacobs, Suresh Naidu, and Siddharth Suri. Monopsony in online labor markets. Technical report, National Bureau of Economic Research, 2018.
  • [38] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68, 2018.
  • [39] Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007.
  • [40] Ferenc Huszar. Scoring rules, divergences and information in Bayesian machine learning. PhD thesis, University of Cambridge, 2013.
  • [41] Mark D Reid and Robert C Williamson. Information, divergence and risk for binary experiments. Journal of Machine Learning Research, 12(Mar):731–817, 2011.
  • [42] Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Scholkopf. Covariate shift by kernel mean matching. Dataset Shift in Machine Learning, pages 131–160, 2009.
  • [43] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
  • [44] Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Gert RG Lanckriet, and Bernhard Schölkopf. Injective hilbert space embeddings of probability measures. In COLT, volume 21, pages 111–122, 2008.
  • [45] Thorsten Joachims. A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 143–151. Morgan Kaufmann Publishers Inc., 1997.
  • [46] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Schoelkopf. Kernel methods for measuring independence. Journal of Machine Learning Research, 6:2075–2129, 2005.
  • [47] Le Song. Learning via Hilbert space embedding of distributions. 2008.
  • [48] Bernhard Scholkopf and Alexander J Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001.
  • [49] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in neural information processing systems, pages 161–168, 2008.