Estimating individual treatment effect: generalization bounds and algorithms

06/13/2016 ∙ by Uri Shalit, et al. ∙ 0

There is intense interest in applying machine learning to problems of causal inference in fields such as healthcare, economics and education. In particular, individual-level causal inference has important applications such as precision medicine. We give a new theoretical analysis and family of algorithms for predicting individual treatment effect (ITE) from observational data, under the assumption known as strong ignorability. The algorithms learn a "balanced" representation such that the induced treated and control distributions look similar. We give a novel, simple and intuitive generalization-error bound showing that the expected ITE estimation error of a representation is bounded by a sum of the standard generalization-error of that representation and the distance between the treated and control distributions induced by the representation. We use Integral Probability Metrics to measure distances between distributions, deriving explicit bounds for the Wasserstein and Maximum Mean Discrepancy (MMD) distances. Experiments on real and simulated data show the new algorithms match or outperform the state-of-the-art.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Making predictions about causal effects of actions is a central problem in many domains. For example, a doctor deciding which medication will cause better outcomes for a patient; a government deciding who would benefit most from subsidized job training; or a teacher deciding which study program would most benefit a specific student. In this paper we focus on the problem of making these predictions based on observational data. Observational data is data which contains past actions, their outcomes, and possibly more context, but without direct access to the mechanism which gave rise to the action. For example we might have access to records of patients (context), their medications (actions), and outcomes, but we do not have complete knowledge of why a specific action was applied to a patient.

The hallmark of learning from observational data is that the actions observed in the data depend on variables which might also affect the outcome, resulting in confounding: For example, richer patients might better afford certain medications, and job training might only be given to those motivated enough to seek it. The challenge is how to untangle these confounding factors and make valid predictions. Specifically, we work under the common simplifying assumption of “no-hidden confounding”, assuming that all the factors determining which actions were taken are observed. In the examples above, it would mean that we have measured a patient’s wealth or an employee’s motivation.

As a learning problem, estimating causal effects from observational data is different from classic learning in that in our training data we never see the individual-level effect. For each unit, we only see their response to one of the possible actions - the one they had actually received. This is close to what is known in the machine learning literature as “learning from logged bandit feedback” (Strehl et al., 2010; Swaminathan & Joachims, 2015), with the distinction that we do not have access to the model generating the action.

Our work differs from much work in causal inference in that we focus on the individual-level causal effect (also known as “c-specific treatment effects” Shpitser & Pearl (2006); Pearl (2015)), rather that the average or population level. Our main contribution is to give what is, to the best of our knowledge, the first generalization-error111Our use of the term generalization is different from its use in the study of transportability, where the goal is to generalize causal conclusion across distributions (Bareinboim & Pearl, 2016). bound for estimating individual-level causal effect, where each individual is identified by its features . The bound leads naturally to a new family of representation-learning based algorithms (Bengio et al., 2013), which we show to match or outperform state-of-the-art methods on several causal effect inference tasks.

We frame our results using the Rubin-Neyman potential outcomes framework (Rubin, 2011), as follows. We assume that for a unit with features , and an action (also known as treatment or intervention) , there are two potential outcomes: and . In our data, for each unit we only see one of the potential outcomes, depending on the treatment assignment: if we observe , if , we observe ; this is known as the Consistency assumption. For example, can denote the set of lab tests and demographic factors of a diabetic patient, denote the standard medication for controlling blood sugar, denotes a new medication, and and indicate the patient’s blood sugar level if they were to be given medications and , respectively.

We will denote , . We are interested in learning the function . is the expected treatment effect of relative to on an individual unit with characteristics , or the Individual Treatment Effect (ITE) 222Sometimes known as the Conditional Average Treatment Effect, CATE.. For example, for a patient with features , we can use this to predict which of two treatments will have a better outcome. The fundamental problem of causal inference is that for any in our data we only observe or , but never both.

As mentioned above, we make an important “no-hidden confounders” assumption, in order to make the conditional causal effect identifiable. We formalize this assumption by using the standard strong ignorability condition: , and for all . Strong ignorability is a sufficient condition for the ITE function to be identifiable (Imbens & Wooldridge, 2009; Pearl, 2015; Rolling, 2014): see proof in the supplement. The validity of strong ignorability cannot be assessed from data, and must be determined by domain knowledge and understanding of the causal relationships between the variables.

One approach to the problem of estimating the function is by learning the two functions and using samples from

. This is similar to a standard machine learning problem of learning from finite samples. However, there is an additional source of variance at work here: For example, if mostly rich patients received treatment

, and mostly poor patients received treatment , we might have an unreliable estimation of for poor patients. In this paper we upper bound this additional source of variance using an Integral Probability Metric (IPM) measure of distance between two distributions , and , also known as the control and treated distributions. In practice we use two specific IPMs: the Maximum Mean Discrepancy (Gretton et al., 2012), and the Wasserstein distance (Villani, 2008; Cuturi & Doucet, 2014). We show that the expected error in learning the individual treatment effect function is upper bounded by the error of learning and , plus the IPM term. In the randomized controlled trial setting, where , the IPM term is , and our bound naturally reduces to a standard learning problem of learning two functions.

The bound we derive points the way to a family of algorithms based on the idea of representation learning (Bengio et al., 2013): Jointly learn hypotheses for both treated and control on top of a representation which minimizes a weighted sum of the factual loss (the standard supervised machine learning objective), and the IPM distance between the control and treated distributions induced by the representation. This can be viewed as learning the functions and under a constraint that encourages better generalization across the treated and control populations. In the Experiments section we apply algorithms based on multi-layer neural nets as representations and hypotheses, along with MMD or Wasserstein distributional distances over the representation layer; see Figure 1 for the basic architecture.

In his foundational text about causality, Pearl (2009) writes: “Whereas in traditional learning tasks we attempt to generalize from one set of instances to another, the causal modeling task is to generalize from behavior under one set of conditions to behavior under another set. Causal models should therefore be chosen by a criterion that challenges their stability against changing conditions…” [emphasis ours]. We believe our work points the way to one such stability criterion, for causal inference in the strongly ignorable case.

Figure 1: Neural network architecture for ITE estimation.

is a loss function,

is an integral probability metric. Note that only one of and is updated for each sample during training.

2 Related work

Much recent work in machine learning for causal inference focuses on causal discovery, with the goal of discovering the underlying causal graph or causal direction from data (Hoyer et al., 2009; Maathuis et al., 2010; Triantafillou & Tsamardinos, 2015; Mooij et al., 2016). We focus on the case when the causal graph is simple and known to be of the form , with no hidden confounders.

Under the causal model we assume, the most common goal of causal effect inference as used in the applied sciences is to obtain the average treatment effect: . We will briefly discuss how some standard statistical causal effect inference methods relate to our proposed method. Note that most of these approaches assume some form of ignorability.

One of the most widely used approaches to estimating ATE is covariate adjustment, also known as back-door adjustment or the G-computation formula (Pearl, 2009; Rubin, 2011). In its basic version, covariate adjustment amounts to estimating the functions , . Therefore, covariate adjustment methods are the most natural candidates for estimating ITE as well as ATE, using the estimates of . However, most previous work on this subject focused on asymptotic consistency (Belloni et al., 2014; Athey et al., 2016; Chernozhukov et al., 2016), and so far there has not been much work on the generalization-error of such a procedure. One way to view our results is that we point out a previously unaccounted for source of variance when using covariate adjustment to estimate ITE. We suggest a new type of regularization, by learning representations with reduced IPM distance between treated and control, enabling a new type of bias-variance trade-off.

Another widely used family of statistical methods used in causal effect inference are weighting methods. Methods such as propensity score weighting (Austin, 2011) re-weight the units in the observational data so as to make the treated and control populations more comparable. These methods do not yield themselves immediately to estimating an individual level effect, and adapting them for that purpose is an interesting research question. Doubly robust methods combine re-weighting the samples and covariate adjustment in clever ways to reduce model bias (Funk et al., 2011). Again, we believe that finding how to adapt the concept of double robustness to the problem of effectively estimating ITE is an interesting open question.

Adapting machine learning methods for causal effect inference, and in particular for individual level treatment effect, has gained much interest recently. For example Wager & Athey (2015); Athey & Imbens (2016) discuss how tree-based methods can be adapted to obtain a consistent estimator with semi-parametric asymptotic convergence rate. Recent work has also looked into how machine learning method can help detect heterogeneous treatment effects when some data from randomized experiments is available (Taddy et al., 2016; Peysakhovich & Lada, 2016). Neural nets have also been used for this purpose, exemplified in early work by Beck et al. (2000), and more recently by Hartford et al. (2016)’s work on deep instrumental variables. Our work differs from all the above by focusing on the generalization-error aspects of estimating individual treatment effect, as opposed to asymptotic consistency, and by focusing solely on the observational study case, with no randomized components or instrumental variables.

Another line of work in the causal inference community relates to bounding the estimate of the average treatment effect given an instrumental variable (Balke & Pearl, 1997; Bareinboim & Pearl, 2012), or under hidden confounding, for example when the ignorability assumption does not hold (Pearl, 2009; Cai et al., 2008). Our work differs, in that we only deal with the ignorable case, and in that we bound a very different quantity: the generalization-error of estimating individual level treatment effect.

Our work has strong connections with work on domain adaptation. In particular, estimating ITE requires prediction of outcomes over a different distribution from the observed one. Our ITE error upper bound has similarities with generalization bounds in domain adaptation given by Ben-David et al. (2007); Mansour et al. (2009); Ben-David et al. (2010); Cortes & Mohri (2014). These bounds employ distribution distance metrics such as the A-distance or the discrepancy metric, which are related to the IPM distance we use. Our algorithm is similar to a recent algorithm for domain adaptation by Ganin et al. (2016), and in principle other domain adaptation methods (e.g. Daumé III (2009); Pan et al. (2011); Sun et al. (2016)) could be adapted for use in ITE estimation as presented here.

Finally, our paper builds on work by Johansson et al. (2016)

, where the authors show a connection between covariate shift and the task of estimating the counterfactual outcome in a causal inference scenario. They proposed learning a representation of the data that makes the treated and control distributions more similar, and fitting a linear ridge-regression model on top of it. They then bounded the relative error of fitting a ridge-regression using the distribution with reverse treatment assignment versus fitting a ridge-regression using the factual distribution. Unfortunately, the relative error bound is not at all informative regarding the absolute quality of the representation. In this paper we focus on a related but more substantive task: estimating the individual treatment effect, building on top of the counterfactual error term. We further provide an informative bound on the absolute quality of the representation. We also derive a much more flexible family of algorithms, including non-linear hypotheses and much more powerful distribution metrics in the form of IPMs such as the Wasserstein and MMD distances. Finally, we conduct significantly more thorough experiments including a real-world dataset and out-of-sample performance, and show our methods outperform previously proposed ones.

3 Estimating ITE: Error bounds

In this section we prove a bound on the expected error in estimating the individual treatment effect for a given representation, and a hypothesis defined over that representation. The bound is expressed in terms of (1) the expected loss of the model when learning the observed outcomes as a function of and , denoted , standing for “Factual”; (2) an Integral Probability Metric (IPM) distance between the distribution of treated and control units. The term is the classic machine learning generalization-error, and in turn can be upper bounded using the empirical error and model complexity terms, applying standard machine learning theory (Shalev-Shwartz & Ben-David, 2014).

3.1 Problem setup

We will employ the following assumptions and notations. The most important notations are in the Notation box in the supplement. The space of covariates is a bounded subset . The outcome space is . Treatment

is a binary variable. We assume there exists a joint distribution

, such that and for all (strong ignorability). The treated and control distributions are the distribution of the features conditioned on treatment: , and , respectively.

Throughout this paper we will discuss representation functions of the form , where is the representation space. We make the following assumption about :

Assumption 1.

The representation is a twice-differentiable, one-to-one function. Without loss of generality we will assume that is the image of under . We then have as the inverse of , such that for all .

The representation pushes forward the treated and control distributions into the new space ; we denote the induced distribution by .

Definition 1.

Define , , to be the treated and control distributions induced over . For a one-to-one , the distributions and can be obtained by the standard change of variables formula, using the determinant of the Jacobian of .

Let be a representation function, and be an hypothesis defined over the representation space . Let be a loss function. We define two complimentary loss functions: one is the standard machine learning loss, which we will call the factual loss and denote . The other is the expected loss with respect to the distribution where the treatment assignment is flipped, which we call the counterfactual loss, .

Definition 2.

The expected loss for the unit and treatment pair is: The expected factual and counterfactual losses of and are:

If denotes patients’ features, a treatment, and a potential outcome such as mortality, we think of as measuring how well do and predict mortality for the patients and doctors’ actions sampled from the same distribution as our data sample. measures how well our prediction with and would do in a “topsy-turvy” world where the patients are the same but the doctors are inclined to prescribe exactly the opposite treatment than the one the real-world doctors would prescribe.

Definition 3.

The expected factual treated and control losses are:

For , it is immediate to show that .

Definition 4.

The treatment effect (ITE) for unit is:

Let by an hypothesis. For example, we could have that .

Definition 5.

The treatment effect estimate of the hypothesis for unit is:

Definition 6.

The expected Precision in Estimation of Heterogeneous Effect (PEHE, Hill (2011)) loss of is:


When , we will also use the notation .

Our proof relies on the notion of an Integral Probability Metric

(IPM), which is a class of metrics between probability distributions

(Sriperumbudur et al., 2012; Müller, 1997)

. For two probability density functions

, defined over , and for a function family of functions , we have that

Integral probability metrics are always symmetric and obey the triangle inequality, and trivially satisfy . For rich enough function families , we also have that and then is a true metric over the corresponding set of probabilities. Examples of function families for which is a true metric are the family of bounded continuous functions, the family of -Lipschitz functions (Sriperumbudur et al., 2012), and the unit-ball of functions in a universal reproducing Hilbert kernel space (Gretton et al., 2012).

Definition 7.

Recall that . The expected variance of with respect to a distribution :

We define:

3.2 Bounds

We first state a Lemma bounding the counterfactual loss, a key step in obtaining the bound on the error in estimating individual treatment effect. We then give the main Thoerem. The proofs and details are in the supplement.

Let be the marginal probability of treatment. By the strong ignorability assumption, .

Lemma 1.

Let be a one-to-one representation function, with inverse . Let be an hypothesis. Let be a family of functions . Assume there exists a constant , such that for fixed , the per-unit expected loss functions (Definition 2) obey . We have:

where , and are as in Definitions 2 and 3.

Theorem 1.

Under the conditions of Lemma 1, and assuming the loss used to define in Definitions 2 and 3 is the squared loss, we have:


where and are defined w.r.t. the squared loss.

The main idea of the proof is showing that is upper bounded by the sum of the expected factual loss and expected counterfactual loss . However, we cannot estimate , since we only have samples relevant to . We therefore bound the difference using an IPM.

Choosing a small function family will make the bound tighter. However, choosing too small a family could result in an incomputable bound. For example, for the minimal choice , we will have to evaluate an expectation term of over , and of over . We cannot in general evaluate these expectations, since by assumption when we only observe , and the same for and . In addition, for some function families there is no known way to efficiently compute the IPM distance or its gradients. In this paper we use two function families for which there are available optimization tools. The first is the family of -Lipschitz functions, which leads to IPM being the Wasserstein distance (Villani, 2008; Sriperumbudur et al., 2012), denoted . The second is the family of norm- reproducing kernel Hilbert space (RKHS) functions, leading to the MMD metric (Gretton et al., 2012; Sriperumbudur et al., 2012), denoted . Both the Wasserstein and MMD metrics have consistent estimators which can be efficiently computed in the finite sample case (Sriperumbudur et al., 2012). Both have been used for various machine learning tasks in recent years (Gretton et al., 2009, 2012; Cuturi & Doucet, 2014).

In order to explicitly evaluate the constant in Theorem 1, we have to make some assumptions about the elements of the problem. For the Wasserstein case these are the loss , the Lipschitz constants of and , and the condition number of the Jacobian of . For the MMD case, we make assumptions about the RKHS representability and RKHS norms of ,

, and the standard deviation of

. The full details are given in the supplement, with the major results stated in Theorems 2 and 3. In all cases we obtain that making smaller increases the constant precluding trivial solutions such as making arbitrarily small.

For an empirical sample, and a family of representations and hypotheses, we can further upper bound and by their respective empirical losses and a model complexity term using standard arguments (Shalev-Shwartz & Ben-David, 2014). The IPMs we use can be consistently estimated from finite samples (Sriperumbudur et al., 2012). The negative variance term arises from the fact that, following Hill (2011); Athey & Imbens (2016), we define the error in terms of the conditional mean functions

, as opposed to fitting the random variables


Our results hold for any given and obeying the Theorem conditions. This immediately suggest an algorithm in which we minimize the upper bound in Eq. (2) with respect to and and either the Wasserstein or MMD IPM, in order to minimize the error in estimating the individual treatment effect. This leads us to Algorithm 1 below.

4 Algorithm for estimating ITE

We propose a general framework called CFR (for Counterfactual Regression) for ITE estimation based on the theoretical results above. Our algorithm is an end-to-end, regularized minimization procedure which simultaneously fits both a balanced representation of the data and a hypothesis for the outcome. CFR draws on the same intuition as the approach proposed by Johansson et al. (2016), but overcomes the following limitations of their method: a) Their theory requires a two-step optimization procedure and is specific to linear hypotheses of the learned representation (and does not support e.g. deep neural networks), b) The treatment indicator might get lost if the learned representation is high-dimensional (see discussion below).

We assume there exists a distribution over , such that strong ignorability holds. We further assume we have a sample from that distribution , where if , if . This standard assumption means that the treatment assignment determines which potential outcome we see. Our goal is to find a representation and hypothesis that will minimize for .

In this work, we let and be parameterized by deep neural networks trained jointly in an end-to-end fashion, see Figure 1. This model allows for learning complex non-linear representations and hypotheses with large flexibility. Johansson et al. (2016) parameterized with a single network using the concatenation of and as input. When the dimension of is high, this risks losing the influence of on during training. To combat this, our first contribution is to parameterize and as two separate “heads” of the joint network, the former used to estimate the outcome under treatment, and the latter under control. This means that statistical power is shared in the representation layers of the network, while the effect of treatment is retained in the separate heads. Note that each sample is used to update only the head corresponding to the observed treatment; for example, an observation is only used to update .

Our second contribution is to excplicitly account and adjust for the bias induced by treatment group imbalance. To this end, we seek a representation and hypothesis that minimizes a trade-off between predictive accuracy and imbalance in the representation space, using the following objective:


Note that in the definition of is simply the proportion of treated units in the population. The weights compensate for the difference in treatment group size in our sample, see Theorem 1. is the (empirical) integral probability metric defined by the function family . For most IPMs, we cannot compute the factor in Equation 2

, but treat it as part of the hyperparameter

. This makes our objective sensitive to the scaling of , even for a constant . We therefore normalize

through either projection or batch-normalization with fixed scale. We refer to the model minimizing (

3) with as Counterfactual Regression (CFR) and the variant without balance regularization () as Treatment-Agnostic Representation Network (TARNet).

We train our models by minimizing (3

) using stochastic gradient descent, where we backpropagate the error through both the hypothesis and representation networks, as described in Algorithm 

1. Both the prediction loss and the penalty term are computed for one mini-batch at a time. Details of how to obtain the gradient with respect to the empirical IPMs are in the supplement.

1:  Input: Factual sample , scaling parameter , loss function , representation network with initial weights , outcome network with initial weights , function family for IPM.
2:  Compute
3:  Compute for
4:  while not converged do
5:     Sample mini-batch
6:     Calculate the gradient of the IPM term:
7:     Calculate the gradients of the empirical loss:
8:     Obtain step size scalar or matrix with standard neural net methods e.g. Adam (Kingma & Ba, 2014)
10:     Check convergence criterion
11:  end while
Algorithm 1 CFR: Counterfactual regression with integral probability metrics

5 Experiments

Evaluating causal inference algorithms is more difficult than many machine learning tasks, since for real-world data we rarely have access to the ground truth treatment effect. Existing literature mostly deals with this in two ways. One is by using synthetic or semi-synthetic datasets, where the outcome or treatment assignment are fully known; we use the semi-synthetic IHDP dataset from Hill (2011). The other is using real-world data from randomized controlled trials (RCT). The problem in using data from RCTs is that there is no imbalance between the treated and control distributions, making our method redundant. We partially overcome this problem by using the Jobs dataset from LaLonde (1986), which includes both a randomized and a non-randomized component. We use both for training, but can only use the randomized component for evaluation. This alleviates, but does not solve, the issue of a completely balanced dataset being unsuited for our method.

We evaluate our framework CFR, and its variant without balancing regularization (TARNet), in the task of estimating ITE and ATE. CFR is implemented as a feed-forward neural network with 3 fully-connected exponential-linear layers for the representation and 3 for the hypothesis. Layer sizes were 200 for all layers used for Jobs and 200 and 100 for the representation and hypothesis used for IHDP. The model is trained using Adam 

(Kingma & Ba, 2014). For an overview, see Figure 1. Layers corresponding to the hypothesis are regularized with a small weight decay. For continuous data we use mean squared loss and for binary data, we use log-loss. While our theory does not immediately apply to log-loss, we were curious to see how our model performs with it.

We compare our method to Ordinary Least Squares with treatment as a feature (OLS-1), OLS with separate regressors for each treatment (OLS-2),

-nearest neighbor (-NN), Targeted Maximum Likelihood, which is a doubly robust method (TMLE) (Gruber & van der Laan, 2011), Bayesian Additive Regression Trees (BART) (Chipman et al., 2010; Chipman & McCulloch, 2016)

, Random Forests (Rand. For.) 

(Breiman, 2001), Causal Forests (Caus. For.) (Wager & Athey, 2015)

as well as the Balancing Linear Regression (BLR) and Balancing Neural Network (BNN) by

Johansson et al. (2016)

. For classification tasks we substitute Logistic Regression (LR) for OLS. Choosing hyperparameters for estimating PEHE is non-trivial; we detail our selection procedure, applied to all methods, in subsection C.1 of the supplement.

We evaluate our model in two different settings. One is within-sample, where the task is to estimate ITE for all units in a sample for which the (factual) outcome of one treatment is observed. This corresponds to the common scenario in which a cohort is selected once and not changed. This task is non-trivial, as we never observe the ITE for any unit. The other is the out-of-sample setting, where the goal is to estimate ITE for units with no observed outcomes. This corresponds to the case where a new patient arrives and the goal is to select the best possible treatment. Within-sample error is computed over both the training and validation sets, and out-of-sample error over the test set.

5.1 Simulated outcome: IHDP

Hill (2011) compiled a dataset for causal effect estimation based on the Infant Health and Development Program (IHDP), in which the covariates come from a randomized experiment studying the effects of specialist home visits on future cognitive test scores. The treatment groups have been made imbalanced by removing a biased subset of the treated population. The dataset comprises 747 units (139 treated, 608 control) and 25 covariates measuring aspects of children and their mothers. We use the simulated outcome implemented as setting “A” in the NPCI package (Dorie, 2016). Following Hill (2011), we use the noiseless outcome to compute the true effect. We report the estimated (finite-sample) PEHE loss (Eq.  1), and the absolute error in average treatment effect . The results of the experiments on IHDP are presented in Table 1 (left). We average over 1000 realizations of the outcomes with 63/27/10 train/validation/test splits.

CFR Wass
CFR Wass
Table 1: Results on IHDP (left) and Jobs (right). MMD is squared linear MMD. Lower is better.
Figure 2: Out-of-sample ITE error versus IPM regularization for CFR Wass, relative to the error at , on 500 realizations of IHDP, with high (), medium and low (artificial) imbalance between control and treated.

We investigate the effects of increasing imbalance between the original treatment groups by constructing biased subsamples of the IHDP dataset. A logistic-regression propensity score model is fit to form estimates of the conditional treatment probability. Then, repeatedly, with probability we remove the remaining control observation that has closest to , and with probability , we remove a random control observation. The higher , the more imbalance. For each value of , we remove observations from each set, leaving .

Figure 3: Policy risk on Jobs as a function of treatment inclusion rate. Lower is better. Subjects are included in treatment in order of their estimated treatment effect given by the various methods. CFR Wass is similar to CFR and is omitted to avoid clutter.

5.2 Real-world outcome: Jobs

The study by LaLonde (1986) is a widely used benchmark in the causal inference community, where the treatment is job training and the outcomes are income and employment status after training. This dataset combines a randomized study based on the National Supported Work program with observational data to form a larger dataset (Smith & Todd, 2005). The presence of the randomized subgroup gives a way to estimate the “ground truth” causal effect. The study includes 8 covariates such as age and education, as well as previous earnings. We construct a binary classification task, called Jobs, where the goal is to predict unemployment, using the feature set of Dehejia & Wahba (2002). Following Smith & Todd (2005), we use the LaLonde experimental sample (297 treated, 425 control) and the PSID comparison group (2490 control). There were 482 (15%) subjects unemployed by the end of the study. We average over 10 train/validation/test splits with ratios 56/24/20.

Because all the treated subjects were part of the original randomized sample , we can compute the true average treatment effect on the treated by , where is the control group. We report the error . We cannot evaluate on this dataset, since there is no ground truth for the ITE. Instead, in order to evaluate the quality of ITE estimation, we use a measure we call policy risk. The policy risk is defined as the average loss in value when treating according to the policy implied by an ITE estimator. In our case, for a model , we let the policy be to treat, , if , and to not treat, otherwise. The policy risk is which we can estimate for the randomized trial subset of Jobs by . See figure 3 for risk as a function of treatment threshold , aligned by proportion of treated, and Table 1 for the risk when .

5.3 Results

We begin by noting that indeed imbalance confers an advantage to using the IPM regularization term, as our theoretical results indicate, see e.g. the results for CFR Wass () and TARNet () on IHDP in Table 1. We also see in Figure 2 that even for the harder case of increased imbalance () between treated and control, the relative gain from using our method remains significant. On Jobs, we see a smaller gain from using IPM penalties than on IHDP. We believe this is the case because, while we are minimizing our bound over observational data and accounting for this bias, we are evaluating the predictions only on a randomized subset, where the treatment groups are distributed identically. For both IHDP, non-linear estimators do significantly better than linear ones in terms of individual effect (). On the Jobs dataset, straightforward logistic regression does remarkably well in estimating the ATT. However, being a linear model, LR can only ascribe a uniform policy - in this case, “treat everyone”. The more nuanced policies offered by non-linear methods achieve lower policy risk in the case of Causal Forests and CFR. This emphasizes the fact that estimating average effect and individual effect can require different models. Specifically, while smoothing over many units may yield a good ATE estimate, this might significantly hurt ITE estimation. -nearest neighbors has very good within-sample results on Jobs, because evaluation is performed over the randomized component, but suffers heavily in generalizing out of sample, as expected.

6 Conclusion

In this paper we give a meaningful and intuitive error bound for the problem of estimating individual treatment effect. Our bound relates ITE estimation to the classic machine learning problem of learning from finite samples, along with methods for measuring distributional distances from finite samples. The bound lends itself naturally to the creation of learning algorithms; we focus on using neural nets as representations and hypotheses. We apply our theory-guided approach to both synthetic and real-world tasks, showing that in every case our method matches or outperforms the state-of-the-art. Important open questions are theoretical considerations in choosing the IPM weight

, how to best derive confidence intervals for our model’s predictions, and how to integrate our work with more complicated causal models such as those with hidden confounding or instrumental variables.


We wish to thank Aahlad Manas for his assistance with the experiments. We also thank Jennifer Hill, Marco Cuturi, Esteban Tabak and Sanjong Misra for fruitful conversations, and Stefan Wager for his help with the code for Causal Forests. DS and US were supported by NSF CAREER award #1350965.


  • (1) MathOverflow: functions with orthogonal Jacobian. Accessed: 2016-05-05.
  • Athey & Imbens (2016) Athey, Susan and Imbens, Guido. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353–7360, 2016.
  • Athey et al. (2016) Athey, Susan, Imbens, Guido W, and Wager, Stefan. Efficient inference of average treatment effects in high dimensions via approximate residual balancing. arXiv preprint arXiv:1604.07125, 2016.
  • Aude et al. (2016) Aude, Genevay, Cuturi, Marco, Peyré, Gabriel, and Bach, Francis. Stochastic optimization for large-scale optimal transport. arXiv preprint arXiv:1605.08527, 2016.
  • Austin (2011) Austin, Peter C. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate behavioral research, 46(3):399–424, 2011.
  • Balke & Pearl (1997) Balke, Alexander and Pearl, Judea. Bounds on treatment effects from studies with imperfect compliance. Journal of the American Statistical Association, 92(439):1171–1176, 1997.
  • Bareinboim & Pearl (2012) Bareinboim, Elias and Pearl, Judea. Controlling selection bias in causal inference. In AISTATS, pp. 100–108, 2012.
  • Bareinboim & Pearl (2016) Bareinboim, Elias and Pearl, Judea. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 113(27):7345–7352, 2016.
  • Beck et al. (2000) Beck, Nathaniel, King, Gary, and Zeng, Langche. Improving quantitative studies of international conflict: A conjecture. American Political Science Review, 94(01):21–35, 2000.
  • Belloni et al. (2014) Belloni, Alexandre, Chernozhukov, Victor, and Hansen, Christian. Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies, 81(2):608–650, 2014.
  • Ben-David et al. (2007) Ben-David, Shai, Blitzer, John, Crammer, Koby, Pereira, Fernando, et al. Analysis of representations for domain adaptation. Advances in neural information processing systems, 19:137, 2007.
  • Ben-David et al. (2010) Ben-David, Shai, Blitzer, John, Crammer, Koby, Kulesza, Alex, Pereira, Fernando, and Vaughan, Jennifer Wortman. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
  • Ben-Israel (1999) Ben-Israel, Adi. The change-of-variables formula using matrix volume. SIAM Journal on Matrix Analysis and Applications, 21(1):300–312, 1999.
  • Bengio et al. (2013) Bengio, Yoshua, Courville, Aaron, and Vincent, Pierre. Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):1798–1828, 2013.
  • Breiman (2001) Breiman, Leo. Random forests. Machine learning, 45(1):5–32, 2001.
  • Cai et al. (2008) Cai, Zhihong, Kuroki, Manabu, Pearl, Judea, and Tian, Jin. Bounds on direct effects in the presence of confounded intermediate variables. Biometrics, 64(3):695–701, 2008.
  • Chernozhukov et al. (2016) Chernozhukov, Victor, Chetverikov, Denis, Demirer, Mert, Duflo, Esther, Hansen, Christian, et al. Double machine learning for treatment and causal parameters. arXiv preprint arXiv:1608.00060, 2016.
  • Chipman & McCulloch (2016) Chipman, Hugh and McCulloch, Robert. BayesTree: Bayesian Additive Regression Trees., 2016.
  • Chipman et al. (2010) Chipman, Hugh A, George, Edward I, and McCulloch, Robert E. BART: Bayesian additive regression trees. The Annals of Applied Statistics, pp. 266–298, 2010.
  • Cortes & Mohri (2014) Cortes, Corinna and Mohri, Mehryar. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103–126, 2014.
  • Cuturi (2013) Cuturi, Marco. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pp. 2292–2300, 2013.
  • Cuturi & Doucet (2014) Cuturi, Marco and Doucet, Arnaud. Fast computation of Wasserstein barycenters. In Proceedings of The 31st International Conference on Machine Learning, pp. 685–693, 2014.
  • Daumé III (2009) Daumé III, Hal. Frustratingly easy domain adaptation. arXiv preprint arXiv:0907.1815, 2009.
  • Dehejia & Wahba (2002) Dehejia, Rajeev H and Wahba, Sadek. Propensity score-matching methods for nonexperimental causal studies. Review of Economics and statistics, 84(1):151–161, 2002.
  • Dorie (2016) Dorie, Vincent. NPCI: Non-parametrics for Causal Inference., 2016.
  • Funk et al. (2011) Funk, Michele Jonsson, Westreich, Daniel, Wiesen, Chris, Stürmer, Til, Brookhart, M Alan, and Davidian, Marie. Doubly robust estimation of causal effects. American journal of epidemiology, 173(7):761–767, 2011.
  • Ganin et al. (2016) Ganin, Yaroslav, Ustinova, Evgeniya, Ajakan, Hana, Germain, Pascal, Larochelle, Hugo, Laviolette, François, Marchand, Mario, and Lempitsky, Victor. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(59):1–35, 2016. URL
  • Gretton et al. (2009) Gretton, Arthur, Smola, Alex, Huang, Jiayuan, Schmittfull, Marcel, Borgwardt, Karsten, and Schölkopf, Bernhard. Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4):5, 2009.
  • Gretton et al. (2012) Gretton, Arthur, Borgwardt, Karsten M., Rasch, Malte J., Schölkopf, Bernhard, and Smola, Alexander. A kernel two-sample test. J. Mach. Learn. Res., 13:723–773, March 2012. ISSN 1532-4435.
  • Gruber & van der Laan (2011) Gruber, Susan and van der Laan, Mark J. tmle: An r package for targeted maximum likelihood estimation. 2011.
  • Grunewalder et al. (2013) Grunewalder, Steffen, Arthur, Gretton, and Shawe-Taylor, John. Smooth operators. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 1184–1192, 2013.
  • Hartford et al. (2016) Hartford, Jason, Lewis, Greg, Leyton-Brown, Kevin, and Taddy, Matt. Counterfactual prediction with deep instrumental variables networks. arXiv preprint arXiv:1612.09596, 2016.
  • Hill (2011) Hill, Jennifer L. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1), 2011.
  • Hoyer et al. (2009) Hoyer, Patrik O, Janzing, Dominik, Mooij, Joris M, Peters, Jonas, and Schölkopf, Bernhard. Nonlinear causal discovery with additive noise models. In Advances in neural information processing systems, pp. 689–696, 2009.
  • Imbens & Wooldridge (2009) Imbens, Guido W and Wooldridge, Jeffrey M. Recent developments in the econometrics of program evaluation. Journal of economic literature, 47(1):5–86, 2009.
  • Johansson et al. (2016) Johansson, Fredrik D., Shalit, Uri, and Sontag, David. Learning representations for counterfactual inference. In Proceedings of the 33rd International Conference on Machine Learning (ICML), 2016.
  • Kingma & Ba (2014) Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kuang & Tabak (2016) Kuang, Max and Tabak, Esteban. Preconditioning of optimal transport. Preprint, 2016.
  • LaLonde (1986) LaLonde, Robert J. Evaluating the econometric evaluations of training programs with experimental data. The American economic review, pp. 604–620, 1986.
  • Maathuis et al. (2010) Maathuis, Marloes H, Colombo, Diego, Kalisch, Markus, and Bühlmann, Peter. Predicting causal effects in large-scale systems from observational data. Nature Methods, 7(4):247–248, 2010.
  • Mansour et al. (2009) Mansour, Yishay, Mohri, Mehryar, and Rostamizadeh, Afshin. Domain adaptation: Learning bounds and algorithms. 2009.
  • Mooij et al. (2016) Mooij, Joris M, Peters, Jonas, Janzing, Dominik, Zscheischler, Jakob, and Schölkopf, Bernhard. Distinguishing cause from effect using observational data: methods and benchmarks. Journal of Machine Learning Research, 17(32):1–102, 2016.
  • Müller (1997) Müller, Alfred. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, pp. 429–443, 1997.
  • Pan et al. (2011) Pan, Sinno Jialin, Tsang, Ivor W, Kwok, James T, and Yang, Qiang. Domain adaptation via transfer component analysis. Neural Networks, IEEE Transactions on, 22(2):199–210, 2011.
  • Pearl (2009) Pearl, Judea. Causality. Cambridge university press, 2009.
  • Pearl (2015) Pearl, Judea. Detecting latent heterogeneity. Sociological Methods & Research, pp. 0049124115600597, 2015.
  • Peysakhovich & Lada (2016) Peysakhovich, Alexander and Lada, Akos. Combining observational and experimental data to find heterogeneous treatment effects. arXiv preprint arXiv:1611.02385, 2016.
  • Rolling (2014) Rolling, Craig Anthony. Estimation of Conditional Average Treatment Effects. PhD thesis, University of Minnesota, 2014.
  • Rubin (2011) Rubin, Donald B. Causal inference using potential outcomes. Journal of the American Statistical Association, 2011.
  • Shalev-Shwartz & Ben-David (2014) Shalev-Shwartz, Shai and Ben-David, Shai. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.
  • Shpitser & Pearl (2006) Shpitser, Ilya and Pearl, Judea. Identification of conditional interventional distributions. In

    Proceedings of the Twenty-second Conference on Uncertainty in Artificial Intelligence

    , pp. 437–444. UAI Press, 2006.
  • Smith & Todd (2005) Smith, Jeffrey A and Todd, Petra E. Does matching overcome LaLonde’s critique of nonexperimental estimators? Journal of econometrics, 125(1):305–353, 2005.
  • Sriperumbudur et al. (2012) Sriperumbudur, Bharath K, Fukumizu, Kenji, Gretton, Arthur, Schölkopf, Bernhard, Lanckriet, Gert RG, et al. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550–1599, 2012.
  • Steinwart & Christmann (2008) Steinwart, Ingo and Christmann, Andreas. Support vector machines. Springer Science & Business Media, 2008.
  • Strehl et al. (2010) Strehl, Alex, Langford, John, Li, Lihong, and Kakade, Sham M. Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems, pp. 2217–2225, 2010.
  • Sun et al. (2016) Sun, Baochen, Feng, Jiashi, and Saenko, Kate. Return of frustratingly easy domain adaptation. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
  • Swaminathan & Joachims (2015) Swaminathan, Adith and Joachims, Thorsten. Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research, 16:1731–1755, 2015.
  • Taddy et al. (2016) Taddy, Matt, Gardner, Matt, Chen, Liyun, and Draper, David. A nonparametric bayesian analysis of heterogenous treatment effects in digital experimentation. Journal of Business & Economic Statistics, 34(4):661–672, 2016.
  • Triantafillou & Tsamardinos (2015) Triantafillou, Sofia and Tsamardinos, Ioannis. Constraint-based causal discovery from multiple interventions over overlapping variable sets. Journal of Machine Learning Research, 16:2147–2205, 2015.
  • Villani (2008) Villani, Cédric. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
  • Wager & Athey (2015) Wager, Stefan and Athey, Susan. Estimation and inference of heterogeneous treatment effects using random forests. arXiv preprint arXiv:1510.04342., 2015.

Appendix A Proofs

a.1 Definitions, assumptions, and auxiliary lemmas

Notation: : distribution on : the marginal probability of treatment. : treated distribution. : control distribution. : representation function mapping from to . : the inverse function of , mapping from to . : the distribution induced by on . , : treated and control distributions induced by on . : loss function, from to . : the expected loss of for the unit and treatment . , : expected factual and counterfactual loss of . , the expected treatment effect for unit . : expected error in estimating the individual treatment effect of a function . : the integral probability metric distance induced by function family between distributions and .

We first define the necessary distributions and prove some simple results about them. We assume a joint distribution function , such that , and for all . Recall that we assume Consistency, that is we assume that we observe and .

Definition A1.

The treatment effect for unit is:

We first show that under consistency and strong ignorability, the ITE function is identifiable:

Lemma A1.

We have:


Equality (4) is because we assume that and are independent conditioned on . Equality (5) follows from the consistency assumption. Finally, the last equation is composed entirely of observable quantities and can be estimated from data since we assume for all .

Definition A2.

Let , and denote respectively the treatment and control distributions.

Let be a representation function. We will assume that is differentiable.

Assumption A1.

The representation function is one-to-one. Without loss of generality we will assume that is the image of under , and define to be the inverse of , such that for all .

Definition A3.

For a representation function , and for a distribution defined over , let be the distribution induced by over . Define , , to be the treatment and control distributions induced over .

For a one-to-one , the distribution over can be obtained by the standard change of variables formula, using the determinant of the Jacobian of . See (Ben-Israel, 1999) for the case of a mapping between spaces of different dimensions.

Lemma A2.

For all , :


Let be the absolute of the determinant of the Jacobian of .

where equality (a) is by the change of variable formula. The proof is identical for . ∎

Let be a loss function, e.g. the absolute loss or squared loss.

Definition A4.

Let be a representation function. Let be an hypothesis defined over the representation space . The expected loss for the unit and treatment pair is:

Definition A5.

The expected factual loss and counterfactual losses of and are, respectively:

When it is clear from the context, we will sometimes use and for the expected factual and counterfactual losses of an arbitrary function .

Definition A6.

The expected treated and control losses are:

The four losses above are simply the loss conditioned on either the control or treated set. Let be the proportion of treated in the population. We then have the immediate result:

Lemma A3.

The proof is immediate, noting that , and from the Definitions A4 and A6 of the losses.

Definition A7.

Let be a function family consisting of functions . For a pair of distributions , over , define the Integral Probability Metric:

defines a pseudo-metric on the space of probability functions over , and for sufficiently large function families, is a proper metric (Müller, 1997). Examples of sufficiently large functions families includes the set of bounded continuous functions, the set of -Lipschitz functions, and the set of unit norm functions in a universal Reproducing Norm Hilbert Space. The latter two give rise to the Wasserstein and Maximum Mean Discrepancy metrics, respectively (Gretton et al., 2012; Sriperumbudur et al., 2012). We note that for function families such as the three mentioned above, for which , the absolute value can be omitted from definition A7.

a.2 General IPM bound

We now state and prove the most important technical lemma of this section.

Lemma A4 (Lemma 1, main text).

Let be an invertible representation with its inverse. Let be defined as in Definition A3. Let . Let be a family of functions , and denote by the integral probability metric induced by . Let be an hypothesis. Assume there exists a constant , such that for , the function . Then we have: