There is intense interest in applying machine learning to problems of causal inference in fields such as healthcare, economics and education. In particular, individual-level causal inference has important applications such as precision medicine. We give a new theoretical analysis and family of algorithms for predicting individual treatment effect (ITE) from observational data, under the assumption known as strong ignorability. The algorithms learn a "balanced" representation such that the induced treated and control distributions look similar. We give a novel, simple and intuitive generalization-error bound showing that the expected ITE estimation error of a representation is bounded by a sum of the standard generalization-error of that representation and the distance between the treated and control distributions induced by the representation. We use Integral Probability Metrics to measure distances between distributions, deriving explicit bounds for the Wasserstein and Maximum Mean Discrepancy (MMD) distances. Experiments on real and simulated data show the new algorithms match or outperform the state-of-the-art.READ FULL TEXT VIEW PDF
Applying causal inference models in areas such as economics, healthcare ...
Practitioners in diverse fields such as healthcare, economics and educat...
Causal inference (CI) in observational studies has received a lot of
Estimating heterogeneous treatment effect is an important task in causal...
Predicting which action (treatment) will lead to a better outcome is a
Estimating individual and average treatment effects from observational d...
Performing inference on data obtained through observational studies is
Making predictions about causal effects of actions is a central problem in many domains. For example, a doctor deciding which medication will cause better outcomes for a patient; a government deciding who would benefit most from subsidized job training; or a teacher deciding which study program would most benefit a specific student. In this paper we focus on the problem of making these predictions based on observational data. Observational data is data which contains past actions, their outcomes, and possibly more context, but without direct access to the mechanism which gave rise to the action. For example we might have access to records of patients (context), their medications (actions), and outcomes, but we do not have complete knowledge of why a specific action was applied to a patient.
The hallmark of learning from observational data is that the actions observed in the data depend on variables which might also affect the outcome, resulting in confounding: For example, richer patients might better afford certain medications, and job training might only be given to those motivated enough to seek it. The challenge is how to untangle these confounding factors and make valid predictions. Specifically, we work under the common simplifying assumption of “no-hidden confounding”, assuming that all the factors determining which actions were taken are observed. In the examples above, it would mean that we have measured a patient’s wealth or an employee’s motivation.
As a learning problem, estimating causal effects from observational data is different from classic learning in that in our training data we never see the individual-level effect. For each unit, we only see their response to one of the possible actions - the one they had actually received. This is close to what is known in the machine learning literature as “learning from logged bandit feedback” (Strehl et al., 2010; Swaminathan & Joachims, 2015), with the distinction that we do not have access to the model generating the action.
Our work differs from much work in causal inference in that we focus on the individual-level causal effect (also known as “c-specific treatment effects” Shpitser & Pearl (2006); Pearl (2015)), rather that the average or population level. Our main contribution is to give what is, to the best of our knowledge, the first generalization-error111Our use of the term generalization is different from its use in the study of transportability, where the goal is to generalize causal conclusion across distributions (Bareinboim & Pearl, 2016). bound for estimating individual-level causal effect, where each individual is identified by its features . The bound leads naturally to a new family of representation-learning based algorithms (Bengio et al., 2013), which we show to match or outperform state-of-the-art methods on several causal effect inference tasks.
We frame our results using the Rubin-Neyman potential outcomes framework (Rubin, 2011), as follows. We assume that for a unit with features , and an action (also known as treatment or intervention) , there are two potential outcomes: and . In our data, for each unit we only see one of the potential outcomes, depending on the treatment assignment: if we observe , if , we observe ; this is known as the Consistency assumption. For example, can denote the set of lab tests and demographic factors of a diabetic patient, denote the standard medication for controlling blood sugar, denotes a new medication, and and indicate the patient’s blood sugar level if they were to be given medications and , respectively.
We will denote , . We are interested in learning the function . is the expected treatment effect of relative to on an individual unit with characteristics , or the Individual Treatment Effect (ITE) 222Sometimes known as the Conditional Average Treatment Effect, CATE.. For example, for a patient with features , we can use this to predict which of two treatments will have a better outcome. The fundamental problem of causal inference is that for any in our data we only observe or , but never both.
As mentioned above, we make an important “no-hidden confounders” assumption, in order to make the conditional causal effect identifiable. We formalize this assumption by using the standard strong ignorability condition: , and for all . Strong ignorability is a sufficient condition for the ITE function to be identifiable (Imbens & Wooldridge, 2009; Pearl, 2015; Rolling, 2014): see proof in the supplement. The validity of strong ignorability cannot be assessed from data, and must be determined by domain knowledge and understanding of the causal relationships between the variables.
One approach to the problem of estimating the function is by learning the two functions and using samples from
. This is similar to a standard machine learning problem of learning from finite samples. However, there is an additional source of variance at work here: For example, if mostly rich patients received treatment, and mostly poor patients received treatment , we might have an unreliable estimation of for poor patients. In this paper we upper bound this additional source of variance using an Integral Probability Metric (IPM) measure of distance between two distributions , and , also known as the control and treated distributions. In practice we use two specific IPMs: the Maximum Mean Discrepancy (Gretton et al., 2012), and the Wasserstein distance (Villani, 2008; Cuturi & Doucet, 2014). We show that the expected error in learning the individual treatment effect function is upper bounded by the error of learning and , plus the IPM term. In the randomized controlled trial setting, where , the IPM term is , and our bound naturally reduces to a standard learning problem of learning two functions.
The bound we derive points the way to a family of algorithms based on the idea of representation learning (Bengio et al., 2013): Jointly learn hypotheses for both treated and control on top of a representation which minimizes a weighted sum of the factual loss (the standard supervised machine learning objective), and the IPM distance between the control and treated distributions induced by the representation. This can be viewed as learning the functions and under a constraint that encourages better generalization across the treated and control populations. In the Experiments section we apply algorithms based on multi-layer neural nets as representations and hypotheses, along with MMD or Wasserstein distributional distances over the representation layer; see Figure 1 for the basic architecture.
In his foundational text about causality, Pearl (2009) writes: “Whereas in traditional learning tasks we attempt to generalize from one set of instances to another, the causal modeling task is to generalize from behavior under one set of conditions to behavior under another set. Causal models should therefore be chosen by a criterion that challenges their stability against changing conditions…” [emphasis ours]. We believe our work points the way to one such stability criterion, for causal inference in the strongly ignorable case.
Much recent work in machine learning for causal inference focuses on causal discovery, with the goal of discovering the underlying causal graph or causal direction from data (Hoyer et al., 2009; Maathuis et al., 2010; Triantafillou & Tsamardinos, 2015; Mooij et al., 2016). We focus on the case when the causal graph is simple and known to be of the form , with no hidden confounders.
Under the causal model we assume, the most common goal of causal effect inference as used in the applied sciences is to obtain the average treatment effect: . We will briefly discuss how some standard statistical causal effect inference methods relate to our proposed method. Note that most of these approaches assume some form of ignorability.
One of the most widely used approaches to estimating ATE is covariate adjustment, also known as back-door adjustment or the G-computation formula (Pearl, 2009; Rubin, 2011). In its basic version, covariate adjustment amounts to estimating the functions , . Therefore, covariate adjustment methods are the most natural candidates for estimating ITE as well as ATE, using the estimates of . However, most previous work on this subject focused on asymptotic consistency (Belloni et al., 2014; Athey et al., 2016; Chernozhukov et al., 2016), and so far there has not been much work on the generalization-error of such a procedure. One way to view our results is that we point out a previously unaccounted for source of variance when using covariate adjustment to estimate ITE. We suggest a new type of regularization, by learning representations with reduced IPM distance between treated and control, enabling a new type of bias-variance trade-off.
Another widely used family of statistical methods used in causal effect inference are weighting methods. Methods such as propensity score weighting (Austin, 2011) re-weight the units in the observational data so as to make the treated and control populations more comparable. These methods do not yield themselves immediately to estimating an individual level effect, and adapting them for that purpose is an interesting research question. Doubly robust methods combine re-weighting the samples and covariate adjustment in clever ways to reduce model bias (Funk et al., 2011). Again, we believe that finding how to adapt the concept of double robustness to the problem of effectively estimating ITE is an interesting open question.
Adapting machine learning methods for causal effect inference, and in particular for individual level treatment effect, has gained much interest recently. For example Wager & Athey (2015); Athey & Imbens (2016) discuss how tree-based methods can be adapted to obtain a consistent estimator with semi-parametric asymptotic convergence rate. Recent work has also looked into how machine learning method can help detect heterogeneous treatment effects when some data from randomized experiments is available (Taddy et al., 2016; Peysakhovich & Lada, 2016). Neural nets have also been used for this purpose, exemplified in early work by Beck et al. (2000), and more recently by Hartford et al. (2016)’s work on deep instrumental variables. Our work differs from all the above by focusing on the generalization-error aspects of estimating individual treatment effect, as opposed to asymptotic consistency, and by focusing solely on the observational study case, with no randomized components or instrumental variables.
Another line of work in the causal inference community relates to bounding the estimate of the average treatment effect given an instrumental variable (Balke & Pearl, 1997; Bareinboim & Pearl, 2012), or under hidden confounding, for example when the ignorability assumption does not hold (Pearl, 2009; Cai et al., 2008). Our work differs, in that we only deal with the ignorable case, and in that we bound a very different quantity: the generalization-error of estimating individual level treatment effect.
Our work has strong connections with work on domain adaptation. In particular, estimating ITE requires prediction of outcomes over a different distribution from the observed one. Our ITE error upper bound has similarities with generalization bounds in domain adaptation given by Ben-David et al. (2007); Mansour et al. (2009); Ben-David et al. (2010); Cortes & Mohri (2014). These bounds employ distribution distance metrics such as the A-distance or the discrepancy metric, which are related to the IPM distance we use. Our algorithm is similar to a recent algorithm for domain adaptation by Ganin et al. (2016), and in principle other domain adaptation methods (e.g. Daumé III (2009); Pan et al. (2011); Sun et al. (2016)) could be adapted for use in ITE estimation as presented here.
Finally, our paper builds on work by Johansson et al. (2016)
, where the authors show a connection between covariate shift and the task of estimating the counterfactual outcome in a causal inference scenario. They proposed learning a representation of the data that makes the treated and control distributions more similar, and fitting a linear ridge-regression model on top of it. They then bounded the relative error of fitting a ridge-regression using the distribution with reverse treatment assignment versus fitting a ridge-regression using the factual distribution. Unfortunately, the relative error bound is not at all informative regarding the absolute quality of the representation. In this paper we focus on a related but more substantive task: estimating the individual treatment effect, building on top of the counterfactual error term. We further provide an informative bound on the absolute quality of the representation. We also derive a much more flexible family of algorithms, including non-linear hypotheses and much more powerful distribution metrics in the form of IPMs such as the Wasserstein and MMD distances. Finally, we conduct significantly more thorough experiments including a real-world dataset and out-of-sample performance, and show our methods outperform previously proposed ones.
In this section we prove a bound on the expected error in estimating the individual treatment effect for a given representation, and a hypothesis defined over that representation. The bound is expressed in terms of (1) the expected loss of the model when learning the observed outcomes as a function of and , denoted , standing for “Factual”; (2) an Integral Probability Metric (IPM) distance between the distribution of treated and control units. The term is the classic machine learning generalization-error, and in turn can be upper bounded using the empirical error and model complexity terms, applying standard machine learning theory (Shalev-Shwartz & Ben-David, 2014).
We will employ the following assumptions and notations. The most important notations are in the Notation box in the supplement. The space of covariates is a bounded subset . The outcome space is . Treatment, such that and for all (strong ignorability). The treated and control distributions are the distribution of the features conditioned on treatment: , and , respectively.
Throughout this paper we will discuss representation functions of the form , where is the representation space. We make the following assumption about :
The representation is a twice-differentiable, one-to-one function. Without loss of generality we will assume that is the image of under . We then have as the inverse of , such that for all .
The representation pushes forward the treated and control distributions into the new space ; we denote the induced distribution by .
Define , , to be the treated and control distributions induced over . For a one-to-one , the distributions and can be obtained by the standard change of variables formula, using the determinant of the Jacobian of .
Let be a representation function, and be an hypothesis defined over the representation space . Let be a loss function. We define two complimentary loss functions: one is the standard machine learning loss, which we will call the factual loss and denote . The other is the expected loss with respect to the distribution where the treatment assignment is flipped, which we call the counterfactual loss, .
The expected loss for the unit and treatment pair is: The expected factual and counterfactual losses of and are:
If denotes patients’ features, a treatment, and a potential outcome such as mortality, we think of as measuring how well do and predict mortality for the patients and doctors’ actions sampled from the same distribution as our data sample. measures how well our prediction with and would do in a “topsy-turvy” world where the patients are the same but the doctors are inclined to prescribe exactly the opposite treatment than the one the real-world doctors would prescribe.
The expected factual treated and control losses are:
For , it is immediate to show that .
The treatment effect (ITE) for unit is:
Let by an hypothesis. For example, we could have that .
The treatment effect estimate of the hypothesis for unit is:
The expected Precision in Estimation of Heterogeneous Effect (PEHE, Hill (2011)) loss of is:
When , we will also use the notation .
Our proof relies on the notion of an Integral Probability Metric
(IPM), which is a class of metrics between probability distributions(Sriperumbudur et al., 2012; Müller, 1997)
. For two probability density functions, defined over , and for a function family of functions , we have that
Integral probability metrics are always symmetric and obey the triangle inequality, and trivially satisfy . For rich enough function families , we also have that and then is a true metric over the corresponding set of probabilities. Examples of function families for which is a true metric are the family of bounded continuous functions, the family of -Lipschitz functions (Sriperumbudur et al., 2012), and the unit-ball of functions in a universal reproducing Hilbert kernel space (Gretton et al., 2012).
Recall that . The expected variance of with respect to a distribution :
We first state a Lemma bounding the counterfactual loss, a key step in obtaining the bound on the error in estimating individual treatment effect. We then give the main Thoerem. The proofs and details are in the supplement.
Let be the marginal probability of treatment. By the strong ignorability assumption, .
The main idea of the proof is showing that is upper bounded by the sum of the expected factual loss and expected counterfactual loss . However, we cannot estimate , since we only have samples relevant to . We therefore bound the difference using an IPM.
Choosing a small function family will make the bound tighter. However, choosing too small a family could result in an incomputable bound. For example, for the minimal choice , we will have to evaluate an expectation term of over , and of over . We cannot in general evaluate these expectations, since by assumption when we only observe , and the same for and . In addition, for some function families there is no known way to efficiently compute the IPM distance or its gradients. In this paper we use two function families for which there are available optimization tools. The first is the family of -Lipschitz functions, which leads to IPM being the Wasserstein distance (Villani, 2008; Sriperumbudur et al., 2012), denoted . The second is the family of norm- reproducing kernel Hilbert space (RKHS) functions, leading to the MMD metric (Gretton et al., 2012; Sriperumbudur et al., 2012), denoted . Both the Wasserstein and MMD metrics have consistent estimators which can be efficiently computed in the finite sample case (Sriperumbudur et al., 2012). Both have been used for various machine learning tasks in recent years (Gretton et al., 2009, 2012; Cuturi & Doucet, 2014).
In order to explicitly evaluate the constant in Theorem 1, we have to make some assumptions about the elements of the problem. For the Wasserstein case these are the loss , the Lipschitz constants of and , and the condition number of the Jacobian of . For the MMD case, we make assumptions about the RKHS representability and RKHS norms of ,
, and the standard deviation of. The full details are given in the supplement, with the major results stated in Theorems 2 and 3. In all cases we obtain that making smaller increases the constant precluding trivial solutions such as making arbitrarily small.
For an empirical sample, and a family of representations and hypotheses, we can further upper bound and by their respective empirical losses and a model complexity term using standard arguments (Shalev-Shwartz & Ben-David, 2014). The IPMs we use can be consistently estimated from finite samples (Sriperumbudur et al., 2012). The negative variance term arises from the fact that, following Hill (2011); Athey & Imbens (2016), we define the error in terms of the conditional mean functions
, as opposed to fitting the random variables.
Our results hold for any given and obeying the Theorem conditions. This immediately suggest an algorithm in which we minimize the upper bound in Eq. (2) with respect to and and either the Wasserstein or MMD IPM, in order to minimize the error in estimating the individual treatment effect. This leads us to Algorithm 1 below.
We propose a general framework called CFR (for Counterfactual Regression) for ITE estimation based on the theoretical results above. Our algorithm is an end-to-end, regularized minimization procedure which simultaneously fits both a balanced representation of the data and a hypothesis for the outcome. CFR draws on the same intuition as the approach proposed by Johansson et al. (2016), but overcomes the following limitations of their method: a) Their theory requires a two-step optimization procedure and is specific to linear hypotheses of the learned representation (and does not support e.g. deep neural networks), b) The treatment indicator might get lost if the learned representation is high-dimensional (see discussion below).
We assume there exists a distribution over , such that strong ignorability holds. We further assume we have a sample from that distribution , where if , if . This standard assumption means that the treatment assignment determines which potential outcome we see. Our goal is to find a representation and hypothesis that will minimize for .
In this work, we let and be parameterized by deep neural networks trained jointly in an end-to-end fashion, see Figure 1. This model allows for learning complex non-linear representations and hypotheses with large flexibility. Johansson et al. (2016) parameterized with a single network using the concatenation of and as input. When the dimension of is high, this risks losing the influence of on during training. To combat this, our first contribution is to parameterize and as two separate “heads” of the joint network, the former used to estimate the outcome under treatment, and the latter under control. This means that statistical power is shared in the representation layers of the network, while the effect of treatment is retained in the separate heads. Note that each sample is used to update only the head corresponding to the observed treatment; for example, an observation is only used to update .
Our second contribution is to excplicitly account and adjust for the bias induced by treatment group imbalance. To this end, we seek a representation and hypothesis that minimizes a trade-off between predictive accuracy and imbalance in the representation space, using the following objective:
Note that in the definition of is simply the proportion of treated units in the population. The weights compensate for the difference in treatment group size in our sample, see Theorem 1. is the (empirical) integral probability metric defined by the function family . For most IPMs, we cannot compute the factor in Equation 2
, but treat it as part of the hyperparameter. This makes our objective sensitive to the scaling of , even for a constant . We therefore normalize
through either projection or batch-normalization with fixed scale. We refer to the model minimizing (3) with as Counterfactual Regression (CFR) and the variant without balance regularization () as Treatment-Agnostic Representation Network (TARNet).
We train our models by minimizing (31. Both the prediction loss and the penalty term are computed for one mini-batch at a time. Details of how to obtain the gradient with respect to the empirical IPMs are in the supplement.
Evaluating causal inference algorithms is more difficult than many machine learning tasks, since for real-world data we rarely have access to the ground truth treatment effect. Existing literature mostly deals with this in two ways. One is by using synthetic or semi-synthetic datasets, where the outcome or treatment assignment are fully known; we use the semi-synthetic IHDP dataset from Hill (2011). The other is using real-world data from randomized controlled trials (RCT). The problem in using data from RCTs is that there is no imbalance between the treated and control distributions, making our method redundant. We partially overcome this problem by using the Jobs dataset from LaLonde (1986), which includes both a randomized and a non-randomized component. We use both for training, but can only use the randomized component for evaluation. This alleviates, but does not solve, the issue of a completely balanced dataset being unsuited for our method.
We evaluate our framework CFR, and its variant without balancing regularization (TARNet), in the task of estimating ITE and ATE. CFR is implemented as a feed-forward neural network with 3 fully-connected exponential-linear layers for the representation and 3 for the hypothesis. Layer sizes were 200 for all layers used for Jobs and 200 and 100 for the representation and hypothesis used for IHDP. The model is trained using Adam(Kingma & Ba, 2014). For an overview, see Figure 1. Layers corresponding to the hypothesis are regularized with a small weight decay. For continuous data we use mean squared loss and for binary data, we use log-loss. While our theory does not immediately apply to log-loss, we were curious to see how our model performs with it.
We compare our method to Ordinary Least Squares with treatment as a feature (OLS-1), OLS with separate regressors for each treatment (OLS-2),-nearest neighbor (-NN), Targeted Maximum Likelihood, which is a doubly robust method (TMLE) (Gruber & van der Laan, 2011), Bayesian Additive Regression Trees (BART) (Chipman et al., 2010; Chipman & McCulloch, 2016)
, Random Forests (Rand. For.)(Breiman, 2001), Causal Forests (Caus. For.) (Wager & Athey, 2015)
as well as the Balancing Linear Regression (BLR) and Balancing Neural Network (BNN) byJohansson et al. (2016)
. For classification tasks we substitute Logistic Regression (LR) for OLS. Choosing hyperparameters for estimating PEHE is non-trivial; we detail our selection procedure, applied to all methods, in subsection C.1 of the supplement.
We evaluate our model in two different settings. One is within-sample, where the task is to estimate ITE for all units in a sample for which the (factual) outcome of one treatment is observed. This corresponds to the common scenario in which a cohort is selected once and not changed. This task is non-trivial, as we never observe the ITE for any unit. The other is the out-of-sample setting, where the goal is to estimate ITE for units with no observed outcomes. This corresponds to the case where a new patient arrives and the goal is to select the best possible treatment. Within-sample error is computed over both the training and validation sets, and out-of-sample error over the test set.
Hill (2011) compiled a dataset for causal effect estimation based on the Infant Health and Development Program (IHDP), in which the covariates come from a randomized experiment studying the effects of specialist home visits on future cognitive test scores. The treatment groups have been made imbalanced by removing a biased subset of the treated population. The dataset comprises 747 units (139 treated, 608 control) and 25 covariates measuring aspects of children and their mothers. We use the simulated outcome implemented as setting “A” in the NPCI package (Dorie, 2016). Following Hill (2011), we use the noiseless outcome to compute the true effect. We report the estimated (finite-sample) PEHE loss (Eq. 1), and the absolute error in average treatment effect . The results of the experiments on IHDP are presented in Table 1 (left). We average over 1000 realizations of the outcomes with 63/27/10 train/validation/test splits.
We investigate the effects of increasing imbalance between the original treatment groups by constructing biased subsamples of the IHDP dataset. A logistic-regression propensity score model is fit to form estimates of the conditional treatment probability. Then, repeatedly, with probability we remove the remaining control observation that has closest to , and with probability , we remove a random control observation. The higher , the more imbalance. For each value of , we remove observations from each set, leaving .
The study by LaLonde (1986) is a widely used benchmark in the causal inference community, where the treatment is job training and the outcomes are income and employment status after training. This dataset combines a randomized study based on the National Supported Work program with observational data to form a larger dataset (Smith & Todd, 2005). The presence of the randomized subgroup gives a way to estimate the “ground truth” causal effect. The study includes 8 covariates such as age and education, as well as previous earnings. We construct a binary classification task, called Jobs, where the goal is to predict unemployment, using the feature set of Dehejia & Wahba (2002). Following Smith & Todd (2005), we use the LaLonde experimental sample (297 treated, 425 control) and the PSID comparison group (2490 control). There were 482 (15%) subjects unemployed by the end of the study. We average over 10 train/validation/test splits with ratios 56/24/20.
Because all the treated subjects were part of the original randomized sample , we can compute the true average treatment effect on the treated by , where is the control group. We report the error . We cannot evaluate on this dataset, since there is no ground truth for the ITE. Instead, in order to evaluate the quality of ITE estimation, we use a measure we call policy risk. The policy risk is defined as the average loss in value when treating according to the policy implied by an ITE estimator. In our case, for a model , we let the policy be to treat, , if , and to not treat, otherwise. The policy risk is which we can estimate for the randomized trial subset of Jobs by . See figure 3 for risk as a function of treatment threshold , aligned by proportion of treated, and Table 1 for the risk when .
We begin by noting that indeed imbalance confers an advantage to using the IPM regularization term, as our theoretical results indicate, see e.g. the results for CFR Wass () and TARNet () on IHDP in Table 1. We also see in Figure 2 that even for the harder case of increased imbalance () between treated and control, the relative gain from using our method remains significant. On Jobs, we see a smaller gain from using IPM penalties than on IHDP. We believe this is the case because, while we are minimizing our bound over observational data and accounting for this bias, we are evaluating the predictions only on a randomized subset, where the treatment groups are distributed identically. For both IHDP, non-linear estimators do significantly better than linear ones in terms of individual effect (). On the Jobs dataset, straightforward logistic regression does remarkably well in estimating the ATT. However, being a linear model, LR can only ascribe a uniform policy - in this case, “treat everyone”. The more nuanced policies offered by non-linear methods achieve lower policy risk in the case of Causal Forests and CFR. This emphasizes the fact that estimating average effect and individual effect can require different models. Specifically, while smoothing over many units may yield a good ATE estimate, this might significantly hurt ITE estimation. -nearest neighbors has very good within-sample results on Jobs, because evaluation is performed over the randomized component, but suffers heavily in generalizing out of sample, as expected.
In this paper we give a meaningful and intuitive error bound for the problem of estimating individual treatment effect. Our bound relates ITE estimation to the classic machine learning problem of learning from finite samples, along with methods for measuring distributional distances from finite samples. The bound lends itself naturally to the creation of learning algorithms; we focus on using neural nets as representations and hypotheses. We apply our theory-guided approach to both synthetic and real-world tasks, showing that in every case our method matches or outperforms the state-of-the-art. Important open questions are theoretical considerations in choosing the IPM weight
, how to best derive confidence intervals for our model’s predictions, and how to integrate our work with more complicated causal models such as those with hidden confounding or instrumental variables.
We wish to thank Aahlad Manas for his assistance with the experiments. We also thank Jennifer Hill, Marco Cuturi, Esteban Tabak and Sanjong Misra for fruitful conversations, and Stefan Wager for his help with the code for Causal Forests. DS and US were supported by NSF CAREER award #1350965.
Proceedings of the Twenty-second Conference on Uncertainty in Artificial Intelligence, pp. 437–444. UAI Press, 2006.
We first define the necessary distributions and prove some simple results about them. We assume a joint distribution function , such that , and for all . Recall that we assume Consistency, that is we assume that we observe and .
The treatment effect for unit is:
We first show that under consistency and strong ignorability, the ITE function is identifiable:
Let , and denote respectively the treatment and control distributions.
Let be a representation function. We will assume that is differentiable.
The representation function is one-to-one. Without loss of generality we will assume that is the image of under , and define to be the inverse of , such that for all .
For a representation function , and for a distribution defined over , let be the distribution induced by over . Define , , to be the treatment and control distributions induced over .
For a one-to-one , the distribution over can be obtained by the standard change of variables formula, using the determinant of the Jacobian of . See (Ben-Israel, 1999) for the case of a mapping between spaces of different dimensions.
For all , :
Let be the absolute of the determinant of the Jacobian of .
where equality (a) is by the change of variable formula. The proof is identical for . ∎
Let be a loss function, e.g. the absolute loss or squared loss.
Let be a representation function. Let be an hypothesis defined over the representation space . The expected loss for the unit and treatment pair is:
The expected factual loss and counterfactual losses of and are, respectively:
When it is clear from the context, we will sometimes use and for the expected factual and counterfactual losses of an arbitrary function .
The expected treated and control losses are:
The four losses above are simply the loss conditioned on either the control or treated set. Let be the proportion of treated in the population. We then have the immediate result:
Let be a function family consisting of functions . For a pair of distributions , over , define the Integral Probability Metric:
defines a pseudo-metric on the space of probability functions over , and for sufficiently large function families, is a proper metric (Müller, 1997). Examples of sufficiently large functions families includes the set of bounded continuous functions, the set of -Lipschitz functions, and the set of unit norm functions in a universal Reproducing Norm Hilbert Space. The latter two give rise to the Wasserstein and Maximum Mean Discrepancy metrics, respectively (Gretton et al., 2012; Sriperumbudur et al., 2012). We note that for function families such as the three mentioned above, for which , the absolute value can be omitted from definition A7.
We now state and prove the most important technical lemma of this section.
Let be an invertible representation with its inverse. Let be defined as in Definition A3. Let . Let be a family of functions , and denote by the integral probability metric induced by . Let be an hypothesis. Assume there exists a constant , such that for , the function . Then we have: