1 Introduction
Learning individuallevel causal effects is concerned with learning how units of interest respond to interventions or treatments. These could be the medications prescribed to particular patients, trainingprograms to job seekers, or educational courses for students. Ideally, such causal effects would be estimated from randomized controlled trials, but in many cases, such trials are unethical or expensive: researchers cannot randomly prescribe smoking to assess health risks. Observational data offers an alternative, with typically larger sample sizes and lower costs, and more relevance to the target population. However, the price we pay for using observational data is lower certainty in our causal estimates, due to the possibility of unmeasured confounding, and the measured and unmeasured differences between the populations who were subject to different treatments.
Progress in learning individuallevel causal effects is being accelerated by deep learning approaches to causal inference
(Johansson et al., 2016; Louizos et al., 2017; Atan et al., 2018; Shi et al., 2019a). Such neural networks can be used to learn causal effects from observational data, but current deep learning tools for causal inference cannot yet indicate when they are unfamiliar with a given data point. For example, a system may offer a patient a recommendation even though it may not have learned from data belonging to anyone with similar age or gender as the patient, or it may have never observed someone like this patient receive a specific treatment before. In the language of machine learning and causal inference, the first example corresponds to a
covariate shift, and the second example corresponds to a violation of the overlap assumption, also known as positivity. When a system experiences either covariate shift or violations of overlap, the recommendation would be uninformed and could lead to undue stress, financial burden, false hope, or worse. In this paper, we explain and examine how covariate shift and violations of overlap are concerns for the realworld use of learning conditional average treatment effects (CATE) from observational data, why deep learning systems should indicate their lack of confidence when these phenomena are encountered, and develop a new and principled approach to incorporating uncertainty estimating into the design of systems for CATE inference.First, we reformulate the lack of overlap at test time as an instance of covariate shift, allowing us to address both problems with one methodology. When an observation lacks overlap, the model predicts the outcome for a treatment
that has probability zero or nearzero under the training distribution. We extend the CausalEffect Variational Autoencoder (CEVAE)
(Louizos et al., 2017) by introducing a method for outofdistribution (OoD) training, negative sampling, to model uncertainty on OoD inputs. Negative sampling is effective and theoretically justified but usually intractable (Hafner et al., 2018). Our insight is that it becomes tractable for addressing nonoverlap since the distribution of testtime inputs is known: it equals the training distribution but with a different choice of treatment (for example, if at training we observe outcome for patient only under treatment , then we know that the outcome for should be uncertain). This can be seen as a special case of transductive learning (Vapnik, 1999, Ch. 9). For addressing covariate shift in the inputs , negative sampling remains intractable as the new covariate distribution is unknown; however, it has been shown in noncausal applications that Bayesian parameter uncertainty captures “epistemic” uncertainty which can indicate covariate shift (Kendall and Gal, 2017). We, therefore, propose to treat the decoder in CEVAE as a Bayesian neural network able to capture epistemic uncertainty.In addition to casting lack of overlap as a distribution shift problem and proposing an OoD training methodology for the CEVAE model, we further extend the modeling of epistemic uncertainty to a range of stateoftheart neural models including TARNet, CFRNet (Shalit et al., 2017), and Dragonnet (Shi et al., 2019b), developing a practical Bayesian counterpart to each. We demonstrate that, by excluding test points with high epistemic uncertainty at test time, we outperform baselines that use the propensity score to exclude points that violate overlap. This result holds across different stateoftheart architectures on the causal inference benchmarks IHDP (Hill, 2011) and ACIC (Shimoni et al., 2018). Leveraging uncertainty for exclusion ties it into causal inference practice where a large number of overlapviolating points must often be discarded or submitted for further scrutiny (Rosenbaum and Rubin, 1983; Imbens, 2004; Crump et al., 2009; Imbens and Rubin, 2015; Hernan and Robins, 2010). Finally, we introduce a new semisynthetic benchmark dataset, CEMNIST, to explore the problem of nonoverlap in highdimensional settings.
2 Background
Classic machine learning is concerned with functions that map an input (e.g. an image) to an output (e.g. “is a person”). The specific function for a given task is typically chosen by an algorithm that minimizes a loss between the outputs and targets over a dataset of input covariates and output targets. Causal effect estimation differs in that, for each input , there is a corresponding treatment and two potential outcomes , – one for each choice of treatment (Rubin, 2005). In this work, we are interested in the Conditional Average Treatment Effect ():
(1)  
(2) 
where the expectations arise because the outcome is nondeterministic. Under the assumption of ignorability conditioned on (or nohidden confounding) which we make in this paper, we have that , thus opening the way to estimate CATE from observational data (Imbens and Rubin, 2015). Specifically, we are motivated by cases where is highdimensional, for example, a patient’s entire medical record, in which case we can think of the CATE as representing an individuallevel causal effect. Though the specific meaning of a CATE measurement depends on context, in general, a positive value indicates that an individual with covariates will have a positive response to treatment, a negative value indicates a negative response, and a value of zero indicates that the treatment will have no effect on such an individual.
The fundamental problem of learning to infer from an observational dataset is that only the factual outcome corresponding to the treatment can be observed. Because the counterfactual outcome is never observed, it is difficult to learn a function for directly. Instead, a standard approach is often either to treat as an additional covariate (Gelman and Hill, 2007) or focus on learning functions for and using the observed in as targets (Shalit et al., 2017; Louizos et al., 2017; Shi et al., 2019a).
2.1 Epistemic uncertainty and covariate shift
In probabilistic modelling, predictions may be assumed to come from a graphical model – a distribution over outputs (the likelihood) given a single set of parameters . Considering a binary label given , a neural network can be described as a function defining the likelihood , with parameters defining the network weights. Different draws from a distribution over parameters would then correspond to different neural networks, i.e. functions from to (e.g. the blue curves in Fig. 1 (left)).
For parametric models such as neural networks (NNs), we treat the weights as random variables, and, with a chosen prior distribution
, aim to infer the posterior distribution . The blue curves in Figure 0(a), are individual NN’s sampled from the posterior of such a Bayesian neural network (BNN). Bayesian inference can be performed by marginalizing the likelihood function over the posteriorin order to obtain the posterior predictive probability
. This marginalization is intractable for BNNs in practice, so variational inference is commonly used as a scalable approximate inference technique, for example, by sampling the weights from a Dropout approximate posterior (Gal and Ghahramani, 2016).Figure 0(a) illustrates the effects of a BNN’s parameter uncertainty in the range (shaded region). While all function samples with (shown in blue) agree with each other for inputs indistribution () these functions make disagreeing predictions for inputs because these lie outofdistribution (OoD) with respect to the training distribution . This is an example of covariate shift.
To avoid overconfident erroneous extrapolations on such OoD examples, we would like to indicate that the prediction is uncertain. This epistemic uncertainty stems from a lack of data, not from measurement noise (also called aleatoric uncertainty). Epistemic uncertainty about the random variable (r.v.) can be quantified in various ways. For classification tasks, a popular informationtheoretic measure is the information gained about the r.v. if the label were observed for a new data point , given the training dataset (Houlsby et al., 2011). This is captured by the mutual information between and , given by
(3) 
where is the entropy of a given r.v. For regression tasks, it is common to measure how the r.v. varies when marginalizing over :
(4) 
3 Nonoverlap as a covariate shift problem
Standard causal inference tasks, under the assumption of ignorability conditioned on , usually deal with estimating both and . Overlap is usually assumed as a means to address this problem. The overlap assumption (also known as common support or positivity) states that there exists such that the propensity score satisfies:
(5) 
i.e., that for every , we have a nonzero probability of observing its outcome under as well as under . This version is sometimes called strict overlap, see (D’Amour et al., 2017) for discussion. When overlap does not hold for some , we might lack data to estimate either or —this is the case in the grey shaded areas in Figure 0(c).
Overlap is a central assumption in causal inference (Rosenbaum and Rubin, 1983; Imbens, 2004). Nonetheless, it is usually not satisfied for all units in a given observational dataset (Rosenbaum and Rubin, 1983; Imbens, 2004; Crump et al., 2009; Imbens and Rubin, 2015; Hernan and Robins, 2010). It is even harder to satisfy for highdimensional data such as images and comprehensive demographic data (D’Amour et al., 2017) where neural networks are used in practice (Goodfellow et al., 2016).
Since overlap must be assumed for most causal inference methods, an enormously popular practice is “trimming”: removing the data points for which overlap is not satisfied before training (Hernan and Robins, 2010; Fogarty et al., 2016; Shi et al., 2019a; King and Nielsen, 2019; D’Agostino Jr, 1998). In practice, points are trimmed when they have a propensity close to 0 or 1, as predicted by a trained propensity model . The average treatment effect (ATE), is then calculated by over the remaining training points.
However, trimming has a different implication when estimating the CATE for each unit with covariates : it means that for some units a CATE estimate is not given. If we think of CATE as a tool for recommending treatment assignment, a trimmed unit receives no treatment recommendation. This reflects the uncertainty in estimating one of the potential outcomes for this unit, since treatment was rarely (if ever) given to similar units. In what follows, we will explore how trimming can be replaced with more dataefficient rejection methods which are specifically focused on assessing the level of uncertainty in estimating the expected outcomes for under both treatment options.
Our model of the is:
(6) 
In Figure 1, we illustrate that lack of overlap constitutes a covariate shift problem. When , we face a covariate shift for (because p(t=1) > 0 would imply by Bayes rule ). When , we face a covariate shift for , and when , we face a covariate shift for (“out of distribution” in the figure). With this understanding, we can deploy tools for epistemic uncertainty to address both covariate shift and nonoverlap simultaneously.
3.1 Epistemic uncertainty in CATE
To the best of our knowledge, uncertainty in highdimensional (i.e. where each value of is only expected to be observed at most once) has not been previously addressed.
can be seen as the first moment of the random variable
given . Here, we extend this notion and examine the secondmoment, the variance, which we can decompose into its aleatoric and epistemic parts by using the law of total variance:
(7) 
The second term on the r.h.s. is . It measures the epistemic uncertainty in since it only stems from the disagreement between predictions for different values of the parameters, not from noise in . We will use this uncertainty in our methods and estimate it directly by sampling from the approximate posterior . The first term on the r.h.s. is the expected aleatoric uncertainty, which is disregarded in estimation (but could be relevant other where).
Referring back to Figure 0(c), when overlap is not satisfied for , is large because at least one of and is large. Similarly, under regular covariate shift (), both will be large.
4 Adapting neural causal models for covariate shift
4.1 Parameter uncertainty
To obtain the epistemic uncertainty in the , we must infer the parameter uncertainty distribution conditioned on the training data , which defines the distribution of each network , conditioned on . There exists a large suite of methods we can leverage for this task, surveyed in Gal (2016). Here, we use MC Dropout (Gal and Ghahramani, 2016) because of its high scalability (Tran et al., 2019), ease of implementation, and stateoftheart performance (Filos et al., 2019). However, our contributions are compatible with other approximate inference methods. We can adapt almost all neural causal inference methods we know. CEVAE, however, (Louizos et al., 2017), is more complicated and will be addressed in the next section.
MC Dropout is a simple change to existing methods. Gal and Ghahramani (2016) showed that we can simply add dropout (Srivastava et al., 2014) with L2 regularization in each of during training and then sample from the same dropout distribution at test time to get samples from . With tuning of the dropout probability, this is equivalent to sampling from a Bernoulli approximate posterior (with standard Gaussian prior). MC Dropout has been used in various applications (Zhu and Laptev, 2017; McAllister et al., 2017; Jungo et al., 2017).
4.2 Bayesian CEVAE
The Causal Effect Variational Autoencoder (CEVAE, Louizos et al. (2017)) was introduced as a means to relaxes the common assumption that the data points contain accurate measurements of all confounders – instead, it assumes that the observed are a noisy transformation of some true confounders , whose conditional distribution can nonetheless be recovered. To do so, CEVAE encodes each observation , into a distribution over latent confounders and reconstructs the entire observation with a decoder network. For each possible value of , there is a separate branch of the model. For each branch , the encoder has an auxiliary distribution to approximate the posterior at test time. It additionally has a single auxiliary distribution which generates . See Figure 2 in Louizos et al. (2017) for an illustration. The decoder reconstructs the entire observation, so it learns the three components of . We will omit the parameters of these distributions to ease our notation. The encoder parameters are summarized as and the decoder parameters as .
If the treatment and outcome were known at test time, the training objective (ELBO) would be
(8) 
where
is the KullbackLeibler divergence. However,
and need to be predicted at test time, so CEVAE learns the two additional distributions by using the objective(9) 
where a star indicates that the variable is only observed at training time. At test time, we calculate the so is set to and for the corresponding branch and is sampled both times.
Although the encoder performs Bayesian inference to infer , CEVAE does not model epistemic uncertainty because the decoder lacks a distribution over . The recently introduced Bayesian Variational Autoencoder (Daxberger and HernándezLobato, 2019) attempts to model such epistemic uncertainty in VAEs using MCMC sampling. We adapt their model for causal inference by inferring an approximate posterior . In practice, this is again a simple change if we use Monte Carlo (MC) Dropout in the decoder^{1}^{1}1We do not treat the parameters of the encoder distributions as random variables. This is because the encoder does not infer directly. Instead, it parameterizes the parameters of a Gaussian posterior over (see eq. (5) in Louizos et al. (2017) for details). These parameters specify the uncertainty over themselves.. This is implemented by adding dropout layers to the decoder and adding a term to (9), where is standard Gaussian. Furthermore, the expectation in (8) now goes over the joint posterior by performing stochastic forward passes with Dropout ‘turned on’.
Negative sampling for nonoverlap.
Negative sampling is a powerful method for modeling uncertainty under a covariate shift by adding loss terms that penalize confident predictions on inputs sampled outside the training distribution (Sun et al., 2019; Lee et al., 2017; Hafner et al., 2018; Hendrycks et al., 2018; Rowley et al., 1998). However, it is usually intractable because the input space is high dimensional. Our insight is that it becomes tractable for nonoverlap, because the OoD inputs are created by simply flipping on the indistribution inputs to create the new inputs . Our negative sampling is implemented by mapping each through both branches of the encoder. On the counterfactual branch, where , we only minimize the KL divergence from the posterior to , but none of the other terms in (9). This is to encode that we have no information on the counterfactual prediction. Figure 2 illustrates the effect that negative sampling has on epistemic uncertainty measurements. Training with negative sampling (figure 1(a)) leads to higher epistemic uncertainty estimates for nonoverlap and outofdistribution examples, as well as sharper transitions between indistribution and outofdistribution examples, when compared to training without negative sampling (figure 1(b)). In appendix C we study negative sampling and demonstrate improved uncertainty.
5 Related work
Epistemic uncertainty is modeled outofthe box by nonparametric Bayesian methods such as Gaussian Processes (GPs) (Rasmussen, 2003) and Bayesian Additive Regression Trees (BART) Chipman et al. (2010)
. Various nonparametric models have been applied to causal inference
(Alaa and van der Schaar, 2017; Chipman et al., 2010; Zhang et al., 2020; Hill, 2011; Wager and Athey, 2018). However, recent stateoftheart results for highdimensional data have been dominated by neural network approaches (Johansson et al., 2016; Louizos et al., 2017; Atan et al., 2018; Shi et al., 2019a). Since these do not incorporate epistemic uncertainty outofthebox, our extensions are meant to fill this gap in the literature.Causal effects are usually estimated after discarding/rejecting points that violate overlap, using the estimated propensity score (Crump et al., 2009; Hernan and Robins, 2010; Fogarty et al., 2016; Shi et al., 2019a; King and Nielsen, 2019; D’Agostino Jr, 1998). This process is cumbersome, and results are often sensitive to a large number of ad hoc choices (Hill et al., 2011) which can be avoided with our methods. Hill and Su (2013)
proposed alternative heuristics for discarding by using the epistemic uncertainty provided by BART on low dimensional data, but focuses on learning the
, the average treatment effect over the training set, and neither uses uncertainty in nor .For estimation, unlike estimation, we additionally face test data, which may also violate overlap. Test data also introduces the possibility of covariate shift away from , which has so far been studied outside the causal inference literature (QuioneroCandela et al., 2009; Li et al., 2011; Sugiyama et al., 2007; Shimodaira, 2000). In both cases, we may wish to reject , e.g. to consult a human expert instead of making a possibly false treatment recommendation. To our knowledge, there has been no comparison of rejection methods for inference.
6 Experiments
In this section, we show empirical evidence for the following claims: that our uncertainty aware methods are robust both to violations of the overlap assumption and a failure mode of propensity based trimming (6.1); that they indicate high uncertainty when covariate shifts occur between training and test distributions (6.2); and that they yield lower errors while rejecting fewer points than propensity based trimming (6.2). In the process, we introduce a new, highdimensional, individuallevel causal effect prediction benchmark dataset called CEMNIST to demonstrate robustness to overlap and propensity failure (6.1). Finally, we introduce a modification to the IHDP causal inference benchmark to explore covariate shift (6.2).
We evaluate our methods by considering treatment recommendations. A simplified recommendation strategy for an individuallevel treatment is to assign if the predicted is positive, and if negative. However, if there is insufficient knowledge about an individual, and a high cost associated with making errors, it might be preferable to withhold the recommendation. It is therefore important to have an informed rejection policy for a treatment assigned based on a given estimator. We use use two rejection policies based on estimators for epistemic and predictive uncertainty derived from equation (7). We compare the utility of these policies to a random rejection baseline and two policies based on a trained propensity score model (propensity trimming and quantiles). Details of the uncertainty and propensity score policies are reported in Appendix A.4. For a estimator, we assign a cost of 1 to making an incorrect prediction and a cost of 0 for either making a correct recommendation or withholding an automated recommendation and deferring the decision to a human expert instead. At a fixed number of rejections, the utility of a policy is defined as the inverse of the total number of erroneous recommendations made. We also report the error over the nonrejected subset as measured by the Precision in Estimation of Heterogenous Treatment Effect (PEHE) (Hill, 2011) as implemented by Shalit et al. (2017)
. We report the mean and standard error of all metrics over a dataset dependent number of training runs.
We evaluate and compare each rejection policy using several uncertainty aware estimators. The estimators are the Bayesian versions of CEVAE Louizos et al. (2017), TARNet, CFRMMD (Shalit et al., 2017), Dragonnet (Shi et al., 2019a), and a deep TLearner. Each model is augmented by introducing Bayesian parameter uncertainty and by predicting a distribution over model outputs. For imaging experiments, a twolayer CNN encoder is added to each model. Details for each model are given in Appendix B. In the result tables, each model’s name is prefixed with a “B" for “Bayesian”.
6.1 Using uncertainty when overlap is violated
Causal effect MNIST (CEMNIST). We introduce the CEMNIST dataset using handwritten digits from the MNIST dataset (LeCun, ) to demonstrate that our uncertainty measures capture nonoverlap on highdimensional data and that they are robust to a failure mode of propensity score rejection.
Digit(s)  

9  
2  
other odds 

other evens 
Table 1 depicts the data generating process for CEMNIST. In expectation, half of the samples in a generated dataset will be nines, and even though the propensity for treating a nine is relatively low, there are still on average twice as many treated nines as there are samples of other treated digits (except for twos). Therefore, it is reasonable to expect that the can be estimates most accurately for nines. For twos, there is strict nonoverlap. Therefore, the cannot be estimated accurately. For the remaining digits, the estimate should be less confident than for nines because there are fewer examples during training, but more confident than for twos because there are both treated and untreated training examples.
This experimental setup is chosen to demonstrate where the propensity trimming rejection policy can be inappropriate for the prediction of individuallevel causal effects. Figure 2(a) shows the histogram over training set predictions for a deep propensity model on a realization of the CEMNIST dataset. A data scientist following the trimming paradigm (Caliendo and Kopeinig, 2008) would be justified in choosing a lower threshold around 0.05 and an upper threshold around 0.75. The upper threshold would properly reject twos, but the lower threshold would start rejecting nines, which represent the population that the CATE estimator can be most confident about. Therefore, rejection choices can be worse than random.
Figure 2(b) shows that the recommendationerrorrate is significantly lower for the uncertainty based policies (red and green) than for both the random baseline policy (purple) and the propensity based policies (orange and blue). These results hold for the across a range of SOTA CATE estimators, as shown in the l.h.s. of Table 2, and in Appendix C. Details on the protocol generating these results are in Appendix A.1.
CEMNIST()  IHDP Cov. ()  IHDP ()  
Method / Pol.  rand.  prop.  unct.  rand.  prop.  unct.  rand.  prop.  unct. 
BART  2.1.02  2.1.02  2.0.03  2.6.2  2.7.3  1.8.2  1.9.2  1.9.2  1.6.1 
BTLearner  .27.00  .18.01  .04.01  2.3.2  2.3.2  1.3.1  1.0.0  0.9.0  0.7.0 
BTARNet  .18.01  .16.01  .00.00  2.2.3  2.0.3  1.2.1  1.1.0  1.0.0  0.8.0 
BCFRMMD  .32.01  .25.01  .13.02  2.5.2  2.4.3  1.7.2  1.3.1  1.3.1  0.9.0 
BDragonnet  .22.01  .19.01  .02.01  2.4.3  2.2.3  1.3.2  1.5.1  1.4.1  1.1.0 
BCEVAE  .30.01  .23.01  .04.01  2.5.2  2.4.3  1.7.1  1.8.1  1.9.1  1.5.1 
6.2 Uncertainty under covariate shift
Infant Health and Development Program (IHDP). When deploying a machine learning system, we must often deal with a test distribution of which is different from the training distribution . We induce a covariate shift in the semisynthetic dataset IHDP (Hill, 2011; Shalit et al., 2017) by excluding instances from the training set for which the mother is unmarried. Mother’s marital status is chosen because it has a balanced frequency of ; furthermore, it has a mild association with the treatment as indicated by a log odds ratio of ; and most importantly, there is evidence of a simple distribution shift, indicated by a predictive accuracy of
for marital status using a logistic regression model over the remaining covariates. We comment on the ethical implications of this experimental setup, describe IHDP, and explain the experimental protocol in Appendix
A.2.We report the mean and standard error in recommendationerrorrates and over 1000 realizations of the IHDP CovariateShift dataset to evaluate each policy by computing each metric over the test set (both subpopulations included). We sweep from 0.0 to 1.0 in increments of 0.05. Figure 3(b) shows, for the BTLearner, that the epistemic uncertainty policy significantly outperforms the uncertaintyoblivious policies across the whole range of rejection rates, and we show in Appendix C that this trend holds across all models. The middle section of table 2 supports this claim by reporting the for each model at ( is approximately the frequency of the excluded population). Every model class shows improved rejection performance. However, comparisons between model classes are not necessarily appropriate since some models target different scenarios, for example, CEVAE targets nonsynthetic data where confounders aren’t directly observed, and it is known to underperform on IHDP (Louizos et al., 2017).
7 Conclusions
Observational data often violates the crucial overlap assumption, especially when the data is highdimensional (D’Amour et al., 2017). When these violations occur, causal inference can be difficult or impossible, and ideally, a good causal model should communicate this failure to the user. However, the only current approach for identifying these failures in deep models is via the propensity score. We develop here a principled approach to modeling outcome uncertainty in individuallevel causal effect estimates, leading to more accurate identification of cases where we cannot expect accurate predictions, while the propensity score approach can be both over and underconservative. We further show that the same uncertainty modeling approach we developed can be usefully applied to predicting causal effects under covariate shift. More generally, since causal inference is often needed in highstakes domains such as medicine, we believe it is crucial to effectively communicate uncertainty and refrain from providing illconceived predictions.
References
 Tensorflow: a system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §B.1.
 Bayesian inference of individualized treatment effects using multitask gaussian processes. In Advances in Neural Information Processing Systems, pp. 3424–3432. Cited by: §5.

Deeptreat: learning optimal personalized treatments from observational data using neural networks.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §1, §5.  Some practical guidance for the implementation of propensity score matching. Journal of economic surveys 22 (1), pp. 31–72. Cited by: §A.4, §6.1.
 BART: bayesian additive regression trees. The Annals of Applied Statistics 4 (1), pp. 266–298. Cited by: Appendix C, §5.
 Dealing with limited overlap in estimation of average treatment effects. Biometrika 96 (1), pp. 187–199. Cited by: §1, §3, §5.
 Propensity score methods for bias reduction in the comparison of a treatment to a nonrandomized control group. Statistics in medicine 17 (19), pp. 2265–2281. Cited by: §3, §5.
 Overlap in observational studies with highdimensional covariates. arXiv preprint arXiv:1711.02582. Cited by: §3, §3, §7.
 Bayesian variational autoencoders for unsupervised outofdistribution detection. arXiv preprint arXiv:1912.05651. Cited by: §4.2.
 Automated versus doityourself methods for causal inference: lessons learned from a data analysis competition. Statistical Science 34 (1), pp. 43–68. Cited by: §A.3.
 NPCI: nonparametrics for causal inference. URL: https://github. com/vdorie/npci. Cited by: §A.2.
 A systematic comparison of bayesian deep learning robustness in diabetic retinopathy tasks. arXiv preprint arXiv:1912.10481. Cited by: §4.1.
 Discrete optimization for interpretable study populations and randomization inference in an observational study of severe sepsis mortality. Journal of the American Statistical Association 111 (514), pp. 447–458. Cited by: §3, §5.
 Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §2.1, §4.1, §4.1.
 Uncertainty in deep learning. University of Cambridge 1, pp. 3. Cited by: §4.1.
 Causal inference using regression on the treatment variable. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cited by: §2.
 Deep learning. MIT press. Cited by: §3.
 Reliable uncertainty estimates in deep neural networks using noise contrastive priors. Cited by: §1, §4.2.

Deep anomaly detection with outlier exposure
. arXiv preprint arXiv:1812.04606. Cited by: §4.2.  Causal inference. Cited by: §1, §3, §3, §5.
 Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics 20 (1), pp. 217–240. Cited by: §A.2, §1, §5, §6.2, §6.
 Assessing lack of common support in causal inference using bayesian nonparametrics: implications for evaluating the effect of breastfeeding on children’s cognitive outcomes. The Annals of Applied Statistics, pp. 1386–1420. Cited by: §5.
 Challenges with propensity score strategies in a highdimensional setting and a potential alternative. Multivariate Behavioral Research 46 (3), pp. 477–513. Cited by: §5.

Bayesian active learning for classification and preference learning
. arXiv preprint arXiv:1112.5745. Cited by: §2.1.  Causal inference in statistics, social, and biomedical sciences. Cambridge University Press. Cited by: §1, §2, §3.
 Nonparametric estimation of average treatment effects under exogeneity: a review. Review of Economics and statistics 86 (1), pp. 4–29. Cited by: §1, §3.
 Learning representations for counterfactual inference. In International conference on machine learning, pp. 3020–3029. Cited by: §1, §5.
 Towards uncertaintyassisted brain tumor segmentation and survival prediction. In International MICCAI Brainlesion Workshop, pp. 474–485. Cited by: §4.1.

What uncertainties do we need in bayesian deep learning for computer vision?
. In Advances in neural information processing systems, pp. 5574–5584. Cited by: §1.  Why propensity scores should not be used for matching. Political Analysis 27 (4), pp. 435–454. Cited by: §3, §5.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix B.
 [32] The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: §6.1.

Training confidencecalibrated classifiers for detecting outofdistribution samples
. arXiv preprint arXiv:1711.09325. Cited by: §4.2.  Knows what it knows: a framework for selfaware learning. Machine learning 82 (3), pp. 399–443. Cited by: §5.
 Causal effect inference with deep latentvariable models. In Advances in Neural Information Processing Systems, pp. 6446–6456. Cited by: Appendix B, §1, §1, §2, §4.1, §4.2, §5, §6.2, §6, footnote 1.
 Visualizing data using tsne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §A.2.
 Concrete problems for autonomous vehicle safety: advantages of bayesian deep learning. Cited by: §4.1.
 The collaborative perinatal study of the national institute of neurological diseases and stroke. The Woman and Their Pregnancies. Cited by: §A.3.
 Dataset shift in machine learning. The MIT Press. Cited by: §5.
 Gaussian processes in machine learning. In Summer School on Machine Learning, pp. 63–71. Cited by: §5.
 The central role of the propensity score in observational studies for causal effects. Biometrika 70 (1), pp. 41–55. Cited by: §1, §3.

Neural networkbased face detection
. IEEE Transactions on pattern analysis and machine intelligence 20 (1), pp. 23–38. Cited by: §4.2.  Causal inference using potential outcomes: design, modeling, decisions. Journal of the American Statistical Association 100 (469), pp. 322–331. Cited by: §2.
 Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3076–3085. Cited by: §A.2, §A.2, Appendix B, §1, §2, §6.2, §6, §6.
 Adapting neural networks for the estimation of treatment effects. In Advances in Neural Information Processing Systems, pp. 2503–2513. Cited by: Appendix B, §1, §2, §3, §5, §5, §6.
 Adapting neural networks for the estimation of treatment effects. In Advances in Neural Information Processing Systems, pp. 2503–2513. Cited by: §1.
 Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of statistical planning and inference 90 (2), pp. 227–244. Cited by: §5.
 Benchmarking framework for performanceevaluation of causal inference analysis. arXiv preprint arXiv:1802.05046. Cited by: §A.3, §1.
 Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §4.1.
 Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research 8 (May), pp. 985–1005. Cited by: §5.
 Functional variational bayesian neural networks. arXiv preprint arXiv:1903.05779. Cited by: §4.2.

Efficient object localization using convolutional networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 648–656. Cited by: Appendix B.  Bayesian layers: a module for neural network uncertainty. In Advances in Neural Information Processing Systems, pp. 14633–14645. Cited by: §4.1.

The nature of statistical learning theory
. Springer Science & Business Media. Cited by: §1. 
Estimation and inference of heterogeneous treatment effects using random forests
. Journal of the American Statistical Association 113 (523), pp. 1228–1242. Cited by: §5.  Learning overlapping representations for the estimation of individualized treatment effects. arXiv preprint arXiv:2001.04754. Cited by: §5.
 Deep and confident prediction for time series at uber. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 103–110. Cited by: §4.1.
Appendix A Datasets
a.1 Cemnist
Digit(s)  Number of train samples  Number treated  

other odds  each  each  
other evens  each  each 
The original MNIST image dataset contains a training set of size 60000 and a test set of size 10000, where each digit class 09 represents 10% of points. We use a subset of the training data, shown in Table 3. Similarly, we use a subset of the test set, with the same proportion for each digit as in the training set (and the same proportion of treated points). The variables , are deterministic as shown in Table 3. Some numbers in Table 3 are approximate because they are generated according to the probabilities in Table 1.
The dataset serves two purposes. First, it illustrates why the standard practice of rejecting points with propensity scores close to or can be worse than rejecting randomly. The digit has the most data making it easy to predict the , but it’s propensity score is only , so that s will be rejected early. It might be a common situation in practice that a subpopulation represents the majority of the data and therefore its is easy to estimate. Second, the digit suffers from strict nonoverlap (propensity score of ). It should be the first digit class to be rejected by any method since its cannot be estimated. When increasing the rejected proportion, digits other than should subsequently be rejected as only and examples are observed for their treatment and control groups respectively. However, propensitybased rejection is likely to retain these subpopulations because their propensity score is .
We repeated the CEMNIST experiment times, each time generating a new dataset with a different random initialization for each model. Note that this is a single dataset, unlike other causal inference benchmarks, so it is only suited for estimation, not estimation.
a.2 Ihdp
Hill (2011) introduced a causal inference dataset based on the The Infant Health Development Program (IHDP), a randomized experiment that assessed the impact of specialist home visits on children’s performance in cognitive tests. Real covariates and treatments related to each participant are used in the IHDP dataset. However, outcomes are simulated based on covariates and treatment, making this dataset semisynthetic. Covariates were made different between the treatment and control groups by removing units with nonwhite mothers from the treated population. There are 747 units in the dataset (139 treated, 608 control), with 25 covariates related to the children and their mothers. Following Shalit et al. (2017); Hill (2011), we use the simulated outcome implemented as setting “A” in the NPCI package (Dorie, 2016) and we use the noiseless/expected outcome to compute the ground truth . The IHDP dataset is available for download at https://www.fredjo.com/.
We run the experiment according to the protocol described in (Shalit et al., 2017): we run 1000 repetitions of the experiment, where each test set has 75 points and the remaining 672 available points are split 70% to 30% for training and validation. The ground truth outcomes are normalized to a mean of
and standard deviation of
over the training set. For evaluation, each model’s predictions are unnormalized to calculate the PEHE.IHDP Covariate Shift. As previously mentioned, we selected a variable (marital status of mother) and exclude datapoints where the mother was unmarried from training (while leaving the test set unaltered). We selected this feature for three reasons: it is active in roughly 50% of data points, the distributions of the remaining covariates were distinct based on a TSNE visualization (Maaten and Hinton, 2008), and the feature is only marginally correlated with treatment (which ensures that we study the impact of covariate shift, not unobserved confounding). The feature is hidden to the models to make the detection of covariate shift nontrivial, and to induce a more realistic scenario where latent factors are often unaccounted for in observational data.
Marital status may be considered a sensitive socioeconomic factor. We do not intend the experiment to be politically insensitive, rather that it emphasizes the problem of demographic exclusion in observational data due to issues such as historical bias, along with the danger of making confident but uniformed predictions when demographic exclusion is latent. Omitting these variables can lead to subpar model performance – particularly for members of a socioeconomic minority.
a.3 Acic 2016
Dorie et al. (2019) introduced a dataset named after the 2016 Atlantic Causal Inference Conference (ACIC) where it was used for a competition. ACIC is a collection of semisynthetic datasets whose covariates are taken from a large study conducted on pregnant women and their children to identifying causal factors leading to developmental disorders (Niswander, 1972). There are 4802 observations and 58 covariates. Outcomes and treatments are simulated, as in IHDP, according to different datagenerating process for each dataset. We chose this dataset instead of the 2018 ACIC challenge (Shimoni et al., 2018) because the latter is aimed at only estimation and the is equal for each observation in most datasets.
a.4 Evaluation metrics
We evaluate our methods by considering treatment recommendations. A simplified recommendation strategy for an individuallevel treatment of a unit with covariates is to recommend if the predicted is positive, and if negative. However, if there is insufficient knowledge about the an individual, and a high cost associated with making errors, it may be preferable to withhold the recommendation, and e.g. refer the case for further scrutiny. It is therefore important to have an informed rejection policy for a treatment assigned based on a given estimator.
To evaluate a rejection policy for a estimator we assign a cost of 1 to making incorrect predictions and a cost of 0 for making a correct recommendation. At a fixed number of rejections, the utility of a policy can be defined as the inverse of the total number of erroneous recommendations made, i.e., if a policy can correctly identify the model’s mistakes and refer such patients to a human expert then it should have a higher utility.
Rejection policies We introduce two rejection policies based on the epistemic and predictive uncertainty estimates of an uncertainty aware estimator. Both policies opt to reject if the uncertainty estimate is greater than a threshold that rejects a given proportion of the training data . The training data is used there may not be a large enough test set in practice. For all policies, we determine thresholds on the training set to simulate a realworld individuallevel recommendation scenario. The epistemic uncertainty policy uses a samplebased estimator of the uncertainty in (second r.h.s. term in equation (7)) given by
(10) 
where Monte Carlo samples are taken from each of . Note that, for the TLearner, this posterior factorizes into two independent distributions because there are separate models for the outcome given treatment and no treatment. Furthermore, other models share parameters for so the individual parameters in may overlap. The predictive uncertainty policy uses an estimator of equation (7), , which has the same functional form as in (10), but instead of being over the difference in expected values of the output distribution it is over samples of the output distribution.
We compare the utility of these policies to a random rejection baseline and two policies based on propensity scores. The first propensity policy (propensity quantiles) finds a two sided threshold on the distribution of estimated propensity scores such that a proportion of the training data is retained. The second policy (propensity trimming) implements a trimming algorithm following the guidelines proposed by Caliendo and Kopeinig (2008).
Appendix B Models
We evaluate and compare each rejection policy using several uncertaintyaware estimators. The estimators are the Bayesian versions of CEVAE Louizos et al. (2017), TARNet, CFRMMD (Shalit et al., 2017), Dragonnet (Shi et al., 2019a), and a deep version of the TLearner (Shalit et al., 2017). Each model is augmented by introducing Bayesian parameter uncertainty and by predicting a distribution over model outputs. For image data, two convolutional bottom layers are added to each model.
Each model is augmented with Bayesian parameter uncertainty by adding dropout with a probability of 0.1 after each layer (0.5 for layers before the output layer), and setting weight decay penalties to be inversely proportional to the number of examples in the training dataset. At test time, uncertainty estimates are calculated over 100 MC samples.
For the Bayesian TLearner we use two BNNs, each having 5 dense, 200 neuron, layers. Dropout is added after each dense layers, followed by ELU activation functions. A linear output layer is added to each network, with a sigmoid activation function if the target is binary. For image data, we add a 2layer convolutional neural network module, with 32 and 64 filters per layer. Spatial dropout
(Tompson et al., 2015), and ELU activations follow each convolutional layer, and the output is flattened before being passed to the rest of the network. For image data, the Bayesian CEVAE decoder is modified by using a transposed convolution block for the part of the decoder that models . For the propensity policies, we use a propensity model that has the same form as a single branch of the Bayesian Tlearner. The propensity model’s L2 regularization is tuned for calibration as this is important for propensity models. We also experimented with a logistic regression model which performed worse.Adam optimization (Kingma and Ba, 2014)
is used with a learning rate of 0.001 (On CEMNIST the learning rate for the BCEVAE is reduced to 0.0002), and we train each model for a maximum of 2000 epochs, using early stopping with a patience of 50.
Aside from these changes, model architectures, optimization strategies and loss weighting follow what is reported in their respective papers. More details can be seen in the attached code.
b.1 Compute infrastructure
All neural network models were implemented in Tensorflow 2.2 (Abadi et al., 2016), using Nvidia GPUs. BART was implemented using the dbarts R package, available at https://cran.rproject.org/web/packages/dbarts/index.html.
Appendix C Additional Results
Table 4 shows that each uncertainty aware neural network model outperforms BART Chipman et al. (2010) on the datasets considered. BART is chosen as a baseline here because it can quantify epistemic uncertainty estimates.
CEMNIST()  IHDP Cov. ()  IHDP ()  
Method / Pol.  rand.  prop.  unct.  rand.  prop.  unct.  rand.  prop.  unct. 
BART  2.1.02  2.1.02  2.0.03  2.6.2  2.7.3  1.8.2  1.9.2  1.9.2  1.6.1 
BTLearner  .27.00  .18.01  .04.01  2.3.2  2.3.2  1.3.1  1.0.0  0.9.0  0.7.0 
BTARNet  .18.01  .16.01  .00.00  2.2.3  2.0.3  1.2.1  1.1.0  1.0.0  0.8.0 
BCFRMMD  .32.01  .25.01  .13.02  2.5.2  2.4.3  1.7.2  1.3.1  1.3.1  0.9.0 
BDragonnet  .22.01  .19.01  .02.01  2.4.3  2.2.3  1.3.2  1.5.1  1.4.1  1.1.0 
BCEVAE  .30.01  .23.01  .04.01  2.5.2  2.4.3  1.7.1  1.8.1  1.9.1  1.5.1 
c.1 Cemnist
Table 5 and figure 5 compare the BCEVAE model when trained with and without negative sampling on the CEMNIST dataset.
()  Rec. Err. ()  

Method / Pol.  rand.  prop.  unct.  rand.  prop.  unct. 
Negative Sampling  .295.005  .227.007  .037.009  .010.001  .005.001  .000.000 
No Negative Sampling  .286.005  .226.007  .033.007  .011.001  .007.001  .000.000 
c.2 Ihdp
Table 6 shows the relative performance of the Bayesian models to the results reported in their respective papers for the IHDP dataset.
withinsample  outofsample  
Method  
OLS2  2.4.1  .14.01  2.5.1  .31.02 
BART  2.1.1  .23.01  2.3.1  .34.02 
BNN  2.2.1  .37.03  2.1.1  .42.03 
GANITE  1.9.4  .43.05  2.4.4  .49.05 
CEVAE  2.7.1  .34.01  2.6.1  .46.02 
TARNet  .88.0  .26.01  .95.0  .28.01 
CFRMMD  .73.0  .3.01  .78.0  .31.01 
Dragonnet  .14.01  .20.01  
BTLearner  .95.0  .21.01  .88.0  .18.01 
BTARNet  1.1.0  .23.01  .96.0  .20.01 
BCFRMMD  1.3.1  .29.01  1.2.1  .26.01 
BDragonnet  1.5.1  .30.01  1.3.0  .27.01 
BCEVAE  1.8.1  .47.01  1.8.1  .50.02 
c.3 Acic
Figure 6 visualizes the performance of the rejection policies on the ACIC 2016 dataset. Table 7 enumerates the results for the ACIC 2016 dataset, and we see that the epistemic uncertainty policy rejects recommendations for data points with high errors.
()  Rec. Err. ()  

Method / Pol.  rand.  prop.  unct.  rand.  prop.  unct. 
BTLearner  2.31.139  2.19.136  1.77.095  .072.006  .069.006  .066.006 
BTARNet  2.18.145  2.05.142  1.67.094  .064.007  .061.007  .059.007 
BCFRMMD  2.26.150  2.13.147  1.71.105  .062.007  .060.007  .058.007 
BDragonnet  2.30.127  2.17.122  1.81.088  .069.006  .067.006  .066.006 
BCEVAE  3.26.161  3.19.156  2.93.132  .097.010  .094.010  .094.010 
Comments
There are no comments yet.