Estimating Potential Outcome Distributions with Collaborating Causal Networks

by   Tianhui Zhou, et al.
Duke University

Many causal inference approaches have focused on identifying an individual's outcome change due to a potential treatment, or the individual treatment effect (ITE), from observational studies. Rather than only estimating the ITE, we propose Collaborating Causal Networks (CCN) to estimate the full potential outcome distributions. This modification facilitates estimating the utility of each treatment and allows for individual variation in utility functions (e.g., variability in risk tolerance). We show that CCN learns distributions that asymptotically capture the correct potential outcome distributions under standard causal inference assumptions. Furthermore, we develop a new adjustment approach that is empirically effective in alleviating sample imbalance between treatment groups in observational studies. We evaluate CCN by extensive empirical experiments and demonstrate improved distribution estimates compared to existing Bayesian and Generative Adversarial Network-based methods. Additionally, CCN empirically improves decisions over a variety of utility functions.



There are no comments yet.


page 1

page 2

page 3

page 4


Causal Inference from Small High-dimensional Datasets

Many methods have been proposed to estimate treatment effects with obser...

Hierarchical Bayesian Bootstrap for Heterogeneous Treatment Effect Estimation

A major focus of causal inference is the estimation of heterogeneous ave...

Quantifying Sufficient Randomness for Causal Inference

Spurious association arises from covariance between propensity for the t...

Flexible sensitivity analysis for observational studies without observable implications

A fundamental challenge in observational causal inference is that assump...

Regret Minimization for Causal Inference on Large Treatment Space

Predicting which action (treatment) will lead to a better outcome is a c...

Estimating individual treatment effect: generalization bounds and algorithms

There is intense interest in applying machine learning to problems of ca...

Bayesian Doubly Robust Causal Inference via Loss Functions

Frequentist inference has a well-established supporting theory for doubl...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Personalized medicine requires estimating how an individual’s intrinsic characteristics trigger heterogeneous responses to treatment (Yazdani and Boerwinkle, 2015). Under the potential outcome framework to causal inference (Imbens and Rubin, 2015), these individual treatment effects (ITE) (or Conditional Average Treatment Effect (CATE)) are defined as the difference between an individual’s expected potential outcomes under different treatment conditions. Since only the outcome for the assigned treatment is observed, estimating the ITE requires inferring the missing potential outcomes Ding and Li (2018).

Machine learning approaches have been adapted to estimate ITEs by extending approaches such as the Random Forest (Wager and Athey, 2018)

and creating bespoke neural network frameworks

(Shalit, Johansson, and Sontag, 2017; Shi, Blei, and Veitch, 2019). ITEs, though, do not necessarily align with optimal choices. In a decision theoretic framework, the optimal decision maximizes the expected utility function, , over the distribution of outcomes (Joyce, 1999). ITE is a special case with the identity utility function, , but more general utility functions requires alternative approaches. Policy learning is one approach to learn a decision maker by optimizing a predefined utility function as the objective function. (Kallus and Zhou, 2018; Qian and Murphy, 2011). However, the decision should not only account for the heterogeneity of an individual’s potential outcomes but also their customized needs (personalized utility functions) Pennings and Smidts (2003). Hence, we propose an approach that estimates the potential outcome distributions to maintain flexibility for personalization.

Previous efforts to estimate potential outcome distributions include Bayesian Additive Regression Trees (BART) (Chipman, George, and McCulloch, 2010; Hill, 2011), variational methods (Louizos et al., 2017), generalized additive model with location, shape and scale (GAMLSS) (Hohberg, Pütz, and Kneib, 2020), and techniques based on adversarial networks (Yoon, Jordon, and van der Schaar, 2018). Empirically, these techniques often impose certain explicit or implicit assumptions about the outcome distributions (e.g., Gaussian errors), which may not match with the true data generating mechanism. In response, we propose a novel dual neural network approach, the Collaborating Causal Networks (CCN). CCN modifies the structure of the Collaborating Networks (Zhou et al., 2020) to create a new causal framework that flexibly represents distributions. Under standard causal inference assumptions, we prove that CCN asymptotically captures the potential outcome distributions. We then propose a novel adjustment method to address poor imbalance between treatment groups, which hurts generalization and is a common problem in practice. Empirically, this adjustment method improves the point estimates, potential distribution estimates, and decision-making.

In summary, the main contributions of the paper are:

  1. We propose the Collaborating Causal Networks (CCN) to estimate potential outcome distributions.

  2. We prove the asymptotic properties of CCN.

  3. We propose a new adjustment scheme to alleviate the treatment group imbalance for observational studies.

In addition, we have the further minor contribution that we propose personalized utilities in decision making, which is practically more meaningful and addresses distinct user needs. We evaluate how well existing algorithms perform on this task for the first time. We demonstrate that CCN improves optimal individual decisions based on utility functions by targeting the full potential distribution of individual treatment effects.

2 Problem Statement

We define the covariates as . We assume a binary treatment condition and each unit is assigned a treatment . We choose the binary setup for clarity, and the multi-class case could be constructed under the same framework. We let and represent the continuous potential outcomes under the two treatments; is the observed outcome. We use lowercase letters with subscript to denote concrete realizations: . A common goal is to estimate the , also known as conditional average treatment effect (CATE).

Our goal is to use the incomplete data to infer the distributions on both potential outcomes, and . Successful estimation of these distributions enables us to study a wide range of objectives besides ITE through the introduction of utility functions (Dehejia, 2005). We define the treatment-specific utility functions as and . These utility functions will often be the same, but can vary due to cost of treatment, etc. We can then estimate the change in utility by approximating the expectations, . We note that the identity function, , returns an estimate of the ITE. In practice, utility functions are often nonlinear in measured outcomes Pennings and Smidts (2003). For example, a decision maker could define a utility function such as to evaluate the chance that they get a meaningful outcome at least at level . The utility function could also be varied and adapted to accommodate for various personal preferences or conditions.

Like most causal methods, CCN relies on the standard strong ignorability assumption (Rosenbaum and Rubin, 1983), to estimate the full potential outcome distributions when each datum only observes a single treatment outcome. Strong ignorability consists of two sub-assumptions:

Assumption 1 (Positivity or overlap).

, the probability of assignment to any treatment group is bounded away from zero:


Assumption 2 (Ignorability or Unconfoundedness).

The potential outcomes are jointly independent of the treatment assignment conditional on : .

3 Collaborating Causal Networks

The CCN approach approximates the conditional distributions and . CCN uses a two-function framework based on the Collaborating Networks (CN) method (Zhou et al., 2020)

. We choose to extend CN to the causal setting because it automatically adapts to different distribution families, including non-Gaussian distributions.

We first give an overview of CN, then present CCN, and then introduce our proposed adjustment strategies. Proofs of all theoretical claims are in Appendix A.

Overview of Collaborating Networks

CN estimates the conditional distribution, , with two neural networks: a network to approximate the conditional CDF, , and a network

to approximate its inverse. Information sharing is enforced by the fact that the CDF and its inverse are an identity mapping for any quantile

: . The networks form a collaborative scheme with their respective losses,


The quantile is randomly sampled (e.g., ). represents the binary cross-entropy loss. The parameters for and are only updated with their respective losses. When trained with (1) and (2), a fixed point of the optimization is at the true conditional CDF and its inverse (Zhou et al., 2020). The relative importance of and is reflected these statistical properties, as only relies on to cover the outcome space, whereas requires an optimal . Zhou et al. (2020) showed that can be replaced by other space searching tools, including simple distributions, which suffer a minor performance loss in exchange for ease of optimization. Thus, we focus on extending and g-loss for causal inference and we replace with notation as a general form of a space searching tool or distribution sample. The g-loss can be simplified to


In practice, the expectation in (3) is replaced with an empirical approximation.

(a) CCN Training
(b) Infer Distribution
Figure 1: 1(a) visualizes the CCN network with the adjustment strategy. 1(b) describes that the trained and functions can be used to sketch the underlying CDF’s of the potential outcomes, and .

Causal Inference Formulation

Based on (3), we define a network for each group, and , with corresponding losses,


We combine the two treatment groups as


We choose for each treatment arm to have a separate network; alternatively, the treatment assignment could also be combined as an additional covariate within a single network. We next present the asymptotic properties of CCN. Under Assumptions 1 and 2 and a minor additional assumption below, the fixed point solution and consistency of CCN hold regardless of the treatment group imbalance.

Assumption 3 (Covariate Shift).

Both potential outcome distributions and are independent of the covariate distribution .

To summarize, Assumption 2 connects the conditional distribution, , to the potential outcome distribution on each covariate space whereas Assumption 3 generalizes the potential outcome distributions from each space to the full space . Assumption 3 is often assumed in the analysis frameworks for causal neural networks (Johansson et al., 2020). Given our assumptions, we state:

Proposition 1 (Optimal solution for and ).

When the distribution of covers the full outcome space, the functions and that minimize are optimal when they are equivalent to the conditional CDF of and , such that .

Proposition 2 (Consistency of and ).

Assume the ground truth CDF functions for satisfy Lipschitz continuity and that covers the full outcome space. Denote the ground truth as and . As , the finite sample estimators and have the following consistency property: under some metrics , such as the norm.

Taken together, these propositions state that the CDF estimators and inherit the large sample properties from CN for estimating potential outcome distributions.

Adjustment for Treatment Group Imbalance

One obstacle for causal inference is the treatment group imbalance, where the distributions of the covariate space and significantly differ. In the asymptotic regime (Proposition 2), we are only dependent on overlap (Assumption 1) holding, as any region with positive density will eventually be densely covered with samples. However, for finite samples, this imbalance hurts generalization. Thus, we propose a new adjustment scheme that alleviates this imbalance and further improves the CCN.

Succinctly, the adjustment method encodes the original covariates into a more balanced space that maintains relevant predictive information. The new covariate space is represented as , with a neural network that transforms the input into a -dimensional latent space, and a propensity estimator to predict the propensity of treatment assignment, .

While may seem redundant given , including in the latent covariate space implicitly encourages propensity score stratification. We denote the g-loss trained with this new covariate space as g-loss. Then the full loss can be expressed the sum of g-loss and terms to train and ,

This full network is sketched in Figure 1, showing the training of different functions with respect to their losses. Wass-loss and Assignment-loss are fully described below. The tuning parameters and vary the relative importance of Wass-loss and Assignment-loss during the learning of the . If and we remove the latent representation and propensity, this structure reduces to the regular CCN.

Wass-loss to Alleviate the Treatment Group Imbalance

The Wass-loss is motivated by CounterFactual Regression (CFR) implemented with the Wasserstein distance (Shalit, Johansson, and Sontag, 2017). CFR is a causal estimator based on a latent representation . The goal is to find latent representations where and are more balanced than the original covariate space while remaining predictive of the potential outcomes. We use the Wasserstein-1 distance, which represents the total “work” required transform one distribution to another (Vallender, 1974). Through the Kantorovich-Rubinstein duality Villani (2008), this distribution distance is,

represents the family of 1-Lipschitz functions. We approximate this distance by adopting the approach of Arjovsky, Chintala, and Bottou (2017). This turns into the following min-max regime,


is parameterized by a small neural network, and the Lipschitz constraint on is enforced through weight clipping. The Wass-loss penalizes differences in the latent space between the treatment and control group, which improves generalization between the groups. However, finding such a balanced representation becomes increasingly challenging with increased dimensionality Shi, Blei, and Veitch (2019). Moreover, this representation by itself does not account for the treatment assignment mechanism (propensity score), which has the potential to reduce bias and confounding effects (Austin, 2011).

Assignment-loss and propensity score stratification

The Assignment-loss is inspired by Dragonnet (Shi, Blei, and Veitch, 2019)

, a recent deep learning method that incorporates treatment assignment to help learn a latent representation that predicts the Average Treatment Effect (ATE). It is defined as binary cross-entropy loss (

) on predicting the treatment assignment label with ,


We extend this technique by incorporating the estimated propensity directly into our covariate space, which encourages propensity score stratification. Using propensity scores in the predictive model is a implicit form of stratification to reduce the bias of point estimation by facilitating information sharing within the sample strata (Hahn, Murray, and Carvalho, 2020).

Finally, we note that Shi, Blei, and Veitch (2019) also included a targeted loss term, which relates to double-robustness methods in causal inference Bang and Robins (2005). It does not directly apply to the framework of CCN, and is not included.

4 Related Work

Below, we discuss three branches of related research.

ITE estimation and representation learning A common approach to combine causal inference and machine learning is the matching strategy, which identifies pairs of similar individuals (Rubin, 1973; Rosenbaum and Rubin, 1983; Li and Fu, 2017; Schwab, Linhardt, and Karlen, 2018). This idea also motivates many tree-based methods that identify similar individuals within automatically-identified regions of the covariate space (Liaw and Wiener, 2002; Zhang and Lu, 2012; Athey and Imbens, 2016; Wager and Athey, 2018). Adversarial methods have attempted to balance treatment groups by learning a treatment-invariant space (Johansson et al., 2020; Johansson, Shalit, and Sontag, 2016; Du et al., 2019). Alternatively, methods have been used to explicitly encode treatment propensity information in the learned representations (Shi, Blei, and Veitch, 2019). Representation learning can also be combined with weighting to enforce additional covariate balance (Assaad et al., 2021). These methods largely focus on estimating the difference of the means of the outcome distributions rather than the full distributions (some exceptions will be noted below). As noted by Park et al. (2021)

, the first moment difference reflected in ITE might be insufficient to reflect the full picture of different treatment regimes. In contrast, we estimate full distributions to assess the utility and confidence of a decision.

Potential outcome distribution sketching. Bayesian methods have been used to estimate outcome distributions, including methods such as Gaussian Processes (Alaa and van der Schaar, 2017), Bayesian dropout (Alaa, Weisz, and Van Der Schaar, 2017), and Bayesian Additive Regression Trees (BART) (Chipman, George, and McCulloch, 2010). BART has gained popularity in recent years and has been the focus of further modifications, including variations to account for regions with poor overlap (Hahn, Murray, and Carvalho, 2020). However, Bayesian methods are subject to model mis-specification (Walker, 2013)

, which can create problems when the outcome distribution does not match model assumptions (e.g., actual outcome distribution is non-Gaussian). Variational methods are capable of drawing inferences from complex structures, such as the Causal Effect Variational Autoencoder (CEVAE) and its extensions

(Louizos et al., 2017; Jesson et al., 2020); hybrid architectures are sometimes adopted to account for certain types of missing data mechanisms Hassanpour and Greiner (2020). Frequentist approaches can also achieve flexible representations of distributions. A well-known adaptation is the generalized additive model with location, scale and shift (GAMLSS), which estimates the parameters for a baseline distribution with up to three transformations given a specific family Briseño Sanchez et al. (2020); Hohberg, Pütz, and Kneib (2020). Alternatively, the CDF may be estimated nonparametrically by weighting the empirical count of data points in a given neighborhood by adapting nearest neighbor approaches Shen (2019). However, this approach is less smooth and accurate in regions with treatment group imbalance due to the lack of neighboring points. GAN-inspired methods, including GANITE (Yoon, Jordon, and van der Schaar, 2018), can also learn non-Gaussian outcome distributions. There is also emerging literature on conformal prediction in treatment effect estimation (Lei and Candès, 2020; Chernozhukov, Wüthrich, and Zhu, 2021). However, conformal prediction learns a specific level of coverage and is not feasible to estimate the expected utility. Additionally, its coverage probabilities are proved for populations rather than individuals.

PEHE 1.59 .16 1.30 .19 2.40 .40 2.60 .10 2.23 .33 3.00 .39 3.52 .57
LL -1.78 .02 -1.64 .02 * -2.82 .08 -1.99 .08 -2.34 .13 NA
AUC (Linear) .925 .011 .942 .010 .723 .017 .523 .008 .923 .009 .930 .10 .896 .009
AUC (Threshold) .913 .011 .935 .010 * .564 .010 .917 .009 .925 .10 NA
Table 1:

Quantitative results on IHDP. Each metric’s mean and standard error are reported. *GANITE is only used for estimating the ITE as it is relatively challenging to optimize for this small dataset according to

Yoon, Jordon, and van der Schaar (2018).

Policy learning and utility functions. A key purpose of estimating the individual causal effect is to serve personalized decisions. This aligns with the goal of policy learning to identify the policy (treatment) that benefits an individual most (Kallus and Zhou, 2018; Qian and Murphy, 2011; Bertsimas et al., 2017). A common strategy is to express the policy as a function of the covariate feature space and learn the policy to optimize the utility (Beygelzimer and Langford, 2009)

. Often, utilities studied in policy learning are linear transformation of the potential outcomes, which can be described as the difference between the benefit and cost

(Athey and Wager, 2021). Unfortunately, the observed utility may be subject to information loss according to the Data Processing Inequality (Beaudry and Renner, 2012)

(e.g., binarization of a continuous variable greatly reduces information). Additionally, policy learning requires each individual to share a utility function, rather than allowing personalization.

5 Experiments

We follow established literature and use semi-synthetic scenarios to assess individualized causal effects. First, we use the Infant Health and Development Program (IHDP) (Hill, 2011), where the outcome of each subject is simulated under a standard Gaussian distribution with a heterogeneous treatment effect. This first situation describes the ideal scenario for many methods, including BART. The second example is based on a field experiment in India studying the impact of education (EDU). In this case, we synthesize each individual outcome with heterogeneous effect and variability using a non-Gaussian distribution. The semi-synthetic procedures are briefly outlined below with full details in Appendix B. In addition, we briefly describe several additional synthetic data experiments and ablations studies below and include full details in the appendices.

We include our base approach, CCN, and the version adjusted for imbalance, Adj-CCN. We compare to existing approaches that estimate potential outcome distributions, including Bayesian approaches (CEVAE (Louizos et al., 2017) and BART (Hill, 2011)), a frequentist approach, GAMLSS (Hohberg, Pütz, and Kneib, 2020), and a GAN-based approach, GANITE (Yoon, Jordon, and van der Schaar, 2018). Causal Forests (CF) (Wager and Athey, 2018) is also benchmarked for non-distribution metrics as a popular recent ITE-only method. GAMLSS’s flexibility and strength in estimating distributions is dependent on a close match to the true underlying distribution families, which is normally unknown in practice. We evaluated two scenarios, one where GAMLSS uses a flexible distribution family and another where GAMLSS is provided the closest possible distribution, meaning that GAMLSS is provided more information than any other method. However, GAMLSS with a flexible distribution did not converge well and gave uncompetitive performance, so only results for the second scenario are reported.

Full implementation details and specifications for all models are given in Appendix C. We use a standard neural network architecture for CCN, but detail an alternative structure that enforces a monotonic constraint in Appendix D. Code to replicate all experiments has been included, which will be released with an MIT license if accepted.


We evaluate how well each method captures the mean of the distribution through Precision in Estimation of Heterogenous Effect (PEHE) and the full distribution by estimating the log-likelihood (LL) of the predictions. We view LL as the key metric here since it evaluates full distributions. In addition, we evaluate how well each method makes decisions by evaluating the Area Under the Curve (AUC) for chosen utility functions to demonstrate that improved distributional estimates lead to improved decisions. Full mathematical definitions of the metrics are given in Appendix E.


The Infant Health and Development Program (IHDP) was originally a randomized experiment that was modified to mimic an observational study by removing a nonrandom portion from the treatment group. We use the response surface B introduced in Hill (2011) for heterogeneous treatment effect. The study consists of 747 subjects (139 in the treated group) with 19 binary and 6 continuous variables (). We utilize 100 replications of the data for out-of-sample evaluation by following the simulation process of Shalit, Johansson, and Sontag (2017).

The quantitative results are in Table 1

. CCN overall outperforms other competing methods in both mean and distribution metrics. Its edge in these metrics could be attributed to the relative robustness of the collaborating networks to the presence of outliers or overfitting in small data

(Zhou et al., 2020). The advantages of combining the imbalance adjustment strategy is evident as Adj-CCN clearly improves over CCN. It is worth noting that LL calculated under the ground truth model is -1.41, demonstrating that Adj-CCN is highly effective in capturing the potential outcome distributions.

PEHE .392 .049 .332 .042 1.253 .181 1.911 .351 .534 .042 .314 .053 1.022 .051
LL -2.178 .024 -2.130 .017 -6.479 .992 -3.558 .055 -2.443 .063 -2.250 .025 NA
AUC .933 .026 .949 .027 .760 .053 .622 .039 .906 .015 .941 .010 NA
Table 2: Quantitative results on the EDU dataset.

We evaluate two sets of utility functions: a linear utility and a non-linear utility with and . As the ATE for surface B is 4 (Hill, 2011), the two setups reflect whether a method can correctly distinguish an individual from the population level of treatment effect. Table 1 shows AUC (Linear) and AUC (Non-Linear) for these utilities, demonstrating that CCN’s improved distribution estimates contribute to accurate and competitive decisions, despite the fact that a homoskedastic Gaussian distribution is well matched to BART and GAMLSS.

We note that recent alternative methods have shown improvements on PEHE estimation in IHDP (Zhang, Bellot, and Schaar, 2020; Assaad et al., 2021); however, our focus here is on techniques that that estimate the full distribution of individualized treatment effects.


This semi-synthetic dataset is based on a randomized field experiment in India between 2011 and 2012 (Banerji, Berry, and Shotland, 2017, 2019). The experiment studied whether providing a mother with adult education benefits their children’s learning. We define the binary treatment as whether a mother received adult education and the continuous outcome as the difference between the final and the baseline test scores. The total sample size is 8,627 with 18 continuous covariates and 14 binary covariates: ().

We create a semi-synthetic dataset over the two potential outcomes by the following procedure. We first train two neural networks,

, on the observed outcomes for the control and treated population. The uncertainty model for the control and treated group are based on a Gaussian distribution and an exponential distribution, respectfully, which helps demonstrate that the structure of CCN can automatically adapt to different families. We represent

as an indicator of whether the mother had received any previous school education as we hypothesize the variability is higher for the mothers without previous education. The final form for the potential outcomes can be expressed,

The treatment group imbalance comes from two aspects. One aspect is from a treatment assignment model with propensity where we assign large coefficients in to add imbalance. The second aspect is from truncation, as we remove well-balanced subjects with estimated propensities in the range of . We keep 1,000 samples for evaluation and use the rest for training. The full procedure is repeated 10 times to generate the variability assessment.

The utility function is customized for each subject to mimic personalized decisions. For subject , , and where . The interpretation of this utility is that each mother might hope the probability of their child scoring above certain threshold to increase. For the mothers without previous education, their expectations are higher by . This design coincides with the expectation that the education should have a positive effect on the final score in exchange for a finite cost, and we would only invest in the intervention for a positive return. Table 2 summarizes the evaluation result. As in IHDP, the CCN methods flexibly model different distributions, with Adj-CCN providing some improvement over CCN. Notably, GAMLSS was provided with additional help by specifying the best distribution family a priori. Despite this, it displayed improved performance over Adj-CCN in PEHE only, not in LL or decision metrics.

(a) Adj-CCN
Figure 2: The estimated versus true 90% interval widths given the four combinations of and . CCN and GANITE reflect the main trend of how uncertainties change with respect to and . Adj-CCN can clearly discern the four scenarios by aligning its estimated interval widths and the true interval widths.

The ability to make an optimal decision is highly dependent on how close its estimated distribution aligns with the true distribution and all relevant heterogeneity. Thus, the AUCs follow their respective LLs with the CCN based methods performing well. CEVAE has two facets of model misspecification that hurt performance: () its homogeneous Gaussian error, and () it decodes the continuous covariates into a Gaussian distribution, whereas the real outcomes come from heteroskedastic Gaussian or exponential distributions. Hence, CEVAE captures the marginal distributions well but does not provide helpful personalized suggestions.

(a) CCN
(b) Adj-CCN
(e) BART
Figure 3: Visualization of a random sample of EDU. Adj-CCN follows the theoretical curve and the rest diverge.

Next, we randomly draw a data sample and compare the estimated CDFs against the true CDFs in Figure 3 (see Figure S1 for additional samples). We find that CCN-based approaches are capable of closely recovering the true CDFs on random individuals, whereas the other methods have gaps in their estimation. GAMLSS is accurate on the control group but not the treatment group. This is partly due to GAMLSS package (Rigby and Stasinopoulos, 2005)

not supporting the exponential distribution with location shift, so skewed normal distribution was chosen as the closest reasonable substitute. Overall, GAMLSS is flexible but requires precise specification on a case-by-case basis, whereas CCN can robustly use the same approach.

Another aspect to assess is whether a method captures the heteroskedasticity of the outcome distributions. The combination of and produces four uncertainty models. We visualize the predictive 90% interval widths from each method in Figure 2

. Adj-CCN captures the bimodal nature of the interval widths. In contrast, GANITE only captures a small fraction the difference between the low and high variance cases. Both CEVAE

Louizos et al. (2017) and BART Hill (2011) did not capture the heterogeneity and are not shown. As GAMLSS requires us to specify that only and affected its uncertainty model to converge effectively, it did not produce variability in interval width. In summary, the ground truth interval widths are 1.47, 1.64, 2.94 and 3.29, while GAMLSS estimates 0.92, 1.52, 3.37 and 3.37. CCN and its variants are more accurate.

Additional Comparisons and Properties

There are several additional experiments included in the appendices. First, we compare to policy learning algorithms in Appendix G. While policy learning is not directly comparable to these other approaches, we can compare decisions for a utility function defined a priori. Specifically, we conducted experiments with policytree Sverdrup et al. (2021) against Adj-CCN in the IDHP dataset. We set up two scenarios, one with a linear utility and one with a threshold utility. Adj-CCN performed well in both with over 80% accuracy. However, policytree’s accuracy dropped from 77% to 58 % when we switched to the threshold utility. These results show that learning the full potential distribution improves upon a policy learning approach despite the potential disadvantage of not having access to the utility function during training.

(a) Adj-CCN
Figure 4: Visualization of estimated density on the potential outcome distributions for multi-modal outcomes.

We also include additional experiments on synthetic data with a variety of distributions in Appendix H, including Gumbel, Gamma, and Weibull distributions. The results are qualitatively similar to the previously presented semi-synthetic cases, where CCN straightforwardly adapts to these distributions and Adj-CCN provides additional improvements. GAMLSS is comparable when provided the true distribution a priori but does not compare with more flexible distribution families. While we choose relatively standard distributions for our semi-synthetic experiments, matching existing literature, we wanted to highlight the flexibility of CCN. Thus, we evaluated a multi-modal outcome distribution in Appendix I. As expected in this case, CCN and CCN-Adj naturally adjust to the multi-modal space whereas the competing methods do not naturally handle the case, as shown briefly in Figure 4 and on all methods in Figure S3. GANITE is aware of certain mixture here. As its objective makes a trade-off between GAN loss and supervised loss, it does not have asymptotic guarantee to approximate the true distribution.

Finally, we note that there are several components of the adjustment strategy. We perform an ablation study in Appendix J, which suggests that no part subsumes the others, but including all three adjustment strategies builds robustness. We additionally assess on which data sample Adj-CCN improves over CCN by visualizing the performance as a function of propensity on synthetic datasets described in Appendix J. These results demonstrate that the primary advantage of the adjustment strategy is on improving estimates with very high or low propensities, and that the strategies have similar performance in well-balanced cases.

6 Discussion

CCN is a novel framework to estimate individualized potential outcome distributions, with novel theoretical proofs and a new adjustment method to address treatment group imbalance. Empirically, CNN is effective in inferring individualized causal effects and is relatively robust with regard to treatment group imbalance in semi-synthetic experiments. We demonstrate that improving distribution estimates leads to improved decision-making even without a priori access to utility functions.

Like nearly all causal methods, we are dependent on Assumptions 1, 2, and 3. Evaluating whether these assumptions hold in practice is non-trivial and requires close interaction with domain experts. Regardless of these limitations, we are encouraged about CCN due to the theoretical backing and empirical performance in our evaluations.


Research reported in this manuscript was supported by the National Institute of Biomedical Imaging and Bioengineering and the National Institute of Mental Health through the National Institutes of Health BRAIN Initiative under Award Number R01EB026937 and the National Institute of Mental Health of the National Institutes of Health under Award Number R01MH125430. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. We would also like to thank all members of the Carlson lab for proofreading and providing valuable comments on this manuscript.


  • Alaa and van der Schaar (2017) Alaa, A. M.; and van der Schaar, M. 2017. Bayesian inference of individualized treatment effects using multi-task gaussian processes. Advances in Neural Information Processing Systems, 3424–3432.
  • Alaa, Weisz, and Van Der Schaar (2017) Alaa, A. M.; Weisz, M.; and Van Der Schaar, M. 2017. Deep counterfactual networks with propensity-dropout. Proceedings of International Conference on Machine Learning.
  • Arjovsky, Chintala, and Bottou (2017) Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein generative adversarial networks. International Conference on Machine Learning.
  • Assaad et al. (2021) Assaad, S.; Zeng, S.; Tao, C.; Datta, S.; Mehta, N.; Henao, R.; Li, F.; and Carin, L. 2021. Counterfactual representation learning with balancing weights. Artificial Intelligence and Statistics.
  • Athey and Imbens (2016) Athey, S.; and Imbens, G. 2016. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27): 7353–7360.
  • Athey and Wager (2021) Athey, S.; and Wager, S. 2021. Policy learning with observational data. Econometrica, 89(1): 133–161.
  • Austin (2011) Austin, P. C. 2011. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research, 46(3): 399–424.
  • Banerji, Berry, and Shotland (2017) Banerji, R.; Berry, J.; and Shotland, M. 2017. The impact of maternal literacy and participation programs: Evidence from a randomized evaluation in india. American Economic Journal: Applied Economics, 9(4): 303–37.
  • Banerji, Berry, and Shotland (2019) Banerji, R.; Berry, J.; and Shotland, M. 2019. The impact of mother literacy and participation programs on child learning: evidence from a randomized evaluation in India.
  • Bang and Robins (2005) Bang, H.; and Robins, J. M. 2005. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4): 962–973.
  • Beaudry and Renner (2012) Beaudry, N. J.; and Renner, R. 2012. An intuitive proof of the data processing inequality. Quantum Information & Computation, 12(5-6): 432–441.
  • Bertsimas et al. (2017) Bertsimas, D.; Kallus, N.; Weinstein, A. M.; and Zhuo, Y. D. 2017. Personalized diabetes management using electronic medical records. Diabetes care, 40(2): 210–217.
  • Beygelzimer and Langford (2009) Beygelzimer, A.; and Langford, J. 2009. The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 129–138.
  • Briseño Sanchez et al. (2020) Briseño Sanchez, G.; Hohberg, M.; Groll, A.; and Kneib, T. 2020. Flexible instrumental variable distributional regression. Journal of the Royal Statistical Society: Series A (Statistics in Society), 183(4): 1553–1574.
  • Chernozhukov, Wüthrich, and Zhu (2021) Chernozhukov, V.; Wüthrich, K.; and Zhu, Y. 2021. An exact and robust conformal inference method for counterfactual and synthetic controls. Journal of the American Statistical Association, 1–44.
  • Chipman and McCulloch (2016) Chipman, H.; and McCulloch, R. 2016. BayesTree: Bayesian Additive Regression Trees. R package version 1.4.
  • Chipman, George, and McCulloch (2007) Chipman, H. A.; George, E. I.; and McCulloch, R. E. 2007. Bayesian ensemble learning. In Advances in neural information processing systems, 265–272.
  • Chipman, George, and McCulloch (2010) Chipman, H. A.; George, E. I.; and McCulloch, R. E. 2010. BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1): 266–298.
  • Dehejia (2005) Dehejia, R. H. 2005. Program evaluation as a decision problem. Journal of Econometrics, 125(1-2): 141–173.
  • Ding and Li (2018) Ding, P.; and Li, F. 2018. Causal inference: A missing data perspective. Statistical Science, 33(2): 214–237.
  • Du et al. (2019) Du, X.; Sun, L.; Duivesteijn, W.; Nikolaev, A.; and Pechenizkiy, M. 2019. Adversarial balancing-based representation learning for causal effect inference with observational data. arXiv preprint arXiv:1904.13335.
  • Hahn, Murray, and Carvalho (2020) Hahn, P. R.; Murray, J. S.; and Carvalho, C. 2020. Bayesian Regression Tree Models for Causal Inference: Regularization, Confounding, and Heterogeneous Effects (with Discussion). Bayesian Analysis, 15(3): 965–1056.
  • Han and Moraga (1995) Han, J.; and Moraga, C. 1995.

    The influence of the sigmoid function parameters on the speed of backpropagation learning.

    In International Workshop on Artificial Neural Networks, 195–201. Springer.
  • Hassanpour and Greiner (2020) Hassanpour, N.; and Greiner, R. 2020. Variational auto-encoder architectures that excel at causal inference. Advances in Neural Information Processing Systems.
  • Hill (2011) Hill, J. L. 2011. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1): 217–240.
  • Hohberg, Pütz, and Kneib (2020) Hohberg, M.; Pütz, P.; and Kneib, T. 2020. Treatment effects beyond the mean using distributional regression: Methods and guidance. PloS one, 15(2): e0226514.
  • Imbens and Rubin (2015) Imbens, G. W.; and Rubin, D. B. 2015. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.
  • Jesson et al. (2020) Jesson, A.; Mindermann, S.; Shalit, U.; and Gal, Y. 2020. Identifying causal-effect inference failure with uncertainty-aware models. Advances in Neural Information Processing Systems, 33.
  • Johansson, Shalit, and Sontag (2016) Johansson, F.; Shalit, U.; and Sontag, D. 2016. Learning representations for counterfactual inference. International Conference on Machine Learning, 3020–3029.
  • Johansson et al. (2020) Johansson, F. D.; Shalit, U.; Kallus, N.; and Sontag, D. 2020. Generalization bounds and representation learning for estimation of potential outcomes and causal effects. arXiv preprint arXiv:2001.07426.
  • Joyce (1999) Joyce, J. M. 1999. The foundations of causal decision theory. Cambridge University Press.
  • Kallus and Zhou (2018) Kallus, N.; and Zhou, A. 2018. Confounding-Robust policy improvement. Advances in Neural Information Processing Systems, 31.
  • Kullback and Leibler (1951) Kullback, S.; and Leibler, R. 1951. On information and sufficiency. The Annals of Mathematical Statistics, 22: 79–86.
  • Lei and Candès (2020) Lei, L.; and Candès, E. J. 2020. Conformal inference of counterfactuals and individual treatment effects. arXiv preprint arXiv:2006.06138.
  • Li and Fu (2017) Li, S.; and Fu, Y. 2017. Matching on balanced nonlinear representations for treatment effects estimation. Advances in Neural Information Processing Systems, 929–939.
  • Liaw and Wiener (2002) Liaw, A.; and Wiener, M. 2002. Classification and regression by randomForest. R news, 2(3): 18–22.
  • Louizos et al. (2017) Louizos, C.; Shalit, U.; Mooij, J. M.; Sontag, D.; Zemel, R.; and Welling, M. 2017. Causal effect inference with deep latent-variable models. Advances in Neural Information Processing Systems, 6446–6456.
  • Lunceford and Davidian (2004) Lunceford, J. K.; and Davidian, M. 2004. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine, 23(19): 2937–2960.
  • Park et al. (2021) Park, J.; Shalit, U.; Schölkopf, B.; and Muandet, K. 2021. Conditional Distributional Treatment Effect with Kernel Conditional Mean Embeddings and U-Statistic Regression. CoRR, abs/2102.08208.
  • Pennings and Smidts (2003) Pennings, J. M.; and Smidts, A. 2003. The shape of utility functions and organizational behavior. Management Science, 49(9): 1251–1263.
  • Qian and Murphy (2011) Qian, M.; and Murphy, S. A. 2011. Performance guarantees for individualized treatment rules. Annals of statistics, 39(2): 1180.
  • Rigby and Stasinopoulos (2005) Rigby, R. A.; and Stasinopoulos, D. M. 2005. Generalized additive models for location, scale and shape,(with discussion). Applied Statistics, 54: 507–554.
  • Rosenbaum and Rubin (1983) Rosenbaum, P. R.; and Rubin, D. B. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1): 41–55.
  • Rubin (1973) Rubin, D. B. 1973. Matching to remove bias in observational studies. Biometrics, 159–183.
  • Schwab, Linhardt, and Karlen (2018) Schwab, P.; Linhardt, L.; and Karlen, W. 2018. Perfect match: A simple method for learning representations for counterfactual inference with neural networks. arXiv preprint arXiv:1810.00656.
  • Shalit, Johansson, and Sontag (2017) Shalit, U.; Johansson, F. D.; and Sontag, D. 2017. Estimating individual treatment effect: generalization bounds and algorithms. In International Conference on Machine Learning, 3076–3085. PMLR.
  • Shen (2019) Shen, S. 2019. Estimation and inference of distributional partial effects: theory and application. Journal of Business & Economic Statistics, 37(1): 54–66.
  • Shi, Blei, and Veitch (2019) Shi, C.; Blei, D.; and Veitch, V. 2019. Adapting neural networks for the estimation of treatment effects. In Advances in Neural Information Processing Systems, 2507–2517.
  • Sverdrup et al. (2021) Sverdrup, E.; Kanodia, A.; Zhou, Z.; Athey, S.; and Wager, S. 2021. policytree: Policy Learning via Doubly Robust Empirical Welfare Maximization over Trees. R package version 1.1.1.
  • Tibshirani, Athey, and Wager (2020) Tibshirani, J.; Athey, S.; and Wager, S. 2020. grf: Generalized Random Forests. R package version 1.2.0.
  • Vallender (1974) Vallender, S. 1974.

    Calculation of the Wasserstein distance between probability distributions on the line.

    Theory of Probability & Its Applications, 18(4): 784–786.
  • Villani (2008) Villani, C. 2008. Optimal transport: old and new, volume 338. Springer Science & Business Media.
  • Wager and Athey (2018) Wager, S.; and Athey, S. 2018. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523): 1228–1242.
  • Walker (2013) Walker, S. G. 2013. Bayesian inference with misspecified models. Journal of Statistical Planning and Inference, 143(10): 1621–1633.
  • Yazdani and Boerwinkle (2015) Yazdani, A.; and Boerwinkle, E. 2015. Causal inference in the age of decision medicine. Journal of Data Mining in Genomics & Proteomics, 6(1).
  • Yoon, Jordon, and van der Schaar (2018) Yoon, J.; Jordon, J.; and van der Schaar, M. 2018. GANITE: Estimation of individualized treatment effects using generative adversarial nets. International Conference on Learning Representations.
  • Zhang and Lu (2012) Zhang, G.; and Lu, Y. 2012. Bias-corrected random forests in regression. Journal of Applied Statistics, 39(1): 151–160.
  • Zhang, Bellot, and Schaar (2020) Zhang, Y.; Bellot, A.; and Schaar, M. 2020. Learning overlapping representations for the estimation of individualized treatment effects. In International Conference on Artificial Intelligence and Statistics, 1005–1014. PMLR.
  • Zhou et al. (2020) Zhou, T.; Li, Y.; Wu, Y.; and Carlson, D. 2020. Estimating uncertainty intervals from collaborating networks. arXiv preprint arXiv:2002.05212.

Appendix A Discussions on Propositions 1 and 2

Zhou et al. (2020) assumes that the covariate distributions are the same for training and evaluation for CN. In the observational causal setting, , , and all differ. Hence, the central challenge of migrating CN’s properties to CCN is to show its robustness to covariate space imbalance (or mismatch). We discover that CCN is robust under a certain type of covariate space mismatch. First, we will explore the properties of CCN under the presence of covariate space mismatch. Second, we will expand on how CCN with the strong ignorability and covariate shift assumptions satisfies this type of covariate space mismatch.

Here, we restate the two propositions from the main article. Note that they are both claimed on the full covariate space such that .

Proposition 1 (Optimal solution for and ). When the space searching tool is able to cover the full outcome space, the functions and that minimize are optimal when they are equivalent to the conditional CDF of and , such that .

Proposition 2 (Consistency of and ). Assume the ground truth CDF functions for satisfy Lipschitz continuity. Denote the ground truth as and . As , the finite sample estimators and have the following consistency property: under some metric such as norm, and with the space searching tool being able to cover the full outcome space.

Restating Claims from Zhou et al. (2020)

We first restate two similar propositions in CN under the non-causal setting.

Proposition S1 (Optimal solution for from Zhou et al. (2020)).

Assume that approximates the conditional quantile of (inverse CDF, not necessarily perfect). If spans , then a minimizing (1) is optimal when it is equivalent to the conditional CDF, or such that .

Proposition S2 (Consistency of from Zhou et al. (2020)).

Assume the ground truth CDF functions satisfy Lipschitz continuity. As , the finite sample estimator has the following consistency property: under some metric such as norm, and with capable of searching the full outcome space.

Overcoming Covariate Space Mismatch

The Proposition S1 demonstrates that a fixed point solution estimates the correct distributions. Proposition S2 states that the optimal learned function asymptotically estimates the ground truth distribution. However, they do not answer whether these properties hold on an evaluation space that differs from the training space.

We address this existing limitation by developing Proposition S3 which shows that the these properties can still be claimed given a certain type of space mismatch. For generality, we define the training space as and the evaluation space as .

Proposition S3 (The dependency of CN on ).

If the conditional outcome distribution remains invariant between and , and , the solutions in Propositions S1 and the consistency in Proposition S2 also generalize to the evaluation space where .

Proof of Proposition S3.

With the Propositions S1 and S2, we have that estimates the conditional distribution of in the training space where . Next we generalize it to an evaluation space . Given the condition , for any in evaluation space with , it is covered in the training space where . From the Proposition S1 and S2, we know that for such , the optimum can be obtained. The covariate shift assumption on the invariance of outcome distributions then guarantees that the optimum of such in the training space is also the optimum in the evaluation space. Therefore, each point in the evaluation space with can obtain their optimum, so we claim that the optimum can be generalized to the evaluation space .

This proposition enables us to extend CN’s optimum to different evaluation spaces given certain conditions. We next show how it relates and applies to the causal setup where the training and evaluation spaces differ from the imbalance between treatment groups. First, we give a weaker version of Propositions 1 and 2 without accounting for the mismatch between the training and evaluation spaces.

Claim S1 (Potential distributions on each treatment space).

The optimal solutions for and given in (5) guarantees that capture the CDF of , such that for .

Discussion on Claim S1.

We only discuss the first part of this claim S1 for without loss of generality as the other group can be shown with identical steps. The full loss can be expressed as . However, optimizing only involves updating parameters in .

A direct conclusion from Propositions S1 and S2 is that the optimized is the fixed point solution and consistently estimates the true CDF of , such that . This is the full conditional distribution rather than the potential outcome distribution of . By virtue of the ignorability assumption, the following two outcomes are identically distributed: . Therefore, we can successfully establish the estimators for potential outcome distributions, but currently limited to each treatment subspace. ∎

Given Claim S1, we generalize it back to the full covariate space with density . This depends on satisfying the condition in Proposition S3, which is verified by the following Lemma S1.

Lemma S1 (Positivity relating to the space generalization).

Under Assumption 1, the equivalent condition holds: , and .

Proof of Lemma S1.

The positivity in Assumption 1 claims that . Then for each from the full covariate space where , we can find a constant that satisfies .

By Bayes rule, . Then . From another direction, if , each component on the right hand side needs to be positive. Therefore, . The same argument holds for . ∎

With Lemma S1 satisfying the condition in the Proposition S3, Propositions 1 and 2 naturally follow. Under Claim S1, we have shown that the optimal solution of CCN estimates and on each treatment space and . The gap in migrating from and to is now filled by the Proposition S3.

Thus, under the standard assumptions in causal inference, the CCN method will capture the full potential outcome distributions. This procedure is not limited to a binary treatment condition, and is extendable to the multiple treatment setup.

Appendix B Semi-synthetic Data Generation


We focus on making our simulation results comparable to other inclusive causal methods’ published results. The simulation replications for the IHDP data are downloaded directly from, which were used to generate the results of WASS-CFR (Shalit, Johansson, and Sontag, 2017) and CEVAE Louizos et al. (2017). The dataset does not contain personally identifiable information or offensive content.

Education Data

The raw education data are downloaded from the Harvard Dataverse111, which consist of 33,167 observations and 378 variables. The dataset does not contain personally identifiable information or offensive content. We pre-process the data such as by removing and combining repetitive information, deleting covariates with over 2,5000 missing values. Then we end up with a clean data containing 8,627 observations.

The function and

are learned from the observed outcomes for the treated and control group. They are both designed as single-hidden-layer neural networks with 32 units per layer and sigmoids as activation functions

Han and Moraga (1995)

. A logistic regression model with coefficient

and propensity score are used to generate treatment label and mimic an observational setting. The coefficients are randomly generated as .

Appendix C Detailed Method Implementations

All Python-based methods: CCN, CEVAE and GANITE are run on a single NVIDIA P100 GPU; the R-based methods, BART, CF, and GAMLSS are run on a Intel(R) Xeon(R) Gold 6154 CPU.

CCN and Adj-CCN

The implementation of CCN and all its variants are based on the code base for CN Zhou et al. (2020), which is provided at with the MIT license. The follow the structures of in Zhou et al. (2020) and we fix

as a uniform distribution covering the range of the observed outcomes.

In Wass-CCN and Dragon-CCN, we introduce a latent representation . We set its dimension to 50. It is parametrized through a neural network with a single hidden layer of 100 units.

The Wasserstein distance in Wass-loss is learned through , which is a network with two hidden layers of 100 and 60 units per layer. We adopted the weight clipping strategy with threshold: (-0.01,0.01) to maintain its Lipschitz constraint Arjovsky, Chintala, and Bottou (2017). The hyper-parameter and were tuned for Adj-CCN. We used the log-likelihood calculated upon the observed outcomes to tune our parameters. We proposed a few candidate values for as: 5e-3, 1e-3, 5e-4, 1e-4, 5e-5, 1e-5, as we did not want these values to be too large to overtake the main part of the loss that learns the distribution. Then we did grid search to determine the hyper-parameters. Based on the results on the first few simulations, we fixed =5e-4 and =1e-5 in IDHP and =1e-5 and =5e-3 in EDU. We found this specification to consistently improve the performance over regular CCN.

To access the potential outcome distributions and take expectation over a defined utility function, we draw 3,000 samples for each test data point with the learned .

The code for CCN and its adjustment will be public on Github with the MIT license when the manuscript is accepted.

Bart (Chipman, George, and McCulloch, 2010)

The implementation of BART uses the R package BayesTree (Chipman and McCulloch, 2016) with GPL ( 2) license. We use the default setting in its model structure. Chipman, George, and McCulloch (2007)

suggest that BART’s performance with default prior is already highly competitive and are not highly dependent on fine tuning. We set the burn-in iteration to be 1,000. We draw 1,000 random samples per individual to access their posterior predicted distributions, as the package stores all the chain information and is not scalable for large data.

Cevae (Louizos et al., 2017)

The CEVAE is implemented with the publicly available code from˙pytorch/ with no license specified. We follow its default structure in defining encoders and decoders. The latent confounder size is 20. The optimizer is based on ADAM with weight decay according to Louizos et al. (2017). We use their recommended learning rate and decay rate in IHDP. In EDU dataset, the learning rate was set as 1e-4, and the decay as 1e-3 after tuning. We draw 3,000 posterior samples to access the posterior distributions.

Ganite (Yoon, Jordon, and van der Schaar, 2018)

The implementation of GANITE is based on

with the MIT license. The model consists of two GANs: one for imputing missing outcomes (counterfactual block) and one for generating the potential outcome distributions (ITE block). Within each block, they have a supervised loss on the observed outcomes to augment the mean estimation of the potential outcomes. We use the recommended specification of

Yoon, Jordon, and van der Schaar (2018) to train the IHDP data. In the education data, the hyper-parameters for the supervised loss were set as 2 (counterfactual block) and 1e-3 (ITE block) after tuning.

Cf (Wager and Athey, 2018)

The implementation of CF uses the R package grf Tibshirani, Athey, and Wager (2020) with GPL-3 license. We specified the argument tune.parameters=‘‘all’’ so that all the hyper-parameters were automatically tuned.

Gamlss (Hohberg, Pütz, and Kneib, 2020)

The implementation of GAMLSS uses the R package gamlss Rigby and Stasinopoulos (2005) with GPL-3 license. Since the method uses likelihood to estimate its parameters and often does not converge under complex models, we feed its location, scale and shape models with relevant variables only. In location models, we fit all continuous variables with penalized splines. In scale and shape models, we uses relevant variables in their linear forms. The choice is based on a balanced consideration of the representation power and model’s stability.

Appendix D Enforcing a Monotonicity Constraint

The learned CCN system should have a monotonic property that . In our experiments, this condition is learned with a standard neural network architecture and our training scheme, and we did not see any non-trivial violations of this requirement. If required, though, this scheme can be enforced by modifying the neural network structure. One way of accomplishing this goal is to use a neural network with the form,

Here, represents the sigmoid function. , , and are all functions neural networks that map from the input space to a

-dimensional vector,

In this case, the formulation of the outcome is still highly flexible, but becomes an admixture of sigmoid functions. As the multiplier on is required to be positive, each individual sigmoid function is monotonically increasing as a function of . Because the weight on each sigmoid is positive, this creates a full monotonic function as a function of .

We implemented this structure and found that it was competitive with a more standard architecture but was more difficult to optimize. As it was not empirically necessary to implement this strategy, we prefer the more standard architecture in our implementations.

Appendix E Metric Definitions

Below, we give the full definitions and procedures for our metrics.

Precision in Estimation of Heterogeneous Effect (PEHE): We adopt the definition of Hill (2011). Specifically, for unit with covariates , , and estimated means and , then,


We note that only evaluates a point estimate, not the distribution or utility.

Log Likelihood (LL): Log likelihood measures how well each method captures the potential outcome distribution. It is normally based on evaluating the PDF functions at the observed data. However, closed-form distributions are not directly specified for GAN-like approaches such as the variants of CCN and GANITE. Instead, we approximate the log likelihood using the CDF on a neighborhood of the realized outcome , , where is a small positive value. Then, the log-likelihood estimate is,


Asymptotically, the true distribution prevails in this evaluation, and this can be shown under the criterion of Kullback–Leibler divergence

(Kullback and Leibler, 1951) , as and . We set for the IHDP and for the EDU to adjust for the scale of outcomes.

Area Under the Curve (AUC): A decision on the optimal treatment requires contrasting the quantities and , which should match the ground truth optimal decision. For our semi-synthetic cases, the true optimal decision is known. Then, using the estimated gain in utility as the decision score we can estimate the AUC.

Appendix F Additional CDF Visualizations

We provide additional visualizations to evaluate the estimated potential outcome distributions with each method in Figure S1 based on another random sample. These results augment the results visualized in Figure 3. The two variants of CCN are capable of capturing the main shape of true CDF curves, including the asymmetry of the exponential distribution, with higher fidelity. GAMLSS is less accurate in the treatment group due to using skewed normal for the exponential distribution with a location shift. GANITE’s two-GAN structures are highly reliant on data richness for accurate prediction (Yoon, Jordon, and van der Schaar, 2018), so in our experiments it falls short in cases with greater treatment group imbalance. CEVAE captures the overall marginal distribution for the potential outcomes as shown in Figure 1(g), but fails to discern the heterogeneity in each individual in figure S1(f). Bart overall provides reasonable estimation in both cases but struggles with misspecification from its Gaussian form.

(a) CCN
(b) Adj-CCN
(d) BART
Figure S1: Visualization of each method’s estimate of the outcome distributions. The two variants of CCN give good distribution estimates. Other methods give less accurate estimates. By comparing the posterior distributions of CEVAE against the conditional distribution and marginal distribution of the ground truth in S1(f) and S1(g), we conclude that CEVAE primarily captures the marginal distribution in this study.

Appendix G Policy Learning

In decision making, the core difference between a traditional policy learning method and a distribution learning method is whether the utility is determined in advance. Though a policy learning method can tailor decisions based on different utilities, it is at the cost of fitting models towards each proposed utility. Regardless of the inconvenience in computation, we discuss below another shortcoming of traditional policy learning methods. To train a traditional policy learning approach, the first step is often to convert the raw outcome to the observed utility. While this is less problematic for bijective utilities, it might incur information loss if we deal with discretized utilities.

To demonstrate, we compare Adj-CCN to policytree using its published package (Athey and Wager, 2021; Sverdrup et al., 2021) on IHDP. We propose two types of utilities. They are defined as the linear utility, , and the threshold utility, . Since the policytree package only outputs the decision, we use accuracy as the metric. The results are summarized in Table S1

. Adj-CCN consistently and faithfully make more correct decisions in both cases. On the other hand, the information loss in the threshold utility negatively impacts policytree’s performance. We note that a threshold drastically reduces the information available for training, as it reduces a continuous variable to a binary variable.

Fundamentally, these two methods are different and they address similar problems from different perspectives. There might be some possibilities that we could combine the merits of them to further improve decision making. Hence, we will conisder exploring there interaction more in future work.

Utility/Method Adj-CCN % policytree %
Linear 85.60 1.35 76.86 1.03
Threshold 86.07 1.44 57.64 .71
Table S1: Comparing the accuracy of policy learning between Adj-CCN and policytree.

Appendix H Additional Distribution Tests

Theoretically, CCN has the potential to model different types of distributions with high fidelity. To further illustrate this, we simulate the potential outcomes from three extra distributions to assess its adaptability. To compare, we include GAMLSS given true distribution families, BART, CCN and Adj-CCN. The assessment is based on log likelihood (LL) to reflect the closeness to the underlying distributions.

First, we simulate the covariate space and treatment assignment according to the following procedure:

Treatment assignment:

The resultant distributions of the propensity scores are given in Figure S2. Given the magnitude of , we created a covariate space with limited overlap between two treatment groups. With limited sample size, it also helps us evaluate the robustness of our method when positivity in Assumption 1 is possibly violated. Then we specify three scenarios with sufficient nolinearity added to the potential outcome generating processes.

Distribution / Method True Value CCN Adj-CCN BART GAMLSS GAMLSS
Gumbel -2.87 -3.67 .02 -3.56 .02 -3.92 .06 -3.67 .02 -3.90 .05
Gamma -3.17 -3.83 .04 -3.74 .06 -3.97 .02 -3.77 .02 -3.95 .02
Weibull -2.87 -3.41 .02 -3.33 .03 -3.86 .12 -3.32 .04 -3.71 .09
Table S2: The estimated LL under different simulated distributions. and represent fitting GAMLSS with the true family and heterosekdastic Gaussian, respectively.

Gumbel Distribution:

Weibull Distribution:

We generated 2,000 data points in each case, and summarize the results in Table S2 with 5-fold cross validations. BART clearly falls behind in this comparison due to the substantial difference between Gaussian and the proposed three distributions. The GAMLSS with heteroskedastic Gaussian has some marginal gain over BART with its added flexibility. In each case, CCN is close to the GAMLSS which is trained in the unrealistic idealized situation where it is given the true distribution families. Though CCN is blind to the distribution families, it still effectively captures them. In all three cases, Adj-CCN increases the LL by around .1 over CCN.

Figure S2: The propensity score overlap. By adopting large coefficient , we created a situation where slight overlap is observed in the propensity scores between two treatment groups. This also indicates a severe treatment group imbalance.

Appendix I Estimating Multimodal Distributions

A fundamental reason that we chose to extend CN to estimate potential outcome distributions is its adaptability to different outcome forms. We demonstrate that with another example. Below, we simulate the outcomes from a mixture distributions:

The control group is a mixture of two Gaussian distributions, and the treatment group is a mixture of Gaussian and exponential distributions. Assume that the mixture information is not given to any model, and we simply use their original form to approximate the distributions. Each model was trained using 1,600 simulated samples. Figure S3 visualizes the estimated density for a random testing point. CCN and its variants can still recover the true distribution faithfully guaranteed by its asymptotic properties, while other models fail due to their constraints and assumptions.

(a) CCN
(b) Adj-CCN
(e) BART
Figure S3: Visualization of each method’s estimated density of the potential outcome distributions. The two variants of CCN outperform other methods with a clear margin. GANITE is aware of certain mixture here. As its objective makes a trade-off between GAN loss and supervised loss, it is not guaranteed to approximate the true distributions.

Appendix J Ablation Study

We include three components to account for the treatment group imbalance in our adjustment scheme. They are the Assignment-loss (Assignment), the Wass-loss (Wass), and the propensity stratification (PS), which combines the propensity score into the covariate spaces. Below, we inspect how CCN empirically benefits from each component.

Metrics/Method CCN Wass Assignment PS Assignment+PS Full Adjustment
PEHE 1.59 .16 1.32 .17 1.42 .25 1.15 .10 1.22 .15 1.30 .19
LL -1.78 .02 -1.65 .02 -1.64 .03 -1.67 .15 -1.65 .02 -1.64 .02
AUC (Linear) .925 .011 .938 .010 .940 .010 .918 .012 .940 .010 .942 .010
AUC (Threshold) .913 .011 .932 .011 .932 .011 .911 .012 .934 .011 .935 .010
Table S3: Quantitative results on the IHDP dataset regarding different variants of CCN.
Metrics/Method CCN Wass Assignment PS Assignment+PS Full Adjustment
PEHE .392 .049 .324 .046 .343 .041 .400 .052 .339 .039 .332 .042
LL -2.178 .024 -2.128 .020 -2.132 .023 -2.171 .024 -2.129 .029 -2.130 .017
AUC .933 .026 .951 .013 .946 .018 .932 .022 .952 .019 .949 .027
Table S4: Quantitative results on the EDU dataset regarding different variants of CCN.

Table S3 and S4

summarize the results of the evaluating metrics in both the IHDP and EDU datasets. Overall, when CCN modified by these components, it more accurately estimates the potential outcomes. However, the aspects on how these components contribute vary by their attributes. The propensity score stratification mainly facilitates information sharing between strata especially for point estimates

(Lunceford and Davidian, 2004), hence it excels in the IHDP dataset where the imbalnce was caused by the disparity in sample size between the control and treatment groups. However, on individual level, the stratification can be regarded as a form of aggregation, which might hinder the precision. Hence, we do not see mcuh gain in distribution based metrics or personalized decisions by solely including the propensity. As the sample size gets larger and dataset becomes more balanced, the effect of stratification becomes less significant as reflected in Table S4. The Assignment-loss overcomes the confounding effect by extracting representation relevant to both treatment assignment and outcome prediction, and we observe that it effectively boosts the model performance on all metrics. As mentioned by Shi, Blei, and Veitch (2019), it could also neglect information useful for outcome prediction solely, which decreases its predicting power on the individual level. This might explain why the Assignment-loss offers fewer benefits for point estimates (PEHE). The combination of Assignment-loss and propensity stratification is displayed to combine the merits of these two approaches. The Wass-loss finds a representation that balances the treatment and control groups and improves both point and distribution estimation in the two datasets. Nevertheless, it could be increasingly challenging to optimize when the number of irrelevant covariates is large in a dataset (Shi, Blei, and Veitch, 2019). The visualization in Figure S4 also reflects the aforementioned characteristics of these components. The Assignment-loss and Wass-loss improve the distribution estimation, while stratification does not.

(a) CCN
(b) Wass
(c) Assignment
(d) PS
(e) Assignment+PS
(f) Full Adjustment
Figure S4: The true 90 % interval widths versus the estimated 90 % interval widths given the four combinations of and on all variants of CCN. Overall, the Assignment-loss and Wass-loss contribute to increased accuracy in distribution estimation.

The full adjustment method takes advantage of each component and stably outperforms CCN in all aspects. However, it does not always prevail compared to other single adjustment method. Regardless, the combination of techniques as it performs the most robustly across a variety of scenarios. We view a greater understanding the trade-offs between these components as a future direction to more robustly estimate the potential outcome in different scenarios.

(a) CCN
(b) Wass
(c) Assignment
(d) PS
(e) Assignment+PS
(f) Full Adjustment
Figure S5: The scatter plot of the propensity scores (x-axis) versus the absolute difference between the true ITEs and their estimates (y-axis). Overall, the Assignment-loss and Wass-loss contribute more to alleviating the treatment group imbalance and helps the full adjustment method to more robustly estimate ITE in different regions.

An Additional Imbalance Adjustment Study

Below, we give another motivating example to visualize the added robustness with our adjustment scheme. We use the same covariate space and treatment assignment mechanism as described in Appendix H

. However, we posit a nonstandard distribution with its location model as a trigonometric function, and outcome uncertainty model as heterosekdastic Beta distribution. We generated 2,000 data points in total, with 8/2 split in training and testing. The detailed synthetic procedure for the outcomes can be described as:

We visualize the performance of different adjustment schemes in scatter plots where axis corresponds to each point’s true propensity score, a measurement of imbalance. Figure S5 depicts the absolute difference between the inferred ITEs and true ITEs. Lower vertical positions represent lower error. We observe that the performance deteriorates in each method if a point is close to two boundaries (extreme propensity scores), which is the area that generally struggles the most in observational studies. Compared with the baseline CCN, each adjustment scheme by itself lowers the error to some extent. Among them, WASS-CCN, Assignment-CCN and Adj-CCN are able to reduce the average error by over 50 %. The continuous propensity stratification (PS) can effectively reduce bias when there is more homogeneity within each stratum. However, severe imbalance in this case only gives homogeneity in the strata where propensity is around 0.5. Hence, we do not gain as many benefits with PS. In contrast, WASS and Assignment losses seek new representations to either rectify group level imbalance or exclude confounding effects. They prove to be more effective in the regions of imbalance, which include the majority of points in this case.

Similar trends are noticed in the scatter plot for log likelihood (LL) in Figure S6. Although each method still struggles in the regions of imbalance, extreme estimated values are greatly reduced by the adjustments. The median smoothing curves for the full adjustment method as well as Assignment-CCN and Wass-CCN are more stable and no longer present sharp disparities in regions with different propensity scores.

In practice, the type of imbalance or outcome distribution can vary case by case. As shown in the two semi-synthetic examples and this fully synthetic example, the impact of the different adjustment components varies in response. Hence, we combine these components and form the full adjustment scheme to provide us with a robust algorithm for many different scenarios.

(a) True Model
(b) CCN
(c) Wass
(d) Assignment
(e) PS
(f) Assignment+PS
(g) Full Adjustment
Figure S6: The scatter plot of the propensity scores (x-axis) versus the estimated log-likelihood (LL) (y-axis). Overall, the Assignment-loss and Wass-loss contribute more to alleviating the treatment group imbalance and helps the full adjustment method to more robustly estimate distributions, similar to 6(a).