1 Introduction
Personalized medicine requires estimating how an individual’s intrinsic characteristics trigger heterogeneous responses to treatment (Yazdani and Boerwinkle, 2015). Under the potential outcome framework to causal inference (Imbens and Rubin, 2015), these individual treatment effects (ITE) (or Conditional Average Treatment Effect (CATE)) are defined as the difference between an individual’s expected potential outcomes under different treatment conditions. Since only the outcome for the assigned treatment is observed, estimating the ITE requires inferring the missing potential outcomes Ding and Li (2018).
Machine learning approaches have been adapted to estimate ITEs by extending approaches such as the Random Forest (Wager and Athey, 2018)
and creating bespoke neural network frameworks
(Shalit, Johansson, and Sontag, 2017; Shi, Blei, and Veitch, 2019). ITEs, though, do not necessarily align with optimal choices. In a decision theoretic framework, the optimal decision maximizes the expected utility function, , over the distribution of outcomes (Joyce, 1999). ITE is a special case with the identity utility function, , but more general utility functions requires alternative approaches. Policy learning is one approach to learn a decision maker by optimizing a predefined utility function as the objective function. (Kallus and Zhou, 2018; Qian and Murphy, 2011). However, the decision should not only account for the heterogeneity of an individual’s potential outcomes but also their customized needs (personalized utility functions) Pennings and Smidts (2003). Hence, we propose an approach that estimates the potential outcome distributions to maintain flexibility for personalization.Previous efforts to estimate potential outcome distributions include Bayesian Additive Regression Trees (BART) (Chipman, George, and McCulloch, 2010; Hill, 2011), variational methods (Louizos et al., 2017), generalized additive model with location, shape and scale (GAMLSS) (Hohberg, Pütz, and Kneib, 2020), and techniques based on adversarial networks (Yoon, Jordon, and van der Schaar, 2018). Empirically, these techniques often impose certain explicit or implicit assumptions about the outcome distributions (e.g., Gaussian errors), which may not match with the true data generating mechanism. In response, we propose a novel dual neural network approach, the Collaborating Causal Networks (CCN). CCN modifies the structure of the Collaborating Networks (Zhou et al., 2020) to create a new causal framework that flexibly represents distributions. Under standard causal inference assumptions, we prove that CCN asymptotically captures the potential outcome distributions. We then propose a novel adjustment method to address poor imbalance between treatment groups, which hurts generalization and is a common problem in practice. Empirically, this adjustment method improves the point estimates, potential distribution estimates, and decisionmaking.
In summary, the main contributions of the paper are:

We propose the Collaborating Causal Networks (CCN) to estimate potential outcome distributions.

We prove the asymptotic properties of CCN.

We propose a new adjustment scheme to alleviate the treatment group imbalance for observational studies.
In addition, we have the further minor contribution that we propose personalized utilities in decision making, which is practically more meaningful and addresses distinct user needs. We evaluate how well existing algorithms perform on this task for the first time. We demonstrate that CCN improves optimal individual decisions based on utility functions by targeting the full potential distribution of individual treatment effects.
2 Problem Statement
We define the covariates as . We assume a binary treatment condition and each unit is assigned a treatment . We choose the binary setup for clarity, and the multiclass case could be constructed under the same framework. We let and represent the continuous potential outcomes under the two treatments; is the observed outcome. We use lowercase letters with subscript to denote concrete realizations: . A common goal is to estimate the , also known as conditional average treatment effect (CATE).
Our goal is to use the incomplete data to infer the distributions on both potential outcomes, and . Successful estimation of these distributions enables us to study a wide range of objectives besides ITE through the introduction of utility functions (Dehejia, 2005). We define the treatmentspecific utility functions as and . These utility functions will often be the same, but can vary due to cost of treatment, etc. We can then estimate the change in utility by approximating the expectations, . We note that the identity function, , returns an estimate of the ITE. In practice, utility functions are often nonlinear in measured outcomes Pennings and Smidts (2003). For example, a decision maker could define a utility function such as to evaluate the chance that they get a meaningful outcome at least at level . The utility function could also be varied and adapted to accommodate for various personal preferences or conditions.
Like most causal methods, CCN relies on the standard strong ignorability assumption (Rosenbaum and Rubin, 1983), to estimate the full potential outcome distributions when each datum only observes a single treatment outcome. Strong ignorability consists of two subassumptions:
Assumption 1 (Positivity or overlap).
Assumption 2 (Ignorability or Unconfoundedness).
The potential outcomes are jointly independent of the treatment assignment conditional on : .
3 Collaborating Causal Networks
The CCN approach approximates the conditional distributions and . CCN uses a twofunction framework based on the Collaborating Networks (CN) method (Zhou et al., 2020)
. We choose to extend CN to the causal setting because it automatically adapts to different distribution families, including nonGaussian distributions.
We first give an overview of CN, then present CCN, and then introduce our proposed adjustment strategies. Proofs of all theoretical claims are in Appendix A.
Overview of Collaborating Networks
CN estimates the conditional distribution, , with two neural networks: a network to approximate the conditional CDF, , and a network
to approximate its inverse. Information sharing is enforced by the fact that the CDF and its inverse are an identity mapping for any quantile
: . The networks form a collaborative scheme with their respective losses,(1)  
(2) 
The quantile is randomly sampled (e.g., ). represents the binary crossentropy loss. The parameters for and are only updated with their respective losses. When trained with (1) and (2), a fixed point of the optimization is at the true conditional CDF and its inverse (Zhou et al., 2020). The relative importance of and is reflected these statistical properties, as only relies on to cover the outcome space, whereas requires an optimal . Zhou et al. (2020) showed that can be replaced by other space searching tools, including simple distributions, which suffer a minor performance loss in exchange for ease of optimization. Thus, we focus on extending and gloss for causal inference and we replace with notation as a general form of a space searching tool or distribution sample. The gloss can be simplified to
(3) 
In practice, the expectation in (3) is replaced with an empirical approximation.
Causal Inference Formulation
Based on (3), we define a network for each group, and , with corresponding losses,
(4) 
We combine the two treatment groups as
(5) 
We choose for each treatment arm to have a separate network; alternatively, the treatment assignment could also be combined as an additional covariate within a single network. We next present the asymptotic properties of CCN. Under Assumptions 1 and 2 and a minor additional assumption below, the fixed point solution and consistency of CCN hold regardless of the treatment group imbalance.
Assumption 3 (Covariate Shift).
Both potential outcome distributions and are independent of the covariate distribution .
To summarize, Assumption 2 connects the conditional distribution, , to the potential outcome distribution on each covariate space whereas Assumption 3 generalizes the potential outcome distributions from each space to the full space . Assumption 3 is often assumed in the analysis frameworks for causal neural networks (Johansson et al., 2020). Given our assumptions, we state:
Proposition 1 (Optimal solution for and ).
When the distribution of covers the full outcome space, the functions and that minimize are optimal when they are equivalent to the conditional CDF of and , such that .
Proposition 2 (Consistency of and ).
Assume the ground truth CDF functions for satisfy Lipschitz continuity and that covers the full outcome space. Denote the ground truth as and . As , the finite sample estimators and have the following consistency property: under some metrics , such as the norm.
Taken together, these propositions state that the CDF estimators and inherit the large sample properties from CN for estimating potential outcome distributions.
Adjustment for Treatment Group Imbalance
One obstacle for causal inference is the treatment group imbalance, where the distributions of the covariate space and significantly differ. In the asymptotic regime (Proposition 2), we are only dependent on overlap (Assumption 1) holding, as any region with positive density will eventually be densely covered with samples. However, for finite samples, this imbalance hurts generalization. Thus, we propose a new adjustment scheme that alleviates this imbalance and further improves the CCN.
Succinctly, the adjustment method encodes the original covariates into a more balanced space that maintains relevant predictive information. The new covariate space is represented as , with a neural network that transforms the input into a dimensional latent space, and a propensity estimator to predict the propensity of treatment assignment, .
While may seem redundant given , including in the latent covariate space implicitly encourages propensity score stratification. We denote the gloss trained with this new covariate space as gloss. Then the full loss can be expressed the sum of gloss and terms to train and ,
This full network is sketched in Figure 1, showing the training of different functions with respect to their losses. Wassloss and Assignmentloss are fully described below. The tuning parameters and vary the relative importance of Wassloss and Assignmentloss during the learning of the . If and we remove the latent representation and propensity, this structure reduces to the regular CCN.
Wassloss to Alleviate the Treatment Group Imbalance
The Wassloss is motivated by CounterFactual Regression (CFR) implemented with the Wasserstein distance (Shalit, Johansson, and Sontag, 2017). CFR is a causal estimator based on a latent representation . The goal is to find latent representations where and are more balanced than the original covariate space while remaining predictive of the potential outcomes. We use the Wasserstein1 distance, which represents the total “work” required transform one distribution to another (Vallender, 1974). Through the KantorovichRubinstein duality Villani (2008), this distribution distance is,
represents the family of 1Lipschitz functions. We approximate this distance by adopting the approach of Arjovsky, Chintala, and Bottou (2017). This turns into the following minmax regime,
(6) 
is parameterized by a small neural network, and the Lipschitz constraint on is enforced through weight clipping. The Wassloss penalizes differences in the latent space between the treatment and control group, which improves generalization between the groups. However, finding such a balanced representation becomes increasingly challenging with increased dimensionality Shi, Blei, and Veitch (2019). Moreover, this representation by itself does not account for the treatment assignment mechanism (propensity score), which has the potential to reduce bias and confounding effects (Austin, 2011).
Assignmentloss and propensity score stratification
The Assignmentloss is inspired by Dragonnet (Shi, Blei, and Veitch, 2019)
, a recent deep learning method that incorporates treatment assignment to help learn a latent representation that predicts the Average Treatment Effect (ATE). It is defined as binary crossentropy loss (
) on predicting the treatment assignment label with ,(7) 
We extend this technique by incorporating the estimated propensity directly into our covariate space, which encourages propensity score stratification. Using propensity scores in the predictive model is a implicit form of stratification to reduce the bias of point estimation by facilitating information sharing within the sample strata (Hahn, Murray, and Carvalho, 2020).
4 Related Work
Below, we discuss three branches of related research.
ITE estimation and representation learning A common approach to combine causal inference and machine learning is the matching strategy, which identifies pairs of similar individuals (Rubin, 1973; Rosenbaum and Rubin, 1983; Li and Fu, 2017; Schwab, Linhardt, and Karlen, 2018). This idea also motivates many treebased methods that identify similar individuals within automaticallyidentified regions of the covariate space (Liaw and Wiener, 2002; Zhang and Lu, 2012; Athey and Imbens, 2016; Wager and Athey, 2018). Adversarial methods have attempted to balance treatment groups by learning a treatmentinvariant space (Johansson et al., 2020; Johansson, Shalit, and Sontag, 2016; Du et al., 2019). Alternatively, methods have been used to explicitly encode treatment propensity information in the learned representations (Shi, Blei, and Veitch, 2019). Representation learning can also be combined with weighting to enforce additional covariate balance (Assaad et al., 2021). These methods largely focus on estimating the difference of the means of the outcome distributions rather than the full distributions (some exceptions will be noted below). As noted by Park et al. (2021)
, the first moment difference reflected in ITE might be insufficient to reflect the full picture of different treatment regimes. In contrast, we estimate full distributions to assess the utility and confidence of a decision.
Potential outcome distribution sketching. Bayesian methods have been used to estimate outcome distributions, including methods such as Gaussian Processes (Alaa and van der Schaar, 2017), Bayesian dropout (Alaa, Weisz, and Van Der Schaar, 2017), and Bayesian Additive Regression Trees (BART) (Chipman, George, and McCulloch, 2010). BART has gained popularity in recent years and has been the focus of further modifications, including variations to account for regions with poor overlap (Hahn, Murray, and Carvalho, 2020). However, Bayesian methods are subject to model misspecification (Walker, 2013)
, which can create problems when the outcome distribution does not match model assumptions (e.g., actual outcome distribution is nonGaussian). Variational methods are capable of drawing inferences from complex structures, such as the Causal Effect Variational Autoencoder (CEVAE) and its extensions
(Louizos et al., 2017; Jesson et al., 2020); hybrid architectures are sometimes adopted to account for certain types of missing data mechanisms Hassanpour and Greiner (2020). Frequentist approaches can also achieve flexible representations of distributions. A wellknown adaptation is the generalized additive model with location, scale and shift (GAMLSS), which estimates the parameters for a baseline distribution with up to three transformations given a specific family Briseño Sanchez et al. (2020); Hohberg, Pütz, and Kneib (2020). Alternatively, the CDF may be estimated nonparametrically by weighting the empirical count of data points in a given neighborhood by adapting nearest neighbor approaches Shen (2019). However, this approach is less smooth and accurate in regions with treatment group imbalance due to the lack of neighboring points. GANinspired methods, including GANITE (Yoon, Jordon, and van der Schaar, 2018), can also learn nonGaussian outcome distributions. There is also emerging literature on conformal prediction in treatment effect estimation (Lei and Candès, 2020; Chernozhukov, Wüthrich, and Zhu, 2021). However, conformal prediction learns a specific level of coverage and is not feasible to estimate the expected utility. Additionally, its coverage probabilities are proved for populations rather than individuals.Metrics/Method  CCN  AdjCCN  GANITE  CEVAE  BART  GAMLSS  CF 

PEHE  1.59 .16  1.30 .19  2.40 .40  2.60 .10  2.23 .33  3.00 .39  3.52 .57 
LL  1.78 .02  1.64 .02  *  2.82 .08  1.99 .08  2.34 .13  NA 
AUC (Linear)  .925 .011  .942 .010  .723 .017  .523 .008  .923 .009  .930 .10  .896 .009 
AUC (Threshold)  .913 .011  .935 .010  *  .564 .010  .917 .009  .925 .10  NA 
Quantitative results on IHDP. Each metric’s mean and standard error are reported. *GANITE is only used for estimating the ITE as it is relatively challenging to optimize for this small dataset according to
Yoon, Jordon, and van der Schaar (2018).Policy learning and utility functions. A key purpose of estimating the individual causal effect is to serve personalized decisions. This aligns with the goal of policy learning to identify the policy (treatment) that benefits an individual most (Kallus and Zhou, 2018; Qian and Murphy, 2011; Bertsimas et al., 2017). A common strategy is to express the policy as a function of the covariate feature space and learn the policy to optimize the utility (Beygelzimer and Langford, 2009)
. Often, utilities studied in policy learning are linear transformation of the potential outcomes, which can be described as the difference between the benefit and cost
(Athey and Wager, 2021). Unfortunately, the observed utility may be subject to information loss according to the Data Processing Inequality (Beaudry and Renner, 2012)(e.g., binarization of a continuous variable greatly reduces information). Additionally, policy learning requires each individual to share a utility function, rather than allowing personalization.
5 Experiments
We follow established literature and use semisynthetic scenarios to assess individualized causal effects. First, we use the Infant Health and Development Program (IHDP) (Hill, 2011), where the outcome of each subject is simulated under a standard Gaussian distribution with a heterogeneous treatment effect. This first situation describes the ideal scenario for many methods, including BART. The second example is based on a field experiment in India studying the impact of education (EDU). In this case, we synthesize each individual outcome with heterogeneous effect and variability using a nonGaussian distribution. The semisynthetic procedures are briefly outlined below with full details in Appendix B. In addition, we briefly describe several additional synthetic data experiments and ablations studies below and include full details in the appendices.
We include our base approach, CCN, and the version adjusted for imbalance, AdjCCN. We compare to existing approaches that estimate potential outcome distributions, including Bayesian approaches (CEVAE (Louizos et al., 2017) and BART (Hill, 2011)), a frequentist approach, GAMLSS (Hohberg, Pütz, and Kneib, 2020), and a GANbased approach, GANITE (Yoon, Jordon, and van der Schaar, 2018). Causal Forests (CF) (Wager and Athey, 2018) is also benchmarked for nondistribution metrics as a popular recent ITEonly method. GAMLSS’s flexibility and strength in estimating distributions is dependent on a close match to the true underlying distribution families, which is normally unknown in practice. We evaluated two scenarios, one where GAMLSS uses a flexible distribution family and another where GAMLSS is provided the closest possible distribution, meaning that GAMLSS is provided more information than any other method. However, GAMLSS with a flexible distribution did not converge well and gave uncompetitive performance, so only results for the second scenario are reported.
Full implementation details and specifications for all models are given in Appendix C. We use a standard neural network architecture for CCN, but detail an alternative structure that enforces a monotonic constraint in Appendix D. Code to replicate all experiments has been included, which will be released with an MIT license if accepted.
Metrics
We evaluate how well each method captures the mean of the distribution through Precision in Estimation of Heterogenous Effect (PEHE) and the full distribution by estimating the loglikelihood (LL) of the predictions. We view LL as the key metric here since it evaluates full distributions. In addition, we evaluate how well each method makes decisions by evaluating the Area Under the Curve (AUC) for chosen utility functions to demonstrate that improved distributional estimates lead to improved decisions. Full mathematical definitions of the metrics are given in Appendix E.
Ihdp
The Infant Health and Development Program (IHDP) was originally a randomized experiment that was modified to mimic an observational study by removing a nonrandom portion from the treatment group. We use the response surface B introduced in Hill (2011) for heterogeneous treatment effect. The study consists of 747 subjects (139 in the treated group) with 19 binary and 6 continuous variables (). We utilize 100 replications of the data for outofsample evaluation by following the simulation process of Shalit, Johansson, and Sontag (2017).
The quantitative results are in Table 1
. CCN overall outperforms other competing methods in both mean and distribution metrics. Its edge in these metrics could be attributed to the relative robustness of the collaborating networks to the presence of outliers or overfitting in small data
(Zhou et al., 2020). The advantages of combining the imbalance adjustment strategy is evident as AdjCCN clearly improves over CCN. It is worth noting that LL calculated under the ground truth model is 1.41, demonstrating that AdjCCN is highly effective in capturing the potential outcome distributions.Metrics/Method  CCN  AdjCCN  GANITE  CEVAE  BART  GAMLSS  CF 

PEHE  .392 .049  .332 .042  1.253 .181  1.911 .351  .534 .042  .314 .053  1.022 .051 
LL  2.178 .024  2.130 .017  6.479 .992  3.558 .055  2.443 .063  2.250 .025  NA 
AUC  .933 .026  .949 .027  .760 .053  .622 .039  .906 .015  .941 .010  NA 
We evaluate two sets of utility functions: a linear utility and a nonlinear utility with and . As the ATE for surface B is 4 (Hill, 2011), the two setups reflect whether a method can correctly distinguish an individual from the population level of treatment effect. Table 1 shows AUC (Linear) and AUC (NonLinear) for these utilities, demonstrating that CCN’s improved distribution estimates contribute to accurate and competitive decisions, despite the fact that a homoskedastic Gaussian distribution is well matched to BART and GAMLSS.
Edu
This semisynthetic dataset is based on a randomized field experiment in India between 2011 and 2012 (Banerji, Berry, and Shotland, 2017, 2019). The experiment studied whether providing a mother with adult education benefits their children’s learning. We define the binary treatment as whether a mother received adult education and the continuous outcome as the difference between the final and the baseline test scores. The total sample size is 8,627 with 18 continuous covariates and 14 binary covariates: ().
We create a semisynthetic dataset over the two potential outcomes by the following procedure. We first train two neural networks,
, on the observed outcomes for the control and treated population. The uncertainty model for the control and treated group are based on a Gaussian distribution and an exponential distribution, respectfully, which helps demonstrate that the structure of CCN can automatically adapt to different families. We represent
as an indicator of whether the mother had received any previous school education as we hypothesize the variability is higher for the mothers without previous education. The final form for the potential outcomes can be expressed,The treatment group imbalance comes from two aspects. One aspect is from a treatment assignment model with propensity where we assign large coefficients in to add imbalance. The second aspect is from truncation, as we remove wellbalanced subjects with estimated propensities in the range of . We keep 1,000 samples for evaluation and use the rest for training. The full procedure is repeated 10 times to generate the variability assessment.
The utility function is customized for each subject to mimic personalized decisions. For subject , , and where . The interpretation of this utility is that each mother might hope the probability of their child scoring above certain threshold to increase. For the mothers without previous education, their expectations are higher by . This design coincides with the expectation that the education should have a positive effect on the final score in exchange for a finite cost, and we would only invest in the intervention for a positive return. Table 2 summarizes the evaluation result. As in IHDP, the CCN methods flexibly model different distributions, with AdjCCN providing some improvement over CCN. Notably, GAMLSS was provided with additional help by specifying the best distribution family a priori. Despite this, it displayed improved performance over AdjCCN in PEHE only, not in LL or decision metrics.
The ability to make an optimal decision is highly dependent on how close its estimated distribution aligns with the true distribution and all relevant heterogeneity. Thus, the AUCs follow their respective LLs with the CCN based methods performing well. CEVAE has two facets of model misspecification that hurt performance: () its homogeneous Gaussian error, and () it decodes the continuous covariates into a Gaussian distribution, whereas the real outcomes come from heteroskedastic Gaussian or exponential distributions. Hence, CEVAE captures the marginal distributions well but does not provide helpful personalized suggestions.
Next, we randomly draw a data sample and compare the estimated CDFs against the true CDFs in Figure 3 (see Figure S1 for additional samples). We find that CCNbased approaches are capable of closely recovering the true CDFs on random individuals, whereas the other methods have gaps in their estimation. GAMLSS is accurate on the control group but not the treatment group. This is partly due to GAMLSS package (Rigby and Stasinopoulos, 2005)
not supporting the exponential distribution with location shift, so skewed normal distribution was chosen as the closest reasonable substitute. Overall, GAMLSS is flexible but requires precise specification on a casebycase basis, whereas CCN can robustly use the same approach.
Another aspect to assess is whether a method captures the heteroskedasticity of the outcome distributions. The combination of and produces four uncertainty models. We visualize the predictive 90% interval widths from each method in Figure 2
. AdjCCN captures the bimodal nature of the interval widths. In contrast, GANITE only captures a small fraction the difference between the low and high variance cases. Both CEVAE
Louizos et al. (2017) and BART Hill (2011) did not capture the heterogeneity and are not shown. As GAMLSS requires us to specify that only and affected its uncertainty model to converge effectively, it did not produce variability in interval width. In summary, the ground truth interval widths are 1.47, 1.64, 2.94 and 3.29, while GAMLSS estimates 0.92, 1.52, 3.37 and 3.37. CCN and its variants are more accurate.Additional Comparisons and Properties
There are several additional experiments included in the appendices. First, we compare to policy learning algorithms in Appendix G. While policy learning is not directly comparable to these other approaches, we can compare decisions for a utility function defined a priori. Specifically, we conducted experiments with policytree Sverdrup et al. (2021) against AdjCCN in the IDHP dataset. We set up two scenarios, one with a linear utility and one with a threshold utility. AdjCCN performed well in both with over 80% accuracy. However, policytree’s accuracy dropped from 77% to 58 % when we switched to the threshold utility. These results show that learning the full potential distribution improves upon a policy learning approach despite the potential disadvantage of not having access to the utility function during training.
We also include additional experiments on synthetic data with a variety of distributions in Appendix H, including Gumbel, Gamma, and Weibull distributions. The results are qualitatively similar to the previously presented semisynthetic cases, where CCN straightforwardly adapts to these distributions and AdjCCN provides additional improvements. GAMLSS is comparable when provided the true distribution a priori but does not compare with more flexible distribution families. While we choose relatively standard distributions for our semisynthetic experiments, matching existing literature, we wanted to highlight the flexibility of CCN. Thus, we evaluated a multimodal outcome distribution in Appendix I. As expected in this case, CCN and CCNAdj naturally adjust to the multimodal space whereas the competing methods do not naturally handle the case, as shown briefly in Figure 4 and on all methods in Figure S3. GANITE is aware of certain mixture here. As its objective makes a tradeoff between GAN loss and supervised loss, it does not have asymptotic guarantee to approximate the true distribution.
Finally, we note that there are several components of the adjustment strategy. We perform an ablation study in Appendix J, which suggests that no part subsumes the others, but including all three adjustment strategies builds robustness. We additionally assess on which data sample AdjCCN improves over CCN by visualizing the performance as a function of propensity on synthetic datasets described in Appendix J. These results demonstrate that the primary advantage of the adjustment strategy is on improving estimates with very high or low propensities, and that the strategies have similar performance in wellbalanced cases.
6 Discussion
CCN is a novel framework to estimate individualized potential outcome distributions, with novel theoretical proofs and a new adjustment method to address treatment group imbalance. Empirically, CNN is effective in inferring individualized causal effects and is relatively robust with regard to treatment group imbalance in semisynthetic experiments. We demonstrate that improving distribution estimates leads to improved decisionmaking even without a priori access to utility functions.
Like nearly all causal methods, we are dependent on Assumptions 1, 2, and 3. Evaluating whether these assumptions hold in practice is nontrivial and requires close interaction with domain experts. Regardless of these limitations, we are encouraged about CCN due to the theoretical backing and empirical performance in our evaluations.
Acknowledgements
Research reported in this manuscript was supported by the National Institute of Biomedical Imaging and Bioengineering and the National Institute of Mental Health through the National Institutes of Health BRAIN Initiative under Award Number R01EB026937 and the National Institute of Mental Health of the National Institutes of Health under Award Number R01MH125430. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. We would also like to thank all members of the Carlson lab for proofreading and providing valuable comments on this manuscript.
References
 Alaa and van der Schaar (2017) Alaa, A. M.; and van der Schaar, M. 2017. Bayesian inference of individualized treatment effects using multitask gaussian processes. Advances in Neural Information Processing Systems, 3424–3432.
 Alaa, Weisz, and Van Der Schaar (2017) Alaa, A. M.; Weisz, M.; and Van Der Schaar, M. 2017. Deep counterfactual networks with propensitydropout. Proceedings of International Conference on Machine Learning.
 Arjovsky, Chintala, and Bottou (2017) Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein generative adversarial networks. International Conference on Machine Learning.
 Assaad et al. (2021) Assaad, S.; Zeng, S.; Tao, C.; Datta, S.; Mehta, N.; Henao, R.; Li, F.; and Carin, L. 2021. Counterfactual representation learning with balancing weights. Artificial Intelligence and Statistics.
 Athey and Imbens (2016) Athey, S.; and Imbens, G. 2016. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27): 7353–7360.
 Athey and Wager (2021) Athey, S.; and Wager, S. 2021. Policy learning with observational data. Econometrica, 89(1): 133–161.
 Austin (2011) Austin, P. C. 2011. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research, 46(3): 399–424.
 Banerji, Berry, and Shotland (2017) Banerji, R.; Berry, J.; and Shotland, M. 2017. The impact of maternal literacy and participation programs: Evidence from a randomized evaluation in india. American Economic Journal: Applied Economics, 9(4): 303–37.
 Banerji, Berry, and Shotland (2019) Banerji, R.; Berry, J.; and Shotland, M. 2019. The impact of mother literacy and participation programs on child learning: evidence from a randomized evaluation in India.
 Bang and Robins (2005) Bang, H.; and Robins, J. M. 2005. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4): 962–973.
 Beaudry and Renner (2012) Beaudry, N. J.; and Renner, R. 2012. An intuitive proof of the data processing inequality. Quantum Information & Computation, 12(56): 432–441.
 Bertsimas et al. (2017) Bertsimas, D.; Kallus, N.; Weinstein, A. M.; and Zhuo, Y. D. 2017. Personalized diabetes management using electronic medical records. Diabetes care, 40(2): 210–217.
 Beygelzimer and Langford (2009) Beygelzimer, A.; and Langford, J. 2009. The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 129–138.
 Briseño Sanchez et al. (2020) Briseño Sanchez, G.; Hohberg, M.; Groll, A.; and Kneib, T. 2020. Flexible instrumental variable distributional regression. Journal of the Royal Statistical Society: Series A (Statistics in Society), 183(4): 1553–1574.
 Chernozhukov, Wüthrich, and Zhu (2021) Chernozhukov, V.; Wüthrich, K.; and Zhu, Y. 2021. An exact and robust conformal inference method for counterfactual and synthetic controls. Journal of the American Statistical Association, 1–44.
 Chipman and McCulloch (2016) Chipman, H.; and McCulloch, R. 2016. BayesTree: Bayesian Additive Regression Trees. R package version 1.4.
 Chipman, George, and McCulloch (2007) Chipman, H. A.; George, E. I.; and McCulloch, R. E. 2007. Bayesian ensemble learning. In Advances in neural information processing systems, 265–272.
 Chipman, George, and McCulloch (2010) Chipman, H. A.; George, E. I.; and McCulloch, R. E. 2010. BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1): 266–298.
 Dehejia (2005) Dehejia, R. H. 2005. Program evaluation as a decision problem. Journal of Econometrics, 125(12): 141–173.
 Ding and Li (2018) Ding, P.; and Li, F. 2018. Causal inference: A missing data perspective. Statistical Science, 33(2): 214–237.
 Du et al. (2019) Du, X.; Sun, L.; Duivesteijn, W.; Nikolaev, A.; and Pechenizkiy, M. 2019. Adversarial balancingbased representation learning for causal effect inference with observational data. arXiv preprint arXiv:1904.13335.
 Hahn, Murray, and Carvalho (2020) Hahn, P. R.; Murray, J. S.; and Carvalho, C. 2020. Bayesian Regression Tree Models for Causal Inference: Regularization, Confounding, and Heterogeneous Effects (with Discussion). Bayesian Analysis, 15(3): 965–1056.

Han and Moraga (1995)
Han, J.; and Moraga, C. 1995.
The influence of the sigmoid function parameters on the speed of backpropagation learning.
In International Workshop on Artificial Neural Networks, 195–201. Springer.  Hassanpour and Greiner (2020) Hassanpour, N.; and Greiner, R. 2020. Variational autoencoder architectures that excel at causal inference. Advances in Neural Information Processing Systems.
 Hill (2011) Hill, J. L. 2011. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1): 217–240.
 Hohberg, Pütz, and Kneib (2020) Hohberg, M.; Pütz, P.; and Kneib, T. 2020. Treatment effects beyond the mean using distributional regression: Methods and guidance. PloS one, 15(2): e0226514.
 Imbens and Rubin (2015) Imbens, G. W.; and Rubin, D. B. 2015. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.
 Jesson et al. (2020) Jesson, A.; Mindermann, S.; Shalit, U.; and Gal, Y. 2020. Identifying causaleffect inference failure with uncertaintyaware models. Advances in Neural Information Processing Systems, 33.
 Johansson, Shalit, and Sontag (2016) Johansson, F.; Shalit, U.; and Sontag, D. 2016. Learning representations for counterfactual inference. International Conference on Machine Learning, 3020–3029.
 Johansson et al. (2020) Johansson, F. D.; Shalit, U.; Kallus, N.; and Sontag, D. 2020. Generalization bounds and representation learning for estimation of potential outcomes and causal effects. arXiv preprint arXiv:2001.07426.
 Joyce (1999) Joyce, J. M. 1999. The foundations of causal decision theory. Cambridge University Press.
 Kallus and Zhou (2018) Kallus, N.; and Zhou, A. 2018. ConfoundingRobust policy improvement. Advances in Neural Information Processing Systems, 31.
 Kullback and Leibler (1951) Kullback, S.; and Leibler, R. 1951. On information and sufficiency. The Annals of Mathematical Statistics, 22: 79–86.
 Lei and Candès (2020) Lei, L.; and Candès, E. J. 2020. Conformal inference of counterfactuals and individual treatment effects. arXiv preprint arXiv:2006.06138.
 Li and Fu (2017) Li, S.; and Fu, Y. 2017. Matching on balanced nonlinear representations for treatment effects estimation. Advances in Neural Information Processing Systems, 929–939.
 Liaw and Wiener (2002) Liaw, A.; and Wiener, M. 2002. Classification and regression by randomForest. R news, 2(3): 18–22.
 Louizos et al. (2017) Louizos, C.; Shalit, U.; Mooij, J. M.; Sontag, D.; Zemel, R.; and Welling, M. 2017. Causal effect inference with deep latentvariable models. Advances in Neural Information Processing Systems, 6446–6456.
 Lunceford and Davidian (2004) Lunceford, J. K.; and Davidian, M. 2004. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine, 23(19): 2937–2960.
 Park et al. (2021) Park, J.; Shalit, U.; Schölkopf, B.; and Muandet, K. 2021. Conditional Distributional Treatment Effect with Kernel Conditional Mean Embeddings and UStatistic Regression. CoRR, abs/2102.08208.
 Pennings and Smidts (2003) Pennings, J. M.; and Smidts, A. 2003. The shape of utility functions and organizational behavior. Management Science, 49(9): 1251–1263.
 Qian and Murphy (2011) Qian, M.; and Murphy, S. A. 2011. Performance guarantees for individualized treatment rules. Annals of statistics, 39(2): 1180.
 Rigby and Stasinopoulos (2005) Rigby, R. A.; and Stasinopoulos, D. M. 2005. Generalized additive models for location, scale and shape,(with discussion). Applied Statistics, 54: 507–554.
 Rosenbaum and Rubin (1983) Rosenbaum, P. R.; and Rubin, D. B. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1): 41–55.
 Rubin (1973) Rubin, D. B. 1973. Matching to remove bias in observational studies. Biometrics, 159–183.
 Schwab, Linhardt, and Karlen (2018) Schwab, P.; Linhardt, L.; and Karlen, W. 2018. Perfect match: A simple method for learning representations for counterfactual inference with neural networks. arXiv preprint arXiv:1810.00656.
 Shalit, Johansson, and Sontag (2017) Shalit, U.; Johansson, F. D.; and Sontag, D. 2017. Estimating individual treatment effect: generalization bounds and algorithms. In International Conference on Machine Learning, 3076–3085. PMLR.
 Shen (2019) Shen, S. 2019. Estimation and inference of distributional partial effects: theory and application. Journal of Business & Economic Statistics, 37(1): 54–66.
 Shi, Blei, and Veitch (2019) Shi, C.; Blei, D.; and Veitch, V. 2019. Adapting neural networks for the estimation of treatment effects. In Advances in Neural Information Processing Systems, 2507–2517.
 Sverdrup et al. (2021) Sverdrup, E.; Kanodia, A.; Zhou, Z.; Athey, S.; and Wager, S. 2021. policytree: Policy Learning via Doubly Robust Empirical Welfare Maximization over Trees. R package version 1.1.1.
 Tibshirani, Athey, and Wager (2020) Tibshirani, J.; Athey, S.; and Wager, S. 2020. grf: Generalized Random Forests. R package version 1.2.0.

Vallender (1974)
Vallender, S. 1974.
Calculation of the Wasserstein distance between probability distributions on the line.
Theory of Probability & Its Applications, 18(4): 784–786.  Villani (2008) Villani, C. 2008. Optimal transport: old and new, volume 338. Springer Science & Business Media.
 Wager and Athey (2018) Wager, S.; and Athey, S. 2018. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523): 1228–1242.
 Walker (2013) Walker, S. G. 2013. Bayesian inference with misspecified models. Journal of Statistical Planning and Inference, 143(10): 1621–1633.
 Yazdani and Boerwinkle (2015) Yazdani, A.; and Boerwinkle, E. 2015. Causal inference in the age of decision medicine. Journal of Data Mining in Genomics & Proteomics, 6(1).
 Yoon, Jordon, and van der Schaar (2018) Yoon, J.; Jordon, J.; and van der Schaar, M. 2018. GANITE: Estimation of individualized treatment effects using generative adversarial nets. International Conference on Learning Representations.
 Zhang and Lu (2012) Zhang, G.; and Lu, Y. 2012. Biascorrected random forests in regression. Journal of Applied Statistics, 39(1): 151–160.
 Zhang, Bellot, and Schaar (2020) Zhang, Y.; Bellot, A.; and Schaar, M. 2020. Learning overlapping representations for the estimation of individualized treatment effects. In International Conference on Artificial Intelligence and Statistics, 1005–1014. PMLR.
 Zhou et al. (2020) Zhou, T.; Li, Y.; Wu, Y.; and Carlson, D. 2020. Estimating uncertainty intervals from collaborating networks. arXiv preprint arXiv:2002.05212.
Appendix A Discussions on Propositions 1 and 2
Zhou et al. (2020) assumes that the covariate distributions are the same for training and evaluation for CN. In the observational causal setting, , , and all differ. Hence, the central challenge of migrating CN’s properties to CCN is to show its robustness to covariate space imbalance (or mismatch). We discover that CCN is robust under a certain type of covariate space mismatch. First, we will explore the properties of CCN under the presence of covariate space mismatch. Second, we will expand on how CCN with the strong ignorability and covariate shift assumptions satisfies this type of covariate space mismatch.
Here, we restate the two propositions from the main article. Note that they are both claimed on the full covariate space such that .
Proposition 1 (Optimal solution for and ). When the space searching tool is able to cover the full outcome space, the functions and that minimize are optimal when they are equivalent to the conditional CDF of and , such that .
Proposition 2 (Consistency of and ). Assume the ground truth CDF functions for satisfy Lipschitz continuity. Denote the ground truth as and . As , the finite sample estimators and have the following consistency property: under some metric such as norm, and with the space searching tool being able to cover the full outcome space.
Restating Claims from Zhou et al. (2020)
We first restate two similar propositions in CN under the noncausal setting.
Proposition S1 (Optimal solution for from Zhou et al. (2020)).
Assume that approximates the conditional quantile of (inverse CDF, not necessarily perfect). If spans , then a minimizing (1) is optimal when it is equivalent to the conditional CDF, or such that .
Proposition S2 (Consistency of from Zhou et al. (2020)).
Assume the ground truth CDF functions satisfy Lipschitz continuity. As , the finite sample estimator has the following consistency property: under some metric such as norm, and with capable of searching the full outcome space.
Overcoming Covariate Space Mismatch
The Proposition S1 demonstrates that a fixed point solution estimates the correct distributions. Proposition S2 states that the optimal learned function asymptotically estimates the ground truth distribution. However, they do not answer whether these properties hold on an evaluation space that differs from the training space.
We address this existing limitation by developing Proposition S3 which shows that the these properties can still be claimed given a certain type of space mismatch. For generality, we define the training space as and the evaluation space as .
Proposition S3 (The dependency of CN on ).
Proof of Proposition S3.
With the Propositions S1 and S2, we have that estimates the conditional distribution of in the training space where . Next we generalize it to an evaluation space . Given the condition , for any in evaluation space with , it is covered in the training space where . From the Proposition S1 and S2, we know that for such , the optimum can be obtained. The covariate shift assumption on the invariance of outcome distributions then guarantees that the optimum of such in the training space is also the optimum in the evaluation space. Therefore, each point in the evaluation space with can obtain their optimum, so we claim that the optimum can be generalized to the evaluation space .
∎
This proposition enables us to extend CN’s optimum to different evaluation spaces given certain conditions. We next show how it relates and applies to the causal setup where the training and evaluation spaces differ from the imbalance between treatment groups. First, we give a weaker version of Propositions 1 and 2 without accounting for the mismatch between the training and evaluation spaces.
Claim S1 (Potential distributions on each treatment space).
The optimal solutions for and given in (5) guarantees that capture the CDF of , such that for .
Discussion on Claim S1.
We only discuss the first part of this claim S1 for without loss of generality as the other group can be shown with identical steps. The full loss can be expressed as . However, optimizing only involves updating parameters in .
A direct conclusion from Propositions S1 and S2 is that the optimized is the fixed point solution and consistently estimates the true CDF of , such that . This is the full conditional distribution rather than the potential outcome distribution of . By virtue of the ignorability assumption, the following two outcomes are identically distributed: . Therefore, we can successfully establish the estimators for potential outcome distributions, but currently limited to each treatment subspace. ∎
Given Claim S1, we generalize it back to the full covariate space with density . This depends on satisfying the condition in Proposition S3, which is verified by the following Lemma S1.
Lemma S1 (Positivity relating to the space generalization).
Under Assumption 1, the equivalent condition holds: , and .
Proof of Lemma S1.
The positivity in Assumption 1 claims that . Then for each from the full covariate space where , we can find a constant that satisfies .
By Bayes rule, . Then . From another direction, if , each component on the right hand side needs to be positive. Therefore, . The same argument holds for . ∎
With Lemma S1 satisfying the condition in the Proposition S3, Propositions 1 and 2 naturally follow. Under Claim S1, we have shown that the optimal solution of CCN estimates and on each treatment space and . The gap in migrating from and to is now filled by the Proposition S3.
Thus, under the standard assumptions in causal inference, the CCN method will capture the full potential outcome distributions. This procedure is not limited to a binary treatment condition, and is extendable to the multiple treatment setup.
Appendix B Semisynthetic Data Generation
Ihdp
We focus on making our simulation results comparable to other inclusive causal methods’ published results. The simulation replications for the IHDP data are downloaded directly from https://github.com/clinicalml/cfrnet, which were used to generate the results of WASSCFR (Shalit, Johansson, and Sontag, 2017) and CEVAE Louizos et al. (2017). The dataset does not contain personally identifiable information or offensive content.
Education Data
The raw education data are downloaded from the Harvard Dataverse^{1}^{1}1https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/19PPE7, which consist of 33,167 observations and 378 variables. The dataset does not contain personally identifiable information or offensive content. We preprocess the data such as by removing and combining repetitive information, deleting covariates with over 2,5000 missing values. Then we end up with a clean data containing 8,627 observations.
The function and
are learned from the observed outcomes for the treated and control group. They are both designed as singlehiddenlayer neural networks with 32 units per layer and sigmoids as activation functions
Han and Moraga (1995). A logistic regression model with coefficient
and propensity score are used to generate treatment label and mimic an observational setting. The coefficients are randomly generated as .Appendix C Detailed Method Implementations
All Pythonbased methods: CCN, CEVAE and GANITE are run on a single NVIDIA P100 GPU; the Rbased methods, BART, CF, and GAMLSS are run on a Intel(R) Xeon(R) Gold 6154 CPU.
CCN and AdjCCN
The implementation of CCN and all its variants are based on the code base for CN Zhou et al. (2020), which is provided at https://github.com/thuizhou/CollaboratingNetworks with the MIT license. The follow the structures of in Zhou et al. (2020) and we fix
as a uniform distribution covering the range of the observed outcomes.
In WassCCN and DragonCCN, we introduce a latent representation . We set its dimension to 50. It is parametrized through a neural network with a single hidden layer of 100 units.
The Wasserstein distance in Wassloss is learned through , which is a network with two hidden layers of 100 and 60 units per layer. We adopted the weight clipping strategy with threshold: (0.01,0.01) to maintain its Lipschitz constraint Arjovsky, Chintala, and Bottou (2017). The hyperparameter and were tuned for AdjCCN. We used the loglikelihood calculated upon the observed outcomes to tune our parameters. We proposed a few candidate values for as: 5e3, 1e3, 5e4, 1e4, 5e5, 1e5, as we did not want these values to be too large to overtake the main part of the loss that learns the distribution. Then we did grid search to determine the hyperparameters. Based on the results on the first few simulations, we fixed =5e4 and =1e5 in IDHP and =1e5 and =5e3 in EDU. We found this specification to consistently improve the performance over regular CCN.
To access the potential outcome distributions and take expectation over a defined utility function, we draw 3,000 samples for each test data point with the learned .
The code for CCN and its adjustment will be public on Github with the MIT license when the manuscript is accepted.
Bart (Chipman, George, and McCulloch, 2010)
The implementation of BART uses the R package BayesTree (Chipman and McCulloch, 2016) with GPL ( 2) license. We use the default setting in its model structure. Chipman, George, and McCulloch (2007)
suggest that BART’s performance with default prior is already highly competitive and are not highly dependent on fine tuning. We set the burnin iteration to be 1,000. We draw 1,000 random samples per individual to access their posterior predicted distributions, as the package stores all the chain information and is not scalable for large data.
Cevae (Louizos et al., 2017)
The CEVAE is implemented with the publicly available code from https://github.com/rikhelwegen/CEVAE˙pytorch/ with no license specified. We follow its default structure in defining encoders and decoders. The latent confounder size is 20. The optimizer is based on ADAM with weight decay according to Louizos et al. (2017). We use their recommended learning rate and decay rate in IHDP. In EDU dataset, the learning rate was set as 1e4, and the decay as 1e3 after tuning. We draw 3,000 posterior samples to access the posterior distributions.
Ganite (Yoon, Jordon, and van der Schaar, 2018)
The implementation of GANITE is based on https://github.com/jsyoon0823/GANITE
with the MIT license. The model consists of two GANs: one for imputing missing outcomes (counterfactual block) and one for generating the potential outcome distributions (ITE block). Within each block, they have a supervised loss on the observed outcomes to augment the mean estimation of the potential outcomes. We use the recommended specification of
Yoon, Jordon, and van der Schaar (2018) to train the IHDP data. In the education data, the hyperparameters for the supervised loss were set as 2 (counterfactual block) and 1e3 (ITE block) after tuning.Cf (Wager and Athey, 2018)
The implementation of CF uses the R package grf Tibshirani, Athey, and Wager (2020) with GPL3 license. We specified the argument tune.parameters=‘‘all’’ so that all the hyperparameters were automatically tuned.
Gamlss (Hohberg, Pütz, and Kneib, 2020)
The implementation of GAMLSS uses the R package gamlss Rigby and Stasinopoulos (2005) with GPL3 license. Since the method uses likelihood to estimate its parameters and often does not converge under complex models, we feed its location, scale and shape models with relevant variables only. In location models, we fit all continuous variables with penalized splines. In scale and shape models, we uses relevant variables in their linear forms. The choice is based on a balanced consideration of the representation power and model’s stability.
Appendix D Enforcing a Monotonicity Constraint
The learned CCN system should have a monotonic property that . In our experiments, this condition is learned with a standard neural network architecture and our training scheme, and we did not see any nontrivial violations of this requirement. If required, though, this scheme can be enforced by modifying the neural network structure. One way of accomplishing this goal is to use a neural network with the form,
Here, represents the sigmoid function. , , and are all functions neural networks that map from the input space to a
dimensional vector,
In this case, the formulation of the outcome is still highly flexible, but becomes an admixture of sigmoid functions. As the multiplier on is required to be positive, each individual sigmoid function is monotonically increasing as a function of . Because the weight on each sigmoid is positive, this creates a full monotonic function as a function of .We implemented this structure and found that it was competitive with a more standard architecture but was more difficult to optimize. As it was not empirically necessary to implement this strategy, we prefer the more standard architecture in our implementations.
Appendix E Metric Definitions
Below, we give the full definitions and procedures for our metrics.
Precision in Estimation of Heterogeneous Effect (PEHE): We adopt the definition of Hill (2011). Specifically, for unit with covariates , , and estimated means and , then,
(8) 
We note that only evaluates a point estimate, not the distribution or utility.
Log Likelihood (LL): Log likelihood measures how well each method captures the potential outcome distribution. It is normally based on evaluating the PDF functions at the observed data. However, closedform distributions are not directly specified for GANlike approaches such as the variants of CCN and GANITE. Instead, we approximate the log likelihood using the CDF on a neighborhood of the realized outcome , , where is a small positive value. Then, the loglikelihood estimate is,
(9) 
Asymptotically, the true distribution prevails in this evaluation, and this can be shown under the criterion of Kullback–Leibler divergence
(Kullback and Leibler, 1951) , as and . We set for the IHDP and for the EDU to adjust for the scale of outcomes.Area Under the Curve (AUC): A decision on the optimal treatment requires contrasting the quantities and , which should match the ground truth optimal decision. For our semisynthetic cases, the true optimal decision is known. Then, using the estimated gain in utility as the decision score we can estimate the AUC.
Appendix F Additional CDF Visualizations
We provide additional visualizations to evaluate the estimated potential outcome distributions with each method in Figure S1 based on another random sample. These results augment the results visualized in Figure 3. The two variants of CCN are capable of capturing the main shape of true CDF curves, including the asymmetry of the exponential distribution, with higher fidelity. GAMLSS is less accurate in the treatment group due to using skewed normal for the exponential distribution with a location shift. GANITE’s twoGAN structures are highly reliant on data richness for accurate prediction (Yoon, Jordon, and van der Schaar, 2018), so in our experiments it falls short in cases with greater treatment group imbalance. CEVAE captures the overall marginal distribution for the potential outcomes as shown in Figure 1(g), but fails to discern the heterogeneity in each individual in figure S1(f). Bart overall provides reasonable estimation in both cases but struggles with misspecification from its Gaussian form.
Appendix G Policy Learning
In decision making, the core difference between a traditional policy learning method and a distribution learning method is whether the utility is determined in advance. Though a policy learning method can tailor decisions based on different utilities, it is at the cost of fitting models towards each proposed utility. Regardless of the inconvenience in computation, we discuss below another shortcoming of traditional policy learning methods. To train a traditional policy learning approach, the first step is often to convert the raw outcome to the observed utility. While this is less problematic for bijective utilities, it might incur information loss if we deal with discretized utilities.
To demonstrate, we compare AdjCCN to policytree using its published package (Athey and Wager, 2021; Sverdrup et al., 2021) on IHDP. We propose two types of utilities. They are defined as the linear utility, , and the threshold utility, . Since the policytree package only outputs the decision, we use accuracy as the metric. The results are summarized in Table S1
. AdjCCN consistently and faithfully make more correct decisions in both cases. On the other hand, the information loss in the threshold utility negatively impacts policytree’s performance. We note that a threshold drastically reduces the information available for training, as it reduces a continuous variable to a binary variable.
Fundamentally, these two methods are different and they address similar problems from different perspectives. There might be some possibilities that we could combine the merits of them to further improve decision making. Hence, we will conisder exploring there interaction more in future work.
Utility/Method  AdjCCN %  policytree % 

Linear  85.60 1.35  76.86 1.03 
Threshold  86.07 1.44  57.64 .71 
Appendix H Additional Distribution Tests
Theoretically, CCN has the potential to model different types of distributions with high fidelity. To further illustrate this, we simulate the potential outcomes from three extra distributions to assess its adaptability. To compare, we include GAMLSS given true distribution families, BART, CCN and AdjCCN. The assessment is based on log likelihood (LL) to reflect the closeness to the underlying distributions.
First, we simulate the covariate space and treatment assignment according to the following procedure:
Covariates:  
Treatment assignment:  
The resultant distributions of the propensity scores are given in Figure S2. Given the magnitude of , we created a covariate space with limited overlap between two treatment groups. With limited sample size, it also helps us evaluate the robustness of our method when positivity in Assumption 1 is possibly violated. Then we specify three scenarios with sufficient nolinearity added to the potential outcome generating processes.
Distribution / Method  True Value  CCN  AdjCCN  BART  GAMLSS  GAMLSS 

Gumbel  2.87  3.67 .02  3.56 .02  3.92 .06  3.67 .02  3.90 .05 
Gamma  3.17  3.83 .04  3.74 .06  3.97 .02  3.77 .02  3.95 .02 
Weibull  2.87  3.41 .02  3.33 .03  3.86 .12  3.32 .04  3.71 .09 
Gumbel Distribution:
Weibull Distribution:
We generated 2,000 data points in each case, and summarize the results in Table S2 with 5fold cross validations. BART clearly falls behind in this comparison due to the substantial difference between Gaussian and the proposed three distributions. The GAMLSS with heteroskedastic Gaussian has some marginal gain over BART with its added flexibility. In each case, CCN is close to the GAMLSS which is trained in the unrealistic idealized situation where it is given the true distribution families. Though CCN is blind to the distribution families, it still effectively captures them. In all three cases, AdjCCN increases the LL by around .1 over CCN.
Appendix I Estimating Multimodal Distributions
A fundamental reason that we chose to extend CN to estimate potential outcome distributions is its adaptability to different outcome forms. We demonstrate that with another example. Below, we simulate the outcomes from a mixture distributions:
The control group is a mixture of two Gaussian distributions, and the treatment group is a mixture of Gaussian and exponential distributions. Assume that the mixture information is not given to any model, and we simply use their original form to approximate the distributions. Each model was trained using 1,600 simulated samples. Figure S3 visualizes the estimated density for a random testing point. CCN and its variants can still recover the true distribution faithfully guaranteed by its asymptotic properties, while other models fail due to their constraints and assumptions.
Appendix J Ablation Study
We include three components to account for the treatment group imbalance in our adjustment scheme. They are the Assignmentloss (Assignment), the Wassloss (Wass), and the propensity stratification (PS), which combines the propensity score into the covariate spaces. Below, we inspect how CCN empirically benefits from each component.
Metrics/Method  CCN  Wass  Assignment  PS  Assignment+PS  Full Adjustment 

PEHE  1.59 .16  1.32 .17  1.42 .25  1.15 .10  1.22 .15  1.30 .19 
LL  1.78 .02  1.65 .02  1.64 .03  1.67 .15  1.65 .02  1.64 .02 
AUC (Linear)  .925 .011  .938 .010  .940 .010  .918 .012  .940 .010  .942 .010 
AUC (Threshold)  .913 .011  .932 .011  .932 .011  .911 .012  .934 .011  .935 .010 
Metrics/Method  CCN  Wass  Assignment  PS  Assignment+PS  Full Adjustment 

PEHE  .392 .049  .324 .046  .343 .041  .400 .052  .339 .039  .332 .042 
LL  2.178 .024  2.128 .020  2.132 .023  2.171 .024  2.129 .029  2.130 .017 
AUC  .933 .026  .951 .013  .946 .018  .932 .022  .952 .019  .949 .027 
summarize the results of the evaluating metrics in both the IHDP and EDU datasets. Overall, when CCN modified by these components, it more accurately estimates the potential outcomes. However, the aspects on how these components contribute vary by their attributes. The propensity score stratification mainly facilitates information sharing between strata especially for point estimates
(Lunceford and Davidian, 2004), hence it excels in the IHDP dataset where the imbalnce was caused by the disparity in sample size between the control and treatment groups. However, on individual level, the stratification can be regarded as a form of aggregation, which might hinder the precision. Hence, we do not see mcuh gain in distribution based metrics or personalized decisions by solely including the propensity. As the sample size gets larger and dataset becomes more balanced, the effect of stratification becomes less significant as reflected in Table S4. The Assignmentloss overcomes the confounding effect by extracting representation relevant to both treatment assignment and outcome prediction, and we observe that it effectively boosts the model performance on all metrics. As mentioned by Shi, Blei, and Veitch (2019), it could also neglect information useful for outcome prediction solely, which decreases its predicting power on the individual level. This might explain why the Assignmentloss offers fewer benefits for point estimates (PEHE). The combination of Assignmentloss and propensity stratification is displayed to combine the merits of these two approaches. The Wassloss finds a representation that balances the treatment and control groups and improves both point and distribution estimation in the two datasets. Nevertheless, it could be increasingly challenging to optimize when the number of irrelevant covariates is large in a dataset (Shi, Blei, and Veitch, 2019). The visualization in Figure S4 also reflects the aforementioned characteristics of these components. The Assignmentloss and Wassloss improve the distribution estimation, while stratification does not.The full adjustment method takes advantage of each component and stably outperforms CCN in all aspects. However, it does not always prevail compared to other single adjustment method. Regardless, the combination of techniques as it performs the most robustly across a variety of scenarios. We view a greater understanding the tradeoffs between these components as a future direction to more robustly estimate the potential outcome in different scenarios.
An Additional Imbalance Adjustment Study
Below, we give another motivating example to visualize the added robustness with our adjustment scheme. We use the same covariate space and treatment assignment mechanism as described in Appendix H
. However, we posit a nonstandard distribution with its location model as a trigonometric function, and outcome uncertainty model as heterosekdastic Beta distribution. We generated 2,000 data points in total, with 8/2 split in training and testing. The detailed synthetic procedure for the outcomes can be described as:
We visualize the performance of different adjustment schemes in scatter plots where axis corresponds to each point’s true propensity score, a measurement of imbalance. Figure S5 depicts the absolute difference between the inferred ITEs and true ITEs. Lower vertical positions represent lower error. We observe that the performance deteriorates in each method if a point is close to two boundaries (extreme propensity scores), which is the area that generally struggles the most in observational studies. Compared with the baseline CCN, each adjustment scheme by itself lowers the error to some extent. Among them, WASSCCN, AssignmentCCN and AdjCCN are able to reduce the average error by over 50 %. The continuous propensity stratification (PS) can effectively reduce bias when there is more homogeneity within each stratum. However, severe imbalance in this case only gives homogeneity in the strata where propensity is around 0.5. Hence, we do not gain as many benefits with PS. In contrast, WASS and Assignment losses seek new representations to either rectify group level imbalance or exclude confounding effects. They prove to be more effective in the regions of imbalance, which include the majority of points in this case.
Similar trends are noticed in the scatter plot for log likelihood (LL) in Figure S6. Although each method still struggles in the regions of imbalance, extreme estimated values are greatly reduced by the adjustments. The median smoothing curves for the full adjustment method as well as AssignmentCCN and WassCCN are more stable and no longer present sharp disparities in regions with different propensity scores.
In practice, the type of imbalance or outcome distribution can vary case by case. As shown in the two semisynthetic examples and this fully synthetic example, the impact of the different adjustment components varies in response. Hence, we combine these components and form the full adjustment scheme to provide us with a robust algorithm for many different scenarios.
Comments
There are no comments yet.