Introduction
Understanding the individual treatment effect (ITE) on an outcome of an intervention on an individual with features is a challenging problem in medicine. When inferring the ITE from observational data, it is common to assume that all of the confounders – factors that affect both the intervention and the outcome – are measurable and captured in the observed data as shown in Figure 1(a). However, in practice, there are often unobserved (latent) confounders as shown in Figure 1(b). For example, socioeconomic status cannot be directly measured, but can influence the types of medications that a subject has access to; therefore, it acts as a confounder between the medication and the patient’s health. If such latent confounders are not appropriately accounted for, then the estimated ITE will be subject to confounding bias, making it impossible to estimate the effect of the intervention on the outcome without bias Pearl:00 ; Kuroki:14 . A common technique for mitigating confounding bias is using proxy variables, which are measurable proxy for the latent confounders that can enable unbiased estimation of the ITE. For instance, in the causal diagram in Figure 1(b), can be viewed as providing noisy proxies of the latent confounders .
Contributions: We introduce an adversarial framework to infer the complex nonlinear relationship between the proxy variables (i.e., observations) and the latent confounders via approximately recovering posterior distributions that can be used to infer the ITE. Our experiments on synthetic and semisynthetic observational datasets show that the proposed method is competitive with – and often outperforms – stateoftheart methods when there are no latent confounders or when the proxy noise is small, and outperforms all tested benchmarks when the proxy variables become noisy.
Causal Effect with Latent Confounders
Our goal is to estimate the ITE from an observational dataset , where , , and , denote the
th subject’s feature vector, treatment (we assume that the treatment is binary, i.e.,
), and outcome vector, respectively, and is the number of subjects. The ITE for a subject with observed potential confounder is defined as(1) 
To recover the ITE under the latent confounder model in Figure 1(b), we need to identify and . The former can be calculated as follows:
(2) 
where the second equality follows from the rules of docalculus^{1}^{1}1The dooperator Pearl:00 simulates physical interventions by deleting certain functions from the model, replacing them with a constant value, while keeping the rest of the model unchanged. applied to the causal graph in Figure 1(b) Pearl:10 . ( can be derived similarly.) It is worth to highlight that is equivalent to if the unconfoundedness assumption holds as in Figure 1(a).
Thus, from (1) and (2), we can make estimates of the ITE without confounding bias using the estimates of the conditional distributions and . Since
is unobservable, we assume that the joint distribution
can be approximately recovered solely from the observations as justified in Louizos:17 .Adversarial Learning for Causal Effect
In this section, we propose a method that estimates Causal Effect using a Generative Adversarial Network (CEGAN). CEGAN’s objective is to estimate the conditional posteriors in (2) under the causal graph in Figure 1(b) so that we can estimate the ITE (1) for new subjects. However, since we cannot measure the true latent confounder, we are unable to directly learn the posterior distribution . Instead, we learn a mapping between the data (observations) and an arbitrary latent space following an adversarial learning framework similar to those developed in Dumoulin:17 and Donahue:17 .
Our model, depicted in Figure 2, comprises a prediction network (right) and a reconstruction network (left). Each network includes an encoderdecoder pair, where the encoder is shared between them. The posterior distributions that are required to solve (2) can be estimated using bidirectional models Dumoulin:17 and Donahue:17 via factorizing the posterior distribution as and , where and are the components of the prediction network and is the propensity score. Meanwhile, the reconstruction network is a denoising autoencoder Vincent:10 , which helps the prediction network find a meaningful mapping to the latent space that preserves information in the data space.
Prediction Network
The prediction network has two components: a generator (which consists of the encoder , the prediction decoder , and the inference subnetwork ) and a discriminator .
The encoder (), which is employed in both the reconstruction and prediction networks, maps the data space to the latent space. Thus, the output of the encoder is given by . The inference subnetwork () is introduced to infer based on and given ; its output is given by . The prediction decoder () is a function that outputs the estimated outcome given a sample drawn from the data distribution and a latent variable inferred by ; thus, . Note that the outputs of the generator are randomized by the noise term using the universal approximator technique described in Makhzani:16 .
With the conditional probabilities
, , and obtained from the generator, we are able to define two joint distributions: for the encoder and for the prediction decoder. Using tuples drawn from the two joint distributions, CEGAN attempts to match these distribution by playing an adversarial game between the generator and the discriminator. To do so, the prediction discriminator () maps tuples to a probability in . Specifically, and denote estimates of the probabilities that the tuple is drawn from and , respectively. The discriminator tries to distinguish between tuples that are drawn from and . Following the framework in Dumoulin:17 , the two distributions can be matched (i.e., they reach the same saddle point) by solving the following minmax problem between the generator and the discriminator:(3) 
Reconstruction Network
The relationship between the data and the latent space is not specified in the prediction network. Consequently, the network may converge to an undesirable matched joint distribution. For instance, it may learn to match the joint distributions and while inferring latent variables that provide no information about the data samples . We introduce a reconstruction network to nudge the prediction network toward learning a meaningful mapping between the data and latent spaces. We utilize a denoising autoencoder for the reconstruction network, which employs the same encoder as the prediction network, , and a reconstruction decoder . reconstructs the original input of the encoder from the output of ; the output can be given as .
Then, we define the following reconstruction loss:
(4) 
where , , for continuous values, and for binary values. Here, denotes the elementwise logarithm. By minimizing (4) iteratively with the minmax problem (3), is able to map data samples into the latent space while preserving information that is available in the data space.
Experiments
Ground truth counterfactual outcomes are never available in observational datasets, which makes it difficult to evaluate causal inference methods. Thus, we evaluate CEGAN against various benchmarks using a semisynthetic dataset where we model the proxy mechanism to generate latent confounding. In the appendix, we perform further comparisons using a semisynthetic dataset suggested in Louizos:17 and a synthetic dataset.
Performance Metric: We use two different performance metrics in our evaluations – expected precision in the estimation of heterogeneous effect (PEHE) and average treatment effect (ATE) Hill:11 :
where and are the ground truth of the treated and controlled outcomes for the th sample and and are their estimates.
Method  
no latent confounding  latent confounding  no latent confounding  latent confounding  
Insample  Outsample  Insample  Outsample  Insample  Outsample  Insample  Outsample  
LR1  0.3650.00  0.3670.00  0.4130.01  0.4230.02  0.0450.02  0.1860.03  0.0640.02  0.2060.03 
LR2  0.4040.02  0.4110.02  0.4420.02  0.4540.02  0.1280.03  0.2060.04  0.1480.03  0.2270.04 
kNN  0.4860.02  0.5060.02  0.4920.02  0.5150.02  0.2540.04  0.2640.04  0.2710.04  0.2850.04 
CForest  0.3560.01  0.3720.01  0.4170.02  0.4290.02  0.0250.02  0.1880.03  0.0230.02  0.1860.03 
BART  0.5690.06  0.5620.06  0.8770.08  0.8710.08  0.4320.08  0.4290.08  0.7900.09  0.7860.09 
CMGP  0.3670.01  0.3650.01  0.4300.05  0.4380.05  0.0340.03  0.0360.04  0.1920.09  0.2130.09 
CFR  0.3710.03  0.3710.03  0.4270.05  0.4380.05  0.0560.06  0.0710.06  0.2050.07  0.2260.07 
CEVAE  0.3630.00  0.3640.00  0.4230.00  0.4280.00  0.0710.01  0.1650.01  0.0880.01  0.1830.01 
CEGAN  0.3630.00  0.3620.00  0.3690.00  0.3690.00  0.0180.01  0.0170.01  0.0220.01  0.0210.02 
We compare CEGAN against benchmarks (see appendix for details of the tested benchmarks) using a semisynthetic dataset (TWINS) which is similar to that was first proposed in Louizos:17 . Based on records of twin births in the USA from 19891991 Almond:05 , we artificially create a binary treatment such that () denotes being born the heavier (lighter). The binary outcome is the mortality of each of the twins in their first year. (Since we have records for both twins, we treat their outcomes as two potential outcomes, i.e., and , with respect to the treatment assignment of being born heavier.) Due to its high correlation with the outcome Moser:07 ; Platt:14 , we select the feature ‘GESTAT’, which is the gestational age in weeks. The treatment assignment is based only on this single variable, i.e., , where and is the minmax normalized value of ‘GESTAT’. The data generation process is not exactly equivalent to that proposed in Louizos:17 as i) it includes artificial proxies of the latent variable in the observational dataset, which is less realistic and ii) the treatment assignment is not only based on the latent variable but also on the observed variables which is not consistent with the causal model in Figure 1(b). (In the appendix, we reported details and results for the TWINS dataset with the same data generation process in Louizos:17 .)
To assess the performance of causal inference methods in the presence of latent confounding, we test them on two datasets: “no latent confounding” which contains ‘GESTAT’ and relies on the unconfoundedness assumption as depicted in Figure 1(a) and “latent confounding” which excludes ‘GESTAT’ from the observational dataset and follows the latent causal graph in Figure 1(b).
Throughout the evaluation, we average over 100 Monte Carlo samples from the estimated posteriors derived using each method to compute and in (1) for CEVAE and CEGAN. The reported values in Table 1 are averaged over 50 realizations with the same 64/16/20 train/validation/test splits.
The performance of and is reported in Table 1, for both withinsample and outofsample tests. The ITE estimation accuracy decreases for all of the evaluated methods after removing ‘GESTAT’ from the observational dataset due to information loss and confounding bias due to the latent confounder. CEGAN provides competitive performance compared to the stateoftheart when there is no latent confounding, while outperforming all benchmarks under latent confounding for both and . Under the circumstances when there is latent confounding and the treatment assignment is solely based on this latent confounder, CEGAN provides more robust performance than CEVAE.
Conclusion
In this paper, we studied the problem of estimating causal effects in the latent confounder model. In order to obtain unbiased estimates of the ITE, we introduced a novel method, CEGAN, which utilizes an adversarially learned bidirectional model along with a denoising autoencoder. CEGAN achieves competitive performance with numerous stateoftheart benchmarks when the unconfoundedness assumption holds or the proxy noise is small, while outperforming stateoftheart causal inference methods when latent confounding is present. CEGAN performs especially well when the proxy noise is large and the treatment is determined based solely on the latent confounders.
References
 [1] Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.
 [2] Manabu Kuroki and Judea Pearl. Measurement bias and effect restoration in causal inference. Biometrika, 101(2):423–437, March 2014.

[3]
Judea Pearl.
On measurement bias in causal inference.
In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence (AUAI 08)
, page 425–432, 2010.  [4] Christos Louizos, Uri Shalit, Joris Mooij, David Sontag, Richard Zemel, and Max Welling. Causal effect inference with deep latentvariable models. In Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS 2017), 2017.
 [5] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), 2017.
 [6] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), 2017.

[7]
Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and
PierreAntoine Manzagol.
Stacked denoising autoencoders: Learning useful representations in a
deep network with a local denoising criterion.
Journal of Machine Learning Research
, 11:3371–3408, 2010.  [8] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. Proc. 2016, 2016.
 [9] Jennifer L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011.
 [10] Douglas Almond, Kenneth Y. Chay, and David S. Lee. The costs of low birth weight. The Quarterly Journal of Economics, 120(3):1031–1083, 2005.
 [11] Kath Moser, Alison Macfarlane, Yuan Huang Chow, Lisa Hilder, and Nirupa Dattani. Introducing new data on gestationspecific infant mortality among babies born in 2005 in england and wales. Health Stat Q., 35:13–27, 2007.
 [12] M. J. Platt. Outcomes in preterm infants. Health Stat Q., 128(5):399–403, 2014.
 [13] Richard K. Crump, V. Joseph Hotz, Guido W. Imbens, and Oscar A. Mitnik. Nonparametric tests for treatment effect heterogeneity. The Review of Economics and Statistics, 90(3):389–405, 2008.
 [14] Hugh A. Chipman, Edward I. George, and Robert E. McCulloch. Bart: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266–298, 2010.

[15]
Stefan Wager and Susan Athey.
Estimation and inference of heterogeneous treatment effects using random forests.
Journal of the American Statistical Association, 2017.  [16] Uri Shalit, Fredrik D. Johansson, and David Sontag. Estimating individual treatment effect: Generalization bounds and algorithms. In Proceedings of the 33rd International Conference on International Conference on Machine Learning (ICML 2016), 2016.
 [17] Ahmed M Alaa and Mihaela van der Schaar. Bayesian inference of individualized treatment effects using multitask gaussian processes. In Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS 2017), 2017.
 [18] Elizabeth S Allman, Catherine Matias, and John A Rhodes. Identifiability of parameters in latent structure models with many observed variables. The Annals of Statistics, pages 3099–3132, 2009.
Optimization of CEGAN
CEGAN is trained in an iterative fashion: we alternate between optimizing the reconstruction network and the prediction network until convergence. In this section, we describe the empirical loss functions that are used to optimize each component. Pseudocode for training CEGAN is provided in Algorithm
1.We train the reconstruction network () by optimizing the following objective in a supervised fashion:
where and . For the prediction network, we define an empirical value function for the minmax optimization problem in (3):
In addition, we define the following reconstruction loss at the prediction decoder :
where is defined as in (4). Overall, the discriminator and generator iteratively optimize the following objectives, where is a tradeoff parameter:
(5) 
When optimizing CEGAN, the prediction network’s loss is used as a regularizer to improve training compared to using only the GAN loss (5), and the reconstruction network’s loss drives the learning process. Specifically, we train to minimize iteratively with the GAN loss (5), which forces to learn a latent mapping that is informative enough to reconstruct and, thus, drives to favor a meaningful latent structure over a trivial one.
Additional Experiments
Benchmarks
We compare CEGAN with several cuttingedge methods including logistic regression using treatment as a feature (
LR1), logistic regression separately trained for each treatment assignment (LR2), nearest neighbor (kNN) [13], Bayesian additive regression trees (BART) [14], causal forests (CForest) [15], counterfactual regression with Wasserstein distance (CFR) [16]^{2}^{2}2https://github.com/clinicalml/cfrnet, multitask Gaussian process (CMGP) [17] and Causal Effect VAE (CEVAE) [4]^{3}^{3}3https://github.com/AMLabAmsterdam/CEVAE. For continuous outcomes, logistic regressions are replaced with least squares linear regressions. We also compare against CEGAN trained only with
(CEGAN()), which is equivalent to a feedforward network consisting of and . Consequently, CEGAN() does not exploit adversarial learning and does not account for latent confounders (i.e., the inferred by is not trained to have a meaningful relationship with the data samples ).Simulation Settings
Unless otherwise specified, we set in (5) and assume a 20dimensional latent space for . A fullyconnected network is used for each component of the prediction network (i.e., , , , and ) and a multioutput network is used for the reconstruction network . Each of these networks comprise layers,
hidden units in each layer, and ReLU activation functions. The networks are trained using an Adam optimizer with a minibatch size of
and a learning rate of . A dropout probability ofis assumed, and Xavier and zero initializations are applied for weight matrices and bias vectors, respectively. CEGAN is implemented using
Tensorflow.SemiSynthetic Dataset: TWINS proposed in [4]
Method  
Insample  Outsample  Insample  Outsample  Insample  Outsample  Insample  Outsample  
LR1  0.3730.00  0.3650.00  0.3790.00  0.3700.00  0.0250.01  0.0210.01  0.0690.02  0.0640.02 
LR2  0.3760.00  0.3740.01  0.3840.01  0.3810.01  0.0210.01  0.0160.01  0.0690.02  0.0630.02 
kNN  0.3850.01  0.3980.01  0.4090.01  0.4220.01  0.0200.02  0.0190.02  0.1160.03  0.1110.03 
CForest  0.4000.30  0.4100.26  0.4090.33  0.4160.27  0.0160.01  0.0240.01  0.0550.02  0.0660.02 
BART  0.4000.02  0.3970.02  0.4550.03  0.4560.03  0.0740.05  0.0780.05  0.2340.06  0.2410.07 
CMGP  0.3770.01  0.3700.02  0.3800.01  0.3710.01  0.0540.02  0.0490.02  0.0780.03  0.0720.03 
CFR  0.3730.00  0.3660.00  0.3790.01  0.3730.01  0.0240.02  0.0210.02  0.0680.03  0.0630.03 
CEVAE  0.3690.00  0.3670.00  0.3730.00  0.3680.01  0.0150.01  0.0200.01  0.0320.02  0.0270.02 
CEGAN  0.3710.00  0.3640.00  0.3720.00  0.3660.00  0.0210.01  0.0160.01  0.0300.01  0.0260.01 
In this subsection, we compare CEGAN against the aforementioned benchmarks using a semisynthetic dataset that was first proposed in [4]. The dataset is based on records of twin births in the USA from 19891991 [10]. Using this realworld dataset, we artificially create a binary treatment such that () denotes being born the heavier (lighter) twin. The binary outcome corresponds to the mortality of each of the twins in their first year. Since we have records for both twins, we treat their outcomes as two potential outcomes, i.e., and , with respect to the treatment assignment of being born heavier. To make a semisynthetic dataset, we choose samesex twins, discard features that are only available after birth, and focus on cases where both twins have birth weights below 2 kg. Overall, we have a dataset of 10,286 twins with 49 features related to the parents, the pregnancy, and the birth.^{4}^{4}4We made every effort to faithfully reproduce the dataset from its source [10] using the same criteria as in [4], but did not end up with the same number of twins or features. The mortality rate of the lighter twin () is and the heavier twin () is , which yields an average treatment effect of .
For the TWINS dataset whose data generation process is equivalent to what was proposed in [4], we base our treatment assignment on the feature ‘GESTAT10’, which is a categorical value from 0 to 9 representing the number of gestation weeks. (In this experiment, ‘GESTAT’ is discarded; see the description in the manuscript.) We then follow the treatment and noisy proxy generation procedures reported in [4]. Specifically, we let , where
denotes the sigmoid function,
, and ^{5}^{5}5Since we have four more features, we did not obtain comparable using as reported in [4]. So, we calibrated the mean of from 5 to 9 to achieve similar results., and we artificially generate noisy proxies by using three randomly flipped replicas of onehot encoded ‘
GESTAT10’ with flipping probability . It is worth to highlight that, compared to the TWINS dataset proposed in the manuscript, artificial proxy variables are created based on the gestational age feature and included as additional observed features, and the treatment depends not only on the latent variable but also on these artificial proxies.The and results are reported in Table 2 for both insample and outofsample tests. When the proxy noise is relatively small (), CEGAN and CEVAE achieve comparable performance to other benchmarks. This aligns with the wellknown result that three independent views of a latent feature guarantee that it can be recovered [18], so even techniques that do not account for latent confounders can make accurate predictions. In contrast, when the artificial proxy variables become too noisy to be useful (), CEGAN and CEVAE achieve comparable performance to each other and outperform the other benchmarks due to their robustness to latent confounders. However, the artificially generated treatments in this data set are conditioned on both and , which is inconsistent with the causal diagram in Figure Causal Effect with Latent Confounders(b), where only depends on the latent features .
Synthetic dataset: toy example
To further illustrate the robustness of CEGAN to latent confounders, we generate a synthetic dataset as follows:
(6) 
where , (), and is the sigmoid function. We assume a binary treatment , a onedimensional , and 5dimensional and , i.e., .
Since we only have access to observations , where is a noisy proxy of , the above generation process introduces latent confounding between and through as illustrated in Figure 1(b). Without measuring , we expect causal inference methods will suffer from confounding bias. In our experiments, we evaluate over a sample size and average over 50 realizations of the outcomes with the same 64/16/20 train/validation/test splits.
In Figure 3(a), using outofsample tests, we illustrate how varies with the standard deviation of the noise () in the proxy mechanism that maps to . The PEHE increases with the noise under all evaluated benchmarks because the conditional entropy of given the proxy , i.e., , is proportional to . However, CEGAN is more robust to the noise than the other benchmarks – including CEVAE, which considers latent confounders.
Following the previous discussion regarding its relationship to CEVAE, we believe that CEGAN performs better than CEVAE because it does not require as many intermediate steps to infer when estimating the ITE (1). In particular, both CEVAE and CEGAN need to predict an intermediate treatment assignment, i.e., , while CEVAE also needs to predict an intermediate outcome, i.e., , where denote the true observations and denote the intermediate predictions. Since the treatment is binary, we adopt the crossentropy defined in (4) between and to quantify the error in the predicted intermediate treatment . This error affects both CEVAE and CEGAN. To evaluate the error in predicting the intermediate outcome we compute the absolute difference between and , i.e., , where is the intermediate outcome conditioned on instead of . This error only affects CEVAE. In Figure 3(b) and 3(c), we show how and vary with respect to the standard deviation of the noise (), respectively.
Figure 3(b) and 3(c) demonstrate that the intermediate predictions made in both CEGAN and CEVAE become less accurate as the noise increases. We conjecture that this error is accumulated and propagated to the inference of and eventually decreases the accuracy of the ITE estimates. Consequently, since CEVAE requires more intermediate steps to infer , it performs worse than CEGAN. Note that, we omit error measurements in the latent space because (i) differences between latent variables do not necessarily correspond to the accuracy of predictions based on them and (ii) we cannot directly compare errors in the different latent spaces generated by CEVAE and CEGAN.
Comments
There are no comments yet.