Estimation of Individual Treatment Effect in Latent Confounder Models via Adversarial Learning

Estimating the individual treatment effect (ITE) from observational data is essential in medicine. A central challenge in estimating the ITE is handling confounders, which are factors that affect both an intervention and its outcome. Most previous work relies on the unconfoundedness assumption, which posits that all the confounders are measured in the observational data. However, if there are unmeasurable (latent) confounders, then confounding bias is introduced. Fortunately, noisy proxies for the latent confounders are often available and can be used to make an unbiased estimate of the ITE. In this paper, we develop a novel adversarial learning framework to make unbiased estimates of the ITE using noisy proxies.



There are no comments yet.


page 1

page 2

page 3

page 4


Treatment effect estimation with disentangled latent factors

A pressing concern faced by cancer patients is their prognosis under dif...

Graph Infomax Adversarial Learning for Treatment Effect Estimation with Networked Observational Data

Treatment effect estimation from observational data is a critical resear...

Learning Individual Treatment Effects from Networked Observational Data

With convenient access to observational data, learning individual causal...

Modeling Treatment Effect Modification in Multidrug-Resistant Tuberculosis in an Individual Patient Data Meta-Analysis

Effect modification occurs while the effect of the treatment is not homo...

Policy Evaluation with Latent Confounders via Optimal Balance

Evaluating novel contextual bandit policies using logged data is crucial...

Propensity Score Methods for Merging Observational and Experimental Datasets

We consider merging information from a randomized controlled trial (RCT)...

Naïve regression requires weaker assumptions than factor models to adjust for multiple cause confounding

The empirical practice of using factor models to adjust for shared, unob...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Understanding the individual treatment effect (ITE) on an outcome of an intervention on an individual with features is a challenging problem in medicine. When inferring the ITE from observational data, it is common to assume that all of the confounders – factors that affect both the intervention and the outcome – are measurable and captured in the observed data as shown in Figure 1(a). However, in practice, there are often unobserved (latent) confounders as shown in Figure 1(b). For example, socio-economic status cannot be directly measured, but can influence the types of medications that a subject has access to; therefore, it acts as a confounder between the medication and the patient’s health. If such latent confounders are not appropriately accounted for, then the estimated ITE will be subject to confounding bias, making it impossible to estimate the effect of the intervention on the outcome without bias Pearl:00 ; Kuroki:14 . A common technique for mitigating confounding bias is using proxy variables, which are measurable proxy for the latent confounders that can enable unbiased estimation of the ITE. For instance, in the causal diagram in Figure 1(b), can be viewed as providing noisy proxies of the latent confounders .

Figure 1: Causal diagrams. (a) is an observed confounder between the intervention and outcome . (b) is a latent confounder and serves as a proxy that provides noisy views of . Shaded and unshaded nodes denote observed and unobserved (latent) variables, respectively.

Contributions: We introduce an adversarial framework to infer the complex non-linear relationship between the proxy variables (i.e., observations) and the latent confounders via approximately recovering posterior distributions that can be used to infer the ITE. Our experiments on synthetic and semi-synthetic observational datasets show that the proposed method is competitive with – and often outperforms – state-of-the-art methods when there are no latent confounders or when the proxy noise is small, and outperforms all tested benchmarks when the proxy variables become noisy.

Causal Effect with Latent Confounders

Our goal is to estimate the ITE from an observational dataset , where , , and , denote the

-th subject’s feature vector, treatment (we assume that the treatment is binary, i.e.,

), and outcome vector, respectively, and is the number of subjects. The ITE for a subject with observed potential confounder is defined as


To recover the ITE under the latent confounder model in Figure 1(b), we need to identify and . The former can be calculated as follows:


where the second equality follows from the rules of do-calculus111The do-operator Pearl:00 simulates physical interventions by deleting certain functions from the model, replacing them with a constant value, while keeping the rest of the model unchanged. applied to the causal graph in Figure 1(b) Pearl:10 . ( can be derived similarly.) It is worth to highlight that is equivalent to if the unconfoundedness assumption holds as in Figure 1(a).

Thus, from (1) and (2), we can make estimates of the ITE without confounding bias using the estimates of the conditional distributions and . Since

is unobservable, we assume that the joint distribution

can be approximately recovered solely from the observations as justified in Louizos:17 .

Adversarial Learning for Causal Effect

In this section, we propose a method that estimates Causal Effect using a Generative Adversarial Network (CEGAN). CEGAN’s objective is to estimate the conditional posteriors in (2) under the causal graph in Figure 1(b) so that we can estimate the ITE (1) for new subjects. However, since we cannot measure the true latent confounder, we are unable to directly learn the posterior distribution . Instead, we learn a mapping between the data (observations) and an arbitrary latent space following an adversarial learning framework similar to those developed in Dumoulin:17 and Donahue:17 .

Figure 2: An illustration of the proposed network architecture.

Our model, depicted in Figure 2, comprises a prediction network (right) and a reconstruction network (left). Each network includes an encoder-decoder pair, where the encoder is shared between them. The posterior distributions that are required to solve (2) can be estimated using bidirectional models Dumoulin:17 and Donahue:17 via factorizing the posterior distribution as and , where and are the components of the prediction network and is the propensity score. Meanwhile, the reconstruction network is a denoising autoencoder Vincent:10 , which helps the prediction network find a meaningful mapping to the latent space that preserves information in the data space.

Prediction Network

The prediction network has two components: a generator (which consists of the encoder , the prediction decoder , and the inference subnetwork ) and a discriminator .

The encoder (), which is employed in both the reconstruction and prediction networks, maps the data space to the latent space. Thus, the output of the encoder is given by . The inference subnetwork () is introduced to infer based on and given ; its output is given by . The prediction decoder () is a function that outputs the estimated outcome given a sample drawn from the data distribution and a latent variable inferred by ; thus, . Note that the outputs of the generator are randomized by the noise term using the universal approximator technique described in Makhzani:16 .

With the conditional probabilities

, , and obtained from the generator, we are able to define two joint distributions: for the encoder and for the prediction decoder. Using tuples drawn from the two joint distributions, CEGAN attempts to match these distribution by playing an adversarial game between the generator and the discriminator. To do so, the prediction discriminator () maps tuples to a probability in . Specifically, and denote estimates of the probabilities that the tuple is drawn from and , respectively. The discriminator tries to distinguish between tuples that are drawn from and . Following the framework in Dumoulin:17 , the two distributions can be matched (i.e., they reach the same saddle point) by solving the following min-max problem between the generator and the discriminator:


Reconstruction Network

The relationship between the data and the latent space is not specified in the prediction network. Consequently, the network may converge to an undesirable matched joint distribution. For instance, it may learn to match the joint distributions and while inferring latent variables that provide no information about the data samples . We introduce a reconstruction network to nudge the prediction network toward learning a meaningful mapping between the data and latent spaces. We utilize a denoising autoencoder for the reconstruction network, which employs the same encoder as the prediction network, , and a reconstruction decoder . reconstructs the original input of the encoder from the output of ; the output can be given as .

Then, we define the following reconstruction loss:


where , , for continuous values, and for binary values. Here, denotes the element-wise logarithm. By minimizing (4) iteratively with the min-max problem (3), is able to map data samples into the latent space while preserving information that is available in the data space.


Ground truth counterfactual outcomes are never available in observational datasets, which makes it difficult to evaluate causal inference methods. Thus, we evaluate CEGAN against various benchmarks using a semi-synthetic dataset where we model the proxy mechanism to generate latent confounding. In the appendix, we perform further comparisons using a semi-synthetic dataset suggested in Louizos:17 and a synthetic dataset.

Performance Metric: We use two different performance metrics in our evaluations – expected precision in the estimation of heterogeneous effect (PEHE) and average treatment effect (ATE) Hill:11 :

where and are the ground truth of the treated and controlled outcomes for the -th sample and and are their estimates.

no latent confounding latent confounding no latent confounding latent confounding
In-sample Out-sample In-sample Out-sample In-sample Out-sample In-sample Out-sample
LR-1 0.3650.00 0.3670.00 0.4130.01 0.4230.02 0.0450.02 0.1860.03 0.0640.02 0.2060.03
LR-2 0.4040.02 0.4110.02 0.4420.02 0.4540.02 0.1280.03 0.2060.04 0.1480.03 0.2270.04
kNN 0.4860.02 0.5060.02 0.4920.02 0.5150.02 0.2540.04 0.2640.04 0.2710.04 0.2850.04
CForest 0.3560.01 0.3720.01 0.4170.02 0.4290.02 0.0250.02 0.1880.03 0.0230.02 0.1860.03
BART 0.5690.06 0.5620.06 0.8770.08 0.8710.08 0.4320.08 0.4290.08 0.7900.09 0.7860.09
CMGP 0.3670.01 0.3650.01 0.4300.05 0.4380.05 0.0340.03 0.0360.04 0.1920.09 0.2130.09
CFR 0.3710.03 0.3710.03 0.4270.05 0.4380.05 0.0560.06 0.0710.06 0.2050.07 0.2260.07
CEVAE 0.3630.00 0.3640.00 0.4230.00 0.4280.00 0.0710.01 0.1650.01 0.0880.01 0.1830.01
CEGAN 0.3630.00 0.3620.00 0.3690.00 0.3690.00 0.0180.01 0.0170.01 0.0220.01 0.0210.02
Table 1: Comparison of and (mean std) on the TWINS dataset.

We compare CEGAN against benchmarks (see appendix for details of the tested benchmarks) using a semi-synthetic dataset (TWINS) which is similar to that was first proposed in Louizos:17 . Based on records of twin births in the USA from 1989-1991 Almond:05 , we artificially create a binary treatment such that () denotes being born the heavier (lighter). The binary outcome is the mortality of each of the twins in their first year. (Since we have records for both twins, we treat their outcomes as two potential outcomes, i.e., and , with respect to the treatment assignment of being born heavier.) Due to its high correlation with the outcome Moser:07 ; Platt:14 , we select the feature ‘GESTAT’, which is the gestational age in weeks. The treatment assignment is based only on this single variable, i.e., , where and is the min-max normalized value of ‘GESTAT’. The data generation process is not exactly equivalent to that proposed in Louizos:17 as i) it includes artificial proxies of the latent variable in the observational dataset, which is less realistic and ii) the treatment assignment is not only based on the latent variable but also on the observed variables which is not consistent with the causal model in Figure 1(b). (In the appendix, we reported details and results for the TWINS dataset with the same data generation process in Louizos:17 .)

To assess the performance of causal inference methods in the presence of latent confounding, we test them on two datasets: “no latent confounding” which contains ‘GESTAT’ and relies on the unconfoundedness assumption as depicted in Figure 1(a) and “latent confounding” which excludes ‘GESTAT’ from the observational dataset and follows the latent causal graph in Figure 1(b).

Throughout the evaluation, we average over 100 Monte Carlo samples from the estimated posteriors derived using each method to compute and in (1) for CEVAE and CEGAN. The reported values in Table 1 are averaged over 50 realizations with the same 64/16/20 train/validation/test splits.

The performance of and is reported in Table 1, for both within-sample and out-of-sample tests. The ITE estimation accuracy decreases for all of the evaluated methods after removing ‘GESTAT’ from the observational dataset due to information loss and confounding bias due to the latent confounder. CEGAN provides competitive performance compared to the state-of-the-art when there is no latent confounding, while outperforming all benchmarks under latent confounding for both and . Under the circumstances when there is latent confounding and the treatment assignment is solely based on this latent confounder, CEGAN provides more robust performance than CEVAE.


In this paper, we studied the problem of estimating causal effects in the latent confounder model. In order to obtain unbiased estimates of the ITE, we introduced a novel method, CEGAN, which utilizes an adversarially learned bidirectional model along with a denoising autoencoder. CEGAN achieves competitive performance with numerous state-of-the-art benchmarks when the unconfoundedness assumption holds or the proxy noise is small, while outperforming state-of-the-art causal inference methods when latent confounding is present. CEGAN performs especially well when the proxy noise is large and the treatment is determined based solely on the latent confounders.


  • [1] Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.
  • [2] Manabu Kuroki and Judea Pearl. Measurement bias and effect restoration in causal inference. Biometrika, 101(2):423–437, March 2014.
  • [3] Judea Pearl. On measurement bias in causal inference.

    In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence (AUAI 08)

    , page 425–432, 2010.
  • [4] Christos Louizos, Uri Shalit, Joris Mooij, David Sontag, Richard Zemel, and Max Welling. Causal effect inference with deep latent-variable models. In Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS 2017), 2017.
  • [5] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), 2017.
  • [6] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), 2017.
  • [7] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.

    Journal of Machine Learning Research

    , 11:3371–3408, 2010.
  • [8] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. Proc. 2016, 2016.
  • [9] Jennifer L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011.
  • [10] Douglas Almond, Kenneth Y. Chay, and David S. Lee. The costs of low birth weight. The Quarterly Journal of Economics, 120(3):1031–1083, 2005.
  • [11] Kath Moser, Alison Macfarlane, Yuan Huang Chow, Lisa Hilder, and Nirupa Dattani. Introducing new data on gestation-specific infant mortality among babies born in 2005 in england and wales. Health Stat Q., 35:13–27, 2007.
  • [12] M. J. Platt. Outcomes in preterm infants. Health Stat Q., 128(5):399–403, 2014.
  • [13] Richard K. Crump, V. Joseph Hotz, Guido W. Imbens, and Oscar A. Mitnik. Nonparametric tests for treatment effect heterogeneity. The Review of Economics and Statistics, 90(3):389–405, 2008.
  • [14] Hugh A. Chipman, Edward I. George, and Robert E. McCulloch. Bart: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266–298, 2010.
  • [15] Stefan Wager and Susan Athey.

    Estimation and inference of heterogeneous treatment effects using random forests.

    Journal of the American Statistical Association, 2017.
  • [16] Uri Shalit, Fredrik D. Johansson, and David Sontag. Estimating individual treatment effect: Generalization bounds and algorithms. In Proceedings of the 33rd International Conference on International Conference on Machine Learning (ICML 2016), 2016.
  • [17] Ahmed M Alaa and Mihaela van der Schaar. Bayesian inference of individualized treatment effects using multi-task gaussian processes. In Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS 2017), 2017.
  • [18] Elizabeth S Allman, Catherine Matias, and John A Rhodes. Identifiability of parameters in latent structure models with many observed variables. The Annals of Statistics, pages 3099–3132, 2009.

Optimization of CEGAN

CEGAN is trained in an iterative fashion: we alternate between optimizing the reconstruction network and the prediction network until convergence. In this section, we describe the empirical loss functions that are used to optimize each component. Pseudo-code for training CEGAN is provided in Algorithm


  Input: Observational dataset
  Output: CEGAN parameters
      1) Reconstruction network optimization
      Sample minibatch of data and noise samples

using stochastic gradient descent (SGD) with gradient:

      2) Prediction network optimization
      Sample minibatch of data and noise samples
      Update using SGD with gradient:
      Sample minibatch of data and noise samples
      Update using SGD with gradient:
  until convergence
Algorithm 1 Pseudo-code of CEGAN

We train the reconstruction network () by optimizing the following objective in a supervised fashion:

where and . For the prediction network, we define an empirical value function for the min-max optimization problem in (3):

In addition, we define the following reconstruction loss at the prediction decoder :

where is defined as in (4). Overall, the discriminator and generator iteratively optimize the following objectives, where is a trade-off parameter:


When optimizing CEGAN, the prediction network’s loss is used as a regularizer to improve training compared to using only the GAN loss (5), and the reconstruction network’s loss drives the learning process. Specifically, we train to minimize iteratively with the GAN loss (5), which forces to learn a latent mapping that is informative enough to reconstruct and, thus, drives to favor a meaningful latent structure over a trivial one.

Additional Experiments


We compare CEGAN with several cutting-edge methods including logistic regression using treatment as a feature (

LR-1), logistic regression separately trained for each treatment assignment (LR-2), -nearest neighbor (kNN) [13], Bayesian additive regression trees (BART) [14], causal forests (CForest) [15], counterfactual regression with Wasserstein distance (CFR) [16]222, multi-task Gaussian process (CMGP) [17] and Causal Effect VAE (CEVAE) [4]333

. For continuous outcomes, logistic regressions are replaced with least squares linear regressions. We also compare against CEGAN trained only with

(CEGAN()), which is equivalent to a feed-forward network consisting of and . Consequently, CEGAN() does not exploit adversarial learning and does not account for latent confounders (i.e., the inferred by is not trained to have a meaningful relationship with the data samples ).

Simulation Settings

Unless otherwise specified, we set in (5) and assume a 20-dimensional latent space for . A fully-connected network is used for each component of the prediction network (i.e., , , , and ) and a multi-output network is used for the reconstruction network . Each of these networks comprise layers,

hidden units in each layer, and ReLU activation functions. The networks are trained using an Adam optimizer with a minibatch size of

and a learning rate of . A dropout probability of

is assumed, and Xavier and zero initializations are applied for weight matrices and bias vectors, respectively. CEGAN is implemented using


Semi-Synthetic Dataset: TWINS proposed in [4]

In-sample Out-sample In-sample Out-sample In-sample Out-sample In-sample Out-sample
LR-1 0.3730.00 0.3650.00 0.3790.00 0.3700.00 0.0250.01 0.0210.01 0.0690.02 0.0640.02
LR-2 0.3760.00 0.3740.01 0.3840.01 0.3810.01 0.0210.01 0.0160.01 0.0690.02 0.0630.02
kNN 0.3850.01 0.3980.01 0.4090.01 0.4220.01 0.0200.02 0.0190.02 0.1160.03 0.1110.03
CForest 0.4000.30 0.4100.26 0.4090.33 0.4160.27 0.0160.01 0.0240.01 0.0550.02 0.0660.02
BART 0.4000.02 0.3970.02 0.4550.03 0.4560.03 0.0740.05 0.0780.05 0.2340.06 0.2410.07
CMGP 0.3770.01 0.3700.02 0.3800.01 0.3710.01 0.0540.02 0.0490.02 0.0780.03 0.0720.03
CFR 0.3730.00 0.3660.00 0.3790.01 0.3730.01 0.0240.02 0.0210.02 0.0680.03 0.0630.03
CEVAE 0.3690.00 0.3670.00 0.3730.00 0.3680.01 0.0150.01 0.0200.01 0.0320.02 0.0270.02
CEGAN 0.3710.00 0.3640.00 0.3720.00 0.3660.00 0.0210.01 0.0160.01 0.0300.01 0.0260.01
Table 2: Comparison of and (mean std) on the TWINS dataset with Scenario 1.

In this subsection, we compare CEGAN against the aforementioned benchmarks using a semi-synthetic dataset that was first proposed in [4]. The dataset is based on records of twin births in the USA from 1989-1991 [10]. Using this real-world dataset, we artificially create a binary treatment such that () denotes being born the heavier (lighter) twin. The binary outcome corresponds to the mortality of each of the twins in their first year. Since we have records for both twins, we treat their outcomes as two potential outcomes, i.e., and , with respect to the treatment assignment of being born heavier. To make a semi-synthetic dataset, we choose same-sex twins, discard features that are only available after birth, and focus on cases where both twins have birth weights below 2 kg. Overall, we have a dataset of 10,286 twins with 49 features related to the parents, the pregnancy, and the birth.444We made every effort to faithfully reproduce the dataset from its source [10] using the same criteria as in [4], but did not end up with the same number of twins or features. The mortality rate of the lighter twin () is and the heavier twin () is , which yields an average treatment effect of .

For the TWINS dataset whose data generation process is equivalent to what was proposed in [4], we base our treatment assignment on the feature ‘GESTAT10’, which is a categorical value from 0 to 9 representing the number of gestation weeks. (In this experiment, ‘GESTAT’ is discarded; see the description in the manuscript.) We then follow the treatment and noisy proxy generation procedures reported in [4]. Specifically, we let , where

denotes the sigmoid function,

, and 555Since we have four more features, we did not obtain comparable using as reported in [4]. So, we calibrated the mean of from 5 to 9 to achieve similar results.

, and we artificially generate noisy proxies by using three randomly flipped replicas of one-hot encoded

GESTAT10’ with flipping probability . It is worth to highlight that, compared to the TWINS dataset proposed in the manuscript, artificial proxy variables are created based on the gestational age feature and included as additional observed features, and the treatment depends not only on the latent variable but also on these artificial proxies.

The and results are reported in Table 2 for both in-sample and out-of-sample tests. When the proxy noise is relatively small (), CEGAN and CEVAE achieve comparable performance to other benchmarks. This aligns with the well-known result that three independent views of a latent feature guarantee that it can be recovered [18], so even techniques that do not account for latent confounders can make accurate predictions. In contrast, when the artificial proxy variables become too noisy to be useful (), CEGAN and CEVAE achieve comparable performance to each other and outperform the other benchmarks due to their robustness to latent confounders. However, the artificially generated treatments in this data set are conditioned on both and , which is inconsistent with the causal diagram in Figure Causal Effect with Latent Confounders(b), where only depends on the latent features .

Figure 3:

Performance evaluation on the synthetic dataset. The x-axes denote the standard deviation of the noise (

) in the proxy mechanism mapping to . (a) PEHE vs. for CEGAN and CEVAE. LR-1, LR-2, and CEGAN() are included for reference. (b) Cross-entropy between and . (c) Absolute error between and .

Synthetic dataset: toy example

To further illustrate the robustness of CEGAN to latent confounders, we generate a synthetic dataset as follows:


where , (), and is the sigmoid function. We assume a binary treatment , a one-dimensional , and 5-dimensional and , i.e., .

Since we only have access to observations , where is a noisy proxy of , the above generation process introduces latent confounding between and through as illustrated in Figure 1(b). Without measuring , we expect causal inference methods will suffer from confounding bias. In our experiments, we evaluate over a sample size and average over 50 realizations of the outcomes with the same 64/16/20 train/validation/test splits.

In Figure 3(a), using out-of-sample tests, we illustrate how varies with the standard deviation of the noise () in the proxy mechanism that maps to . The PEHE increases with the noise under all evaluated benchmarks because the conditional entropy of given the proxy , i.e., , is proportional to . However, CEGAN is more robust to the noise than the other benchmarks – including CEVAE, which considers latent confounders.

Following the previous discussion regarding its relationship to CEVAE, we believe that CEGAN performs better than CEVAE because it does not require as many intermediate steps to infer when estimating the ITE (1). In particular, both CEVAE and CEGAN need to predict an intermediate treatment assignment, i.e., , while CEVAE also needs to predict an intermediate outcome, i.e., , where denote the true observations and denote the intermediate predictions. Since the treatment is binary, we adopt the cross-entropy defined in (4) between and to quantify the error in the predicted intermediate treatment . This error affects both CEVAE and CEGAN. To evaluate the error in predicting the intermediate outcome we compute the absolute difference between and , i.e., , where is the intermediate outcome conditioned on instead of . This error only affects CEVAE. In Figure 3(b) and 3(c), we show how and vary with respect to the standard deviation of the noise (), respectively.

Figure 3(b) and 3(c) demonstrate that the intermediate predictions made in both CEGAN and CEVAE become less accurate as the noise increases. We conjecture that this error is accumulated and propagated to the inference of and eventually decreases the accuracy of the ITE estimates. Consequently, since CEVAE requires more intermediate steps to infer , it performs worse than CEGAN. Note that, we omit error measurements in the latent space because (i) differences between latent variables do not necessarily correspond to the accuracy of predictions based on them and (ii) we cannot directly compare errors in the different latent spaces generated by CEVAE and CEGAN.