TVAE_release
TVAE release version
view repo
A pressing concern faced by cancer patients is their prognosis under different treatment options. Considering a binary-treatment, e.g., to receive radiotherapy or not, the problem can be characterized as estimating the treatment effect of radiotherapy on the survival outcome of the patients. Estimating treatment effect from observational studies is a fundamental problem, yet it is still especially challenging due to the counterfactual and confounding problems. In this work, we show the importance of differentiating confounding factors from factors that only affect the treatment or the outcome, and propose a data-driven approach to learn and disentangle the latent factors into three disjoint sets for a more accurate estimating treatment estimator. Empirical validations on semi-synthetic benchmark and real-world datasets demonstrate the effectiveness of the proposed method.
READ FULL TEXT VIEW PDFTVAE release version
A fundamental question in many scientific researches can be stated as: whether and how much an intervention affect the result of an outcome? In other words, in the case of a binary treatment, whether and to what degree the outcome without the treatment differs from the outcome with the treatment? In social economy, policy makers need to study whether a job training program will improve employment perspective of the workers [4]; in cancer diagnosis, oncologists need to determine whether prescribing a treatment will improve patients’ prognoses [25].
In the center of these questions lies the counterfactual problem: each individual is associated with two potential outcomes: one with treatment and one without. After an individual receives a treatment it is impossible to know what the potential outcome would have been if the treatment were different. In any case, if a cancer patient has received radiotherapy, we would never know her prognosis should her had not been treated. Since the underlying true treatment effect is defined by both the factual and counterfactual outcomes, estimation of treatment effect is difficult in observational studies [1].
In order to estimate treatment effect from observational data, the treatment assignment needs to be independent of the potential outcomes when conditioned on the observed variables, i.e., the unconfoundedness assumption [20] must be satisfied. With the unconfoundedness, treatment effect can be estimated from observational data by adjusting for the confounders. If not all the confounders are measured and included in the study, the causal effect of the treatment on the outcome may not be estimated unbiasedly from the data [19].
From a theoretical perspective, practitioners are tempted to include as many variables as possible in the adjustment set to ensure unconfoundedness. This is because confounders can be difficult to measure in the real-world and practitioners need to include noisy proxy variables to ensure unconfoundedness. For example, the socio-economic statuses of the patients confound treatment and prognosis, but cannot be included in the electronic medical records due to privacy concerns. Luckily, it is often the case that such unmeasured confounders can be inferred from noisy proxy variables which are easier to measure. For instance, the patients’ zip codes and job types can be used as proxies to infer their socio-economic statuses [21].
From a practical perspective, the inflated number of variables in the adjustment set reduces the efficiency of treatment effect estimation due to curse of dimensionality. Moreover, it has been previously shown that including unnecessary covariates is suboptimal when the treatment effect is estimated nonparametrically
[7, 1]. In a high dimensional scenario, eventually many variables are not confounders and should be excluded from the adjustment set. Apart from irrelevant variables that are not related to the treatment or the outcome, some variables may only affect the treatment but have no effect on the outcome, while others may have effect on the outcome but unrelated to the treatment.Existing treatment estimation algorithms treat the variables “as is”, and leave the daunting task of choosing variables into the adjustment set to the users. It is clear that practitioners are left with a dilemma: including more variables than what is needed in the adjustment set produces inefficient estimators and inaccurate, whereas limiting the size of the adjustment set could exclude confounders themselves or proxy variables and result in biased estimation. With only a handful of variables, this can be solved by consulting domain experts. However, a data-driven approach is required in the big data era to ease the practitioners’ burden.
In this work, we propose a data-driven approach for simultaneously inferring confounding factors from proxy variables and disentangling the learned factors into three disjoint sets (Figure 1): instrumental factors which only affects the treatment, risk factors which only affect the outcome, and confounding factors
that affects both treatment and outcome. Since we utilize the latest development from disentangled variational autoencoder, we coin it Treatment Effect by Disentangled Variational Autoencoder (TEDVAE). Our main contributions are summarized as follows:
We address a new problem of treatment effect estimation using observational data in the real-world scenario, in the presence of noisy proxies and non-confounders, which is critical for designing efficient and accurate estimators from observational data in the big data era.
We propose a novel data-driven algorithm to simultaneously infer latent factors from proxy variables and disentangle confounding factors from others for efficient and accurate estimation of both ATE and ITE.
We validate the effectiveness of the proposed for ATE and ITE estimation on semi-synthetic treatment effect estimation benchmarks and real-world datasets.
The rest of this paper is organized as follows. In Section 2, we review related work on treatment effect estimation. The details of our methods is presented in Section 3. In Section 4, we discuss the evaluation metrics, datasets and experiment results. Finally, we conclude the paper in Section 5.
It is the seminal work in [4]
that sparkled the interest in treatment effect estimation from the machine learning community. In
[4], a tree method is proposed which utilizes a treatment effect specific splitting criterion for recursive partitioning. Later on, a variety methods have been proposed from the machine learning community [25, 15, 18, 22, 23]. For example, in [23]an ensemble based estimator similar to random forest using Causal Tree as base learner. In
[15], a meta algorithm called the X-Learner has been proposed and and shown to improve upon the two model approach.Neural net based methods have attracted increasingly research interest during the past few years [22, 18, 8, 15, 24]. [22] proposes to reduce the discrepancy between the treated and untreated groups of samples by learning a representation space such that the treated and untreated are as close to each other as possible. [3] use an auto-encoder network to learn a representation which reduces the bias by minimizing the cross entropy loss between the original treatment distribution and the conditional distribution of treatment conditioned on the representation. [8] also proposes an improvement over CFR by borrowing ideas from covariate shift. However, their designs are not capable of separating the covariates that only contributes to the treatment assignment from those only contributed to the outcomes. Furthermore, these methods are also not able to infer latent covariates from proxies.
Variable decomposition has been investigated in [14]. In their work, the covariates is decomposed into confounding and risk variables using optimization. Apart from the capability of estimating both ITE and ATE, our method has several advantages: (i) we are able to identify the non-linear relationships between the latent factors and their proxies, whereas their approach is linear; (ii) we can learn and disentangle instrumental factors that affects only the treatment, which yields a more accurate estimation.
Perhaps the work most related to ours is CEVAE [18], which also utilizes variational autoencoder to learn confounders from proxy varialbes. However, our method has a major advantages over CEVAE. CEVAE does not consider the existence of risk and instrumental factors, and only learns a single set of latent factors. As demonstrated by the experiments, ignoring the existence of instrumental and risk factors results in decreased estimation accuracy.
Let denote a binary treatment where indicates the -th individual receives no treatment (control) and indicates the individual receives the treatment (treated). We use to denote the potential outcome of if it were treated, and to denote the potential outcome if it were not treated. Noting that only one of the potential outcomes can be realized, the observed outcome is . Additionally, let denote the “as is” set of covariates for the -th individual.
We assume that the following standard assumptions needed for treatment effect estimation from observational data [20] are satisfied:
(SUTVA) The Stable Unit Treatment Value Assumption requires that the potential outcomes for one unit is unaffected by the treatment assignment of other units.
(Unconfoundedness) The distribution of treatment is independent of the potential outcome when conditioning on the observed variables: .
(Overlap)
Every unit has a nonzero probability to receive either treatment when given the observed variables, i.e.,
.An important goal of causal inference from observational data is to evaluate the average treatment effect (ATE) and individual treatment effect (ITE). The ATE is defined as:
(1) |
where denote an manipulation on by removing all incoming edges to setting its value to [19].
The ITE for an individual is defined as
Due to the counterfactual problem, we never observe and at the same time and thus is not observed for any individual. Instead, we estimate the conditional average treatment effect (CATE) , which is defined as
(2) |
Throughout the rest of this paper we adopt the causal model in Figure 1, where the variables can be viewed as generated from three disjoint sets of latent factors . Here are the instrumental factors that only affect the treatment but not the outcome, are the risk factors which only affect the outcome but not the treatment, and are the confounding factors that affect both the treatment and the outcome.
It is important to note that our causal diagram does not pose any restriction other than the three standard assumptions discussed in Section 3.1. Indeed, the widely used causal diagram for treatment effect estimation (Figure 2 in [11]) can be viewed as a special case of ours. To see this, consider the case where every variable in itself is a confounder, then the generating mechanism is with .
With our causal structure, estimation of treatment effect is immediate using the back-door criterion:
The effect of on can be estimated if we recover the confounding factors from the data.
For estimation of the conditional average treatment effect, our result follows from Theorem 3.4.1 in [19]:
The conditional average treatment effect of on conditioned on can be estimated if we recover the confounding factors and risk factors .
Let denote the causal structure obtained by removing all incoming edges of in Figure 1, denote the structure by deleting all outgoing edges of .
Noting that in , using the three rules of do-calculus we can remove from the conditioning set and obtain
(6) |
Equation 5 use utilizes Rule 1. Furthermore, using Rule 2 with in yields
An noteworthy implication of Theorem 1 and 2 is that they are not restricted to binary treatment. In other words, our proposed method can be used for estimating treatment effect of a continuous treatment variable, while most of the existing estimators are not able to do so. However, due to the lack of datasets with continuous treatment variables for evaluating this, we focus on the case of binary treatment variable and leave the continuous treatment case for future work.
Theorems 1 and 2 suggest that disentangling the confounding factors allow us to exclude unnecessary factors when estimating ATE and ITE. However, keen readers may wonder since we already assumed unconfoundedness, doesn’t straightforwardly adjusting for achieve the same goal?
To give an intuitive illustration of the necessity in excluding unnecessary variables from the conditioning set, consider that the variation in is consisted of three components: the variation explained by , the variation explained by , and the unexplained variation. When conditioning on , one source of variation is removed which makes the variation explained by a larger proportion of the remaining variation.
Theoretically, it has been shown that both the bias [1]
and the variance
[7] of treatment effect estimation will increase if variables unrelated to the outcome is included during the estimation. Therefore, it is crucial to differentiate the instrumental, confounding and risk factors and only use the appropriate factors during treatment effect estimation. In the next section, we propose our data-driven approach to learn and disentangle the latent factors.In the above discussion, we have seen that removing unnecessary factors from the adjustment set is crucial for efficient and accurate treatment effect estimation. We have assumed that the mechanism which generates the observed variables from the latent factors and the decomposition of latent factors are available. However, in practice both the mechanism and the decomposition are not known. Therefore, the practical approach would be to utilize the complete set of available variables during the modeling to ensure the satisfaction of the unconfoundedness assumption, and adopt a data-driven approach to simultaneously learn and disentangle the latent factors into disjoint subsets.
To this end, our goal is to learn the posterior distribution for the set of latent factors with , where
are independent of each other and correspond the instrumental factors, confounding factors, and risk factors, respectively. Because exact inference would be intractable, we employ neural network based variational inference to approximate the posterior
. Specifically, we utilize three separate encoders , , and that serve as variational posteriors over the latent factors. These latent factors are then used by a single decoder for the reconstruction of . Following standard VAE design, the prior distributionsare chosen as Gaussian distributions
[13].Given the training samples, all of the parameters can be optimized by maximizing the evidence lower bound (ELBO) given the input:
(7) | ||||
(8) | ||||
(9) | ||||
(10) |
To further encourage the disentanglement of the latent factors, we introduce two objectives to the loss function ensuring that the treatment
can be predicted from and , and the outcome can be predicted from and . Finally,the objective of TEDVAE can be expressed as(11) | ||||
(12) | ||||
(13) |
Since the main goal is to exclude unnecessary covariates from the adjustment set, during inference we only use the encoder ,
, and the auxiliary classifier
.It has been shown that unsupervised learning of disentangled factors is impossible for generative models, and constraints on the latent space are necessary to identify a model that matches the underlying generating mechanism
[17]. To avoid this problem, TEDVAE uses the outcome and the treatment along with during training. Furthermore, the marginal distribution of is forced to take the decomposition , which prevents the unsupervised dilemma described in [17]. The additional information in and makes it possible for TEDVAE to learn the disentanglement that matches the data generating mechanism.A noteworthy issue is that by employing variational inference parameterized by neural networks, we cannot guarantee that the learned model is identical to the mechanism. However, as demonstrated by the empirical evaluations, we argue that VAE’s lack of theoretical support is mitigated by its strong empirical performance and the demonstrated capability of learning disentangled factors [9].
We empirically compare the proposed algorithm with traditional and neural network based ITE estimators. For statistical methods, we compare with Causal Tree (CT) [4], Causal Random Forest (CRF) [23], Bayesian Additive Regression Tree (BART) [10], X-Learner Random Forest (XRF) [15]. For neural network representation learning methods, we compare with Counterfactual Regression Net (CFR) [22], CFR with importance probability weight (CFR-IPW) [8], Similarity Preserved Individual Treatment Effect (SITE) [24], and Causal Effect Variational Autoencoder (CEVAE) [18]. For all compared algorithms, we use the implementation provided by the original authors on GitHub (for neural network based algorithms) and CRAN (for traditional methods). Parameters for the compared methods are tuned by cross-validated grid search on the value ranges recommended in the code repository. Since TEDVAE and CEVAE both utilizes variational autoencoder, we set them to use the same network structures in terms of number of hidden layers and nodes to ensure a fair comparison.
It has been well known that evaluation of treatment effect estimation methods is difficult due to the counterfactual problem [10, 18, 8]. For semi-synthetic datasets with known groundtruth individual treatment effect, and use metrics such as Precision in Estimation of Heterogeneous Effect (PEHE) [10].
(14) |
where and is the true and the estimated ITE, respectively. For real-world data from randomized controlled trials, the true ITE is not available but the groundtruth average treatment effect can be calculated. Therefore, we evaluate the absolute error in the average treatment effect
(15) |
Without knowing the groundtruth treatment effects, the uplift curve has long been used for evaluating individual treatment effect estimation in real-world datasets [6]. The intuition of uplift curve is that when individuals are ranked by their estimated treatment effect, a good estimator should rank individuals with positive outcomes in the treated group and those with negative outcomes in the control group higher than the others. It is worthnoting that the POL curve used in [22, 18] is equivalent to the uplift curve with a constant related to the treated/control sample proportioin.
To provide a clear definition of uplift curve, we introduce some extra notations. For a given estimator and subjects , let be the descending ordering of the subjects according to their estimated treatment effects, and let be the first k subjects from the ordering. Let be the count of individuals with positive outcomes in , and let and be the numbers of positive outcomes in the treatment and control groups respectively from : , and . Finally, let and be the numbers of subjects in the treated and control groups from the top- subjects. The uplift curve is then defined as:
(16) |
Methods | ||
---|---|---|
CT | 1.48 0.12 | 1.56 0.13 |
CRF | 1.01 0.08 | 1.09 0.16 |
BART | 1.13 0.28 | 1.35 0.30 |
XRF | 0.98 0.08 | 1.09 0.15 |
CFR | 0.89 0.04 | 0.96 0.12 |
CFR-IPW | 0.90 0.04 | 1.00 0.17 |
SITE | 0.80 0.05 | 0.82 0.10 |
CEVAE | 1.13 0.07 | 1.37 0.19 |
TEDVAE | 0.79 0.07 | 0.82 0.07 |
Mean and standard errors on training and test samples for ITE estimation on the IHDP datasets. The uplift curves on Jobs and Twins are located in Figure
2.Unlike parameter selection in standard supervised tasks where models can be selected using cross-validation, a major challenge of parameters tuning for treatment estimation is that there is no groundtruth ITE for any individual. Therefore, the algorithm need to use some surrogate estimation to approximate the true treatment effect . A common approach used by many previous methods [4, 14, 12] is to use the matching surrogate: , where is the index of the nearest neighbor to the -th individual whose treatment is opposite to .
However, even with a moderate dimension of variables, it is unlikely that the matching surrogate is a good choice for cross-validation. The intuition behind this is two-fold. Firstly, due to the curse of dimensionality, finding meaning match becomes exponentially more difficult as the number of variables increases. Secondly, the outcome of the matching surrogate in the opposite treatment may not be a good representative of the counterfactual outcome, due to the fact that in observational studies treatment is not randomly assigned.
An alternative approach is to utilize a traditional estimator, e.g., BART as a surrogate to provide an estimation of the true treatment effect [8]. However, we argue that this approach is also not optimal. This approach inevitably leads to a model that is similar to the chose surrogate estimator, and thus it is unlikely to produce models better than the surrogate on a wide range of datasets. Furthermore, we argue thats a treatment effect estimation algorithm should be self-sufficient and should not rely on others. To avoid the above problem, we use the algorithms’ objective function on the validation set for model selection.
IHDP | Jobs | Twins | |
CT | 0.36 0.03 | 0.017 0.006 | 0.003 0.005 |
CRF | 0.18 0.03 | 0.018 0.003 | 0.002 0.003 |
BART | 0.54 0.08 | 0.014 0.005 | 0.007 0.005 |
XRF | 0.17 0.02 | 0.016 0.006 | 0.002 0.004 |
CFR | 0.30 0.10 | 0.019 0.009 | 0.005 0.003 |
CFR-IPW | 0.31 0.10 | 0.020 0.010 | 0.006 0.004 |
SITE | 0.30 0.04 | 0.017 0.005 | 0.005 0.002 |
CEVAE | 0.39 0.04 | 0.025 0.020 | 0.032 0.010 |
TEDVAE | 0.20 0.05 | 0.016 0.013 | 0.005 0.004 |
The Infant Health and Development Program (IHDP) dataset is designed to evaluate the effect of home visit from specialist doctors on the cognitive test scores of premature infants. The dataset is first used for benchmarking treatment effect estimation by [10], where selection bias is induced by removing a non-random subset of the treated samples from the original randomized controlled trial data to create a realistic observational dataset. The resulting dataset contains 747 samples (608 control and 139 treated) with 25 pre-treatment variables that describe both the infants and their mothers.
We followed the exact procedure as described in [10, 12, 18] in the experiment, in which the counterfactual outcomes are simulated using the Non-Parametric Causal Inference (NPCI) package [5] with “Setting A” and set “sample.kind” to “rounding”. The reported performances are calculated by averaging over 100 replications with a training/validation/test splits proportion of 60%/30%/10%.
The Jobs dataset is based on the randomized controlled trial samples originally used in [16], which has then become a widely used benchmark for treatment effect estimation. The treatment is whether a subject has participated in a job training program, and the outcome is the subject’s employment status. This dataset combines a randomized study based on the National Supported Work program with observational data to form a larger dataset. The randomized controlled trial permits us to estimate the true average treatment effect, and by including the observational part, The study includes 8 covariates such as age and education, as well as previous earnings of the participants. We follow the procedure in [22], where the goal is to predict if the program is effective for improving the participants’ future employment and income.
The twins dataset has been used for evaluating causal inference in [18, 24]. It consists of samples from twin births in the USA between the year of 1989 and 1991 [2]. Each sample contains 40 pre-treatment variables related to the parents, the pregnancy and the birth statistics of the twins. The treatment is if a sample is the heavier one of the twins, and is if the sample is the lighter one. The outcome is the children’s mortality after one year follow-up period. After eliminating the records containing missing features, the final dataset contains 4821 samples.
For estimation of individual treatment effect, from Table 1 we can see that in terms of , the performances of SITE and TEDVAE are similar to each other and significantly better than other compared methods. When contrasting TEDVAE with CEVAE, another generative estimator without considering the disentanglement of latent factors, TEDVAE performs significantly better. Since we set the two algorithms to use similar network structure in terms of numbers of hidden layers and nodes, this result illustrate the effectiveness of learning disentangled set of instrumental, confounding and adjustment factors. Within the group of traditional methods, CRF and XRF perform competitively on the IHDP dataset. For the variants of the popular CFR, CFR-IPW does not demonstrate significant improvement over the CFR. SITE performs better than the former two aforementioned algorithms and performs competitively when compared to the proposed TEDVAE.
Since Twins and Jobs has no groundtruth ITE available, we employ the uplift curve (Equation 16) for evaluation. From the results of the uplift curves (Figure 2), we can see that TEDVAE is the clear winner on both Jobs and Twins dataset. For neural network based methods, the variants of CFR (CFR-IPW and SITE) perform similar to CFR on these two datasets in terms of uplift curves. To avoid clutter, we do not show the curves for CFR-IPW and SITE in the figures. For traditional methods, X-Learner shows good performance while the performances of CT, CRF and BART are not ideal (their uplift curves are omitted to avoid clutter).
For estimation of the average treatment effect, TEDVAE is the best performing one in the neural network methods on IHDP dataset. Furthermore, we point out that although TEDVAE out performs neural-network methods, its is (slightly) higher than those of CRF and XRF. On Jobs and Twins dataset, the differences in is not significant between different algorithms CRF and XRF. The reason behind this slightly worse ATE results is two-fold. Firstly, since estimating ATE is easier than estimating ITE, it is reasonable that state-of-the-art estimators performs similarly on a relatively easy dataset. Secondly, because the variations of ITE between individuals are relatively large in the IHDP dataset, a model that performs better in ITE estimation (e.g., TEDVAE) need more capacity to model the variation, and thus may perform slightly worse than methods on the ATE. Nonetheless, TEDVAE performs better than other neural network methods.
In this paper, we study the problem of estimating treatment effect in the wild with possible existence of noisy proxy variables and non-confounding variables. We argued that most previous methods can be improved upon two perspectives: Firstly, they do not consider fact that difficult to measure confounders may be represented by noisy proxy variables. Secondly, they take the given variable set “as is” and do not consider the fact that many of them may not be true confounders and could increase bias and variance in the estimation. Based on our causal diagram which assumes that the observed variables are generated from three disjoint set of latent factors: the confounding, adjustment and selection factors, we proposed a Treatment Effect Disentangled Variational AutoEncoder (TEDVAE) algorithm to jointly learn and disentangle the latent representation factors from observed covariates. Experimental results on semi-synthetic benchmark and real world datasets verified the practical usefulness of our model and the effectiveness of our TEDVAE algorithm for estimating treatment effect using data from observational studies.
For future directions, a path worth following for TEDVAE is to explore its effectiveness in estimating the treatment effect of continuous treatment variables. This is important because in medical applications, treatments often involve different level of dosages. However, almost all of the existing methods are only applicable to binary outcome and thus cannot be applied. In general, the future road of data-driven treatment effect estimation is still long and arduous. One major roadblock is the long-lasting issue of lacking a set of large-scale benchmark datasets. Another issue related to neural network based treatment effect estimators is the need of tuning parameters, which is magnified by the counterfactual problem in treatment effect estimation. Although traditional algorithms perform slightly worse than recent neural network estimators, they still have the advantage of being less sensitive to the hyperparameters. Nonetheless, the recent development learning disentangled latent factors shows a promising way for data-driven treatment effect estimation.
Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI’18)
, Cited by: §2.