1 Introduction
Many domains of science require inference of causal effects, including healthcare (Glass et al., 2013; Casucci et al., 2017), economics and marketing (Chernozhukov et al., 2013; LaLonde, 1986; Smith and Todd, 2005), sociology (Morgan and Harding, 2006) and education (Zhao and Heffernan, 2017)
. For instance, medical scientists must know whether a new medicine is more beneficial for patients; teachers want to know if the teaching plan can be beneficial for students; economists need to evaluate how a policy affects the unemployment rates. Hence, proper causal effect inference is an important task for data mining and machine learning research.
Conducting Randomized Controlled Trials (RCT) can be timeconsuming, expensive, or unethical (e.g. for studying the effect of smoking). Therefore, approaches for estimating the causal effects from observational data are needed. The core issue of causal effect inference from observational data is confounding: variables might affect both intervention and treatment outcomes. For example, students who have afterschool tutoring are more likely to try the teaching plan that increases the likelihood of success; patients with more personal wealth are in a better position to get the new medicines, increasing the likelihood that they survive. Making causal effect inference without controlling for confounders will lead to errors. Following (Pearl, 2009), we assume that all the confounders can be observed, so that the effects of confounding variables can be blocked.
Some methods conduct matching of treatment and control units, e.g., based on mutual information (Sun and Nikolaev, 2016) or propensity scores (Dehejia and Wahba, 2002)
, to estimate the Average Treatment Effect (ATE) and Average Treatment effect on the Treated (ATT). These methods account for treatment selection bias, and can achieve balance between control and treatment groups. Some methods employ propensity weighing to design unbiased estimators of causal effects
(Lunceford and Davidian, 2004). Due to the existing latent heterogeneity (Pearl, 2017), there might be some subgroups of the study population for which an intervention is more effective or ineffective; some research focuses on identifying those exceptional subgroups (Bertsimas et al., 2018). To account for heterogeneity in such subpopulations, articles about Individual Treatment Effects (ITE) have come out recently (Shalit et al., 2017; Lu et al., 2018; Yoon et al., 2018; Louizos et al., 2017; Yao et al., 2018). These methods are discussed in detail in Sections 2 and 5 of this paper.We focus on estimating ITE from observational data. The main challenges for this task are twofold: on the one hand, in observational data, we only know the factual outcome of each unit (treated or untreated), but we will never know the counterfactual outcome; on the other hand, usually the distributions of covariates in treatment and control group are unbalanced (treatment selection bias
). If we directly employ the standard supervised learning framework to learn the treatment outcome, we will get a biased model suffering from generalization error, which is similar to the problem of learning from logged bandit feedback
(Swaminathan and Joachims, 2015b).To overcome these challenges, we propose a Deep Neural Network (DNN) model to encode the input covariates into a latent representation space, and estimate the treatment outcome with those learned representations. There are three components on top of the encoder in our model: (1) mutual information estimation: in this component, a neural network estimator is established to estimate and maximize the mutual information between the representations and the input covariates; (2) adversarial balancing: we build a discriminator to distinguish representations from the treatment group and control group. The encoder will play an adversarial game with the discriminator, trying to fool the discriminator by minimizing the distances between distributions of representations from the treatment and control group; (3) treatment outcome prediction: a neural network predictor is employed to predict the factual and counterfactual outcome given the representation of a unit. By jointly optimizing the three components via back propagation, a welltrained encoder can be gained. The architecture of our model is shown in Figure 1.
1.1 Main Contributions

We propose a novel model: Adversarial Balancingbased representation learning for Causal Effect Inference (ABCEI). ABCEI addresses the information loss and selection bias issues in observational data by learning highly informative and balanced representations in the latent space.

A neural network encoder is constrained by a mutual information estimator to minimize the information loss between representations and the input covariates, which preserves highly predictive information for causal effect inference.

We employ an adversarial learning method to balance the distributions of representations between treatment and control groups, which deals with the selection bias problem without any assumption on the form of the treatment selection function, unlike, e.g., the propensity score method.

We conduct various experiments on synthetic and realworld datasets. ABCEI outperforms most of the stateoftheart baselines on benchmark datasets. The experimental results show that our model is decent and robust. By supporting minibatch, ABCEI is qualified for largescale datasets.
2 Background
Work on causality learning falls into two categories: causal inference and causal discovery (Mooij et al., 2016). In the branch of causal inference, three kinds of data are used: data from Randomized Controlled Trials (RCT), observational data for which all the (potential) confounders can be observed, and observational data with unobserved confounders. A branch of research with RCT datasets focuses on identification of heterogeneous treatment effects. Both machine learning (Lamont et al., 2018; Taddy et al., 2016) and optimization (Bertsimas et al., 2018) approaches are applied. Due to the difficulties of obtaining RCT datasets, observational studies become an alternative. Removing confounding is a core issue in causal inference with observational data. Some research estimates population causal effects with an instrumental variable (Bareinboim and Pearl, 2012); some research uses latent variable models to simultaneously discover hidden confounders and estimate causal effects (Louizos et al., 2017), which is robust against hidden confounding. We focus on the branch of observational studies with no hidden confounders, which satisfies the strong ignorability assumption (Pearl, 2017, 2009; Cai et al., 2008).
Methods from representation learning are used to transform covariates from the original space into a latent representation space (Li and Fu, 2017). The representations are used as the input of predictors for individual and population causal effect inference. One study reported on use of a single neural network with the concatenation of representations and treatment variable as the input (Johansson et al., 2016). Separate models were trained for different treatments associated with a probabilistic integral metric to bound the generalization errors in (Shalit et al., 2017). Hard samples to preserve local similarity during balancing process were used in (Yao et al., 2018).
ABCEI does not need prior knowledge about treatment assignment. By following the design of Wasserstein GAN (Gulrajani et al., 2017), our adversarial balancing can make the encoder generate more similar distributions for treatment and control group. Another advantage of our method is that we account for the information loss problem by using a mutual information estimator to regularize the encoder. The mutual information estimator uses a neural network to simultaneously approximate and minimize the information loss, which persuades the encoder to learn representations preserving highly predictive information. Based on those advantages, the two components – mutual information estimator and adversarial balancing – combined together allow us to find the proper predictor for causal effect inference.
3 Problem Setup
Given an observational dataset , with covariate matrix
, binary treatment vector
, and treatment outcome vector . Here, denotes the number of observed units and denotes the number of covariates in the dataset. For each unit , we have covariates , associated with one treatment variable and one treatment outcome . According to the RubinNeyman causal model (Rubin, 2005), two potential outcomes , exist for treatments , respectively. When is assigned to unit , we say unit is treated, with the outcome ; otherwise, we say unit is untreated or control, with the outcome . Respectively, we call the factual outcome denoted by , and the counterfactual outcome denoted by. Assuming there is a joint distribution
, we make the following assumptions:Assumption 1
Conditioned on , the potential outcomes are independent of , which can be stated as: .
This assumption indicates that all the confounders are observed, i.e., no unmeasured confounder is present.
Assumption 2
For all sets of covariates and for all treatments, the probability of treatment assignment will always be strictly larger than
and strictly smaller than , which can be expressed as: and .Under this assumption, we know that the Individual Treatment Effect (ITE) can be estimated for any in the covariate space.
Under these assumptions, we can formalize the definition of ITE for our task:
Definition 1
The Individual Treatment Effect (ITE), also known as conditional Average Treatment Effect (CATE), for unit is:
We can then define the Average Treatment Effect (ATE) and the Average Treatment effect on the Treated (ATT) as:
Because the joint distribution is unknown, we can only estimate with the observational data at hand. A function over the covariate space can be defined as . The estimate of can now be defined:
Definition 2
Given an observational dataset and a function , for unit , the estimate of is:
We employ the Precision in Estimation of Heterogeneous Effect (PEHE) (Hill, 2011), as a metric for the accuracy of ITE estimation:
(1) 
In order to properly accomplish the task of ITE estimation, we need to find an optimal function over the covariate space for both systems ( and ). The most challenging problem for this task is that from the observational dataset, we only know the factual outcomes. If we directly apply the classical supervised learning framework, we will get a biased model, which will suffer from high generalization error (Swaminathan and Joachims, 2015a). In the next section, we will show how our proposed method overcomes this problem.
4 Proposed Method
In order to overcome the challenging problems in the task of ITE estimation, we build our model on the recent advances in latent representation learning. We propose to define a function , and a function . Then we have . Instead of directly estimating the treatment outcome conditioned on covariates, firstly we use a neural network encoder to learn latent representations of covariates. Our aim is to simultaneously learn latent representations and estimate the treatment outcome. However, the function would still suffer from information loss and treatment selection bias problems, unless we constrain the encoder to let it learn the balanced representations while preserving useful information.
4.1 Mutual Information Estimation
Consider the information loss problem when learning the latent representations from original covariates. The nonlinear statistical dependencies between variables can be acquired by mutual information (MI) (Kinney and Atwal, 2014), which can reflect the true dependency. Thus we are going to use MI between the learned representations and the original covariates as a measure to account for information loss:
(2) 
We denote the joint distribution between covariates and representations by and the product of marginals by . From the viewpoint of Shannon information theory, the mutual information can be represented as the form of KullbackLeibler (KL) divergence, . According to Equation (2), it is quite difficult to calculate MI in continuous and highdimensional spaces. The lower bound of MI is captured based on the DonskerVaradhan representation of KLdivergence (Donsker and Varadhan, 1983):
Theorem 1 (DonskerVaradhan)
Here denotes the set of unconstrained functions (detail proof is in supplementary materials). Inspired by Mutual Information Neural Estimation (MINE) (Belghazi et al., 2018), we propose to establish a neural network estimator for MI. Specifically, let be a function: parametrized by a deep neural network, we have:
(3) 
By distinguishing the joint distribution and the product of marginals, the estimator approximates the MI with arbitrary precision. In practice, as shown in Figure 2, we concatenate the input covariates with representations one by one to create positive samples (as samples from true joint distribution). Then we randomly shuffle on the batch axis to create fake input covariates . Representations are concatenated with fake input to create negative samples (as samples from the product of marginals). Based on Equation (3
), we can write down the loss function for the MI estimator:
Information loss is diminished by simultaneously learning the encoder and the MI estimator to minimize iteratively via gradient descent.
4.2 Adversarial Balancing
The representation treatment and control groups are denoted by and , corresponding to the input covariates groups and . Even though the information loss has been accounted for by the MI maximization, the discrepancy between distributions of the two groups still exists, which is an urgent problem in need of a solution. To decrease this discrepancy, we propose an adversarial learning method to train the encoder to learn treatment and control representations that are balanced distributions. We build an adversarial game between a discriminator and the encoder , inspired by the framework of Generative Adversarial Networks (GAN) (Goodfellow et al., 2014). In the classical GAN framework, a source of noise is mapped to a generated image by a generator . A discriminator is trained to distinguish whether an input sample is from the true image distribution or synthetic image distribution generated by the generator. The aim of classical GAN is to train a reliable discriminator to distinguish fake and real images, and then to use the discriminator to train the generator to generate images by fooling the discriminator. However, sometimes the training process of classical GAN is unstable.
In our adversarial game: (1) we draw a noise vector , where
can be a spherical Gaussian distribution or an Uniform distribution; (2) Representations are separated with regard to treatment assignments and two distributions are formed:
and ; (3) we concatenate with representation vectors, to obtain and ; (4) we train a discriminator to distinguish concatenated vectors from treatment group and control group; (5) we adjust the encoder to try to generate balanced representations to fool the discriminator.According to the architecture of ABCEI, the encoder is associated with the MI estimator , treatment outcome predictor and adversarial discriminator . This means that the training process is iteratively adjusting each of the components. The instability of GAN training will become serious in this context.
To stabilize the training of GAN, we propose to use the framework of Wasserstein GAN with gradient penalty (Gulrajani et al., 2017). By removing the sigmoid layer and applying the gradient penalty to the data between the distributions of treatment and control groups, we can find a function which satisfies the 1Lipschitz inequality:
We can write down the form of our adversarial game:
where is the distribution acquired by uniformly sampling along the straight lines between pair of samples from and . The adversarial learning process is shown in Figure 3.
This ensures the encoder to be smoothly trained to generate balanced representations. We can write down the training objective for discriminator and encoder, respectively:
4.3 Treatment Outcome Prediction
The final step for ITE estimation is to predict the treatment outcomes with learned representations. We establish a neural network predictor, which takes latent representations and treatment assignments of a unit, as the input to conduct outcome prediction: . We can write down the loss function of the training objective as:
Here, is a regularization on for the model complexity.
4.4 Learning Optimization
With regard to the architecture of our model in Figure 1, we propose to minimize , , and , respectively, and to iteratively find the optimal parameter setting for the global model. In the training step, firstly we minimize by simultaneously optimizing and with onestep gradient descent. Then the representations are passed to the discriminator to minimize by optimizing with step gradient descent, in order to find a stable discriminator. Next, we use discriminator to train encoder by minimizing with onestep gradient descent. Finally, encoder and predictor are optimized simultaneously by minimizing . The optimization steps are handled with the stochastic method Adam (Kingma and Ba, 2014). The pseudocode of our model is given in Algorithm 1.
5 Experiments
Due to the lack of counterfactual treatment outcomes in observational data, it is difficult to validate and test the performance of causal effect inference methods. Fortunately, there are two ways to overcome this difficulty: one way is to use simulated or semisimulated treatment outcomes, e.g., dataset IHDP (Hill, 2011); the other way is to use RCT datasets and add a nonrandomized component to generate imbalanced datasets, e.g., dataset Jobs (LaLonde, 1986; Smith and Todd, 2005; Dehejia and Wahba, 2002). We designed experiments along both paths for evaluating our method.
The details of implementation, datasets and process of hyperparameter optimization are given in the supplementary materials.
5.1 Experiments on Benchmark Datasets
5.1.1 Evaluation Metrics
Since the ground truth ITE for the IHDP dataset is known, we can employ Equation (1) as the evaluation of our method for ITE estimation. Afterwards we can evaluate the precision of ATE estimation based on estimated ITE. For the Jobs dataset, because we only know parts of the ground truth (the randomized component), we cannot evaluate the performance of ATE estimation. Instead, following the suggestion of (Shalit et al., 2017), we evaluate the precision of ATT estimation for population causal effects and policy risk estimation for individual causal effects.
In this paper, we consider when .
For the Twins dataset, because we only know the observed treatment outcome for each unit, we follow (Louizos et al., 2017)
using area under ROC curve (AUC) as the evaluation metric. For each dataset, the experimental results are averaged over
train/validation/test sets with split sizes .5.1.2 Baseline Methods
We compare with the following baselines: least square regression using treatment as a feature (OLS/); separate least square regressions for each treatment (OLS/
); balancing linear regression (
BLR) and balancing neural network (BNN) (Johansson et al., 2016); nearest neighbor (kNN) (Crump et al., 2008); Bayesian additive regression trees (BART) (Sparapani et al., 2016); random forests (
RF) (Breiman, 2001); causal forests (CF) (Wager and Athey, 2017); treatmentagnostic representation networks (TARNet) and counterfactual regression with Wasserstein distance (CFRWass) (Shalit et al., 2017); causal effect variational autoencoders (
CEVAE) (Louizos et al., 2017); local similarity preserved individual treatment effect (SITE) (Yao et al., 2018).Methods  Insample  Outsample  

OLS/  
OLS/  
BLR  
BART  
kNN  
RF  
CF  
BNN  
TARNet  
CFRWass  
CEVAE  
SITE  
ABCEI 
Insample and outofsample results with mean and standard errors on the IHDP dataset (lower = better).
We show the quantitative comparison between our method and the stateoftheart baselines. Experimental results (insample and outofsample) on IHDP, Jobs and Twins datasets are reported. Specifically, we use to represent our model without the mutual information estimation component, and to represent our model without the adversarial learning component.
Methods  Insample  Outsample  

OLS/  
OLS/  
BLR  
BART  
kNN  
RF  
CF  
BNN  
TARNet  
CFRWass  
CEVAE  
SITE  
ABCEI 
5.1.3 Results
Experimental results are shown in Tables 1, 2 and 3. ABCEI shows stable and good performance under various settings, with the varied number of samples and number of covariates . For the IHDP dataset, as we can see, in the insample part ABCEI achieves a competitive result, which is very close to best method (SITE) on . ABCEI has the best performance on the insample ATE, outofsample and ATE. For the Jobs dataset, ABCEI achieves better performance than other baselines on policy risk estimation and close performance to OLS/ on insample ATT estimation. For the Twins dataset, our method has best performance outperforming other baselines.
Methods  Insample  Outsample  

OLS/  
OLS/  
BLR  
BART  
kNN  
BNN  
TARNet  
CFRWass  
CEVAE  
SITE  
ABCEI 
Due to the existence of treatment selection bias, regression based methods suffer from high generalization error. Nearest neighbor based methods consider the similar units to overcome selection bias, but cannot achieve balance globally. Recent advances in representation learning bring improvements in causal effect estimation. CEVAE use variational autoencoders to learn the latent variable causal model, which has competitive performance on the Jobs and Twins datasets, but relatively weak performance on the IHDP dataset. BNN accounts for selection bias balancing, but considers treatment variable as a feature using one single network, which might lose the impact of treatment variable when the dimension of representations is high (Shalit et al., 2017). TARnet employs two separate networks but follows agnostic setting without balancing selection bias. CFRWass does not consider local similarity information in the original covariate space. SITE uses hard samples to preserve local similarity information and consider balance property. Comparing with CFRWass, BNN, and SITE, our method ABCEI considers information loss and balancing problems. The mutual information estimator ensures that the encoder learns representations preserving useful information from the original covariate space. The adversarial learning component constrains the encoder to learn balanced representations. This makes our method to achieve better performance than other baselines. We also report the performance of our model without mutual information estimator or adversarial learning respectively in , . From the results we can see that both of their performance are worse than the original model, which demonstrates the importance of adversarial learning and mutual information estimation.
5.2 Empirical Robustness Analysis
5.2.1 Robustness Analysis on Selection Bias
In order to investigate the performance of our model under different level of selection bias, we generate toy datasets by varying the discrepancy between the treatment and control groups. We draw samples with ten covariates as control group, where . Then we draw samples from . By adjusting
, we generate treatment groups with different selection bias, which can be measured by KullbackLeibler divergence. For the outcomes, we generate
, where and .In Figure (a)a
, we can see the robustness of ABCEI, in comparison with CFRWass, BART, and SITE. The reported experimental results are averaged over 100 test sets. From the figure, we can see that with increasing of KL divergence, our method achieves more stable performance. We do not visualize standard deviations as they are negligibly small.
5.2.2 Robustness Analysis on Mutual Information Estimation
Our aim is to investigate the impact of minimizing the information loss on causal effect learning. We block the adversarial learning component and train our model on the IHDP dataset. We record the values of the estimated MI and in each epoch. In Figure (b)b, we report the experimental results averaged over 1000 test sets. We can see that with the increasing MI value, the mean square error decreases and reaches a stable region. But without the adversarial balancing component, the cannot be further lowered due to the selection bias. This result indicates that even though the estimators benefit from highly predictive information, they will still suffer if imbalance is ignored.
6 Conclusions
We proposed a novel model for causal effect inference with observational data, called ABCEI, which is built on deep representation learning methods. In this model, we engage two important performanceensuring components: mutual information estimator and adversarial balancing. These two components can help us to tackle the problems of information loss and selection bias in designing the causal effect predictors. With the mutual information estimator, we preserve highly predictive information from the original covariate space by simultaneously estimating and maximizing the mutual information between covariates and learned representations. Our experimental results demonstrate the effectiveness of mutual information on causal effect inference. At the same time, adversarial learning balances the representation distributions of treatment and control group. We establish a discriminator to distinguish the representations from treatment group and control group. By adjusting the encoder parameters, our aim is to find an encoder that can fool the discriminator, which ensures that the distributions of treatment and control groups are as similar as possible. Our balancing method does not make any assumption on the form of the treatment selection function. Experimental results show that our encoder can learn more identical distributions of treatment and control groups with better coverage. Finally, experimental results on benchmark datasets and synthetic datasets also demonstrate that ABCEI is able to achieve robust and better performance compared to stateoftheart approaches.
In future work, we would like to explore more connections between relevant methods in domain adaptation (Daume III and Marcu, 2006) and counterfactual learning (Swaminathan and Joachims, 2015b) with the methods in causal inference. A proper extending is to consider the multiple treatment assignments or the existence of hidden confounders. Further more, we also plan to investigate the causal effects in subpopulations to detect the latent heterogeneity, which is a very important issue for decision makers in many fields such as public health and social security.
References
 Abadi et al. (2016) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. (2016). Tensorflow: a system for largescale machine learning. In OSDI, volume 16, pages 265–283.
 Almond et al. (2005) Almond, D., Chay, K. Y., and Lee, D. S. (2005). The costs of low birth weight. The Quarterly Journal of Economics, 120(3):1031–1083.
 Bareinboim and Pearl (2012) Bareinboim, E. and Pearl, J. (2012). Controlling selection bias in causal inference. In Artificial Intelligence and Statistics, pages 100–108.
 Belghazi et al. (2018) Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D. (2018). Mutual information neural estimation. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 531–540. PMLR.
 Bertsimas et al. (2018) Bertsimas, D., Korolko, N., and Weinstein, A. (2018). Identifying exceptional responders in randomized trials: An optimization approach. INFORMS Journal on Optimization, under review.
 Breiman (2001) Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
 Cai et al. (2008) Cai, Z., Kuroki, M., Pearl, J., and Tian, J. (2008). Bounds on direct effects in the presence of confounded intermediate variables. Biometrics, 64(3):695–701.
 Casucci et al. (2017) Casucci, S., Lin, L., Hewner, S., and Nikolaev, A. (2017). Estimating the causal effects of chronic disease combinations on 30day hospital readmissions based on observational medicaid data. Journal of the American Medical Informatics Association, 25(6):670–678.
 Chernozhukov et al. (2013) Chernozhukov, V., FernándezVal, I., and Melly, B. (2013). Inference on counterfactual distributions. Econometrica, 81(6):2205–2268.
 Clevert et al. (2015) Clevert, D.A., Unterthiner, T., and Hochreiter, S. (2015). Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289.
 Crump et al. (2008) Crump, R. K., Hotz, V. J., Imbens, G. W., and Mitnik, O. A. (2008). Nonparametric tests for treatment effect heterogeneity. The Review of Economics and Statistics, 90(3):389–405.

Daume III and Marcu (2006)
Daume III, H. and Marcu, D. (2006).
Domain adaptation for statistical classifiers.
Journal of artificial Intelligence research, 26:101–126.  Dehejia and Wahba (2002) Dehejia, R. H. and Wahba, S. (2002). Propensity scorematching methods for nonexperimental causal studies. Review of Economics and statistics, 84(1):151–161.
 Diamond and Sekhon (2013) Diamond, A. and Sekhon, J. S. (2013). Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. Review of Economics and Statistics, 95(3):932–945.
 Donsker and Varadhan (1983) Donsker, M. D. and Varadhan, S. S. (1983). Asymptotic evaluation of certain markov process expectations for large time. iv. Communications on Pure and Applied Mathematics, 36(2):183–212.
 Dorie (2016) Dorie, V. (2016). Npci: Nonparametrics for causal inference.
 Glass et al. (2013) Glass, T. A., Goodman, S. N., Hernán, M. A., and Samet, J. M. (2013). Causal inference in public health. Annual review of public health, 34:61–75.
 Goodfellow (2016) Goodfellow, I. (2016). Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160.
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680.
 Gulrajani et al. (2017) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. (2017). Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767–5777.
 Hill (2011) Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240.
 Ho et al. (2011) Ho, D. E., Imai, K., King, G., Stuart, E. A., et al. (2011). Matchit: nonparametric preprocessing for parametric causal inference. Journal of Statistical Software, 42(8):1–28.

Jiang and Li (2016)
Jiang, N. and Li, L. (2016).
Doubly robust offpolicy value evaluation for reinforcement learning.
In Proceedings of the 33rd International Conference on International Conference on Machine LearningVolume 48, pages 652–661. JMLR. org.  Johansson et al. (2016) Johansson, F., Shalit, U., and Sontag, D. (2016). Learning representations for counterfactual inference. In International Conference on Machine Learning, pages 3020–3029.
 Kingma and Ba (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 Kingma and Welling (2013) Kingma, D. P. and Welling, M. (2013). Autoencoding variational bayes. arXiv preprint arXiv:1312.6114.
 Kinney and Atwal (2014) Kinney, J. B. and Atwal, G. S. (2014). Equitability, mutual information, and the maximal information coefficient. Proceedings of the National Academy of Sciences, page 201309933.
 Kuang et al. (2018) Kuang, K., Cui, P., Athey, S., Xiong, R., and Li, B. (2018). Stable prediction across unknown environments. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1617–1626. ACM.
 LaLonde (1986) LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. The American economic review, pages 604–620.
 Lamont et al. (2018) Lamont, A., Lyons, M. D., Jaki, T., Stuart, E., Feaster, D. J., Tharmaratnam, K., Oberski, D., Ishwaran, H., Wilson, D. K., and Van Horn, M. L. (2018). Identification of predicted individual treatment effects in randomized clinical trials. Statistical methods in medical research, 27(1):142–157.
 Li and Fu (2017) Li, S. and Fu, Y. (2017). Matching on balanced nonlinear representations for treatment effects estimation. In Advances in Neural Information Processing Systems, pages 929–939.
 Louizos et al. (2017) Louizos, C., Shalit, U., Mooij, J. M., Sontag, D., Zemel, R., and Welling, M. (2017). Causal effect inference with deep latentvariable models. In Advances in Neural Information Processing Systems, pages 6446–6456.
 Lu et al. (2018) Lu, M., Sadiq, S., Feaster, D. J., and Ishwaran, H. (2018). Estimating individual treatment effect in observational data using random forest methods. Journal of Computational and Graphical Statistics, 27(1):209–219.
 Lunceford and Davidian (2004) Lunceford, J. K. and Davidian, M. (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in medicine, 23(19):2937–2960.
 Mooij et al. (2016) Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J., and Schölkopf, B. (2016). Distinguishing cause from effect using observational data: methods and benchmarks. The Journal of Machine Learning Research, 17(1):1103–1204.
 Morgan and Harding (2006) Morgan, S. L. and Harding, D. J. (2006). Matching estimators of causal effects: Prospects and pitfalls in theory and practice. Sociological methods & research, 35(1):3–60.
 Nowozin et al. (2016) Nowozin, S., Cseke, B., and Tomioka, R. (2016). fgan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pages 271–279.
 OzeryFlato et al. (2018) OzeryFlato, M., Thodoroff, P., and ElHay, T. (2018). Adversarial balancing for causal inference. arXiv preprint arXiv:1810.07406.
 Pearl (2009) Pearl, J. (2009). Causality. Cambridge university press.
 Pearl (2017) Pearl, J. (2017). Detecting latent heterogeneity. Sociological Methods & Research, 46(3):370–389.
 Rubin (2001) Rubin, D. B. (2001). Using propensity scores to help design observational studies: application to the tobacco litigation. Health Services and Outcomes Research Methodology, 2(34):169–188.
 Rubin (2005) Rubin, D. B. (2005). Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469):322–331.
 Shalit et al. (2017) Shalit, U., Johansson, F. D., and Sontag, D. (2017). Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3076–3085. JMLR. org.
 Smith and Todd (2005) Smith, J. A. and Todd, P. E. (2005). Does matching overcome lalonde’s critique of nonexperimental estimators? Journal of econometrics, 125(12):305–353.

Smith (2002)
Smith, L. I. (2002).
A tutorial on principal components analysis.
Technical report.  Sparapani et al. (2016) Sparapani, R. A., Logan, B. R., McCulloch, R. E., and Laud, P. W. (2016). Nonparametric survival analysis using bayesian additive regression trees (bart). Statistics in medicine, 35(16):2741–2753.
 Sugiyama et al. (2007) Sugiyama, M., Krauledat, M., and MÃžller, K.R. (2007). Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(May):985–1005.
 Sun and Nikolaev (2016) Sun, L. and Nikolaev, A. G. (2016). Mutual information based matching for causal inference with observational data. The Journal of Machine Learning Research, 17(1):6990–7020.
 Swaminathan and Joachims (2015a) Swaminathan, A. and Joachims, T. (2015a). Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research, 16(1):1731–1755.
 Swaminathan and Joachims (2015b) Swaminathan, A. and Joachims, T. (2015b). Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pages 814–823.
 Taddy et al. (2016) Taddy, M., Gardner, M., Chen, L., and Draper, D. (2016). A nonparametric bayesian analysis of heterogenous treatment effects in digital experimentation. Journal of Business & Economic Statistics, 34(4):661–672.

Vincent et al. (2008)
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.A. (2008).
Extracting and composing robust features with denoising autoencoders.
In Proceedings of the 25th international conference on Machine learning, pages 1096–1103.  Wager and Athey (2017) Wager, S. and Athey, S. (2017). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association.
 Yao et al. (2018) Yao, L., Li, S., Li, Y., Huai, M., Gao, J., and Zhang, A. (2018). Representation learning for treatment effect estimation from observational data. In Advances in Neural Information Processing Systems, pages 2634–2644.
 Yoon et al. (2018) Yoon, J., Jordon, J., and van der Schaar, M. (2018). Ganite: Estimation of individualized treatment effects using generative adversarial nets.
 Zhao and Heffernan (2017) Zhao, S. and Heffernan, N. (2017). Estimating individual treatment effects from educational studies with residual counterfactual networks. In 10th International Conference on Educational Data Mining.
Appendix A Related Work
One of the core issues to deal with in causal inference is the treatment selection bias. From the view of balancing, there are three ways. The first and classical way of balancing is referred to as matching (Ho et al., 2011): a control group is selected in order to maximize the similarity between the empirical covariate distributions in the treatment and control group. Mahalanobis distance and propensity score matching methods are proposed for population causal effect inference (Rubin, 2001; Diamond and Sekhon, 2013). An information theorydriven approach is proposed by using mutual information as the similarity measure (Sun and Nikolaev, 2016). In the second way of balancing, the inverse propensity score (IPS) method is proposed based on the variants of importance sampling (Sugiyama et al., 2007; Jiang and Li, 2016). The IPS is used to reweigh each unit sample to learn the counterfactuals, which is akin to counterfactual learning from logged bandit feedback (Swaminathan and Joachims, 2015b, a). The main difference between ABCEI and the existing approaches is that except balancing, we address the information loss problem by simultaneously estimating and maximizing the mutual information between latent representations and the input covariates.
From the technical viewpoint, our method lies into the field of representation learning. The main aim of learning representations is to obtain useful information from original data for downstream tasks like building predictors or classifiers. From principal components analysis (PCA) (Smith, 2002) to autoencoders (Vincent et al., 2008), many approaches account for learning representations. A proper way to evaluate the quality of learned representations is to measure the reconstruction error (Kingma and Welling, 2013). Specifically, reconstruction error is shown to be minimized by maximizing mutual information between input and the learned representations when their joint distributions for the encoder and decoder are matched (Belghazi et al., 2018). As a consequence, maximizing mutual information minimizes the information loss and the expected reconstruction error. We adopt this approach to regularize the encoder to preserve useful information for prediction tasks. However, in continuous and highdimensional spaces, accurately computing MI is quite difficult. KLdivergence (Donsker and Varadhan, 1983) and JensenShannondivergence (JSD) (Nowozin et al., 2016) based methods are introduced for approximating mutual information with neural networks. We follow this way to build the neural network estimator for MI estimation.
More and more machine learning methods are employed for causal inference. For instance, Bayesian additive regression trees and Random forests were employed to estimate causal effects in (Sparapani et al., 2016) and (Wager and Athey, 2017) respectively. Some research discusses how domain adaptation (Daume III and Marcu, 2006) and generative adversarial networks (GAN) (Goodfellow, 2016) can be used for causal inference by generating balanced weights for unit samples (OzeryFlato et al., 2018; Kuang et al., 2018). Fitting a model only with observed factual data by using the GAN framework, which is suitable for any number of treatments was proposed in (Yoon et al., 2018). The main difference between ABCEI and those methods is that we use adversarial learning to balance distributions of treatment group and control group in the latent representation space.
Appendix B Details of Datasets
Ihdp
The Infant Health and Development Program (IHDP) studies the impact of specialist home visits on future cognitive test scores. Covariates in the semisimulated dataset are collected from a realworld randomized experiment. The treatment selection bias is created by removing a subset of the treatment group. We use the setting ‘A’ in (Dorie, 2016) to simulate treatment outcomes. This dataset includes units ( control and treated) with covariates associated with each unit.
Jobs
The Jobs dataset (LaLonde, 1986; Smith and Todd, 2005) studies the effect of job training on the employment status. It consists of a nonrandomized component from observational studies and a randomized component based on the National Supported Work program. The randomized component includes units ( control and treated) with seven covariates, and the nonrandomized component (PSID comparison group) includes control units.
Twins
The Twins dataset is created based on the “Linked Birth / Infant Death Cohort Data” by NBER ^{1}^{1}1https://nber.org/data/linkedbirthinfantdeathdatavitalstatisticsdata.html. Inspired by (Almond et al., 2005), we employ a matching algorithm to select twin births in the USA between 19891991. By doing this, we get units associated with covariates including education, age, race of parents, birth place, marital status of mother, the month in which pregnancy prenatal care began, total number of prenatal visits and other variables indicating demographic and health conditions. We only select twins that have the same gender who both weigh less than . For the treatment variable, we use indicating the lighter twin and indicating the heavier twin. We take the mortality of each twin in their first year of life as the treatment outcome, inspired by (Louizos et al., 2017). Finally, we have a dataset consisting of 12,828 pairs of twins whose mortality rate is for the lighter twin and for the heavier twin. Hence, we have observational treatment outcomes for both treatments. In order to simulate the selection bias, we selectively choose one of the twins to observe with regard to the covariates associated with each unit as follows: , where and .
Appendix C Implementation details
The implementation of our method is based on Python and Tensorflow (Abadi et al., 2016). The architecture of our neural network model is shown in Figure 1. We adopt ELU (Clevert et al., 2015)
as the nonlinear activation function if there is no specification. The model is trained using Adam
(Kingma and Ba, 2014) within Algorithm 1.We employ various numbers of fullyconnected hidden layers with various sizes across networks: four layers with size for the encoder network; two layers with size for the mutual information estimator network; three layers with size for the discriminator network; and finally, three layers with size for the predictor network, following the structure of TARnet (Shalit et al., 2017). The gradient penalty weight is set to , and the regularization weight is set to . All the experiments in this paper are conducted on a cluster with 1x Intel Xeon E5 2.2GHz CPU, 4x Nvidia Tesla V100 GPU and 256GB RAM^{2}^{2}2The source code and datasets will be available in the formal version of publication.
Appendix D Hyperparameter optimization
Due to the reason that we cannot observe counterfactuals in observational datasets, standard crossvalidation methods are not feasible. We follow the hyperparameter optimization criterion in (Shalit et al., 2017), with an early stopping with regard to the lower bound on the validation set. Detail search space of hyperparameter is demonstrated in Table 4. The optimal hyperparameter settings for each benchmark dataset is demonstrated in Table 5.
Hyperparameter  Range 

,,  
,,,  
Optimizer  RMSProp, Adam 
Depth of encoder layers  
Depth of discriminator layers  
Depth of predictor layers  
Dimension of encoder layers  
Dimension of discriminator layers  
Dimension of MI estimator layers  
Dimension of predictor layers  
Batch size 
Hyperparameters  Datasets  
IHDP  Jobs  Twins  
      
Optimizer  Adam  Adam  Adam 
Depth of encoder layers  
Depth of discriminator layers  
Depth of predictor layers  
Dimension of encoder layers  
Dimension of discriminator layers  
Dimension of MI estimator layers  
Dimension of predictor layers  
Batch size 
Appendix E Proofs
e.1 DonskerVaradhan
Theorem 2 (DonskerVaradhan)
Let , , be distributions on the same support , and let denote a family of functions , we have
Proof 1
Given a fixed function , we can define distribution by:
Equivalently, we have:
Then by construction, we have:
When distribution is equal to , this bound is tight.
Appendix F Balancing Performance of Adversarial Learning
In Figure 5, we visualize the learned representations on the IHDP and Jobs datasets using tSNE. We can see that compared to CFRWass, the coverage of the treatment group over the control group in the representation space learned by our method is better. This showcases the degree to which adversarial balancing improves the performance of ABCEI, especially in population causal effect (ATE, ATT) inference.
Comments
There are no comments yet.