1 Introduction
In domains such as healthcare, genomics or social science there is high demand for data analysis that reveals causal
relationships between independent and target variables. For example, doctors not only want models that accurately predict the status of patients, but also want to identify the factors that can improve it. The distinction between prediction and causation has at times been subject to debate in statistics and machine learning
(Breiman et al., 2001; Shmueli, 2010). While machine learning has focused mostly on prediction tasks, in many scientific domains pure prediction without considering the underlying causal mechanisms is considered unscientific (Shmueli, 2010). In this work, we propose a causal regularizer that balances causal interpretability and predictive power.We use the counterfactual causality framework (Pearl, 2000)
, in which one random variable
(e.g. red wine consumption) causes another variable (i.e., reduction in risk of heart attack) denoted as if experimental testing of would be proven to change the distribution of (Spirtes, 2010). But there may also be competing explanations of the observed correlation between and because of confounding (e.g., people of high socioeconomic status tend to drink more wine and this is related to other lifestyle factors that cause a reduction in heart attack) that need to be reconciled in assessing the likelihood that is true. Causal analytic methods can be used to prioritize what warrants testing in clinical trials among a diversity of hypotheses or as primary evidence if controlled trials are not feasible or desireable (e.g., climate science or health). In healthcare, in particular, it is common that an ensemble of many causal factors needs to occur simultaneously to have an effect on the target variable, a phenomenon we will call multivariate causation. Scalable methods are needed to explore the exponential combinations of the independent variables and different transformations in order to detect multivariate causal relationships.Methods for discovering causal relationships among multiple variables from observational data (Chickering, 2002; Kalisch and Bühlmann, 2007; Colombo et al., 2012)
are largely based on the principle that any given set of causal relationships among multiple variables leaves welldefined marks in the joint distribution of the variables. However, when these methods are used for causal variable selection
(Guyon et al., 2007; Cawley, 2008; Bontempi and Meyer, 2010; Sun et al., 2015), the process becomes very sensitive to small changes in the joint distribution of variables and may exclude many causal variables due to noise or selection bias in the data.Our main idea is to design a causal regularizer to control the complexity of the statistical models and at the same time favor causal explanations. Compared to the two step procedure of causal variable selection followed by a multivariate regression/classification, the proposed approach performs joint causal variable selection and prediction, thus avoiding the statistically sensitive thresholding of the causality scores in the causal variable selection step. It allows few dependencies that cannot be explained via causation to still be included in the model, relaxing the variable selection procedure. Our technical contributions are as follows:

We use causality detectors to construct a causal regularizer that can guide predictive models towards learning causal relationships between the independent and target variables. We theoretically quantify the impact of the accuracy of the causality detector on the causal accuracy of the regularized models.

We propose a new nonlinear predictive model regularized by our causal regularizer, which allows causally interpretable neural networks.

Finally, we demonstrate that the proposed causal regularizer can be combined with neural representation learning techniques to efficiently detect multivariate causal hypotheses.
The proposed framework scales linearly with the number of variables, as opposed to many previous causal methods.
We applied the proposed algorithms to clinical predictive modeling problems using large EHR datasets: one on heart failure onset prediction and another on mortality prediction using the publicly available MIMIC III (Johnson et al., 2016) dataset. Altogether, we analyzed the collective influence of 17,081 independent variables on heart failure and validated the results by having a clinical expert to manually review the findings in a blind setup. As shown in Figure 1, our proposed causallyregularized algorithm significantly outperforms the baseline algorithms in causality detection performance. We show a similar boost in the causality score of the detected multivariate causal hypotheses. Finally, we show that the proposed algorithms are also competitive in predictive performance on both datasets.
2 Preliminaries on Causality Detection
We begin with description of pairwise causal analysis () of a single independent variable on the target variable based on the independence of mechanisms (ICM) assumption and then extend the pairwise causality detector to perform multivariate causality analysis in the next section. While our proposed causal regularizer can be constructed using any causality detection algorithm, a review of the ICM based methods, as the state of art causality detection algorithms, is helpful because they are the baseline algorithms in the experiments.
in each case and train a classifier to distinguish between these cases based on the (automatically learned) features of the joint distribution.
We are interested in finding causal models where causes , or causes , or the two are confounded based on joint distribution of . However, the pairwise causality analysis is infeasible for arbitrary joint distributions. Thus, we need to resort to additional assumptions on the nature of the causal relationships. Recently several algorithms have been proposed that distinguish between the cause and effect based on the natural assumption that steps in the process that generates the data are independent from each other, see (Lemeire and Dirkx, 2006; Janzing et al., 2012; Daniusis et al., 2010; LopezPaz, 2016; Chalupka et al., 2016; Kocaoglu et al., 2016) and the references therein. In this work, we follow (LopezPaz et al., 2016; Chalupka et al., 2016) to describe this causality detection approach. In the next subsections, we describe our novel causal regularizer designed based on this causality detection approach and its application in nonlinear causality analysis and multivariate causal hypothesis generation.
Conceptual description of the independence between the cause and the mechanism. ICM states that the two processes of generation of the cause and mapping from cause to effect are independent. In our case, we assume that when ( causes
), the probabilities
and are generated by independent higherlevel distribution functions. Thus, we do not put assumptions on the functional form of the causal relationships between the variables of interest. ICM conforms to the scientific idea of Uniformitarianism (Gould, 1965) which, putting roughly, states that the laws of nature apply to all objects similarly. ICM can be described in both deterministic (Janzing and Scholkopf, 2010) and probabilistic (Daniusis et al., 2010) sense; this work mainly uses the probabilistic interpretation.ICM can be used to generate samples from distributions that agree with the possible graphical models including two observed variables and and an unobserved variable shown in Figure 2, by requiring that the probability functions in the factorization of the joint distribution are independent from each other. The hidden variables can represent the other observed variables, critical in design of the regularizer in the next subsection. Chalupka et al. (2016) developed an analytical likelihood ratio test that decides between the causal and anticausal cases (Figures 1(b) and 1(c)). However, taking into account the confounded cases is analytically difficult. Nevertheless, it is possible to generate samples from the scenarios in Figure 2 under the ICM and train a classifier to learn to choose the max likelihood causal structure given samples from the joint . This is the key idea of the causality detectors in (LopezPaz et al., 2016; Chalupka et al., 2016) described in the rest of this section.
Mathematical description of the causality detection algorithm. Formally, suppose we have variables , each with dimensionality . For each variable we observe a sample of size denoted by , where are observations of a common target variable . Let denote the set of all such samples. For each sample , we are interested in determining the binary label which determines whether causes or not. In fact, we are interested in the function approximation problem of learning the mapping .
Several approaches can learn such a mapping function. When and are both discrete and finite, Chalupka et al. (2016) offer a means to construct the empirical joint distribution and train a supervised neural network mapping function . LopezPaz et al. (2016) learn the representation and a neural network , followed by training both the representation leaning function and the classification network in a joint and supervised way.
However, it is rare to have the true causal labels for training a causal detector. Rather, we generate synthetic datasets to represent the scenarios in Figure 2 based on the ICM assumption. The overall procedure is to generate samples from distributions that are one of the ten possible scenarios in Figure 2. We need to select distributions that impose a minimum number of restriction on the data and the syntheticallygenerated distributions have statistics as similar as possible to those of our true data of interest. For example, in our datasets, the independent variables are counts of the number of disease codes in patients’ records (cf. Section 4). Thus, we sample
from a mixture of appropriate distributions for count data: the Zipf, Poisson, Uniform, and Bernoulli distributions. The hidden variable
and the response variable
are sampled from the Dirichlet and Bernoulli distributions, respectively. Details of our sampling and training procedures are provided in Appendix B and Algorithm B there.3 Methodology
Given the causality detector in Section 2, we propose the causal regularizer for linear models in 3.1. We demonstrate in Section 3.2 using a nonlinear deep neural networks regularized by our causal regularizer, we can learn nonlinear causal relationships between the independent and target variables. Finally, we show that the causal regularizer can efficiently explore the space of multivariate causal hypotheses and extract meaningful candidates for causality analysis.
3.1 The Causal Regularizer
Using the causality detection methods in the previous section for causal variable selection (Guyon et al., 2007; Cawley, 2008; Bontempi and Meyer, 2010; Sun et al., 2015) makes the variable selection process becomes very sensitive to small changes in the joint distribution of variables and may exclude many causal variables due to noise or selection bias in the data. Ideally, if the ICM holds and if we had access to the true joint distributions and could discriminate between causal and noncausal variables with perfect accuracy, the twostep procedure would be sufficient. But observational datasets are not usually an accurate representation of the true probabilistic generative process because of measurement error and selection bias, which can perturb the causality scores generated by the neural network causality detector.
For example, consider the twostep analysis process of first finding the variables that cause from a list of variables for and then performing a sparse multivariate regression on the selected variables to prioritize the selected variables. This procedure is sensitive because our causality detection algorithm might give soft scores such as or to two variables and , respectively. These softscores can be interpreted as the probability that each variable is a cause of . If we use the twostep procedure, we will include in the regression model but not . However, could possibly contribute more to the predictive performance in presence of other variables in the multivariate regression. In other words, any hard cutoff for the purpose of twostep causal variable selection and regression will pose the question of “what should be the best cutoff threshold?”
Instead, we propose a causally regularized regression approach, where this tradeoff is performed smoothly via a regularization parameter. We select variables that are both causal with high probability and also significantly predictive.
Causal Regularizer. Now, given Section 2, assume that we have a classifier that outputs , we can design the following regularizer to encourage learning a causal predictive model:
(1) 
where
is the loss function for prediction of
given . The above regularization term is the norm version of the causal regularizer which will be used in our experiments. However, we can define norm version similarly as .The first term in Eq. (1
) is a multivariate analysis term, whereas the regularizer is constructed using a bivariate causality score of each independent variable
and the target variable for . This does not create a problem because in the design of the causal regularizer we have implicitly included the other variables as hidden variables in the analysis to allow the regularizer to be used with multivariate regression. That is, the rest of the observed independent variables can be considered as hidden variables in our bivariate causality analysis which allows proper regularization. The proposed causal regularizer is also a decomposable regularizer which makes analysis of its theoretical properties easier (Negahban et al., 2012).The interplay between causation and prediction has been studied recently, see (Peters et al., 2015; RojasCarulla et al., 2015) and the references therein. In particular, the notion of a causal regularizer was previously recognized (LopezPaz, 2016, Page 181; LopezPaz et al., 2016) as possible, however a specific causal regularizer has never been developed and evaluated. Notice that using the score of a “causalanticausal”only classifier without including the confounding cases, as e.g. in (LopezPaz et al., 2016)
, cannot properly regularize a multivariate model such as logistic regression. Moreover, a major novelty of our proposed causal regularizer is to do joint causal variable selection (the
regularization) and prediction, but the idea in (LopezPaz et al., 2016) cannot.3.1.1 Analysis of Causal Regularization
The following theorem uses a simple setting to quantify the impact of the norm based causal regularizer.
Theorem 1
Consider the following general linear model:^{1}^{1}1We have intentionally made the settings of this theorem simple to have readable results. It is possible to obtain results on more general settings, potentially at the expense of cluttering the results.
where the noise variable
is a zero mean random variable with variance
and a distribution that satisfies the regularity conditions of Theorem 3.2 in (White, 1982). We assume that causes but does not and its correlation with is due to an unobserved confounder. We have access to an imperfect causality detector with and , for . Without loss of generality, assume that. Under this setting, the causality accuracy of an estimate
is defined as follows:Consider the fixed design setting where an i.i.d. sample of size is drawn from the model as follows:
where , , and . For cleanness of the results, we study the orthonormal design setting where . Using this sample, we obtain two estimates for : and which are the the result of norm and norm based causally regularized regression, respectively. Asymptotically, as , we have the following results:
(2)  
(3) 
where
denotes the CDF of the unit Gaussian distribution.
A proof is provided in Appendix A. To understand the result, considering several special cases can be helpful. When the causal detector is perfect (), we can rewrite as follows
Compared to Eq. (3), we see a factor scaling of the causal coefficient against the noncausal coefficient in the nominator, increasing the chance of correct causality detection. That is, a perfect causality detector guarantees causal interpretability if the magnitude of outweights the predictive advantage of over . When the causal detector is random (), we can show that . That is, a noninformative causality detector makes causal regularization equivalent to standard regularization. Finally, in the limit of large penalization coefficient, we obtain:
The impact of the error rate of the causality detector in the nominator can be seen as linear scaling of the causal coefficient by and the noncausal factor by .
Another property of the causal regularizer is that the twostep analysis can be cast as a form of causal regularization where we use hard scores instead of soft scores. Consider the following setting:
where if and otherwise. Now, consider the limiting case of and . This case corresponds to the twostep procedure with regularized logistic regression.
3.2 Causal Regularizers in Neural Networks
We demonstrate two key scenarios of using the causal regularizer as shown in Figure 3.
Nonlinear Modeling. The linear model in Eq. (1) assumes that the strength of the impact of each independent variable on the target variable is fixed. However, according to probabilistic view of causation (Pearl, 2000), the strength of causation can change from subject to subject. Thus, we need nonlinear extensions of logistic regression that can be regularizerd by the causal regularizer and steered towards being causal.
To address this problem, we seek neural network architectures that represent the impact of each independent variable by a single coefficient (that can change for each subject) and regularize the coefficients with the causal regularizer. In particular, we propose the following nonlinear generalized linear model:
(4) 
where the embedding matrix maps the input to a lower dimensional representation space and the symbol
denotes the elementwise product. The logistic sigmoid function
maps the real values to the interval. The term acts as the skip connection and is initialized by the result of the logistic regression. The embedding allows dealing with very large set of discrete concepts and can be initialized via techniques such as skipgram (Mikolov et al., 2013) or GloVe (Pennington et al., 2014). The vector
is computed using a Multilayer Perceptron (MLP).The model in Eq. (4) is a nonlinear extension of logistic regression that is suitable for causal regularization. We can reorder the equations to write the right hand side of Eq. (4) as , where the new regression coefficient can change with every input. Each coordinate of the new regression coefficient can be calculated as , where denotes the th column of the embedding matrix . The variability of for each input enables us to perform individual causality analysis. For training, we can penalize the coefficients and minimize the following loss function
(5) 
where denotes the negative loglikelihood of the model described by Eq. (4). The change of the prediction vector with each sample can be related to the probabilistic definition of causation (Pearl, 2000) in the sense that the strength of causality may change from a subject to another one.
Multivariate Causal Hypothesis Generation. A key application of our proposed causal regularizer in conjunction with deep representation learning is to efficiently extract multivariate causal hypotheses from the data. Figure 2(b) shows an example of causal hypothesis generation where the hypotheses are generated via an MLP. We assume that there is a representation learning network with dimensional output , where denotes the range of the output, for example for sigmoid and
for ReLU activation functions. Our goal is to force each dimension of
to be causal, thus each coordinate of can be used as a multivariate causal hypothesis. In particular, we aim at minimizing the following objective function:(6) 
Our approach is to train an anticausality detector based on (LopezPaz et al., 2016) and design the regularizer based on its score. Then, as shown in Figure 2(b), we can combine it with the neural network to regularize the coefficients of the last layer of the MLP which predicts the labels from . The weights of the lower layers in are regularized using regularizer to make the generated causal hypotheses simple and interpretable.
The learning process has two steps: First, the causality detector network is trained on a synthetic dataset with causal and anticausal scenarios are labeled as and , respectively. We select the nonlinearity for
to be the logistic sigmoid function, thus we use Beta distribution for generating synthetic data for training of the causality classifier. In the second phase, the coefficients of
are fixed and we train the rest of the parameters in Eq. (6). To train the network, we select batches with fixedsize of 200 examples. The size of the batches indicate the number of samples from that is available to the causality detector. We select this number to be large enough such that error rate of the causality detector in (LopezPaz et al., 2016) becomes lower than .4 Experiments
We evaluate the proposed causal regularizer in Section 3.1 both in terms of its predictive and causal performance. Next, we compare the quality of the codes identified as causes of heart failure identified by different approaches. Finally, we evaluate performance of multivariate causal hypothesis generation by qualitatively analyzing the extracted hypotheses. We defer evaluation of the causality detection algorithms to Appendix B, as they are not the main contributions of this work. Table 1 lists the acronyms and symbols for techniques used in the experiments to improve the presentation.
Symbol  Description  Symbol  Description 

CD  Causality detector, described in Section 2  The output of a causality detector  
LogCause  Logistic regression regularizerd by the causal regularizer  LogL  Logistic regression regularizerd by the regularizer 
Twostep  The two step procedure of causal variable selection and logistic regression, as discussed in Section 3.1  The regression coefficients of an algorithm (one of LogCause, Twostep, or LogL)  
nonlinCause  The nonlinear causality analysis model in Eq. (4)  CauseHyp  The multivariate causal hypothesis generation described in Eq. (6) 
Algorithms  Heart Failure  MIMIC III  

AUC  AUC  
LogCause  
LogL  
TwoStep 
4.1 Data
The Sutter Health heart failure (HF) dataset consists of Electronic Health Records of middleaged adults collected by Sutter Health for study of heart failure. From the encounter records, medication orders, procedure orders and problem lists, we extracted visit records consisting of diagnosis, medication and procedure codes. We denote the set of such codes by .
Given a visit sequence , we try to predict if the patient will be diagnosed with heart failure (HF) and identify the key causes of increase heart failure risk. To this end, 3,884 cases are selected and approximately 10 controls are selected for each case (28,903 controls). The case/control selection criteria are fully described in Appendix D. Cases have index dates to denote the date they are diagnosed with HF. Controls have the same index dates as their corresponding cases. We extract diagnosis codes, medication codes and procedure codes from the 18month window before the index date. There are in total 17,081 number of unique medical codes in this dataset.
The MIMIC III dataset (Johnson et al., 2016) is a publicly available dataset consisting of medical records of intensive care unit (ICU) patients over 11 years. We use a public query^{2}^{2}2https://github.com/MITLCP/mimiccode/blob/master/concepts/cookbook/mortality.sql to extract the binary mortality labels for the patients. Our goal is to use the codes in the patients’ last visit to the ICU and predict their mortality outcome. Our dataset includes 46,520 patients out of whom 5810 have deceased (mortality=1). A totoal of 14,587 different medical codes are used in this dataset.
Feature construction. Given the sequence of visits for patients , we create a feature vector by counting the number of codes observed in the records of the th patient. Given the large variations in the number of codes, we logarithmically bin the count data into 16 bins. The final data is in the form of where is th patient’s label; heart failure and mortality outcome in the heart failure and MIMIC III datasets, respectively.
Training details. Because we generate synthetic datasets for training the causality detector neural networks, we can generate as many new batches of data for training and parameter tuning purposes as required. For training and parameter tuning of the models in Section 3, we perform the common 75%/10%/15% training/validation/test splits. The full details of the training procedure for the neural networks are given in Appendix C.
Name  Conditions  Description 

Aortic Dissection from Trauma  Dissection of aorta  This collection of diagnoses is is especially causal for heart failure, as heart failure can manifest as a complication of dissection of aorta. Dissection of aorta can present with abdominal pain, and may happen in traumatic injuries that involve burn of unspecified degree of other and multiple sites of trunk, occurring together. 
Burn in multiple sites of trunk  
Abdominal pain, lower left quadrant  
Kidney Neoplasm and Severe Infections  Malignant neoplasm of kidney  Neoplasms in the kidney may lead to paraneoplastic systemic effects that may lead to heart failure. Furthermore, having concurrent severe infections such as tuberculosis can also increase the risk of heart failure. 
History of infectious and parasitic diseases  
Tuberculosis of lung  
Metabolic Syndrome with Concurrent Infections and Pregnancy  Metabolic syndrome  Metabolic syndrome cooccurring with severe infections such as tuberculosis can lead to heart failure. Obstetrical pulmonary embolisms can lead to acute heart failure. 
Tuberculosis of lung  
Obstetrical pulmonary embolism 
4.2 Predictive performance evaluation
Table 2 shows the test accuracy of heart failure and mortality prediction in heart failure and MIMIC datasets, respectively. We have run each algorithm ten times and report the mean and standard deviation of the performance measures. As we can see, the proposed causal regularizer does not hurt the predictive performance, whereas the twostep procedure significantly reduces the accuracy.
An interesting phenomenon, shown in Figure 4, is the relative robustness of the performance with respect to the value of the penalization parameter compared to the regularization case. This robustness comes at no surprise, because the causal regularizer assigns very small penalization coefficients to the causal variables and as we discussed in Section 3.1, only with very high values of penalization we can force all coefficients to become zero, see Figures 3(c) and 3(f) which show the sparsity results. The predictive robustness of the causal regularizer can be also partially attributed to the invariant prediction Peters et al. (2015) property of causal models. That is, the robustness can be due to the fact that the causal regularizer might match the true generative process of the dataset better than the flat regularizer and put the model under less pressure as we increase the penalization parameter. We demonstrate the predictive gain by nonlinCause in Figure 4(a). Furthermore, the impact of changing the regularization parameter on the number of selected variables is visualized in Figures 7(a) and 7(b) in Appendix B.2.
4.3 Causality detection performance evaluation
The risk factors for heart failure are wellstudied in medical literature, making the heart failure condition an ideal case for study of causality. To evaluate the causality detection performance of the algorithms, we generate top 100 influential factors by each method. We ask a clinical expert to label each factor as “causal”, “notcausal”, and “potentially causal” and assign scores , , and to them, respectively. To prevent bias by the expert, we ask him to label a single list of all unique codes in the three lists and use this list to find the scores for individual lists. Figure 1 shows the average causality score by each algorithm based on the labels provided by the medical expert. As expected, regularized logistic regression performs poorly, as it is susceptible to the impact of confounded variables. Performance of the causally regularized logistic regression is superior to the two step procedure, which suggests that picking factors that are both causal and highly predictive leads to better causality score. The result in Figure 1 together with the predictive results in Table 2 confirm that the causal regularizer can be efficiently used for finding few causal variables that are highly predictive of the target quantity.
The qualitative advantages of the regularized approach can also be seen by the results in Table 5 in Appendix E. We have marked the disease codes that can potentially increase the risk of heart failure, but the predicted causality score for them is lower than and the twostep procedure would have eliminated from the predictors set (as shown in Table 6 in Appendix E). Thus, the causal regularizer approach is able to establish a balance between the prediction and causation and produce clinically more plausible results.
4.4 Evaluating the multivariate causal hypotheses
We evaluate the performance of the proposed causal hypothesis generation against the case when we do not use any causal regularization. We generate two lists of top 30 hypotheses using two algorithms and ask our medical expert to label each hypothesis as causal, noncausal or possibly causal with corresponding scores of , , and . The results in Figure 4(b) shows that the causal regularizer can increase the causality score of the hypotheses by up to . We also provide a qualitative analysis of the causal hypotheses generated by our algorithm by picking several hypotheses and showing that they are clinically meaningful. Three examples of multivariate causal hypotheses generated via causal regularizer and the description of their clinical meaning are shown in Table 3.
5 Conclusion and Discussion
We addressed the problem of exploring the highdimensional causal hypothesis space in applications such as healthcare. We designed a causal regularizer that maximally steers predictive models towards causally explainable models. The proposed causal regularizer, based on our causality detector, does not increase the computational complexity of the regularizer and can be seamlessly integrated with a neural network to perform nonlinear causality analysis. We also demonstrated the application of the proposed causal regularizer in generating multivariate causal hypotheses. Finally, we demonstrated the usefulness of the causal regularizer in detecting the risk factors of heart failure using an electronic health records dataset.
Acknowledgment
The authors would like to thank Frederick Eberhardt for helpful discussions. Mohammad Taha Bahadori acknowledges the previous discussions with David C. Kale and Micheal E. Hankin on the concept of causal regularizer. This work was supported by the National Science Foundation, award IIS#1418511 and CCF#1533768, research partnership between Children’s Healthcare of Atlanta and the Georgia Institute of Technology, CDC ISMILE project, Google Faculty Award, Sutter health, UCB and Samsung Scholarship. Krzysztof Chalupka’s work was supported by the NSF grant #1564330".
References
 Bontempi and Meyer (2010) Bontempi, G. and P. E. Meyer (2010). Causal filter selection in microarray data. In ICML.
 Breiman et al. (2001) Breiman, L. et al. (2001). Statistical modeling: The two cultures. Statistical Science.

Cawley (2008)
Cawley, G. C. (2008).
Causal and noncausal feature selection for ridge regression.
In WCCI Causation and Prediction Challenge, pp. 107–128.  Chalupka et al. (2016) Chalupka, K., F. Eberhardt, and P. Perona (2016). Estimating the causal direction and confounding of two discrete variables. arXiv Preprint.
 Chickering (2002) Chickering, D. M. (2002). Optimal structure identification with greedy search. JMLR.
 Colombo et al. (2012) Colombo, D., M. H. Maathuis, M. Kalisch, and T. S. Richardson (2012). Learning highdimensional directed acyclic graphs with latent and selection variables. Ann. Stat..
 Daniusis et al. (2010) Daniusis, P., D. Janzing, J. Mooij, J. Zscheischler, B. Steudel, K. Zhang, and B. Schölkopf (2010). Inferring deterministic causal relations. In UAI.
 Gould (1965) Gould, S. J. (1965). Is uniformitarianism necessary? Am. J. Sci. 263(3), 223–228.
 Guyon et al. (2007) Guyon, I., C. Aliferis, and A. Elisseeff (2007). Causal feature selection. Computational methods of feature selection, 63–82.
 Janzing et al. (2012) Janzing, D., J. Peters, E. Sgouritsa, K. Zhang, J. M. Mooij, and B. Schölkopf (2012). On causal and anticausal learning. In ICML.
 Janzing and Scholkopf (2010) Janzing, D. and B. Scholkopf (2010). Causal inference using the algorithmic markov condition. IEEE Trans. Inf. Theory.
 Johnson et al. (2016) Johnson, A. E., T. J. Pollard, L. Shen, L.w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016). Mimiciii, a freely accessible critical care database. Scientific data 3.
 Kalisch and Bühlmann (2007) Kalisch, M. and P. Bühlmann (2007). Estimating highdimensional directed acyclic graphs with the pcalgorithm. JMLR.
 Kocaoglu et al. (2016) Kocaoglu, M., A. G. Dimakis, S. Vishwanath, and B. Hassibi (2016). Entropic causal inference. arXiv preprint arXiv:1611.04035.
 Lemeire and Dirkx (2006) Lemeire, J. and E. Dirkx (2006). Causal models as minimal descriptions of multivariate systems.
 LopezPaz (2016) LopezPaz, D. (2016). From dependence to causation. Ph. D. thesis, University of Cambridge.
 LopezPaz et al. (2016) LopezPaz, D., R. Nishihara, S. Chintala, B. Schölkopf, and L. Bottou (2016). Discovering causal signals in images. arXiv:1605.08179.
 Mikolov et al. (2013) Mikolov, T., I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013). Distributed representations of words and phrases and their compositionality. In NIPS.
 Negahban et al. (2012) Negahban, S. N., P. Ravikumar, M. J. Wainwright, and B. Yu (2012). A unified framework for highdimensional analysis of estimators with decomposable regularizers. Statist. Sci..
 Pearl (2000) Pearl, J. (2000). Causality. Cambridge university press.
 Pennington et al. (2014) Pennington, J., R. Socher, and C. D. Manning (2014). Glove: Global vectors for word representation. EMNLP.

Peters
et al. (2015)
Peters, J., P. Bühlmann, and N. Meinshausen (2015).
Causal inference using invariant prediction: identification and confidence intervals.
JRSSB.  RojasCarulla et al. (2015) RojasCarulla, M., B. Schölkopf, R. Turner, and J. Peters (2015). Causal transfer in machine learning.
 Shmueli (2010) Shmueli, G. (2010). To explain or to predict? Statistical science.
 Spirtes (2010) Spirtes, P. (2010). Introduction to causal inference. JMLR.
 Sun et al. (2015) Sun, Y., J. Li, J. Liu, C. Chow, B. Sun, and R. Wang (2015). Using causal discovery for feature selection in multivariate numerical time series. Machine Learning.
 Wainwright and Jordan (2008) Wainwright, M. J. and M. I. Jordan (2008). Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning 1(12), 1–305.
 White (1982) White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica.
Appendix A Proof of the theorem
The proof is established based on two results: First, establishing the connection between the two estimates and the ordinary least squares estimate and then using the known asymptotic normality results for maximum likelihood estimation under misspecification. We need the latter, because the noise distribution is not necessarily Gaussian in the theorem to allow causality detection.
We can write both of the based causal regularization and Ridge regression in the following unified format:
where is a diagonal matrix
. It is equal to identity matrix
for the ridge regression and for the causal regularization. Given this unified formulation, we can represent the estimates in the theorem in terms of the ordinary least squares estimate as follows:(7) 
where . According to the asymptotic normality of the quasimaximum likelihood estimation in (White, 1982), as , the ordinary least squares estimate is normal with the following distribution :
(8) 
Given the results in Eqs. (7) and (8), we can find the distributions for the quantities of interest:
(9) 
where in the last step we have used the results on linear transformation of multivariate normal variables. Now, we can use the result in Eq. (
9) to write:(10) 
where is the cdf of the unit Gaussian distribution. Substituting for the ridge regression and for the causal regularization, we obtain the result in the theorem.
Appendix B Details of causality detector design
We first describe the sampling process for generating synthetic datasets used for training the causality detector algorithms in Section B.1. Next, evaluate the impact of the proposed sampling procedure on the quality of causality detection algorithms. Algorithm B summarizes the process described in Section 2.
[t] Given the data, perform the following steps: Generate data samples for from according to the ten cases in Figure 2. Assign label to the cases in Figures 1(b), 1(d), 1(g) and 1(i) and to the rest. Train a classifier
to classify them as causation (label=1) or notcausation (label=0). Given the fact that this is a synthetic dataset, we know these labels and we can use supervised learning.
On the test set, construct the test sample sets and use the classifier in step iii to classify the example.b.1 Sampling procedure for count variables
As described in Section 4, our independent variables have count data type. Thus, we need to generate data from distributions for count data, such as Poisson or Zipf distributions with fixed support size of 16. Looking at the histogram of maximum number of code occurrences in Figure 6, we observe that many codes only occur at most once or twice. Thus, we also generate binary and trinary distributions from flat Dirichlet distributions. Finally, to make sure that the space is fully spanned, we also generate samples from Dirichlet distribution with 16 categories. In summary, the is the mixture of these five distributions. The parameters of Poisson and Zipf are sampled from distribution.
[ht] Let denotes a discrete distribution with parameter and given support size . Direct : Sample . Generate . Sample for times. Compute the dimensional vector .
Sampling from the other graphical models in Figure 2 is done writing the factorization and sampling from directed edges and finally marginalization with respect to hidden variables (Wainwright and Jordan, 2008)
. The hidden variables are selected to be categorical variables with cardinality selected uniformly from the integers in the interval
. The conditional distribution of the hidden variables is selected to be Dirichlet distribution with allones parameter vector.b.2 Evaluating the causality detector
Table 4 show two advantages of the proposed sampling procedure for count data in comparison to the binary case proposed by Chalupka et al. (2016). First, in the synthetic dataset, the test error is significantly lower. This is because the size of input to the neural causality detector is compared to for the binary case. Applying the causality detectors to our data, we observe that the causality scores generated by our sampling scheme has significantly higher correlation with the mutual information between independent variables and the target label. Figure 7 highlights another advantage of the sampling procedure for count data as it is able to identify a larger portion of the variables as noncausal, which is more in line the expectations. Table 6 shows that the mutual information identifies V70.0 (Routine general medical examination at a health care facility) as highly correlated, but the causality detector correctly identifies it as noncausal with causality score .
In particular, in Figure 7(a), the Spearman’s rank correlation is which indicates a strong correlation. This is intuitive as we expect on average the causal connections to create stronger correlations. Another consequence of the large correlation makes regularization by the noncausality scores safer and guarantees that it will not significantly hurt the predictive performance. In Figure 7(a), we have marked four codes in the four corner of the figure. An example of highly correlated and causal code we can point out 250.00 (Diabetes mellitus without mention of complication) which is a known cause of heart failure. Code 362.01 (Background diabetic retinopathy) is an effect of diabetes —a common cause of heart failure. Code V06.5 (Need for TD vaccination) is an example of neither causal nor correlated code. Finally, code 365.00 (Preglaucoma) is known for increasing the risk of heart failure, despite the fact that it is not very correlated with heart failure.
Algorithm  Error  Spearman Correlation 

Rate  w/ Mutual Information  
Binary  0.2165  0.0099 (0.4506) 
Count  0.0617  0.6689 (0.0000) 
Appendix C Details of the neural networks
In this section we describe the details of the neural networks used in the paper. Implementation of all methods is done in Theano 0.8 and adamax is used for optimization. We also use early stopping based on the validation accuracy.
c.1 Details of CD architecture
We used a multilayer perceptron with seven layers of size 1024 with rectified linear units as activation functions. We use batch normalization for each layer.
c.2 Details of nonlinCause
The network in the nonlinCause is selected to be a multilayer perceptron with three layers of size 200 and rectified linear units as activation functions. Using the described tuning procedure, the embedding dimension is selected to be and we used dropout with rate . The results in Figure 4(a) is generated using for both LogCause and nonlinCause, though we observed similar performance gain for other values of penalization coefficient.
c.3 Details of CauseHyp
Implementation of the causality detector in (LopezPaz et al., 2016) in our CauseHyp is done via first generating features from the data using a three layered MLP with 200 hidden nodes in each layer. Then, after averaging over the batch, we use a five layered MLP with 200 hidden nodes in each layer. The architecture for the entire network is described in Section 3.2.
Appendix D Heart failure cohort design
Case patients were 40 to 85 years of age at the time of HF diagnosis. HF diagnosis (HFDx) is defined as:

Qualifying ICD9 codes for HF appeared in the encounter records or medication orders.

A minimum of three clinical encounters with qualifying ICD9 codes had to occur within 12 months of each other, where the date of diagnosis was assigned to the earliest of the three dates. If the time span between the first and second appearances of the HF diagnostic code was greater than 12 months, the date of the second encounter was used as the first qualifying encounter. The date at which HF diagnosis was given to the case is denoted as HFDx.
Up to ten eligible controls (in terms of sex, age, location) were selected for each case, yielding an overall ratio of 9 controls per case. Each control was also assigned an index date, which is the HFDx of the matched case. Controls are selected such that they did not meet the operational criteria for HF diagnosis prior to the HFDx plus 182 days of their corresponding case. Control subjects were required to have their first office encounter within one year of the matching HF case patient’s first office visit, and have at least one office encounter 30 days before or any time after the case’s HF diagnosis date to ensure similar duration of observations among cases and controls.
Appendix E Qualitative results
Code  Description  

794.31  Nonspecific abnormal electrocardiogram [ECG] [EKG]  0.3422  0.9351 
425.8  Cardiomyopathy in other diseases classified elsewhere  0.3272  0.2322 
786.05  Shortness of breath  0.3124  0.5536 
424.90  Endocarditis, valve unspecified, unspecified cause  0.3086  0.3908 
425.4  Other primary cardiomyopathies  0.2880  0.1351 
427.9  Cardiac dysrhythmia, unspecified  0.2531  0.9864 
785.9  Other symptoms involving cardiovascular system  0.2377  0.8024 
585.6  End stage renal disease  0.2225  0.3948 
511.9  Unspecified pleural effusion  0.2218  0.0839 
425.9  Secondary cardiomyopathy, unspecified  0.2203  0.8024 
782.3  Edema  0.2065  0.0027 
278.01  Morbid obesity  0.1955  0.0345 
424.0  Mitral valve disorders  0.1948  0.0003 
427.31  Atrial fibrillation  0.1762  1.0000 
410.90  Acute myocardial infarction of unspecified site, episode of care unspecified  0.1756  0.2510 
426.3  Other left bundle branch block  0.1690  0.4890 
424.1  Aortic valve disorders  0.1649  0.0012 
879.8  Open wound(s) (multiple) of unspecified site(s), without mention of complication  0.1645  0.6399 
429.3  Cardiomegaly  0.1619  0.5022 
780.60  Fever, unspecified  0.1602  0.7747 
482.9  Bacterial pneumonia, unspecified  0.1514  0.7482 
786.09  Other respiratory abnormalities  0.1454  0.7305 
496  Chronic airway obstruction, not elsewhere classified  0.1403  0.9990 
V42.0  Kidney replaced by transplant  0.1398  0.4351 
250.03  Diabetes mellitus without mention of complication, type I [juvenile type], uncontrolled  0.1388  0.4727 
276.51  Dehydration  0.1347  0.6738 
403.10  Hypertensive chronic kidney disease, benign, with chronic kidney disease stages I IV  0.1316  0.7488 
250.50  Diabetes with ophthalmic manifestations, type II, not uncontrolled  0.1283  0.2271 
427.89  Other specified cardiac dysrhythmias  0.1282  0.9416 
250.51  Diabetes with ophthalmic manifestations, type I [juvenile type], not stated as uncontrolled  0.1234  0.5473 
Code  Description  

782.3  Edema  1.7355  0.0027 
424.1  Aortic valve disorders  2.2021  0.0012 
425.4  Other primary cardiomyopathies  2.2330  0.1351 
V70.0  Routine general medical examination at a health care facility  2.2420  0.0000 
443.9  Peripheral vascular disease, unspecified  2.2668  0.1145 
424.0  Mitral valve disorders  2.3088  0.0003 
250.50  Diabetes with ophthalmic manifestations, type II, not stated as uncontrolled  2.3288  0.2271 
511.9  Unspecified pleural effusion  2.3320  0.0839 
427.32  Atrial flutter  2.3508  0.0767 
278.01  Morbid obesity  2.3924  0.0345 
425.8  Cardiomyopathy in other diseases classified elsewhere  2.4090  0.2322 
362.01  Background diabetic retinopathy  2.4536  0.0815 
584.9  Acute kidney failure, unspecified  2.4661  0.2882 
412  Old myocardial infarction  2.4750  0.0780 
428.0  Congestive heart failure, unspecified  2.4946  0.0039 
791.0  Proteinuria  2.4984  0.1883 
357.2  Polyneuropathy in diabetes  2.5040  0.2120 
402.90  Unspecified hypertensive heart disease without heart failure  2.5103  0.2034 
250.42  Diabetes with renal manifestations, type II or unspecified type, uncontrolled  2.5310  0.2378 
280.9  Iron deficiency anemia, unspecified  2.5661  0.0283 
427.1  Paroxysmal ventricular tachycardia  2.5884  0.0367 
V53.31  Fitting and adjustment of cardiac pacemaker  2.6014  0.2894 
459.81  Venous (peripheral) insufficiency, unspecified  2.6093  0.0748 
410.90  Acute myocardial infarction of unspecified site, episode of care unspecified  2.6154  0.2510 
588.81  Secondary hyperparathyroidism (of renal origin)  2.6199  0.0478 
250.62  Diabetes with neurological manifestations, type II or unspecified type, uncontrolled  2.6367  0.2051 
414.8  Other specified forms of chronic ischemic heart disease  2.6403  0.0923 
362.02  Proliferative diabetic retinopathy  2.6490  0.2629 
586  Renal failure, unspecified  2.7324  0.2388 
250.52  Diabetes with ophthalmic manifestations, type II or unspecified type, uncontrolled  2.7375  0.2154 
Comments
There are no comments yet.