Causal Regularization

02/08/2017 ∙ by Mohammad Taha Bahadori, et al. ∙ 0

In application domains such as healthcare, we want accurate predictive models that are also causally interpretable. In pursuit of such models, we propose a causal regularizer to steer predictive models towards causally-interpretable solutions and theoretically study its properties. In a large-scale analysis of Electronic Health Records (EHR), our causally-regularized model outperforms its L1-regularized counterpart in causal accuracy and is competitive in predictive performance. We perform non-linear causality analysis by causally regularizing a special neural network architecture. We also show that the proposed causal regularizer can be used together with neural representation learning algorithms to yield up to 20 multivariate causation, a situation common in healthcare, where many causal factors should occur simultaneously to have an effect on the target variable.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In domains such as healthcare, genomics or social science there is high demand for data analysis that reveals causal

relationships between independent and target variables. For example, doctors not only want models that accurately predict the status of patients, but also want to identify the factors that can improve it. The distinction between prediction and causation has at times been subject to debate in statistics and machine learning

(Breiman et al., 2001; Shmueli, 2010). While machine learning has focused mostly on prediction tasks, in many scientific domains pure prediction without considering the underlying causal mechanisms is considered unscientific (Shmueli, 2010). In this work, we propose a causal regularizer that balances causal interpretability and predictive power.

Figure 1: The proposed causal regularizer achieves significantly higher causality score, computed using ground truth causality. We compute the score for top codes in the ranked list reported by three algorithms. The causal regularizer is also competitive in predictive performance, see Section 4 for more details.

We use the counterfactual causality framework (Pearl, 2000)

, in which one random variable

(e.g. red wine consumption) causes another variable (i.e., reduction in risk of heart attack) denoted as if experimental testing of would be proven to change the distribution of (Spirtes, 2010). But there may also be competing explanations of the observed correlation between and because of confounding (e.g., people of high socio-economic status tend to drink more wine and this is related to other lifestyle factors that cause a reduction in heart attack) that need to be reconciled in assessing the likelihood that is true. Causal analytic methods can be used to prioritize what warrants testing in clinical trials among a diversity of hypotheses or as primary evidence if controlled trials are not feasible or desireable (e.g., climate science or health). In healthcare, in particular, it is common that an ensemble of many causal factors needs to occur simultaneously to have an effect on the target variable, a phenomenon we will call multivariate causation. Scalable methods are needed to explore the exponential combinations of the independent variables and different transformations in order to detect multivariate causal relationships.

Methods for discovering causal relationships among multiple variables from observational data (Chickering, 2002; Kalisch and Bühlmann, 2007; Colombo et al., 2012)

are largely based on the principle that any given set of causal relationships among multiple variables leaves well-defined marks in the joint distribution of the variables. However, when these methods are used for causal variable selection

(Guyon et al., 2007; Cawley, 2008; Bontempi and Meyer, 2010; Sun et al., 2015), the process becomes very sensitive to small changes in the joint distribution of variables and may exclude many causal variables due to noise or selection bias in the data.

Our main idea is to design a causal regularizer to control the complexity of the statistical models and at the same time favor causal explanations. Compared to the two step procedure of causal variable selection followed by a multivariate regression/classification, the proposed approach performs joint causal variable selection and prediction, thus avoiding the statistically sensitive thresholding of the causality scores in the causal variable selection step. It allows few dependencies that cannot be explained via causation to still be included in the model, relaxing the variable selection procedure. Our technical contributions are as follows:

  1. We use causality detectors to construct a causal regularizer that can guide predictive models towards learning causal relationships between the independent and target variables. We theoretically quantify the impact of the accuracy of the causality detector on the causal accuracy of the regularized models.

  2. We propose a new non-linear predictive model regularized by our causal regularizer, which allows causally interpretable neural networks.

  3. Finally, we demonstrate that the proposed causal regularizer can be combined with neural representation learning techniques to efficiently detect multivariate causal hypotheses.

The proposed framework scales linearly with the number of variables, as opposed to many previous causal methods.

We applied the proposed algorithms to clinical predictive modeling problems using large EHR datasets: one on heart failure onset prediction and another on mortality prediction using the publicly available MIMIC III (Johnson et al., 2016) dataset. Altogether, we analyzed the collective influence of 17,081 independent variables on heart failure and validated the results by having a clinical expert to manually review the findings in a blind setup. As shown in Figure 1, our proposed causally-regularized algorithm significantly outperforms the baseline algorithms in causality detection performance. We show a similar boost in the causality score of the detected multivariate causal hypotheses. Finally, we show that the proposed algorithms are also competitive in predictive performance on both datasets.

2 Preliminaries on Causality Detection

We begin with description of pairwise causal analysis () of a single independent variable on the target variable based on the independence of mechanisms (ICM) assumption and then extend the pairwise causality detector to perform multivariate causality analysis in the next section. While our proposed causal regularizer can be constructed using any causality detection algorithm, a review of the ICM based methods, as the state of art causality detection algorithms, is helpful because they are the baseline algorithms in the experiments.

(a) Independent
(b) Direct
(c) Reverse
(d) Indirect
(e) Indirect Reverse
(f) Confounded Correlation
(g) Confounded and Direct
(h) Confounded and Reverse
(i) Confounded and Indirect
(j) Confounded & Indirect Reverse
Figure 2: Common causal and anti-causal structures between two observed and one or more hidden variables. Under the algorithmic independence assumption, we can sample from the joint distribution of and

in each case and train a classifier to distinguish between these cases based on the (automatically learned) features of the joint distribution.

We are interested in finding causal models where causes , or causes , or the two are confounded based on joint distribution of . However, the pairwise causality analysis is infeasible for arbitrary joint distributions. Thus, we need to resort to additional assumptions on the nature of the causal relationships. Recently several algorithms have been proposed that distinguish between the cause and effect based on the natural assumption that steps in the process that generates the data are independent from each other, see (Lemeire and Dirkx, 2006; Janzing et al., 2012; Daniusis et al., 2010; Lopez-Paz, 2016; Chalupka et al., 2016; Kocaoglu et al., 2016) and the references therein. In this work, we follow (Lopez-Paz et al., 2016; Chalupka et al., 2016) to describe this causality detection approach. In the next subsections, we describe our novel causal regularizer designed based on this causality detection approach and its application in non-linear causality analysis and multivariate causal hypothesis generation.

Conceptual description of the independence between the cause and the mechanism. ICM states that the two processes of generation of the cause and mapping from cause to effect are independent. In our case, we assume that when ( causes

), the probabilities

and are generated by independent higher-level distribution functions. Thus, we do not put assumptions on the functional form of the causal relationships between the variables of interest. ICM conforms to the scientific idea of Uniformitarianism (Gould, 1965) which, putting roughly, states that the laws of nature apply to all objects similarly. ICM can be described in both deterministic (Janzing and Scholkopf, 2010) and probabilistic (Daniusis et al., 2010) sense; this work mainly uses the probabilistic interpretation.

ICM can be used to generate samples from distributions that agree with the possible graphical models including two observed variables and and an unobserved variable shown in Figure 2, by requiring that the probability functions in the factorization of the joint distribution are independent from each other. The hidden variables can represent the other observed variables, critical in design of the regularizer in the next subsection. Chalupka et al. (2016) developed an analytical likelihood ratio test that decides between the causal and anticausal cases (Figures 1(b) and 1(c)). However, taking into account the confounded cases is analytically difficult. Nevertheless, it is possible to generate samples from the scenarios in Figure 2 under the ICM and train a classifier to learn to choose the max likelihood causal structure given samples from the joint . This is the key idea of the causality detectors in (Lopez-Paz et al., 2016; Chalupka et al., 2016) described in the rest of this section.

Mathematical description of the causality detection algorithm. Formally, suppose we have variables , each with dimensionality . For each variable we observe a sample of size denoted by , where are observations of a common target variable . Let denote the set of all such samples. For each sample , we are interested in determining the binary label which determines whether causes or not. In fact, we are interested in the function approximation problem of learning the mapping .

Several approaches can learn such a mapping function. When and are both discrete and finite, Chalupka et al. (2016) offer a means to construct the empirical joint distribution and train a supervised neural network mapping function . Lopez-Paz et al. (2016) learn the representation and a neural network , followed by training both the representation leaning function and the classification network in a joint and supervised way.

However, it is rare to have the true causal labels for training a causal detector. Rather, we generate synthetic datasets to represent the scenarios in Figure 2 based on the ICM assumption. The overall procedure is to generate samples from distributions that are one of the ten possible scenarios in Figure 2. We need to select distributions that impose a minimum number of restriction on the data and the synthetically-generated distributions have statistics as similar as possible to those of our true data of interest. For example, in our datasets, the independent variables are counts of the number of disease codes in patients’ records (cf. Section 4). Thus, we sample

from a mixture of appropriate distributions for count data: the Zipf, Poisson, Uniform, and Bernoulli distributions. The hidden variable

and the response variable

are sampled from the Dirichlet and Bernoulli distributions, respectively. Details of our sampling and training procedures are provided in Appendix B and Algorithm B there.

3 Methodology

Given the causality detector in Section 2, we propose the causal regularizer for linear models in 3.1. We demonstrate in Section 3.2 using a non-linear deep neural networks regularized by our causal regularizer, we can learn non-linear causal relationships between the independent and target variables. Finally, we show that the causal regularizer can efficiently explore the space of multivariate causal hypotheses and extract meaningful candidates for causality analysis.

3.1 The Causal Regularizer

Using the causality detection methods in the previous section for causal variable selection (Guyon et al., 2007; Cawley, 2008; Bontempi and Meyer, 2010; Sun et al., 2015) makes the variable selection process becomes very sensitive to small changes in the joint distribution of variables and may exclude many causal variables due to noise or selection bias in the data. Ideally, if the ICM holds and if we had access to the true joint distributions and could discriminate between causal and non-causal variables with perfect accuracy, the two-step procedure would be sufficient. But observational datasets are not usually an accurate representation of the true probabilistic generative process because of measurement error and selection bias, which can perturb the causality scores generated by the neural network causality detector.

For example, consider the two-step analysis process of first finding the variables that cause from a list of variables for and then performing a sparse multivariate regression on the selected variables to prioritize the selected variables. This procedure is sensitive because our causality detection algorithm might give soft scores such as or to two variables and , respectively. These soft-scores can be interpreted as the probability that each variable is a cause of . If we use the two-step procedure, we will include in the regression model but not . However, could possibly contribute more to the predictive performance in presence of other variables in the multivariate regression. In other words, any hard cut-off for the purpose of two-step causal variable selection and regression will pose the question of “what should be the best cut-off threshold?”

Instead, we propose a causally regularized regression approach, where this trade-off is performed smoothly via a regularization parameter. We select variables that are both causal with high probability and also significantly predictive.

Causal Regularizer. Now, given Section 2, assume that we have a classifier that outputs , we can design the following regularizer to encourage learning a causal predictive model:



is the loss function for prediction of

given . The above regularization term is the -norm version of the causal regularizer which will be used in our experiments. However, we can define -norm version similarly as .

The first term in Eq. (1

) is a multivariate analysis term, whereas the regularizer is constructed using a bivariate causality score of each independent variable

and the target variable for . This does not create a problem because in the design of the causal regularizer we have implicitly included the other variables as hidden variables in the analysis to allow the regularizer to be used with multivariate regression. That is, the rest of the observed independent variables can be considered as hidden variables in our bivariate causality analysis which allows proper regularization. The proposed causal regularizer is also a decomposable regularizer which makes analysis of its theoretical properties easier (Negahban et al., 2012).

The interplay between causation and prediction has been studied recently, see (Peters et al., 2015; Rojas-Carulla et al., 2015) and the references therein. In particular, the notion of a causal regularizer was previously recognized (Lopez-Paz, 2016, Page 181; Lopez-Paz et al., 2016) as possible, however a specific causal regularizer has never been developed and evaluated. Notice that using the score of a “causal-anticausal”-only classifier without including the confounding cases, as e.g. in (Lopez-Paz et al., 2016)

, cannot properly regularize a multivariate model such as logistic regression. Moreover, a major novelty of our proposed causal regularizer is to do joint causal variable selection (the

regularization) and prediction, but the idea in (Lopez-Paz et al., 2016) cannot.

3.1.1 Analysis of Causal Regularization

The following theorem uses a simple setting to quantify the impact of the -norm based causal regularizer.

Theorem 1

Consider the following general linear model:111We have intentionally made the settings of this theorem simple to have readable results. It is possible to obtain results on more general settings, potentially at the expense of cluttering the results.

where the noise variable

is a zero mean random variable with variance

and a distribution that satisfies the regularity conditions of Theorem 3.2 in (White, 1982). We assume that causes but does not and its correlation with is due to an unobserved confounder. We have access to an imperfect causality detector with and , for . Without loss of generality, assume that

. Under this setting, the causality accuracy of an estimate

is defined as follows:

Consider the fixed design setting where an i.i.d. sample of size is drawn from the model as follows:

where , , and . For cleanness of the results, we study the orthonormal design setting where . Using this sample, we obtain two estimates for : and which are the the result of -norm and -norm based causally regularized regression, respectively. Asymptotically, as , we have the following results:



denotes the CDF of the unit Gaussian distribution.

A proof is provided in Appendix A. To understand the result, considering several special cases can be helpful. When the causal detector is perfect (), we can rewrite as follows

Compared to Eq. (3), we see a factor scaling of the causal coefficient against the non-causal coefficient in the nominator, increasing the chance of correct causality detection. That is, a perfect causality detector guarantees causal interpretability if the magnitude of outweights the predictive advantage of over . When the causal detector is random (), we can show that . That is, a non-informative causality detector makes causal regularization equivalent to standard regularization. Finally, in the limit of large penalization coefficient, we obtain:

The impact of the error rate of the causality detector in the nominator can be seen as linear scaling of the causal coefficient by and the non-causal factor by .

Another property of the causal regularizer is that the two-step analysis can be cast as a form of causal regularization where we use hard scores instead of soft scores. Consider the following setting:

where if and otherwise. Now, consider the limiting case of and . This case corresponds to the two-step procedure with regularized logistic regression.

3.2 Causal Regularizers in Neural Networks

We demonstrate two key scenarios of using the causal regularizer as shown in Figure 3.

(a) Non-linear causality analysis
(b) Multivariate causal hypothesis generation
Figure 3: Two scenarios of using the proposed causal regularizer: (fig:nonlin) In the proposed architecture, applying the causal regularizer allows identification of causal relationships in the non-linear settings, where the causality coefficient can change from subject to subject. (fig:hype_gen) The causal regularizer allows us to explore the high-dimensional multivariate combinations of the variables and identify plausible hypotheses. Here, generates the causal regularization coefficients for the hypotheses . The regularizer encourages the coordinates of to be more causal.

Non-linear Modeling. The linear model in Eq. (1) assumes that the strength of the impact of each independent variable on the target variable is fixed. However, according to probabilistic view of causation (Pearl, 2000), the strength of causation can change from subject to subject. Thus, we need non-linear extensions of logistic regression that can be regularizerd by the causal regularizer and steered towards being causal.

To address this problem, we seek neural network architectures that represent the impact of each independent variable by a single coefficient (that can change for each subject) and regularize the coefficients with the causal regularizer. In particular, we propose the following non-linear generalized linear model:


where the embedding matrix maps the input to a lower dimensional representation space and the symbol

denotes the element-wise product. The logistic sigmoid function

maps the real values to the interval. The term acts as the skip connection and is initialized by the result of the logistic regression. The embedding allows dealing with very large set of discrete concepts and can be initialized via techniques such as skip-gram (Mikolov et al., 2013) or GloVe (Pennington et al., 2014)

. The vector

is computed using a Multilayer Perceptron (MLP).

The model in Eq. (4) is a non-linear extension of logistic regression that is suitable for causal regularization. We can reorder the equations to write the right hand side of Eq. (4) as , where the new regression coefficient can change with every input. Each coordinate of the new regression coefficient can be calculated as , where denotes the th column of the embedding matrix . The variability of for each input enables us to perform individual causality analysis. For training, we can penalize the coefficients and minimize the following loss function


where denotes the negative log-likelihood of the model described by Eq. (4). The change of the prediction vector with each sample can be related to the probabilistic definition of causation (Pearl, 2000) in the sense that the strength of causality may change from a subject to another one.

Multivariate Causal Hypothesis Generation. A key application of our proposed causal regularizer in conjunction with deep representation learning is to efficiently extract multivariate causal hypotheses from the data. Figure 2(b) shows an example of causal hypothesis generation where the hypotheses are generated via an MLP. We assume that there is a representation learning network with -dimensional output , where denotes the range of the output, for example for sigmoid and

for ReLU activation functions. Our goal is to force each dimension of

to be causal, thus each coordinate of can be used as a multivariate causal hypothesis. In particular, we aim at minimizing the following objective function:


Our approach is to train an anti-causality detector based on (Lopez-Paz et al., 2016) and design the regularizer based on its score. Then, as shown in Figure 2(b), we can combine it with the neural network to regularize the coefficients of the last layer of the MLP which predicts the labels from . The weights of the lower layers in are regularized using regularizer to make the generated causal hypotheses simple and interpretable.

The learning process has two steps: First, the causality detector network is trained on a synthetic dataset with causal and anti-causal scenarios are labeled as and , respectively. We select the non-linearity for

to be the logistic sigmoid function, thus we use Beta distribution for generating synthetic data for training of the causality classifier. In the second phase, the coefficients of

are fixed and we train the rest of the parameters in Eq. (6). To train the network, we select batches with fixed-size of 200 examples. The size of the batches indicate the number of samples from that is available to the causality detector. We select this number to be large enough such that error rate of the causality detector in (Lopez-Paz et al., 2016) becomes lower than .

4 Experiments

We evaluate the proposed causal regularizer in Section 3.1 both in terms of its predictive and causal performance. Next, we compare the quality of the codes identified as causes of heart failure identified by different approaches. Finally, we evaluate performance of multivariate causal hypothesis generation by qualitatively analyzing the extracted hypotheses. We defer evaluation of the causality detection algorithms to Appendix B, as they are not the main contributions of this work. Table 1 lists the acronyms and symbols for techniques used in the experiments to improve the presentation.

Symbol Description Symbol Description
CD Causality detector, described in Section 2 The output of a causality detector
LogCause Logistic regression regularizerd by the causal regularizer LogL Logistic regression regularizerd by the regularizer
Two-step The two step procedure of causal variable selection and logistic regression, as discussed in Section 3.1 The regression coefficients of an algorithm (one of LogCause, Two-step, or LogL)
nonlinCause The non-linear causality analysis model in Eq. (4) CauseHyp The multivariate causal hypothesis generation described in Eq. (6)
Table 1: List of symbols and acronyms used in the experiment section. Bold font shows our proposed approaches.
Algorithms Heart Failure MIMIC III
Table 2: Prediction accuracy results on two datasets. (meanstandard deviation)
(a) AUC on HF
(b) on HF
(c) AUC v. Sparsity (HF)
(d) AUC on MIMIC
(e) on MIMIC
(f) AUC v. Sparsity (MIMIC)
Figure 4: Comparison of variable selection in logistic regression via the causal and regularizers on two datsets and two accuracy measures. Note the stability of variable selection by LogCause as the penalization coefficient varies.

4.1 Data

The Sutter Health heart failure (HF) dataset consists of Electronic Health Records of middle-aged adults collected by Sutter Health for study of heart failure. From the encounter records, medication orders, procedure orders and problem lists, we extracted visit records consisting of diagnosis, medication and procedure codes. We denote the set of such codes by .

Given a visit sequence , we try to predict if the patient will be diagnosed with heart failure (HF) and identify the key causes of increase heart failure risk. To this end, 3,884 cases are selected and approximately 10 controls are selected for each case (28,903 controls). The case/control selection criteria are fully described in Appendix D. Cases have index dates to denote the date they are diagnosed with HF. Controls have the same index dates as their corresponding cases. We extract diagnosis codes, medication codes and procedure codes from the 18-month window before the index date. There are in total 17,081 number of unique medical codes in this dataset.

The MIMIC III dataset (Johnson et al., 2016) is a publicly available dataset consisting of medical records of intensive care unit (ICU) patients over 11 years. We use a public query222 to extract the binary mortality labels for the patients. Our goal is to use the codes in the patients’ last visit to the ICU and predict their mortality outcome. Our dataset includes 46,520 patients out of whom 5810 have deceased (mortality=1). A totoal of 14,587 different medical codes are used in this dataset.

Feature construction. Given the sequence of visits for patients , we create a feature vector by counting the number of codes observed in the records of the th patient. Given the large variations in the number of codes, we logarithmically bin the count data into 16 bins. The final data is in the form of where is th patient’s label; heart failure and mortality outcome in the heart failure and MIMIC III datasets, respectively.

Training details. Because we generate synthetic datasets for training the causality detector neural networks, we can generate as many new batches of data for training and parameter tuning purposes as required. For training and parameter tuning of the models in Section 3, we perform the common 75%/10%/15% training/validation/test splits. The full details of the training procedure for the neural networks are given in Appendix C.

(a) Predictive gain by nonlinCause
(b) Accuracy of CauseHyp
(c) Runtime
Figure 5: (fig:nonlin_gain) The predictive gain by nonlinCause on the MIMIC III datset. The gain is more visible when fewer features are used in the analysis because the input become more expressive by themselves. We select the variables in the descending order of variance. (fig:hypo_gen) Average causality score computed using ground truth causality labels for generated hypotheses. We compute the score for top hypotheses reported by two algorithms. (fig:scalability) Runtime of the proposed algorithms as number of input variables change.
Name Conditions Description
Aortic Dissection from Trauma Dissection of aorta This collection of diagnoses is is especially causal for heart failure, as heart failure can manifest as a complication of dissection of aorta. Dissection of aorta can present with abdominal pain, and may happen in traumatic injuries that involve burn of unspecified degree of other and multiple sites of trunk, occurring together.
Burn in multiple sites of trunk
Abdominal pain, lower left quadrant
Kidney Neoplasm and Severe Infections Malignant neoplasm of kidney Neoplasms in the kidney may lead to paraneoplastic systemic effects that may lead to heart failure. Furthermore, having concurrent severe infections such as tuberculosis can also increase the risk of heart failure.
History of infectious and parasitic diseases
Tuberculosis of lung
Metabolic Syndrome with Concurrent Infections and Pregnancy Metabolic syndrome Metabolic syndrome co-occurring with severe infections such as tuberculosis can lead to heart failure. Obstetrical pulmonary embolisms can lead to acute heart failure.
Tuberculosis of lung
Obstetrical pulmonary embolism
Table 3: Examples of multivariate causal hypotheses generated via causal regularizer.

4.2 Predictive performance evaluation

Table 2 shows the test accuracy of heart failure and mortality prediction in heart failure and MIMIC datasets, respectively. We have run each algorithm ten times and report the mean and standard deviation of the performance measures. As we can see, the proposed causal regularizer does not hurt the predictive performance, whereas the two-step procedure significantly reduces the accuracy.

An interesting phenomenon, shown in Figure 4, is the relative robustness of the performance with respect to the value of the penalization parameter compared to the regularization case. This robustness comes at no surprise, because the causal regularizer assigns very small penalization coefficients to the causal variables and as we discussed in Section 3.1, only with very high values of penalization we can force all coefficients to become zero, see Figures 3(c) and 3(f) which show the sparsity results. The predictive robustness of the causal regularizer can be also partially attributed to the invariant prediction Peters et al. (2015) property of causal models. That is, the robustness can be due to the fact that the causal regularizer might match the true generative process of the dataset better than the flat regularizer and put the model under less pressure as we increase the penalization parameter. We demonstrate the predictive gain by nonlinCause in Figure 4(a). Furthermore, the impact of changing the regularization parameter on the number of selected variables is visualized in Figures 7(a) and 7(b) in Appendix B.2.

4.3 Causality detection performance evaluation

The risk factors for heart failure are well-studied in medical literature, making the heart failure condition an ideal case for study of causality. To evaluate the causality detection performance of the algorithms, we generate top 100 influential factors by each method. We ask a clinical expert to label each factor as “causal”, “not-causal”, and “potentially causal” and assign scores , , and to them, respectively. To prevent bias by the expert, we ask him to label a single list of all unique codes in the three lists and use this list to find the scores for individual lists. Figure 1 shows the average causality score by each algorithm based on the labels provided by the medical expert. As expected, regularized logistic regression performs poorly, as it is susceptible to the impact of confounded variables. Performance of the causally regularized logistic regression is superior to the two step procedure, which suggests that picking factors that are both causal and highly predictive leads to better causality score. The result in Figure 1 together with the predictive results in Table 2 confirm that the causal regularizer can be efficiently used for finding few causal variables that are highly predictive of the target quantity.

The qualitative advantages of the regularized approach can also be seen by the results in Table 5 in Appendix E. We have marked the disease codes that can potentially increase the risk of heart failure, but the predicted causality score for them is lower than and the two-step procedure would have eliminated from the predictors set (as shown in Table 6 in Appendix E). Thus, the causal regularizer approach is able to establish a balance between the prediction and causation and produce clinically more plausible results.

4.4 Evaluating the multivariate causal hypotheses

We evaluate the performance of the proposed causal hypothesis generation against the case when we do not use any causal regularization. We generate two lists of top 30 hypotheses using two algorithms and ask our medical expert to label each hypothesis as causal, non-causal or possibly causal with corresponding scores of , , and . The results in Figure 4(b) shows that the causal regularizer can increase the causality score of the hypotheses by up to . We also provide a qualitative analysis of the causal hypotheses generated by our algorithm by picking several hypotheses and showing that they are clinically meaningful. Three examples of multivariate causal hypotheses generated via causal regularizer and the description of their clinical meaning are shown in Table 3.

5 Conclusion and Discussion

We addressed the problem of exploring the high-dimensional causal hypothesis space in applications such as healthcare. We designed a causal regularizer that maximally steers predictive models towards causally explainable models. The proposed causal regularizer, based on our causality detector, does not increase the computational complexity of the regularizer and can be seamlessly integrated with a neural network to perform non-linear causality analysis. We also demonstrated the application of the proposed causal regularizer in generating multivariate causal hypotheses. Finally, we demonstrated the usefulness of the causal regularizer in detecting the risk factors of heart failure using an electronic health records dataset.


The authors would like to thank Frederick Eberhardt for helpful discussions. Mohammad Taha Bahadori acknowledges the previous discussions with David C. Kale and Micheal E. Hankin on the concept of causal regularizer. This work was supported by the National Science Foundation, award IIS-#1418511 and CCF-#1533768, research partnership between Children’s Healthcare of Atlanta and the Georgia Institute of Technology, CDC I-SMILE project, Google Faculty Award, Sutter health, UCB and Samsung Scholarship. Krzysztof Chalupka’s work was supported by the NSF grant #1564330".


  • Bontempi and Meyer (2010) Bontempi, G. and P. E. Meyer (2010). Causal filter selection in microarray data. In ICML.
  • Breiman et al. (2001) Breiman, L. et al. (2001). Statistical modeling: The two cultures. Statistical Science.
  • Cawley (2008) Cawley, G. C. (2008).

    Causal and non-causal feature selection for ridge regression.

    In WCCI Causation and Prediction Challenge, pp. 107–128.
  • Chalupka et al. (2016) Chalupka, K., F. Eberhardt, and P. Perona (2016). Estimating the causal direction and confounding of two discrete variables. arXiv Preprint.
  • Chickering (2002) Chickering, D. M. (2002). Optimal structure identification with greedy search. JMLR.
  • Colombo et al. (2012) Colombo, D., M. H. Maathuis, M. Kalisch, and T. S. Richardson (2012). Learning high-dimensional directed acyclic graphs with latent and selection variables. Ann. Stat..
  • Daniusis et al. (2010) Daniusis, P., D. Janzing, J. Mooij, J. Zscheischler, B. Steudel, K. Zhang, and B. Schölkopf (2010). Inferring deterministic causal relations. In UAI.
  • Gould (1965) Gould, S. J. (1965). Is uniformitarianism necessary? Am. J. Sci. 263(3), 223–228.
  • Guyon et al. (2007) Guyon, I., C. Aliferis, and A. Elisseeff (2007). Causal feature selection. Computational methods of feature selection, 63–82.
  • Janzing et al. (2012) Janzing, D., J. Peters, E. Sgouritsa, K. Zhang, J. M. Mooij, and B. Schölkopf (2012). On causal and anticausal learning. In ICML.
  • Janzing and Scholkopf (2010) Janzing, D. and B. Scholkopf (2010). Causal inference using the algorithmic markov condition. IEEE Trans. Inf. Theory.
  • Johnson et al. (2016) Johnson, A. E., T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016). Mimic-iii, a freely accessible critical care database. Scientific data 3.
  • Kalisch and Bühlmann (2007) Kalisch, M. and P. Bühlmann (2007). Estimating high-dimensional directed acyclic graphs with the pc-algorithm. JMLR.
  • Kocaoglu et al. (2016) Kocaoglu, M., A. G. Dimakis, S. Vishwanath, and B. Hassibi (2016). Entropic causal inference. arXiv preprint arXiv:1611.04035.
  • Lemeire and Dirkx (2006) Lemeire, J. and E. Dirkx (2006). Causal models as minimal descriptions of multivariate systems.
  • Lopez-Paz (2016) Lopez-Paz, D. (2016). From dependence to causation. Ph. D. thesis, University of Cambridge.
  • Lopez-Paz et al. (2016) Lopez-Paz, D., R. Nishihara, S. Chintala, B. Schölkopf, and L. Bottou (2016). Discovering causal signals in images. arXiv:1605.08179.
  • Mikolov et al. (2013) Mikolov, T., I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013). Distributed representations of words and phrases and their compositionality. In NIPS.
  • Negahban et al. (2012) Negahban, S. N., P. Ravikumar, M. J. Wainwright, and B. Yu (2012). A unified framework for high-dimensional analysis of -estimators with decomposable regularizers. Statist. Sci..
  • Pearl (2000) Pearl, J. (2000). Causality. Cambridge university press.
  • Pennington et al. (2014) Pennington, J., R. Socher, and C. D. Manning (2014). Glove: Global vectors for word representation. EMNLP.
  • Peters et al. (2015) Peters, J., P. Bühlmann, and N. Meinshausen (2015).

    Causal inference using invariant prediction: identification and confidence intervals.

  • Rojas-Carulla et al. (2015) Rojas-Carulla, M., B. Schölkopf, R. Turner, and J. Peters (2015). Causal transfer in machine learning.
  • Shmueli (2010) Shmueli, G. (2010). To explain or to predict? Statistical science.
  • Spirtes (2010) Spirtes, P. (2010). Introduction to causal inference. JMLR.
  • Sun et al. (2015) Sun, Y., J. Li, J. Liu, C. Chow, B. Sun, and R. Wang (2015). Using causal discovery for feature selection in multivariate numerical time series. Machine Learning.
  • Wainwright and Jordan (2008) Wainwright, M. J. and M. I. Jordan (2008). Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning 1(1-2), 1–305.
  • White (1982) White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica.

Appendix A Proof of the theorem

The proof is established based on two results: First, establishing the connection between the two estimates and the ordinary least squares estimate and then using the known asymptotic normality results for maximum likelihood estimation under misspecification. We need the latter, because the noise distribution is not necessarily Gaussian in the theorem to allow causality detection.

We can write both of the based causal regularization and Ridge regression in the following unified format:

where is a diagonal matrix

. It is equal to identity matrix

for the ridge regression and for the causal regularization. Given this unified formulation, we can represent the estimates in the theorem in terms of the ordinary least squares estimate as follows:


where . According to the asymptotic normality of the quasi-maximum likelihood estimation in (White, 1982), as , the ordinary least squares estimate is normal with the following distribution :


Given the results in Eqs. (7) and (8), we can find the distributions for the quantities of interest:


where in the last step we have used the results on linear transformation of multivariate normal variables. Now, we can use the result in Eq. (

9) to write:


where is the cdf of the unit Gaussian distribution. Substituting for the ridge regression and for the causal regularization, we obtain the result in the theorem.

Appendix B Details of causality detector design

We first describe the sampling process for generating synthetic datasets used for training the causality detector algorithms in Section B.1. Next, evaluate the impact of the proposed sampling procedure on the quality of causality detection algorithms. Algorithm B summarizes the process described in Section 2.

[t] Given the data, perform the following steps: Generate data samples for from according to the ten cases in Figure 2. Assign label to the cases in Figures 1(b), 1(d), 1(g) and 1(i) and to the rest. Train a classifier

to classify them as causation (label=1) or not-causation (label=0). Given the fact that this is a synthetic dataset, we know these labels and we can use supervised learning.

On the test set, construct the test sample sets and use the classifier in step iii to classify the example. The algorithm for constructing the causality detector. The structure of neural network classifier is given in Appendix C.

b.1 Sampling procedure for count variables

As described in Section 4, our independent variables have count data type. Thus, we need to generate data from distributions for count data, such as Poisson or Zipf distributions with fixed support size of 16. Looking at the histogram of maximum number of code occurrences in Figure 6, we observe that many codes only occur at most once or twice. Thus, we also generate binary and trinary distributions from flat Dirichlet distributions. Finally, to make sure that the space is fully spanned, we also generate samples from Dirichlet distribution with 16 categories. In summary, the is the mixture of these five distributions. The parameters of Poisson and Zipf are sampled from distribution.

Figure 6: Histogram of maximum number of code occurrences.

[ht] Let denotes a discrete distribution with parameter and given support size . Direct : Sample . Generate . Sample for times. Compute the -dimensional vector . Another example of generating the synthetic dataset.

Sampling from the other graphical models in Figure 2 is done writing the factorization and sampling from directed edges and finally marginalization with respect to hidden variables (Wainwright and Jordan, 2008)

. The hidden variables are selected to be categorical variables with cardinality selected uniformly from the integers in the interval

. The conditional distribution of the hidden variables is selected to be Dirichlet distribution with all-ones parameter vector.

(a) Binary
(b) Count
Figure 7: Distribution of the coefficients generated by the causality detector.

b.2 Evaluating the causality detector

Table 4 show two advantages of the proposed sampling procedure for count data in comparison to the binary case proposed by Chalupka et al. (2016). First, in the synthetic dataset, the test error is significantly lower. This is because the size of input to the neural causality detector is compared to for the binary case. Applying the causality detectors to our data, we observe that the causality scores generated by our sampling scheme has significantly higher correlation with the mutual information between independent variables and the target label. Figure 7 highlights another advantage of the sampling procedure for count data as it is able to identify a larger portion of the variables as non-causal, which is more in line the expectations. Table 6 shows that the mutual information identifies V70.0 (Routine general medical examination at a health care facility) as highly correlated, but the causality detector correctly identifies it as non-causal with causality score .

In particular, in Figure 7(a), the Spearman’s rank correlation is which indicates a strong correlation. This is intuitive as we expect on average the causal connections to create stronger correlations. Another consequence of the large correlation makes regularization by the non-causality scores safer and guarantees that it will not significantly hurt the predictive performance. In Figure 7(a), we have marked four codes in the four corner of the figure. An example of highly correlated and causal code we can point out 250.00 (Diabetes mellitus without mention of complication) which is a known cause of heart failure. Code 362.01 (Background diabetic retinopathy) is an effect of diabetes —a common cause of heart failure. Code V06.5 (Need for TD vaccination) is an example of neither causal nor correlated code. Finally, code 365.00 (Preglaucoma) is known for increasing the risk of heart failure, despite the fact that it is not very correlated with heart failure.

Algorithm Error Spearman Correlation
Rate w/ Mutual Information
Binary 0.2165 -0.0099 (0.4506)
Count 0.0617 0.6689 (0.0000)
Table 4: Summary of the results
(a) Causation score vs. mutual information
(b) The impact of the causal regularization parameter
Figure 8: (fig:mi-cause) The scatter plot of causation score vs. mutual information. (fig:impact_lamda_vis) The impact of on the top 50 selected variables, marked in the original plot in Figure 7(a) As increases, there is a shift from left-up towards down-right corner and the trade-off shifts towards selecting more causal codes despite possibly lower mutual information. In this figure a small noise has been added to the points to visualize the overlapped points.

Appendix C Details of the neural networks

In this section we describe the details of the neural networks used in the paper. Implementation of all methods is done in Theano 0.8 and adamax is used for optimization. We also use early stopping based on the validation accuracy.

c.1 Details of CD architecture

We used a multilayer perceptron with seven layers of size 1024 with rectified linear units as activation functions. We use batch normalization for each layer.

c.2 Details of nonlinCause

The network in the nonlinCause is selected to be a multilayer perceptron with three layers of size 200 and rectified linear units as activation functions. Using the described tuning procedure, the embedding dimension is selected to be and we used dropout with rate . The results in Figure 4(a) is generated using for both LogCause and nonlinCause, though we observed similar performance gain for other values of penalization coefficient.

c.3 Details of CauseHyp

Implementation of the causality detector in (Lopez-Paz et al., 2016) in our CauseHyp is done via first generating features from the data using a three layered MLP with 200 hidden nodes in each layer. Then, after averaging over the batch, we use a five layered MLP with 200 hidden nodes in each layer. The architecture for the entire network is described in Section 3.2.

Appendix D Heart failure cohort design

Case patients were 40 to 85 years of age at the time of HF diagnosis. HF diagnosis (HFDx) is defined as:

  1. Qualifying ICD-9 codes for HF appeared in the encounter records or medication orders.

  2. A minimum of three clinical encounters with qualifying ICD-9 codes had to occur within 12 months of each other, where the date of diagnosis was assigned to the earliest of the three dates. If the time span between the first and second appearances of the HF diagnostic code was greater than 12 months, the date of the second encounter was used as the first qualifying encounter. The date at which HF diagnosis was given to the case is denoted as HFDx.

Up to ten eligible controls (in terms of sex, age, location) were selected for each case, yielding an overall ratio of 9 controls per case. Each control was also assigned an index date, which is the HFDx of the matched case. Controls are selected such that they did not meet the operational criteria for HF diagnosis prior to the HFDx plus 182 days of their corresponding case. Control subjects were required to have their first office encounter within one year of the matching HF case patient’s first office visit, and have at least one office encounter 30 days before or any time after the case’s HF diagnosis date to ensure similar duration of observations among cases and controls.

Appendix E Qualitative results

Tables 5 and 6 are discussed in the experiments section for qualitative evaluation of the results.

Code Description
794.31 Nonspecific abnormal electrocardiogram [ECG] [EKG] 0.3422 0.9351
425.8 Cardiomyopathy in other diseases classified elsewhere 0.3272 0.2322
786.05 Shortness of breath 0.3124 0.5536
424.90 Endocarditis, valve unspecified, unspecified cause 0.3086 0.3908
425.4 Other primary cardiomyopathies 0.2880 0.1351
427.9 Cardiac dysrhythmia, unspecified 0.2531 0.9864
785.9 Other symptoms involving cardiovascular system 0.2377 0.8024
585.6 End stage renal disease 0.2225 0.3948
511.9 Unspecified pleural effusion 0.2218 0.0839
425.9 Secondary cardiomyopathy, unspecified 0.2203 0.8024
782.3 Edema 0.2065 0.0027
278.01 Morbid obesity 0.1955 0.0345
424.0 Mitral valve disorders 0.1948 0.0003
427.31 Atrial fibrillation 0.1762 1.0000
410.90 Acute myocardial infarction of unspecified site, episode of care unspecified 0.1756 0.2510
426.3 Other left bundle branch block 0.1690 0.4890
424.1 Aortic valve disorders 0.1649 0.0012
879.8 Open wound(s) (multiple) of unspecified site(s), without mention of complication 0.1645 0.6399
429.3 Cardiomegaly 0.1619 0.5022
780.60 Fever, unspecified 0.1602 0.7747
482.9 Bacterial pneumonia, unspecified 0.1514 0.7482
786.09 Other respiratory abnormalities 0.1454 0.7305
496 Chronic airway obstruction, not elsewhere classified 0.1403 0.9990
V42.0 Kidney replaced by transplant 0.1398 0.4351
250.03 Diabetes mellitus without mention of complication, type I [juvenile type], uncontrolled 0.1388 0.4727
276.51 Dehydration 0.1347 0.6738
403.10 Hypertensive chronic kidney disease, benign, with chronic kidney disease stages I IV 0.1316 0.7488
250.50 Diabetes with ophthalmic manifestations, type II, not uncontrolled 0.1283 0.2271
427.89 Other specified cardiac dysrhythmias 0.1282 0.9416
250.51 Diabetes with ophthalmic manifestations, type I [juvenile type], not stated as uncontrolled 0.1234 0.5473
Table 5: Top 30 codes increasing the heart failure risk identified by the LogCause algorithm.
Code Description
782.3 Edema -1.7355 0.0027
424.1 Aortic valve disorders -2.2021 0.0012
425.4 Other primary cardiomyopathies -2.2330 0.1351
V70.0 Routine general medical examination at a health care facility -2.2420 0.0000
443.9 Peripheral vascular disease, unspecified -2.2668 0.1145
424.0 Mitral valve disorders -2.3088 0.0003
250.50 Diabetes with ophthalmic manifestations, type II, not stated as uncontrolled -2.3288 0.2271
511.9 Unspecified pleural effusion -2.3320 0.0839
427.32 Atrial flutter -2.3508 0.0767
278.01 Morbid obesity -2.3924 0.0345
425.8 Cardiomyopathy in other diseases classified elsewhere -2.4090 0.2322
362.01 Background diabetic retinopathy -2.4536 0.0815
584.9 Acute kidney failure, unspecified -2.4661 0.2882
412 Old myocardial infarction -2.4750 0.0780
428.0 Congestive heart failure, unspecified -2.4946 0.0039
791.0 Proteinuria -2.4984 0.1883
357.2 Polyneuropathy in diabetes -2.5040 0.2120
402.90 Unspecified hypertensive heart disease without heart failure -2.5103 0.2034
250.42 Diabetes with renal manifestations, type II or unspecified type, uncontrolled -2.5310 0.2378
280.9 Iron deficiency anemia, unspecified -2.5661 0.0283
427.1 Paroxysmal ventricular tachycardia -2.5884 0.0367
V53.31 Fitting and adjustment of cardiac pacemaker -2.6014 0.2894
459.81 Venous (peripheral) insufficiency, unspecified -2.6093 0.0748
410.90 Acute myocardial infarction of unspecified site, episode of care unspecified -2.6154 0.2510
588.81 Secondary hyperparathyroidism (of renal origin) -2.6199 0.0478
250.62 Diabetes with neurological manifestations, type II or unspecified type, uncontrolled -2.6367 0.2051
414.8 Other specified forms of chronic ischemic heart disease -2.6403 0.0923
362.02 Proliferative diabetic retinopathy -2.6490 0.2629
586 Renal failure, unspecified -2.7324 0.2388
250.52 Diabetes with ophthalmic manifestations, type II or unspecified type, uncontrolled -2.7375 0.2154
Table 6: Top 20 codes with highest mutual information with the heart failure outcome. is the mutual information between and .