Structural Agnostic Model
We present the Structural Agnostic Model (SAM), a framework to estimate end-to-end non-acyclic causal graphs from observational data. In a nutshell, SAM implements an adversarial game in which a separate model generates each variable, given real values from all others. In tandem, a discriminator attempts to distinguish between the joint distributions of real and generated samples. Finally, a sparsity penalty forces each generator to consider only a small subset of the variables, yielding a sparse causal graph. SAM scales easily to hundreds variables. Our experiments show the state-of-the-art performance of SAM on discovering causal structures and modeling interventions, in both acyclic and non-acyclic graphs.READ FULL TEXT VIEW PDF
Structural Agnostic Model
The increasing need for interpretable machine learning places renewed importance in learning causal models. While the gold standard to establish causal relationships remains randomized experiments(Pearl, 2003; Imbens and Rubin, 2015), those often turn out to be costly, unethical, or infeasible in practice. Because of these reasons, hypothesizing causal relations from observational data has attracted a lot of interest in machine learning (Lopez-Paz et al., 2015; Mooij et al., 2016), with the goal of prioritizing experiments or simply interpreting data. This is often referred to as “causal discovery”. Observational causal discovery has found applications in fields such as bio-informatics (to infer network structures from gene expression data (Sachs et al., 2005)) and economics (to model the impact of monetary policies (Chen et al., 2007)).
One formal way to describe causal models is to use the framework of Functional Causal Models, or FCMs (Pearl, 2009). FCMs describe the causal structure of variables using the equations:
for all , where are the indices of the direct causes of , and is a source of random noise accounting for uncertainty. Finally, is a deterministic function that computes the effect from the causes and noise .
The summary of an FCM is a causal graph associating each variable with one node, and modeling each direct causal relation from to as the directed edge
. The goal of observational causal discovery is to learn this causal graph from samples of the joint probability distribution of. In the restrictive cases when the causal graph is a DAG, causal discovery builds on the principle of
-separation from Bayesian Networks(Heckerman et al., 1995).
The main limitation of state-of-the-art observational causal discovery is two-fold. On the one hand, searching a causal graph is a combinatorial optimization problem, whose complexity grows super-exponentially in the number of variables. On the other hand, most phenomena in Nature arguably lead to causal graphs with cycles. As one example, biological pathways may include positive and negative feedback loops(Sachs et al., 2009). Some alternatives to learn non-acyclic causal graphs have been proposed (Lacerda et al., 2012; Rothenhäusler et al., 2015). Many of these approaches unfold the cycles through time, as it is done in Dynamic Bayesian Networks (Murphy and Russell, 2002). However, observational data often ignores the time information: in cross-sectional studies, values are averaged over time.
Our contribution is a novel approach to identify non-acyclic causal structures and all its associated causal mechanisms , purely from continuous observational data. Our framework Structural Agnostic Model (SAM) learns each causal mechanism
as a conditional neural network generator(Mirza and Osindero, 2014). This extends the bivariate cause-effect model of Lopez-Paz and Oquab (2017). SAM is “agnostic” in several aspects: it eliminates the needs for the “causal Markov condition” (contingent to directed acyclic graphs), the causal faithfulness condition (allowing us in particular to represent complex non-linear relationships like the XOR problem), and various other common assumptions, such as linearity of relationships, and Gaussianity or non-Gaussianity of the noise. Lastly, the commonly made “causal sufficiency” condition (that there exist no hidden common cause to any pair of observed variables) can also be alleviated. SAM implements an adversarial game (Goodfellow et al., 2014) where each conditional generator (an artificial neural network) “plays” to produce one of the variables , given real observations of a (small) subset of all the others, hypothesized to be its direct causes, and noise. In tandem, a joint discriminator “plays” to distinguish between real and generated observational samples. As a result of jointly learning neural networks over samples, SAM estimates end-to-end the complete causal graph and underlying causal mechanisms. For acyclic causal graphs, SAM can simulate either the observational distribution or any interventional distribution (for given interventions) (Pearl, 2003). For non-acyclic graphs, SAM can be used to simulate the instantaneous effects of interventions. In both cases, these are crucial tools to understand the causal functioning of a system of variables. The performance and the scalability of SAM are examined and compared to the state-of-the-art throughout an extensive variety of experiments on both controlled and real data.
There exists a vast literature in observational causal discovery, surveyed in volumes such as (Spirtes et al., 2000) and (Peters et al., 2017). A first class of observational causal discovery algorithms (Spirtes et al., 2000) consider the recovery of directed and acyclic causal graphs. These algorithms proceed by starting with a complete graph, and then using information about conditional independences to progressively remove edges. A second class of approaches associate one score per candidate causal graph, and performs a combinatorial search to find the best candidate according to that score (Chickering, 2002)
. Although there exists many heuristics to explore the enormous space of directed acyclic graphs(Tsamardinos et al., 2006), many of these might become impractical for a high number of variables.
A second family of algorithms avoid the explicit combinatorial search of graphs by placing strong assumptions on the generative process of the observational data. For instance, Meinshausen and Bühlmann (2006) assume Gaussian data to recover the causal skeleton by examining the non-zero entries in the inverse covariance matrix of the data. In a similar spirit, another stream of works (Shojaie and Michailidis, 2010; Ren et al., 2016) assume sparse causal graphs, and leverage Lasso-type regression techniques to reveal the strongest causal relations in the data. Still, all of these algorithms place Gaussianity assumptions (linear causal mechanisms with additive Gaussian noise) on the generative process of the data. This is a constraining limitation: not only real-world data is rarely Gaussian, but also linear-Gaussian causal mechanisms do not contain any asymmetries that could be leveraged as causal footprints in the data.
These asymmetries, or causal footprints, are the main target of a third family of observational causal discovery methods. On one hand, some of these algorithms target the causal relation between two variables. Examples are the Additive Noise Model or ANM (Hoyer et al., 2009); the Gaussian Process Inference causal model, or GPI (Stegle et al., 2010); or the Randomized Causation Coefficient, (RCC) (Lopez-Paz et al., 2015). On the other hand, some of these algorithms are able to estimate the causal structure underlying multiple variables. These include the Causal Additive Model, or CAM (Bühlmann et al., 2014), and the Causal Generative Neural Network, or CGNN (Goudet et al., 2017). Both CAM and CGNN leverage conditional independences and distributional asymmetries for multivariate causal discovery. However, CAM and CGNN rely on the undesirable combinatorial search of directed acyclic graphs to find the best causal structure.
Our work aims to combine the best characteristics of these three families: exploring conditional independences, regularizing sparsity to avoid combinatorial graph search, and revealing distributional asymmetries. Furthermore, SAM exploits the representational power of deep learning in two ways. First, we estimate a probabilistic representation of each of the causal mechanisms forming the FCM that underlies the data. Second, we train an adversarial discriminator. This replaces the use of restrictive likelihood functions on possibly complex data (for instance, mean squared error minimization). During adversarial training, a sparsity penalty forces each estimated causal mechanism to consider only a small subset of the observed variables. Therefore, SAM borrows inspiration from sparse feature selection methods(Leray and Gallinari, 1999), and conditional adversarial training (Goodfellow et al., 2014; Mirza and Osindero, 2014).
Our starting point is the complete FCM on variables. In the complete FCM, each variable is caused by all other variables and an independent noise :
for all . The complete FCM may describe the local equilibria of a dynamic process without self-loops (Lacerda et al., 2012). But we also interpret it as a predictor of the instantaneous evolution of a quasi-static or slowly evolving system whose variables may be integrated or averaged over a given period of time. Without loss of generality, we assume that the noise variables
are jointly independent and follow a Gaussian distribution(since the functional combination of noise and observed variables may include any functional transform of the noise, leading to non-Gaussianity). The joint independence of noise terms presumes causal sufficiency: that is, there exists no hidden common causes of two observed variables. We will first present results making this common assumption and then propose a way to relax it in Section 5.
Any non parametric differentiable model may be used to represent causal mechanisms in our framework. In practice, we use a neural network with one hidden layer of neurons:
with , , , and . refers to element-wise product.
The neural network (Eq. 3) has three peculiarities:
Each observed input variable is multiplied by a causal filter , where . These causal filters will determine the estimated causal graph. Indeed, is a discriminant value
in the problem of classifying the relationship betweenand as being a direct causal link vs. other possibilities (including indirect link, opposite causal link, common cause, or independence). To some extent, may be interpreted as a “causation coefficient” estimating the strength of a putative causal relationship.
The -dimensional rows of are normalized to unit Euclidean norm. So, the influence of each input observed variable on the outcome is solely controlled by the coefficient .
The value noise variable is drawn from at every evaluation.111Sampling the noise variables from a Gaussian distribution does not limit the model to generate non-Gaussian noise at the output of the generators, as the generative neural nets can generate complex distributions with normal variables as input. Therefore, the neural network estimates an entire conditional distribution.
In the following, we explain how we use adversarial training (Goodfellow et al., 2014) to estimate the causal mechanisms . For all , we generate the synthetic sample by employing: (i) the current estimate of the causal mechanism given by Eq. (3), (ii) the values for all the other observed variables given by a real observational sample , and (iii) . Then, we employ a single discriminator to distinguish between real observational samples and “almost real” observational samples , which differ only in the -th coordinate. More specifically, we define the following adversarial losses:
for , which are minimized to update the parameters of the shared discriminator , and maximized to update the parameters of the causal mechanisms .
In SAM, the structure of the learned causal graph is given by the collection of causal filters .
In particular, SAM estimates that if is greater than a given threshold.
For better interpretability and generalization, we would be interested in recovering simple graphs: that is, a model where each causal mechanism has few non-zero .
222In SAM models, exogenous variables are those where is a vector of almost-zeros, and therefore are generated purely from the source of noise
is a vector of almost-zeros, and therefore are generated purely from the source of noise.
Borrowing inspiration from Lasso variable selection (Meinshausen and Bühlmann, 2006)
, we build our final loss function by adding a sparsity penalty to the collection of causal filters:
Figure 1 illustrates the SAM model. The final loss (4) implements a competition between sparse causal mechanisms and a shared discriminator . This sidesteps the usual combinatorial optimization problem involved in causal graph search. SAM does not avoid the construction of cycles in the estimated causal graph.
This methodological approach is justified for three reasons:
Avoiding the use of complex penalization techniques to prune cycles.
Allowing to model generative processes that may contain cycles.
Allowing to model non-identifiable causal relationships as bi-directed arrows (even in DAGs).
Our experimental results confirms the validity of our approach, which avoids forcing local conclusions regarding causal relationships and propagating errors by enforcing a DAG structure.
This section pursues two goals: interpreting the adjacency matrix given by SAM’s filters (Section 4.4) and evaluating the performance of SAM when compared to the state-of-the-art on a variety of causal graph discovery benchmarks (Section 4.5). We emphasize the discovery of non-linear causal mechanisms with non-additive noise terms, on both acyclic and non-acyclic graphs. 333All our code is available at https://github.com/Diviyan-Kalainathan/SAM.
We generate seven types of graphs, where :
Sigmoid AM: , where with , and .
Sigmoid Mix: , where is as in the previous bullet-point.
GP AM: where is an univariate Gaussian process with a Gaussian kernel of unit bandwidth.
GP Mix: , where is a multivariate Gaussian process with a Gaussian kernel of unit bandwidth.
Polynomial: , or , where is a random polynomial with a random degree in .
NN: , where
a random single-hidden-layer neural network with 20 ReLU units.
Under each of the seven generative processes, we build 20 directed acyclic graphs of 20 variables, and 10 directed acyclic graphs of 100 variables. We sample 500 data points for each graph. The samples are generated by i) sampling values for the root causes using a random Gaussian distribution for the datasets Linear, Sigmoid, GP
and using four component Gaussian mixture models with random mean and random covariance for the datasetsPolynomial and NN, ii) running the causal mechanisms in the topological order of the graph.
Using the same seven generative processes, we build 20 cyclic graphs of 20 variables, and 10 cyclic graphs of 100 variables. In order to generate these graphs, for we sample two sets of 500 data points and from Gaussian distribution. Then, at each iteration, we generate each variable using the corresponding causal mechanism and the values of its parents at the previous time step: for . We repeat the process 50 times, and then average the values of the samples of each variable during 50 iterations: .
We recall that is a discriminant value in the problem of classifying the relationship between and as being a direct causal link vs. other possibilities (including indirect link, opposite causal link, common cause, or independence). There is always a trade off between true positives and false positives, often evidenced by ROC curves or precision-recall curves. We chose precision-recall because it puts more emphasis on the positive class. To avoid choosing a specific threshold on , we use the Average Precision (AP). This metric summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as a weight:
are the precision and recall at the-th threshold. The AP is between 0 and 1, 1 is best.
In all experiments, SAM employs generators with one hidden layer of Tanh units, and one discriminator with two hidden layers of
LeakyReLU units and batch normalization(Ioffe and Szegedy, 2015)
. All observed variables are normalized to zero-mean and unit-variance. All causal filtersare initialized at one, except the self-loop terms , which are always held constant to zero. We use the Adam optimizer Kingma and Ba (2014) with learning rate 0.1 for epochs to train our neural networks. The noise terms appearing in each generator function are sampled anew at each epoch from .
A key aspect of our method is to perform a calibration of hyper-parameters using a few “representative” problems for which the ground truth of causal relationships is known. In the experiments carried out in this paper, we optimized the hyper-parameters using only 10 training graphs (five of 20 variables and five of 100 variables) drawn from models of type VI, where the number of parents for each variable is uniform between zero and five. This led us to choose the values , for generating model graphs with 20 variables and , for graphs with 100 variables.
In this section, we provide graphical illustrations to interpret qualitatively the pair as indicators of the causal relation between and . Figure 2 plots the pairs with for some of our experiments on dimensions. Each pair is colored according to their ground truth: blue for , red for , green for indirect causal relation (there exists such that , or , yellow for confounded (there exists such that ), and gray for other. Additionally, we mark with crosses the pairs of variables that do not share a v-structure, and mark with a triangle pairs of variables sharing a cycle.
Then, four cases arise:
both close to zero (Bottom left, grey area on Fig.3): evidence that and are statistically independent.
both far from zero (Bisecting line, purple area on Fig.3): evidence that and are statistically dependent, but the causal direction is not identifiable. Several possible explanations could be given: (1) presence of a common cause (confounder), (2) presence of a 2-cycle (such as in Figure 1(d)-(2)), or (3) non-identifiable causal relation (such as linear-Gaussian associations, Figure 1(b)-(2)). Although not considered here, this case would also hint at the possible existence of hidden confounders.
(Bottom right, blue area on Fig.3): evidence that and share the causal relation .
These figures qualitatively illustrate properties of SAM, some of which are quantified in the next section. As explained in Figure 3, the further a pair is from the origin, the stronger the evidence for an association. The further a pair is from the diagonal, the stronger the evidence for a particular causal direction.
We observe that SAM is able to discriminate between causal directions (red versus blue points), and discriminate between causal relations (red and blue points) and non-causal associations (green, yellow, and grey points).
The four plots in Figure 2 highlight the different level of difficulty of each dataset. For instance, the Linear dataset in Figure 1(b) shows a higher concentration of red/blue points along the diagonal, since in these data there exists less asymmetry between cause and effect (in particular, these data will exhibit conditional independences but not distributional asymmetries). In contrast, the Sigmoid dataset in Figure 1(c) is much easier to solve, thanks to the strong non-linear nature of its causal mechanisms and the absence of cycles.
Figure 1(a)-(2) shows that the correct causal direction can be recovered by SAM for almost-linear relations between and in the absence of v-structure. Finally, SAM leverages multivariate non-linear interactions to discover direct causal relations: Figure 1(c)-(3) shows no strong dependence between and , but SAM is able to recover the true causal relation .
We compare SAM with a variety of state-of-the-art algorithms on the task of observational causal discovery: the PC algorithm (Spirtes et al., 2000), the score-based methods GES Chickering (2002), the hybrid method MMHC (Tsamardinos et al., 2006), the L1 penalized method DAGL1 (Fu and Zhou, 2013), the LiNGAM algorithm (Shimizu et al., 2006) and the causal additive model CAM (Peters et al., 2014). When considering non-acyclic causal structures, we also compare to CCD (Richardson, 1996). Other methods such as BackShift (Rothenhäusler et al., 2015), which require observations from different interventions, are out of our scope.
More specifically and for PC, we employ the better-performing, order-independent version of the PC algorithm proposed by Colombo and Maathuis (2014)
. We compare against two versions of PC: one using a Gaussian conditional independence test on z-scores, and one using the HSIC independence test(Zhang et al., 2012) with a Gamma null distribution (Gretton et al., 2005). PC, GES and LINGAM are implemented in the pcalg package (Kalisch et al., 2012). MMHC is implemented with the bnlearn package (Scutari, 2009). DAGL1 is implemented with the sparsebn package (Aragam et al., 2017). CCD is implemented from the Tetrad package (Scheines et al., 1998). The code of SAM is available online at https://github.com/Diviyan-Kalainathan/SAM. All results are averaged over runs.
|PC Gauss||PC HSIC||GES||MMHC||DAGL1||LINGAM||CAM||SAM|
Table 1 summarizes the results for all algorithms, datasets, and graph dimensions for the experiments on acyclic graphs. Overall, our results showcase the good performance of SAM. This is especially the case for non-linear datsets such as Sigmoid Mix and NN. Although SAM is a non-linear method, it shows surprisingly good performance also in linear problems (which may be due to the linear behaviour of Tanh around the origin). Linear methods such as PC-Gauss and GES also obtain good performance on linear problems, as expected. Other non-linear methods such as PC-HSIC exhibit worse performance on linear Gaussian problems, but perform better with complex distribution, because it uses non-parametric conditional independence test. As a note, we canceled the PC-HSIC algorithm after 50 hours of running time.
CAM outperforms SAM on the dataset GP AM, corresponding directly to the model fitted by CAM (additive contributions, additive noise and Gaussian mechanisms). However, if one can assume additive noise, it is straightforward to restrict the structure of the conditional generators of SAM in order to better capture this assumption. The same goes with assumptions about linear mechanisms: the conditional generators of SAM would be restricted to linear neural networks, in this case. In terms of computation time, SAM scales well at variables, especially when compared to the best competitor CAM, who uses a combinatorial graph search.
Table 2 summarizes the results for all algorithms, datasets, and graph dimensions for experiments on cyclic graphs. SAM performs better than most methods, including CCD, which was specifically designed to address causal relationships in cyclic graphs, like SAM. In general, all algorithms degrade in performance when moving from acyclic graphs to cyclic graphs. Notably, the causal Markov assumption is no longer verified in the presence of cycles, and cycles are known to be particularly hard to detect using only observational data.
We apply the previous algorithms to a real-world problem studying the reconstruction of a protein network (Sachs et al., 2005)
. We use the same hyperparameters as in the previous section. The results are given in Table3 : SAM performs consistently compared to other algorithms.
Figure 4 shows the causal structure estimated by SAM. While SAM misses a number of causal relations, the recovered edges are almost always correct with the ground truth mentioned by (Sachs et al., 2005, Fig. 2).
Notably, SAM correctly identifies the important pathway rafmekerk with two arrows associated to a direct enzyme-substrate causal effect. We recover also the well-established influence connection of plcg on pip2 reported in Sachs et al. (2005).
The edge from erk to pka marked as wrong on Figure 4 corresponds to a cycle in our model and was not reported in the original consensus network of Sachs et al. (2005) supposed to be acyclic. However this link is reported in a more recent model of the same authors Itani et al. (2010) that may include cycles.
We evaluated the computational complexity of our method. We recall that is the number of training examples and
the number of variables and that training with stochastic gradient descent incurs a computational complexity proportional to the number of training examples. We havegenerators with inputs each resulting in a number of parameters . Comparably, the number of parameters of the discriminator is negligible: .
Thus, the computational complexity generator and discriminator training being , the total complexity is of .
While we made significant advances in the direction of designing a simple agnostic method for causal discovery lifting most commonly made assumptions (causal Markov, causal faithfulness, additive noise, Gaussianity) there remain some extensions to be made to lift the causal faithfulness assumption, and generalize the model to binary and categorical variables. Further work also could lead to using SAM to predict the result of interventions. Finally, while our numerical experiments validate the approach, theoretical derivations of identifiability would strengthen the framework.
Regarding causal sufficiency we performed a preliminary sanity check. We relaxed the independence assumption between the random noise variables to model unobserved common causes. Indeed, by introducing a noise covariance matrix with learnable parameters (Rothenhäusler et al., 2015), one can estimate hidden confounding effects at a minimal computational cost. After learning our SAM model, the presence of non-zero off-diagonal covariance terms indicate the existence of hidden confounders.
To validate this approach, we compare the estimated noise covariance from SAM models when trained on two mini test cases. First, a direct causal effect . Second, a confounded causal effect , where remains unobserved. After running SAM on 256 independent runs, we obtain the two following noise covariance matrices, respectively:
In the second case, the off-diagonal covariance terms reveal a correlation between noise variables with a statistical significance above , which in turn correctly hints at the existence of the hidden confounder .
We introduced SAM, a novel ”agnostic” model for causal graph reconstruction, leveraging the power and non-parametric flexibility of Generative Adversarial Neural networks (GANs).
On an extensive set of experiments we show its excellent performances compared to state of the art methods on the task of discovering causal structures in acyclic and cyclic graphs. In particular, SAM fares well computationally. We evaluated that the computational complexity of SAM scales with the square of the number of observable variables of the system. In comparison, extensive search methods must explore a space of graphs, and in the worst cases, algorithms such as the PC algorithm are exponential in . SAM lifts most common restricting assumptions of other methods (causal Markov, causal faithfulness, causal sufficiency, additive noise, Gaussian noise). It performs well across the board, even though it can on occasion be beaten by other models, particularly if those models were used to generate the data. It never has catastrophic performances. Its strongest contender is CAM (Bühlmann et al., 2014), which it beats on cyclic graph reconstruction (by design) and beats on all datasets except those that were generated using the specific modeling assumptions made by CAM (additive noise model).
Proceedings of the Twelfth international conference on Uncertainty in artificial intelligence, pages 454–461. Morgan Kaufmann Publishers Inc.