Log In Sign Up

Masked Gradient-Based Causal Structure Learning

by   Ignavier Ng, et al.

Learning causal graphical models based on directed acyclic graphs is an important task in causal discovery and causal inference. We consider a general framework towards efficient causal structure learning with potentially large graphs. Within this framework, we propose a masked gradient-based structure learning method based on binary adjacency matrix that exists for any structural equation model. To enable first-order optimization methods, we use Gumbel-Softmax approach to approximate the binary valued entries of the adjacency matrix, which usually results in real values that are close to zero or one. The proposed method can readily include any differentiable score function and model function for learning causal structures. Experiments on both synthetic and real-world datasets are conducted to show the effectiveness of our approach.


page 1

page 2

page 3

page 4


A Graph Autoencoder Approach to Causal Structure Learning

Causal structure learning has been a challenging task in the past decade...

Definite Non-Ancestral Relations and Structure Learning

In causal graphical models based on directed acyclic graphs (DAGs), dire...

Gradient-Based Neural DAG Learning

We propose a novel score-based approach to learning a directed acyclic g...

Causal Discovery with Reinforcement Learning

Discovering causal structure among a set of variables is a fundamental p...

DAGs with No Curl: An Efficient DAG Structure Learning Approach

Recently directed acyclic graph (DAG) structure learning is formulated a...

Physical System for Non Time Sequence Data

We propose a novelty approach to connect machine learning to causal stru...

Differentiable Invariant Causal Discovery

Learning causal structure from observational data is a fundamental chall...

1 Introduction

Causal graphical models based on Directed Acyclic Graphs (DAGs) describe causal systems without latent variables or selection biases, and have important applications in many empirical sciences such as weather forecasting (Abramson et al., 1996), biomedicine and heathcare (Lucas et al., 2004), and biology (Sachs et al., 2005; Pearl, 2009). Although randomized controlled experiments can be conducted to learn the causal structure effectively, they are generally expensive or even impossible in practice. It is hence more appealing to discover causal structures from observational data.

Indeed, there have been lots of efforts devoted to learning causal DAGs with passively observed data. Existing approaches can be roughly categorized into three classes: score-based methods (Cooper and Herskovits, 1992; Chickering, 2002), constraint-based methods (Spirtes and Glymour, 1991; Colombo and Maathuis, 2014), and hybrid methods consisting of the previous two methods (Tsamardinos et al., 2006). Despite that constraint-based and hybrid methods are well suited for learning sparse systems, their assumptions on underlying distributions, e.g., faithfulness, are usually stronger than score-based methods (Spirtes, 2010; Kalisch and Bühlmann, 2007). Recent advances on identifiability of casual structures (Shimizu et al., 2011; Peters et al., 2014) have also motivated much research attention on score-based methods (Bühlmann et al., 2014; Zheng et al., 2018; Yu et al., 2019; Lachapelle et al., 2019).

Traditionally, score-based methods first assign a score to each DAG according to some predefined score function, and then search over the space formed by all DAGs to find the one with the optimal score. However, due to the combinatorial nature of the search space (He et al., 2015; Chickering, 1996), brute-force search for the causal structure is usually infeasible. As such, a variety of approximate structure searching strategies have been proposed in the literature. For example, K2 locally finds the best parents set of each variable, assuming that the ordering of the variables is known (Cooper and Herskovits, 1992). Another strategy, Greedy Equivalence Search (GES), attempts to walk through the space of graph structures by adding, deleting or reversing an edge (Chickering, 2002), following a series of rules to avoid creating cycles every time an edge is added or reversed. These methods change the graph in a local way and their performance is usually less satisfactory in practice due to finite data and possible violation of model assumptions.

More recently, Zheng et al. (2018)

propose NOTEARS, a new score-based method that formulates the above combinatorial optimization problem as a continuous optimization one by posing a smooth constraint to enforce acyclicity. NOTEARS is specifically designed for linear Structural Equation Models (SEMs), whereas subsequent works DAG-GNN

(Yu et al., 2019) and GraN-DAG (Lachapelle et al., 2019)

have extended it to nonlinear cases, with Neural Networks (NNs) used to model the causal structures from the observed data.

In contrast with traditional searching methods that only change one edge each time, these methods can utilize existing first-order optimization methods to estimate the causal structure through

weighted adjacency matrices: NOTEARS and DAG-GNN assume special forms of structural equations with weighted adjacency matrices that may not hold for general SEMs, while GraN-DAG utilizes feed-forward NNs to model causal relations and further constructs an equivalent weighted adjacency matrix from the NN weights (more details can be found in Section 2.2). A weighted adjacency matrix consisting of entries (with being the number of variables) can make the non-convex in NOTEARS, DAG-GNN and GraN-DAG harder to solve with larger graphs. Furthermore, instead of feed-forward NNs, one would like to use other powerful models, e.g., RNN or CNN based architecture, to estimate the causal relations, with which the equivalent weighted adjacency matrix cannot be easily constructed as in GraN-DAG. It then becomes interesting to devise a generic method so that an adjacency matrix representing the underlying causal structure exists for any SEM and the continuous formulation of structure learning problem can include any smooth model for estimating causal mechanisms.

In this work, we consider a general framework for learning causal DAGs towards more efficient learning with potentially larger graphs, as outlined in Fig. 1. This framework is similar to what has been used by Bühlmann et al. (2014); Lachapelle et al. (2019) but with different purposes (see Section 4 for more details). There are three stages: Preliminary Neighboring Selection (PNS) for variable selection to reduce search or optimization space, structure learning to estimate the directed graph that best describes the causal structure from observed data, and pruning to induce acyclicity and remove spurious edges. We empirically investigate existing PNS methods and find that these PNS methods may bring a negative effect by removing many true edges (cf. Section 4). We also develop a masked gradient-based causal structure learning method based on binary adjacency matrix that exists for any SEM. To enable first-order methods, we use Gumbel-Softmax approach (Jang et al., 2017; Maddison et al., 2016) to approximate the binary valued entries of the adjacency matrix, which usually results in real values close to zero or one (cf. Section 3). Compared with existing gradient-based method, one can readily plug in any differentiable score function and model function for learning causal structures. Finally, we conduct experiments to show the effectiveness of our approach on both synthetic and real-world datasets.

Figure 1: A general framework for learning causal DAGs.

2 Background and Related Work

We briefly review Structural Equation Models (SEMs) that are widely used in causal inference and structure learning, followed by an introduction to recent gradient-based structure learning methods.

2.1 Causal DAGs and Structural Equation Models

Let denote a DAG with vertex set and edge set . For each , we use to denote the set of parental nodes of so that there is an edge from to in . A commonly used model in causal structure learning is the SEM that contains two types of variables: substantive and noise variables (Spirtes, 2010). The former are of particular interest, since they determine the underlying causal mechanisms for generating the observed data. Each substantive variable is obtained from a function of some (or possibly none) other substantive variable(s) and a unique noise variable. In this work, we focus on the following recursive SEM with additive noises w.r.t. a DAG :


where are jointly independent noises and are deterministic functions with input argument

. More general SEMs can be similarly treated with our proposed approach, provided with suitable score function and parametric models for causal relationships; see Section 

3 for more details. We also assume that each is non-degenerate w.r.t. , that is, for any and any fixed value of , is not a constant function in . This assumption corresponds to causality minimality commonly used in causal structure learning (Peters et al., 2014, 2017).


be the variable vector that concatenates all the variables

. We use to denote the marginal distribution over induced by the above SEM. Then it can be shown that is Markovian to (Spirtes et al., 2000; Pearl, 2009), that is, can be factorized according to as,


denotes the probability of

conditioning on its parents . In other words, and

form a causal Bayesian network

(Spirtes, 2010; Peters et al., 2017).

2.2 Gradient-Based Structure Learning Methods

Traditional score-based methods rely on various heuristics such as removing or adding edges to directly search in the space of DAGs for the best DAG according to a score function. In contrast, recent methods have cast this combinatorial optimization problem into a continuous constrained one, which can then be solved by various first-order optimization methods.

NOTEARS (Zheng et al., 2018) is the first to formulate the structure learning problem as a continuous optimization one, thanks to a novel smooth characterization of the acyclicity constraint. NOTEARS is specifically developed for linear SEMs with , where is the coefficient vector and the indices of non-zero elements in correspond to the parents of . Define as the coefficient matrix with each column being , then can be viewed as the weighted adjacency matrix of the underlying causal DAG and learning the structure of is equivalent to learning . Using mean square loss, NOTEARS solves the following optimization problem to estimate :

subject to (2)

where is the design matrix containing sample vectors, and the penalty term is used to induce sparsity. Eq. (2) characterizes the acyclicity constraint: as shown in (Zheng et al., 2018, Thm. 2), Eq. (2) holds if and only if is a weighted adjacency matrix of . The above problem can be solved by the augmented Lagrangian method (Bertsekas, 1999), followed by thresholding on the estimated adjacency matrix to reduce false discoveries.

DAG-GNN (Yu et al., 2019) extends NOTEARS to handle nonlinear SEMs, based on the fact that a linear SEM can be equivalently written as , where is the noise vector given by . DAG-GNN assumes the following data generating procedure: , where and are possibly nonlinear functions. It then estimates

under the framework of variational autoencoders (VAE), whose objective is to maximize the evidence lower bound, with the encoder and decoder being two graph neural networks, respectively

(Kingma and Welling, 2013). DAG-GNN has shown to achieve better results than NOTEARS for several nonlinear SEMs. Due to the nonlinear transformations on the weighted adjacency matrix , however, the weights in this model are indeterminate and lack causal interpretability.

Rather than characterizing the nonlinearity of SEMs directly, GraN-DAG Lachapelle et al. (2019) models the conditional distribution of each variable given its parents with feed-forward NNs. GraN-DAG defines the so-called NN-path as a way of measuring how one variable affects another variable . The sum of all NN-paths being zero implies that is a constant function in and there is no edge from to . In this way, an equivalent weighted adjacency matrix can be constructed with the -th entry being the sum of all NN-paths. This approach of NN-path is specifically designed for feed-forward NNs and may require much extra effort to apply to other powerful NN models such as CNNs or RNNs. Our proposed method, on the other hand, is more generic as it readily includes such NN models for learning causal structures.

In addition, Zhu et al. (2019)

has proposed to use reinforcement learning to search for the DAG according to a predefined score function. The authors adopt policy gradient with the gradient being estimated by the REINFORCE algorithm

(Williams, 1992). Despite that this approach can include any score function that is not necessarily differentiable, dealing with large graphs (with more than nodes) is still challenging.

3 Masked Gradient-Based Causal Structure Learning with Gumbel-Softmax

Both NOTEARS and DAG-GNN rely on the notion of weighted adjacency matrix; however, the weighted adjacency matrix is not obvious for certain SEMs; see, e.g., Section 5.1.2. Noticing that every DAG corresponds to a binary adjacency matrix and vice versa, we consider to estimate the binary adjacency matrix in this work. Fig. 2 outlines our proposed method.

Figure 2: Masked gradient-based causal structure learning. denotes the gumbel-sigmoid indicator for variable such that , which is either extremely close to 1 or extremely close to 0. For each , the masked input vector is fed into the -th NN to compute and evaluate the loss together with acyclicity loss computed based on . Note that the NNs illustrated here can be replaced with other choices of learnable models that are first-order differentiable.

We aim to learn the causal graph from observational data, or equivalently, to learn for each variable under the acyclicity constraint. Let be the binary adjacency matrix associated with the true causal DAG , where can be viewed as an indicator vector for variable such that , the -th element of , equals if and only if is a parent of . For ease of presentation, we further assume that the input of is a -dim vector where denotes the Hadamard product. With some abuse of notations, we rewrite Eq. (1) as


Consequently, learning the structure of is equivalent to estimating .

3.1 Approximating Binary Adjacency Matrix with Gumbel-Softmax

Although representing as an binary adjacency matrix is straightforward, the discrete nature of prohibits direct use of first-order methods to estimate it efficiently. To proceed, we relax each entry to take real values from

, which can be interpreted as the probability that a directed edge exists. A naive approach is then to apply a logistic sigmoid function to a real valued variable. However, this approach is equivalent to rescaling the initial inputs and gradients to a large range and may have a negative effect on the subsequent optimization procedure. Moreover, the estimated entries of the adjacency matrix can lie in a very small range near zero, making it hard to identify what edges, indicated by the adjacency matrix, are indeed true positives. We would like that each estimated entry, which has been relaxed to be continuous, is either close to zero or one, so that they can be easily thresholded. To this end, we adopt the Gumbel-Softmax approach based on reparameterization trick

(Jang et al., 2017; Maddison et al., 2016), which has shown to be more effective than straight-through (Bengio et al., 2013) and several REINFORCE based methods.

The Gumbel-Softmax approach is based on the Gumbel-Softmax distribution, a continuous distribution over the simplex that can approximate samples from a categorical distribution (Jang et al., 2017; Maddison et al., 2016). Unlike other continuous distributions over the simplex, such as the Dirichlet distribution, the parameter gradients of Gumble-Softmax can be computed efficiently with reparameterization trick. In particular, we use two-dimensional Gumbel-Softmax distribution for our binary valued entry. Define with , i.e., the probability that there is no edge from to , and let and be two independent variables sampled from . Then , given by


follow the Gumbel-Softmax distribution with probability density function being


is a hyperparameter denoting the softmax temperature. We also have

where denotes the logistic sigmoid function. Thus, if we let and , then can be written as


and would be the optimization parameter.

3.2 Acyclicity Constraint and Optimization Problem

Now let be a matrix with the defined in Eq. (4) as the -th entry. Then can be viewed as a weighted adjacency matrix, which approximates the binary adjacency matrix by use of Gumbel-Softmax approch. To avoid self-loop, we simply mask the -th entry by setting . We then have the following acyclicity constraint, according to Zheng et al. (2018):


which is differentiable w.r.t.  and also .

A remaining issue is to model the causal mechanisms. We will use feed-forward NNs in this work; in particular, let denote an NN with parameter to estimate , with input being . For ease of presentation, we write and as the estimate vector for . Then we have the following optimization problem for structure learning

subject to

where is the -th sample, is the norm, and

is a properly selected loss function. While any differentiable score function can be used here, we focus on the squared loss function in this work, i.e.,

We use the augmented Lagrangian approach to solve the above problem, similar to Zheng et al. (2018) and Lachapelle et al. (2019). The temperature parameter can be set to a large value and then annealed to a small but non-zero value so that is extremely close to zero or one. In the present work, we simply pick which is found to work well in our experiments. The penalty term can be used to control false discoveries; however, we find that picking a proper value is not easy and we stick to a small value (i.e., ) to slightly remove spurious edges, leaving the rest to be handled by pruning.

4 Preliminary Neighborhood Selection and Pruning

For gradient-based methods, the solution outputted by the augmented Lagrangian only satisfies the acyclicity constraint up to certain numerical precision, i.e., several entries of the adjacency matrix can be very small but not exactly zero. Hence, thresholding is required to remove such entries. However, even after thresholding, the inferred graph obtained from the reconstruction based score function, e.g., BIC or negative log-likelihood, is very likely to contain spurious edges in the finite sample regime, and pruning is necessary for controlling false discovery rate. A useful approach is the CAM pruning proposed in Bühlmann et al. (2014) that applies significance testing of covariates based on generalized additive models and then declare significance if the reported p-values are lower than or equal to a predefined value. Despite that an additive model may not be correct for the true causal relationships, we find that it usually performs well in practice.

Preliminary Neighborhood Selection (PNS) is a variable selection procedure to choose a subset of other variables as its possible parents, or equivalently, to remove non-parental variables, for each variable. In Bühlmann et al. (2014), this preprocessing step is implemented with a boosting method for additive model fitting and is observed to reduce time consumption of the followed procedure significantly, particularly for large DAGs. While this procedure does reduce much time consumption of the CAM algorithm, we find that it may also removes true edges and thus affects the performance of the inferred graph, as shown in Table 1. Without PNS, CAM needs approximately hours to finish, whereas with PNS it only takes minutes.

20 nodes 50 nodes 100 nodes
with PNS time (mins.) 1.6  0.2 2.2  0.2 5.6  0.8 6.0  0.7 14.7  0.7 14.0  0.7
FDR 0.01  0.02 0.02  0.02 0.02  0.02 0.19  0.01 0.04  0.01 0.01  0.01
TPR 1  0 0.92  0.03 1  0 0.88  0.01 1  0 0.90  0.08
SHD 0.20  0.40 8.20  2.64 1.00  0.89 27.20  4.40 3.60  1.02 41.40 4.22
w/o PNS time (mins.) 15.3  0.4 15.6  0.2 777.8  7.0 796.1  9.2 N/A N/A
FDR 0  0 0.01  0.02 0.03  0.01 0.01  0.02 N/A N/A
TPR 1  0 0.99  0.01 1  0 0.99  0.01 N/A N/A
SHD 0  0 0.80  1.17 2.00  0.60 3.40  4.84 N/A N/A
Table 1: Comparison of CAM with PNS and without PNS on a causal additive model, whose data generating procedure can be found in Section 5.1.1.
Total edges in true graphs 50.63.9 206.49.1 407.212.8
Total edges after PNS 623.23.9 561.663.1 660.684.4
TPR 0.910.04 0.380.04 0.270.03
Table 2: Empirical results of the graph from the PNS step used in GraN-DAG on linear-Gaussian data model with nodes and samples, whose generating procedure is the same as the experiment setting used by Zheng et al. (2018).

Besides Bühlmann et al. (2014), we find that GraN-DAG also uses PNS but with a different purpose. Without PNS, GraN-DAG, even after thresholding and pruning, can still have a much higher SHD due to both false discoveries and missing true positives. To implement PNS, GraN-DAG fits extremely randomized trees for each variable against all the other variables and select potential parents according to the importance score based on the gain of purity. Despite that this procedure indeed reduces SHDs, we find that this procedure may fail for linear-Gaussian models with somewhat dense DAGs. As one can see from Table 2, PNS reduces too many true edges and any structure learning method working on the graph after PNS can never increase TPR.

Along with other experiments, we empirically find that PNS reduces many non-parental nodes while retaining most true edges for large sparse graphs, thus reducing the search or optimization space for the subsequent structure learning method. However, we comment that PNS is also possible to have a negative effect in the sense that it may remove many true edges for certain cases, and should be used with care. While we do not observe much effect of the PNS step for our method, we believe that the reduced search or optimization space is potentially beneficial to other structure learning methods.

5 Experiments

In this section, we compare our proposed method, denoted as MaskedNN, against several traditional and recent gradient-based structure learning methods, which include PC (Spirtes and Glymour, 1991), GES (Chickering, 2002), CAM (Bühlmann et al., 2014), NOTEARS (Zheng et al., 2018), DAG-GNN (Yu et al., 2019) and GraN-DAG (Lachapelle et al., 2019). We report True Positive Rate (TPR), False Discovery Rate (FDR), and Structural Hamming Distance (SHD) to evaluate the discrepancies between learned graphs and true graphs. Since GES and PC may output oriented edges, we treat them favorably by regarding undirected edges as true positives as long as the true graph has a directed edge in place of the undirected edges.

5.1 Synthetic Data

We conduct experiments on different datasets which vary along graph size, level of edge sparsity, and causal relationships. We first draw a random DAG according to the Erdős–Rényi graph model with a pre-specified expected node degree. Given , the observed data are then generated in the causal order indicated from , according to the SEM in Eq. (1) with different causal relationships

. In this work, the additive noise is distributed according to standard Gaussian distribution. We pick sample size

and consider graphs with nodes and an average degree of and .

5.1.1 Causal Additive Model

Our first dataset follows the causal additive model (Bühlmann et al., 2014), which uses the data generating procedure below:

where denotes the -th element in , is a scalar uniformly sampled from .

The empirical results are reported in Table 3. We observe that CAM has the best performance in terms of SHD and TPR in most settings, whereas MaskedNN and GraN-DAG both have slightly worse performance than CAM. It is not surprising that CAM performs well as this dataset follows the causal additive assumption made by CAM. For DAG-GNN, it performs similarly to a linear method such as NOTEARS in most of the settings. As discussed in Section 2.2, we hypothesize that the nonlinear transformation on the adjacency matrix results in the lack of causal interpretability in DAG-GNN. For PC and GES, they both have high TPRs on sparse graphs but very low TPRs on graphs of degree . This observation is also found with our next experiment in Section 5.1.2.

For MaskedNN, GraN-DAG and CAM, an additional pruning step is used to remove spurious edges, identical to what is done in Bühlmann et al. (2014). We did not apply this post-processing step to other baselines because they have lower TPRs, especially on graphs of degree . MaskedNN, GraN-DAG and CAM still achieve high TPRs on these graphs.

10 nodes
MaskedNN 0.81.3 0.920.14 0.100.19 8.82.6 0.800.05 0.040.03
GraN-DAG 2.21.7 0.850.11 0.160.19 7.24.0 0.840.09 0.080.04
DAG-GNN 3.61.4 0.680.09 0.030.06 26.83.0 0.380.08 0.090.07
NOTEARS 3.42.3 0.770.11 0.070.12 23.62.0 0.450.06 0.070.06
CAM 3.21.2 1.000.00 0.210.04 4.01.8 0.920.04 0.090.03
ICA-LiNGAM 14.83.7 0.420.12 0.720.07 29.41.4 0.470.04 0.380.04
GES 5.22.1 0.740.06 0.330.08 33.43.0 0.220.09 0.600.13
PC 4.44.0 0.710.23 0.230.25 34.61.6 0.190.03 0.540.08
50 nodes
MaskedNN 8.62.5 0.910.03 0.130.05 44.09.5 0.840.02 0.060.03
GraN-DAG 9.22.8 0.890.05 0.090.02 39.64.0 0.860.01 0.060.02
DAG-GNN 25.65.3 0.550.09 0.030.01 154.48.7 0.290.09 0.100.05
NOTEARS 19.63.6 0.700.05 0.100.05 150.09.6 0.380.04 0.230.04
CAM 1.00.9 1.000.00 0.020.02 27.24.4 0.880.01 0.190.01
ICA-LiNGAM 383.635.5 0.370.03 0.950.01 381.021.4 0.290.02 0.820.02
GES 22.04.6 0.750.06 0.220.07 183.65.9 0.240.01 0.450.03
PC 38.87.1 0.720.07 0.460.06 212.45.9 0.290.02 0.650.01
Table 3: Results of causal structure learning on causal additive model.

5.1.2 Gaussian Process

We consider another SEM which is also used in Peters et al. (2014); Lachapelle et al. (2019): each causal relationship is a function sampled from a Gaussian process, with RBF kernel of bandwidth one. This setting is known to be identifiable according to Peters et al. (2014).

Our results are reported in Table 4. It is observed that MaskedNN, GraN-DAG and CAM outperform the rest methods across all settings in terms of SHD and TPR. DAG-GNN, NOTEARS and ICA-LiNGAM perform poorly on this dataset, possibly because they may not be able to model this type of causal relationship. More importantly, these methods operate on the notion of weighted adjacency matrix which is not obvious for this SEM.

10 nodes
MaskedNN 1.21.6 0.870.21 0.130.25 9.23.3 0.780.09 0.050.07
GraN-DAG 2.42.2 0.860.15 0.150.21 14.44.8 0.660.12 0.110.22
DAG-GNN 6.63.6 0.500.24 0.070.13 37.02.1 0.120.03 0.180.15
NOTEARS 5.02.9 0.620.19 0.040.05 35.22.5 0.150.05 0.020.05
CAM 5.02.3 0.920.08 0.310.07 16.63.4 0.630.08 0.290.11
ICA-LiNGAM 21.24.5 0.630.15 0.760.08 31.23.06 0.480.08 0.440.06
GES 3.41.7 0.780.13 0.130.07 29.41.0 0.300.30 0.320.08
PC 4.81.9 0.690.09 0.330.10 18.83.54 0.550.10 0.290.08
50 nodes
MaskedNN 7.43.8 0.900.06 0.100.07 61.815.0 0.730.05 0.090.03
GraN-DAG 6.63.5 0.920.05 0.080.07 59.412.9 0.750.05 0.100.02
DAG-GNN 32.27.8 0.450.10 0.100.06 186.214.5 0.100.04 0.140.08
NOTEARS 22.87.0 0.670.10 0.140.07 174.813.5 0.160.03 0.100.08
CAM 3.81.9 0.960.02 0.060.04 58.66.6 0.760.02 0.140.01
ICA-LiNGAM 683.029.0 0.610.06 0.960.01 777.69.5 0.580.04 0.880.01
GES 19.06.6 0.740.09 0.170.08 147.416.0 0.310.05 0.280.05
PC 28.06.5 0.670.07 0.370.08 123.211.4 0.430.03 0.410.05
Table 4: Results of causal structure learning on Gaussian process.

5.2 Real-World Data

In this section, we consider a real-world dataset to discover a protein signaling network based on expression levels of proteins and phospholipid (Sachs et al., 2005). This is a widely used data set for research on graphical models, with experimental annotations accepted by the biological research community. This dataset contains both observational and interventional data. Since we are interested in using observational data to infer the causal graph, we only consider the observational data with samples. The ground truth causal graph proposed by Sachs et al. (2005) has 11 nodes and 17 edges. Note that the causal graph is sparse and an empty graph can result in an SHD as low as 17.

The empirical results are shown in Table 5. We observe that MaskedNN and CAM achieve the lowest SHD. DAG-GNN and NOTEARS are on par with MaskedNN and CAM in terms of TPR, but have higher FDRs leading to higher SHDs. We have also applied GES and PC to this dataset: GES obtains undirected edges while PC estimates undirected and directed edges. By contrast, all the inferred graphs of MaskedNN, CAM, and GraN-DAG consist of directed edges.

SHD 12 13 16 19 12 14
TPR 0.35 0.29 0.35 0.35 0.35 0.24
FDR 0.40 0.50 0.60 0.70 0.40 0.50
Table 5: Results of causal structure learning on real-world data.

6 Concluding Remarks

In this work, we have considered a general framework for learning causal DAGs, which consists of PNS, structure learning, and pruning. We empirically study existing PNS methods and find that these methods may bring a negative effect on variable selection by removing many true edges. We also develop a masked gradient-based causal structure learning method based on binary adjacency matrix that exists for any SEM. As an advantage, our method can readily include any differentiable score function and model function for learning causal structures. Experiments on both synthetic and real-world datasets are conducted to show the effectiveness of the proposed approach.


  • Abramson et al. (1996) Bruce Abramson, John Brown, Ward Edwards, Allan Murphy, and Robert L. Winkler. Hailfinder: A bayesian system for forecasting severe weather. International Journal of Forecasting, 12(1):57 – 71, 1996. Probability Judgmental Forecasting.
  • Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  • Bertsekas (1999) Dimitri P Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
  • Bühlmann et al. (2014) Peter Bühlmann, Jonas Peters, Jan Ernest, et al. CAM: Causal additive models, high-dimensional order search and penalized regression. The Annals of Statistics, 42(6):2526–2556, 2014.
  • Chickering (2002) David M. Chickering. Optimal structure identification with greedy search.

    Journal of Machine Learning Research

    , 3(Nov):507–554, 2002.
  • Chickering (1996) David Maxwell Chickering. Learning Bayesian networks is NP-complete. In Learning from Data, pages 121–130. Springer, 1996.
  • Colombo and Maathuis (2014) Diego Colombo and Marloes H. Maathuis. Order-independent constraint-based causal structure learning. Journal of Machine Learning Research, 15:3921–3962, 2014.
  • Cooper and Herskovits (1992) Gregory F. Cooper and Edward Herskovits. A bayesian method for the induction of probabilistic networks from data. Machine Learning, 9(4):309–347, Oct 1992.
  • He et al. (2015) Yangbo He, Jinzhu Jia, and Bin Yu. Counting and exploring sizes of markov equivalence classes of directed acyclic graphs. Journal of Machine Learning Research, 16:2589–2609, 2015.
  • Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations (ICLR), 2017.
  • Kalisch and Bühlmann (2007) Markus Kalisch and Peter Bühlmann. Estimating high-dimensional directed acyclic graphs with the pc-algorithm. Journal of Machine Learning Research, 8(Mar):613–636, 2007.
  • Kingma and Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR), 2013.
  • Lachapelle et al. (2019) Sébastien Lachapelle, Philippe Brouillard, Tristan Deleu, and Simon Lacoste-Julien. Gradient-based neural dag learning. arXiv preprint arXiv:1906.02226, 2019.
  • Lucas et al. (2004) Peter J. F. Lucas, Linda C. van der Gaag, and Ameen Abu-Hanna. Bayesian networks in biomedicine and healthcare. Artificial Intelligence in Medicine, 30(3):201–214, March 2004.
  • Maddison et al. (2016) Chris J. Maddison, Andriy Mnih, and Yee Whye Teh.

    The concrete distribution: A continuous relaxation of discrete random variables, 2016.

  • Pearl (2009) Judea Pearl. Causality. Cambridge University Press, 2009.
  • Peters et al. (2017) J. Peters, D. Janzing, and B. Schölkopf. Elements of Causal Inference - Foundations and Learning Algorithms. Adaptive Computation and Machine Learning Series. The MIT Press, Cambridge, MA, USA, 2017.
  • Peters et al. (2014) Jonas Peters, Joris M Mooij, Dominik Janzing, and Bernhard Schölkopf. Causal discovery with continuous additive noise models. The Journal of Machine Learning Research, 15(1):2009–2053, 2014.
  • Sachs et al. (2005) Karen Sachs, Omar Perez, Dana Pe’er, Douglas A Lauffenburger, and Garry P Nolan. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721):523–529, 2005.
  • Shimizu et al. (2011) Shohei Shimizu, Takanori Inazumi, Yasuhiro Sogawa, Aapo Hyvärinen, Yoshinobu Kawahara, Takashi Washio, Patrik O Hoyer, and Kenneth Bollen. Directlingam: A direct method for learning a linear non-Gaussian structural equation model. Journal of Machine Learning Research, 12(Apr):1225–1248, 2011.
  • Spirtes et al. (2000) P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT press, Cambridge, MA, USA, 2nd edition, 2000.
  • Spirtes (2010) Peter Spirtes. Introduction to causal inference. Journal of Machine Learning Research, 11(May):1643–1662, 2010.
  • Spirtes and Glymour (1991) Peter Spirtes and Clark Glymour. An algorithm for fast recovery of sparse causal graphs. Social Science Computer Review, 9(1):62–72, 1991.
  • Tsamardinos et al. (2006) Ioannis Tsamardinos, Laura E Brown, and Constantin F Aliferis. The max-min hill-climbing bayesian network structure learning algorithm. Machine learning, 65(1):31–78, 2006.
  • Williams (1992) Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256, 1992.
  • Yu et al. (2019) Yue Yu, Jie Chen, Tian Gao, and Mo Yu. DAG-GNN: DAG structure learning with graph neural networks. In ICML, 2019.
  • Zheng et al. (2018) Xun Zheng, Bryon Aragam, Pradeep Ravikumar, and Eric P. Xing. DAGs with NO TEARS: Continuous optimization for structure learning. In Advances in Neural Information Processing Systems, 2018.
  • Zhu et al. (2019) Shengyu Zhu, Ignavier Ng, and Zhitang Chen. Causal discovery with reinforcement learning. arXiv preprint arXiv:1906.04477, 2019.