1 Introduction
Causal graphical models based on Directed Acyclic Graphs (DAGs) describe causal systems without latent variables or selection biases, and have important applications in many empirical sciences such as weather forecasting (Abramson et al., 1996), biomedicine and heathcare (Lucas et al., 2004), and biology (Sachs et al., 2005; Pearl, 2009). Although randomized controlled experiments can be conducted to learn the causal structure effectively, they are generally expensive or even impossible in practice. It is hence more appealing to discover causal structures from observational data.
Indeed, there have been lots of efforts devoted to learning causal DAGs with passively observed data. Existing approaches can be roughly categorized into three classes: scorebased methods (Cooper and Herskovits, 1992; Chickering, 2002), constraintbased methods (Spirtes and Glymour, 1991; Colombo and Maathuis, 2014), and hybrid methods consisting of the previous two methods (Tsamardinos et al., 2006). Despite that constraintbased and hybrid methods are well suited for learning sparse systems, their assumptions on underlying distributions, e.g., faithfulness, are usually stronger than scorebased methods (Spirtes, 2010; Kalisch and Bühlmann, 2007). Recent advances on identifiability of casual structures (Shimizu et al., 2011; Peters et al., 2014) have also motivated much research attention on scorebased methods (Bühlmann et al., 2014; Zheng et al., 2018; Yu et al., 2019; Lachapelle et al., 2019).
Traditionally, scorebased methods first assign a score to each DAG according to some predefined score function, and then search over the space formed by all DAGs to find the one with the optimal score. However, due to the combinatorial nature of the search space (He et al., 2015; Chickering, 1996), bruteforce search for the causal structure is usually infeasible. As such, a variety of approximate structure searching strategies have been proposed in the literature. For example, K2 locally finds the best parents set of each variable, assuming that the ordering of the variables is known (Cooper and Herskovits, 1992). Another strategy, Greedy Equivalence Search (GES), attempts to walk through the space of graph structures by adding, deleting or reversing an edge (Chickering, 2002), following a series of rules to avoid creating cycles every time an edge is added or reversed. These methods change the graph in a local way and their performance is usually less satisfactory in practice due to finite data and possible violation of model assumptions.
More recently, Zheng et al. (2018)
propose NOTEARS, a new scorebased method that formulates the above combinatorial optimization problem as a continuous optimization one by posing a smooth constraint to enforce acyclicity. NOTEARS is specifically designed for linear Structural Equation Models (SEMs), whereas subsequent works DAGGNN
(Yu et al., 2019) and GraNDAG (Lachapelle et al., 2019)have extended it to nonlinear cases, with Neural Networks (NNs) used to model the causal structures from the observed data.
In contrast with traditional searching methods that only change one edge each time, these methods can utilize existing firstorder optimization methods to estimate the causal structure through
weighted adjacency matrices: NOTEARS and DAGGNN assume special forms of structural equations with weighted adjacency matrices that may not hold for general SEMs, while GraNDAG utilizes feedforward NNs to model causal relations and further constructs an equivalent weighted adjacency matrix from the NN weights (more details can be found in Section 2.2). A weighted adjacency matrix consisting of entries (with being the number of variables) can make the nonconvex in NOTEARS, DAGGNN and GraNDAG harder to solve with larger graphs. Furthermore, instead of feedforward NNs, one would like to use other powerful models, e.g., RNN or CNN based architecture, to estimate the causal relations, with which the equivalent weighted adjacency matrix cannot be easily constructed as in GraNDAG. It then becomes interesting to devise a generic method so that an adjacency matrix representing the underlying causal structure exists for any SEM and the continuous formulation of structure learning problem can include any smooth model for estimating causal mechanisms.In this work, we consider a general framework for learning causal DAGs towards more efficient learning with potentially larger graphs, as outlined in Fig. 1. This framework is similar to what has been used by Bühlmann et al. (2014); Lachapelle et al. (2019) but with different purposes (see Section 4 for more details). There are three stages: Preliminary Neighboring Selection (PNS) for variable selection to reduce search or optimization space, structure learning to estimate the directed graph that best describes the causal structure from observed data, and pruning to induce acyclicity and remove spurious edges. We empirically investigate existing PNS methods and find that these PNS methods may bring a negative effect by removing many true edges (cf. Section 4). We also develop a masked gradientbased causal structure learning method based on binary adjacency matrix that exists for any SEM. To enable firstorder methods, we use GumbelSoftmax approach (Jang et al., 2017; Maddison et al., 2016) to approximate the binary valued entries of the adjacency matrix, which usually results in real values close to zero or one (cf. Section 3). Compared with existing gradientbased method, one can readily plug in any differentiable score function and model function for learning causal structures. Finally, we conduct experiments to show the effectiveness of our approach on both synthetic and realworld datasets.
2 Background and Related Work
We briefly review Structural Equation Models (SEMs) that are widely used in causal inference and structure learning, followed by an introduction to recent gradientbased structure learning methods.
2.1 Causal DAGs and Structural Equation Models
Let denote a DAG with vertex set and edge set . For each , we use to denote the set of parental nodes of so that there is an edge from to in . A commonly used model in causal structure learning is the SEM that contains two types of variables: substantive and noise variables (Spirtes, 2010). The former are of particular interest, since they determine the underlying causal mechanisms for generating the observed data. Each substantive variable is obtained from a function of some (or possibly none) other substantive variable(s) and a unique noise variable. In this work, we focus on the following recursive SEM with additive noises w.r.t. a DAG :
(1) 
where are jointly independent noises and are deterministic functions with input argument
. More general SEMs can be similarly treated with our proposed approach, provided with suitable score function and parametric models for causal relationships; see Section
3 for more details. We also assume that each is nondegenerate w.r.t. , that is, for any and any fixed value of , is not a constant function in . This assumption corresponds to causality minimality commonly used in causal structure learning (Peters et al., 2014, 2017).Let
be the variable vector that concatenates all the variables
. We use to denote the marginal distribution over induced by the above SEM. Then it can be shown that is Markovian to (Spirtes et al., 2000; Pearl, 2009), that is, can be factorized according to as,where
denotes the probability of
conditioning on its parents . In other words, andform a causal Bayesian network
(Spirtes, 2010; Peters et al., 2017).2.2 GradientBased Structure Learning Methods
Traditional scorebased methods rely on various heuristics such as removing or adding edges to directly search in the space of DAGs for the best DAG according to a score function. In contrast, recent methods have cast this combinatorial optimization problem into a continuous constrained one, which can then be solved by various firstorder optimization methods.
NOTEARS (Zheng et al., 2018) is the first to formulate the structure learning problem as a continuous optimization one, thanks to a novel smooth characterization of the acyclicity constraint. NOTEARS is specifically developed for linear SEMs with , where is the coefficient vector and the indices of nonzero elements in correspond to the parents of . Define as the coefficient matrix with each column being , then can be viewed as the weighted adjacency matrix of the underlying causal DAG and learning the structure of is equivalent to learning . Using mean square loss, NOTEARS solves the following optimization problem to estimate :
subject to  (2) 
where is the design matrix containing sample vectors, and the penalty term is used to induce sparsity. Eq. (2) characterizes the acyclicity constraint: as shown in (Zheng et al., 2018, Thm. 2), Eq. (2) holds if and only if is a weighted adjacency matrix of . The above problem can be solved by the augmented Lagrangian method (Bertsekas, 1999), followed by thresholding on the estimated adjacency matrix to reduce false discoveries.
DAGGNN (Yu et al., 2019) extends NOTEARS to handle nonlinear SEMs, based on the fact that a linear SEM can be equivalently written as , where is the noise vector given by . DAGGNN assumes the following data generating procedure: , where and are possibly nonlinear functions. It then estimates
under the framework of variational autoencoders (VAE), whose objective is to maximize the evidence lower bound, with the encoder and decoder being two graph neural networks, respectively
(Kingma and Welling, 2013). DAGGNN has shown to achieve better results than NOTEARS for several nonlinear SEMs. Due to the nonlinear transformations on the weighted adjacency matrix , however, the weights in this model are indeterminate and lack causal interpretability.Rather than characterizing the nonlinearity of SEMs directly, GraNDAG Lachapelle et al. (2019) models the conditional distribution of each variable given its parents with feedforward NNs. GraNDAG defines the socalled NNpath as a way of measuring how one variable affects another variable . The sum of all NNpaths being zero implies that is a constant function in and there is no edge from to . In this way, an equivalent weighted adjacency matrix can be constructed with the th entry being the sum of all NNpaths. This approach of NNpath is specifically designed for feedforward NNs and may require much extra effort to apply to other powerful NN models such as CNNs or RNNs. Our proposed method, on the other hand, is more generic as it readily includes such NN models for learning causal structures.
In addition, Zhu et al. (2019)
has proposed to use reinforcement learning to search for the DAG according to a predefined score function. The authors adopt policy gradient with the gradient being estimated by the REINFORCE algorithm
(Williams, 1992). Despite that this approach can include any score function that is not necessarily differentiable, dealing with large graphs (with more than nodes) is still challenging.3 Masked GradientBased Causal Structure Learning with GumbelSoftmax
Both NOTEARS and DAGGNN rely on the notion of weighted adjacency matrix; however, the weighted adjacency matrix is not obvious for certain SEMs; see, e.g., Section 5.1.2. Noticing that every DAG corresponds to a binary adjacency matrix and vice versa, we consider to estimate the binary adjacency matrix in this work. Fig. 2 outlines our proposed method.
We aim to learn the causal graph from observational data, or equivalently, to learn for each variable under the acyclicity constraint. Let be the binary adjacency matrix associated with the true causal DAG , where can be viewed as an indicator vector for variable such that , the th element of , equals if and only if is a parent of . For ease of presentation, we further assume that the input of is a dim vector where denotes the Hadamard product. With some abuse of notations, we rewrite Eq. (1) as
(3) 
Consequently, learning the structure of is equivalent to estimating .
3.1 Approximating Binary Adjacency Matrix with GumbelSoftmax
Although representing as an binary adjacency matrix is straightforward, the discrete nature of prohibits direct use of firstorder methods to estimate it efficiently. To proceed, we relax each entry to take real values from
, which can be interpreted as the probability that a directed edge exists. A naive approach is then to apply a logistic sigmoid function to a real valued variable. However, this approach is equivalent to rescaling the initial inputs and gradients to a large range and may have a negative effect on the subsequent optimization procedure. Moreover, the estimated entries of the adjacency matrix can lie in a very small range near zero, making it hard to identify what edges, indicated by the adjacency matrix, are indeed true positives. We would like that each estimated entry, which has been relaxed to be continuous, is either close to zero or one, so that they can be easily thresholded. To this end, we adopt the GumbelSoftmax approach based on reparameterization trick
(Jang et al., 2017; Maddison et al., 2016), which has shown to be more effective than straightthrough (Bengio et al., 2013) and several REINFORCE based methods.The GumbelSoftmax approach is based on the GumbelSoftmax distribution, a continuous distribution over the simplex that can approximate samples from a categorical distribution (Jang et al., 2017; Maddison et al., 2016). Unlike other continuous distributions over the simplex, such as the Dirichlet distribution, the parameter gradients of GumbleSoftmax can be computed efficiently with reparameterization trick. In particular, we use twodimensional GumbelSoftmax distribution for our binary valued entry. Define with , i.e., the probability that there is no edge from to , and let and be two independent variables sampled from . Then , given by
and
follow the GumbelSoftmax distribution with probability density function being
where
is a hyperparameter denoting the softmax temperature. We also have
where denotes the logistic sigmoid function. Thus, if we let and , then can be written as
(4) 
and would be the optimization parameter.
3.2 Acyclicity Constraint and Optimization Problem
Now let be a matrix with the defined in Eq. (4) as the th entry. Then can be viewed as a weighted adjacency matrix, which approximates the binary adjacency matrix by use of GumbelSoftmax approch. To avoid selfloop, we simply mask the th entry by setting . We then have the following acyclicity constraint, according to Zheng et al. (2018):
(5) 
which is differentiable w.r.t. and also .
A remaining issue is to model the causal mechanisms. We will use feedforward NNs in this work; in particular, let denote an NN with parameter to estimate , with input being . For ease of presentation, we write and as the estimate vector for . Then we have the following optimization problem for structure learning
subject to 
where is the th sample, is the norm, and
is a properly selected loss function. While any differentiable score function can be used here, we focus on the squared loss function in this work, i.e.,
We use the augmented Lagrangian approach to solve the above problem, similar to Zheng et al. (2018) and Lachapelle et al. (2019). The temperature parameter can be set to a large value and then annealed to a small but nonzero value so that is extremely close to zero or one. In the present work, we simply pick which is found to work well in our experiments. The penalty term can be used to control false discoveries; however, we find that picking a proper value is not easy and we stick to a small value (i.e., ) to slightly remove spurious edges, leaving the rest to be handled by pruning.
4 Preliminary Neighborhood Selection and Pruning
For gradientbased methods, the solution outputted by the augmented Lagrangian only satisfies the acyclicity constraint up to certain numerical precision, i.e., several entries of the adjacency matrix can be very small but not exactly zero. Hence, thresholding is required to remove such entries. However, even after thresholding, the inferred graph obtained from the reconstruction based score function, e.g., BIC or negative loglikelihood, is very likely to contain spurious edges in the finite sample regime, and pruning is necessary for controlling false discovery rate. A useful approach is the CAM pruning proposed in Bühlmann et al. (2014) that applies significance testing of covariates based on generalized additive models and then declare significance if the reported pvalues are lower than or equal to a predefined value. Despite that an additive model may not be correct for the true causal relationships, we find that it usually performs well in practice.
Preliminary Neighborhood Selection (PNS) is a variable selection procedure to choose a subset of other variables as its possible parents, or equivalently, to remove nonparental variables, for each variable. In Bühlmann et al. (2014), this preprocessing step is implemented with a boosting method for additive model fitting and is observed to reduce time consumption of the followed procedure significantly, particularly for large DAGs. While this procedure does reduce much time consumption of the CAM algorithm, we find that it may also removes true edges and thus affects the performance of the inferred graph, as shown in Table 1. Without PNS, CAM needs approximately hours to finish, whereas with PNS it only takes minutes.
20 nodes  50 nodes  100 nodes  
with PNS  time (mins.)  1.6 0.2  2.2 0.2  5.6 0.8  6.0 0.7  14.7 0.7  14.0 0.7 
FDR  0.01 0.02  0.02 0.02  0.02 0.02  0.19 0.01  0.04 0.01  0.01 0.01  
TPR  1 0  0.92 0.03  1 0  0.88 0.01  1 0  0.90 0.08  
SHD  0.20 0.40  8.20 2.64  1.00 0.89  27.20 4.40  3.60 1.02  41.40 4.22  
w/o PNS  time (mins.)  15.3 0.4  15.6 0.2  777.8 7.0  796.1 9.2  N/A  N/A 
FDR  0 0  0.01 0.02  0.03 0.01  0.01 0.02  N/A  N/A  
TPR  1 0  0.99 0.01  1 0  0.99 0.01  N/A  N/A  
SHD  0 0  0.80 1.17  2.00 0.60  3.40 4.84  N/A  N/A 
Total edges in true graphs  50.63.9  206.49.1  407.212.8 
Total edges after PNS  623.23.9  561.663.1  660.684.4 
TPR  0.910.04  0.380.04  0.270.03 
Besides Bühlmann et al. (2014), we find that GraNDAG also uses PNS but with a different purpose. Without PNS, GraNDAG, even after thresholding and pruning, can still have a much higher SHD due to both false discoveries and missing true positives. To implement PNS, GraNDAG fits extremely randomized trees for each variable against all the other variables and select potential parents according to the importance score based on the gain of purity. Despite that this procedure indeed reduces SHDs, we find that this procedure may fail for linearGaussian models with somewhat dense DAGs. As one can see from Table 2, PNS reduces too many true edges and any structure learning method working on the graph after PNS can never increase TPR.
Along with other experiments, we empirically find that PNS reduces many nonparental nodes while retaining most true edges for large sparse graphs, thus reducing the search or optimization space for the subsequent structure learning method. However, we comment that PNS is also possible to have a negative effect in the sense that it may remove many true edges for certain cases, and should be used with care. While we do not observe much effect of the PNS step for our method, we believe that the reduced search or optimization space is potentially beneficial to other structure learning methods.
5 Experiments
In this section, we compare our proposed method, denoted as MaskedNN, against several traditional and recent gradientbased structure learning methods, which include PC (Spirtes and Glymour, 1991), GES (Chickering, 2002), CAM (Bühlmann et al., 2014), NOTEARS (Zheng et al., 2018), DAGGNN (Yu et al., 2019) and GraNDAG (Lachapelle et al., 2019). We report True Positive Rate (TPR), False Discovery Rate (FDR), and Structural Hamming Distance (SHD) to evaluate the discrepancies between learned graphs and true graphs. Since GES and PC may output oriented edges, we treat them favorably by regarding undirected edges as true positives as long as the true graph has a directed edge in place of the undirected edges.
5.1 Synthetic Data
We conduct experiments on different datasets which vary along graph size, level of edge sparsity, and causal relationships. We first draw a random DAG according to the Erdős–Rényi graph model with a prespecified expected node degree. Given , the observed data are then generated in the causal order indicated from , according to the SEM in Eq. (1) with different causal relationships
. In this work, the additive noise is distributed according to standard Gaussian distribution. We pick sample size
and consider graphs with nodes and an average degree of and .5.1.1 Causal Additive Model
Our first dataset follows the causal additive model (Bühlmann et al., 2014), which uses the data generating procedure below:
where denotes the th element in , is a scalar uniformly sampled from .
The empirical results are reported in Table 3. We observe that CAM has the best performance in terms of SHD and TPR in most settings, whereas MaskedNN and GraNDAG both have slightly worse performance than CAM. It is not surprising that CAM performs well as this dataset follows the causal additive assumption made by CAM. For DAGGNN, it performs similarly to a linear method such as NOTEARS in most of the settings. As discussed in Section 2.2, we hypothesize that the nonlinear transformation on the adjacency matrix results in the lack of causal interpretability in DAGGNN. For PC and GES, they both have high TPRs on sparse graphs but very low TPRs on graphs of degree . This observation is also found with our next experiment in Section 5.1.2.
For MaskedNN, GraNDAG and CAM, an additional pruning step is used to remove spurious edges, identical to what is done in Bühlmann et al. (2014). We did not apply this postprocessing step to other baselines because they have lower TPRs, especially on graphs of degree . MaskedNN, GraNDAG and CAM still achieve high TPRs on these graphs.
10 nodes  
SHD  TPR  FDR  SHD  TPR  FDR  
MaskedNN  0.81.3  0.920.14  0.100.19  8.82.6  0.800.05  0.040.03 
GraNDAG  2.21.7  0.850.11  0.160.19  7.24.0  0.840.09  0.080.04 
DAGGNN  3.61.4  0.680.09  0.030.06  26.83.0  0.380.08  0.090.07 
NOTEARS  3.42.3  0.770.11  0.070.12  23.62.0  0.450.06  0.070.06 
CAM  3.21.2  1.000.00  0.210.04  4.01.8  0.920.04  0.090.03 
ICALiNGAM  14.83.7  0.420.12  0.720.07  29.41.4  0.470.04  0.380.04 
GES  5.22.1  0.740.06  0.330.08  33.43.0  0.220.09  0.600.13 
PC  4.44.0  0.710.23  0.230.25  34.61.6  0.190.03  0.540.08 
50 nodes  
SHD  TPR  FDR  SHD  TPR  FDR  
MaskedNN  8.62.5  0.910.03  0.130.05  44.09.5  0.840.02  0.060.03 
GraNDAG  9.22.8  0.890.05  0.090.02  39.64.0  0.860.01  0.060.02 
DAGGNN  25.65.3  0.550.09  0.030.01  154.48.7  0.290.09  0.100.05 
NOTEARS  19.63.6  0.700.05  0.100.05  150.09.6  0.380.04  0.230.04 
CAM  1.00.9  1.000.00  0.020.02  27.24.4  0.880.01  0.190.01 
ICALiNGAM  383.635.5  0.370.03  0.950.01  381.021.4  0.290.02  0.820.02 
GES  22.04.6  0.750.06  0.220.07  183.65.9  0.240.01  0.450.03 
PC  38.87.1  0.720.07  0.460.06  212.45.9  0.290.02  0.650.01 
5.1.2 Gaussian Process
We consider another SEM which is also used in Peters et al. (2014); Lachapelle et al. (2019): each causal relationship is a function sampled from a Gaussian process, with RBF kernel of bandwidth one. This setting is known to be identifiable according to Peters et al. (2014).
Our results are reported in Table 4. It is observed that MaskedNN, GraNDAG and CAM outperform the rest methods across all settings in terms of SHD and TPR. DAGGNN, NOTEARS and ICALiNGAM perform poorly on this dataset, possibly because they may not be able to model this type of causal relationship. More importantly, these methods operate on the notion of weighted adjacency matrix which is not obvious for this SEM.
10 nodes  
SHD  TPR  FDR  SHD  TPR  FDR  
MaskedNN  1.21.6  0.870.21  0.130.25  9.23.3  0.780.09  0.050.07 
GraNDAG  2.42.2  0.860.15  0.150.21  14.44.8  0.660.12  0.110.22 
DAGGNN  6.63.6  0.500.24  0.070.13  37.02.1  0.120.03  0.180.15 
NOTEARS  5.02.9  0.620.19  0.040.05  35.22.5  0.150.05  0.020.05 
CAM  5.02.3  0.920.08  0.310.07  16.63.4  0.630.08  0.290.11 
ICALiNGAM  21.24.5  0.630.15  0.760.08  31.23.06  0.480.08  0.440.06 
GES  3.41.7  0.780.13  0.130.07  29.41.0  0.300.30  0.320.08 
PC  4.81.9  0.690.09  0.330.10  18.83.54  0.550.10  0.290.08 
50 nodes  
SHD  TPR  FDR  SHD  TPR  FDR  
MaskedNN  7.43.8  0.900.06  0.100.07  61.815.0  0.730.05  0.090.03 
GraNDAG  6.63.5  0.920.05  0.080.07  59.412.9  0.750.05  0.100.02 
DAGGNN  32.27.8  0.450.10  0.100.06  186.214.5  0.100.04  0.140.08 
NOTEARS  22.87.0  0.670.10  0.140.07  174.813.5  0.160.03  0.100.08 
CAM  3.81.9  0.960.02  0.060.04  58.66.6  0.760.02  0.140.01 
ICALiNGAM  683.029.0  0.610.06  0.960.01  777.69.5  0.580.04  0.880.01 
GES  19.06.6  0.740.09  0.170.08  147.416.0  0.310.05  0.280.05 
PC  28.06.5  0.670.07  0.370.08  123.211.4  0.430.03  0.410.05 
5.2 RealWorld Data
In this section, we consider a realworld dataset to discover a protein signaling network based on expression levels of proteins and phospholipid (Sachs et al., 2005). This is a widely used data set for research on graphical models, with experimental annotations accepted by the biological research community. This dataset contains both observational and interventional data. Since we are interested in using observational data to infer the causal graph, we only consider the observational data with samples. The ground truth causal graph proposed by Sachs et al. (2005) has 11 nodes and 17 edges. Note that the causal graph is sparse and an empty graph can result in an SHD as low as 17.
The empirical results are shown in Table 5. We observe that MaskedNN and CAM achieve the lowest SHD. DAGGNN and NOTEARS are on par with MaskedNN and CAM in terms of TPR, but have higher FDRs leading to higher SHDs. We have also applied GES and PC to this dataset: GES obtains undirected edges while PC estimates undirected and directed edges. By contrast, all the inferred graphs of MaskedNN, CAM, and GraNDAG consist of directed edges.
MaskedNN  GraNDAG  DAGGNN  NOTEARS  CAM  ICALiNGAM  
SHD  12  13  16  19  12  14 
TPR  0.35  0.29  0.35  0.35  0.35  0.24 
FDR  0.40  0.50  0.60  0.70  0.40  0.50 
6 Concluding Remarks
In this work, we have considered a general framework for learning causal DAGs, which consists of PNS, structure learning, and pruning. We empirically study existing PNS methods and find that these methods may bring a negative effect on variable selection by removing many true edges. We also develop a masked gradientbased causal structure learning method based on binary adjacency matrix that exists for any SEM. As an advantage, our method can readily include any differentiable score function and model function for learning causal structures. Experiments on both synthetic and realworld datasets are conducted to show the effectiveness of the proposed approach.
References
 Abramson et al. (1996) Bruce Abramson, John Brown, Ward Edwards, Allan Murphy, and Robert L. Winkler. Hailfinder: A bayesian system for forecasting severe weather. International Journal of Forecasting, 12(1):57 – 71, 1996. Probability Judgmental Forecasting.
 Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
 Bertsekas (1999) Dimitri P Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
 Bühlmann et al. (2014) Peter Bühlmann, Jonas Peters, Jan Ernest, et al. CAM: Causal additive models, highdimensional order search and penalized regression. The Annals of Statistics, 42(6):2526–2556, 2014.

Chickering (2002)
David M. Chickering.
Optimal structure identification with greedy search.
Journal of Machine Learning Research
, 3(Nov):507–554, 2002.  Chickering (1996) David Maxwell Chickering. Learning Bayesian networks is NPcomplete. In Learning from Data, pages 121–130. Springer, 1996.
 Colombo and Maathuis (2014) Diego Colombo and Marloes H. Maathuis. Orderindependent constraintbased causal structure learning. Journal of Machine Learning Research, 15:3921–3962, 2014.
 Cooper and Herskovits (1992) Gregory F. Cooper and Edward Herskovits. A bayesian method for the induction of probabilistic networks from data. Machine Learning, 9(4):309–347, Oct 1992.
 He et al. (2015) Yangbo He, Jinzhu Jia, and Bin Yu. Counting and exploring sizes of markov equivalence classes of directed acyclic graphs. Journal of Machine Learning Research, 16:2589–2609, 2015.
 Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. In International Conference on Learning Representations (ICLR), 2017.
 Kalisch and Bühlmann (2007) Markus Kalisch and Peter Bühlmann. Estimating highdimensional directed acyclic graphs with the pcalgorithm. Journal of Machine Learning Research, 8(Mar):613–636, 2007.
 Kingma and Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. In International Conference on Learning Representations (ICLR), 2013.
 Lachapelle et al. (2019) Sébastien Lachapelle, Philippe Brouillard, Tristan Deleu, and Simon LacosteJulien. Gradientbased neural dag learning. arXiv preprint arXiv:1906.02226, 2019.
 Lucas et al. (2004) Peter J. F. Lucas, Linda C. van der Gaag, and Ameen AbuHanna. Bayesian networks in biomedicine and healthcare. Artificial Intelligence in Medicine, 30(3):201–214, March 2004.

Maddison et al. (2016)
Chris J. Maddison, Andriy Mnih, and Yee Whye Teh.
The concrete distribution: A continuous relaxation of discrete random variables, 2016.
 Pearl (2009) Judea Pearl. Causality. Cambridge University Press, 2009.
 Peters et al. (2017) J. Peters, D. Janzing, and B. Schölkopf. Elements of Causal Inference  Foundations and Learning Algorithms. Adaptive Computation and Machine Learning Series. The MIT Press, Cambridge, MA, USA, 2017.
 Peters et al. (2014) Jonas Peters, Joris M Mooij, Dominik Janzing, and Bernhard Schölkopf. Causal discovery with continuous additive noise models. The Journal of Machine Learning Research, 15(1):2009–2053, 2014.
 Sachs et al. (2005) Karen Sachs, Omar Perez, Dana Pe’er, Douglas A Lauffenburger, and Garry P Nolan. Causal proteinsignaling networks derived from multiparameter singlecell data. Science, 308(5721):523–529, 2005.
 Shimizu et al. (2011) Shohei Shimizu, Takanori Inazumi, Yasuhiro Sogawa, Aapo Hyvärinen, Yoshinobu Kawahara, Takashi Washio, Patrik O Hoyer, and Kenneth Bollen. Directlingam: A direct method for learning a linear nonGaussian structural equation model. Journal of Machine Learning Research, 12(Apr):1225–1248, 2011.
 Spirtes et al. (2000) P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT press, Cambridge, MA, USA, 2nd edition, 2000.
 Spirtes (2010) Peter Spirtes. Introduction to causal inference. Journal of Machine Learning Research, 11(May):1643–1662, 2010.
 Spirtes and Glymour (1991) Peter Spirtes and Clark Glymour. An algorithm for fast recovery of sparse causal graphs. Social Science Computer Review, 9(1):62–72, 1991.
 Tsamardinos et al. (2006) Ioannis Tsamardinos, Laura E Brown, and Constantin F Aliferis. The maxmin hillclimbing bayesian network structure learning algorithm. Machine learning, 65(1):31–78, 2006.
 Williams (1992) Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8(34):229–256, 1992.
 Yu et al. (2019) Yue Yu, Jie Chen, Tian Gao, and Mo Yu. DAGGNN: DAG structure learning with graph neural networks. In ICML, 2019.
 Zheng et al. (2018) Xun Zheng, Bryon Aragam, Pradeep Ravikumar, and Eric P. Xing. DAGs with NO TEARS: Continuous optimization for structure learning. In Advances in Neural Information Processing Systems, 2018.
 Zhu et al. (2019) Shengyu Zhu, Ignavier Ng, and Zhitang Chen. Causal discovery with reinforcement learning. arXiv preprint arXiv:1906.04477, 2019.