DAG-GNN: DAG Structure Learning with Graph Neural Networks

by   Yue Yu, et al.

Learning a faithful directed acyclic graph (DAG) from samples of a joint distribution is a challenging combinatorial problem, owing to the intractable search space superexponential in the number of graph nodes. A recent breakthrough formulates the problem as a continuous optimization with a structural constraint that ensures acyclicity (Zheng et al., 2018). The authors apply the approach to the linear structural equation model (SEM) and the least-squares loss function that are statistically well justified but nevertheless limited. Motivated by the widespread success of deep learning that is capable of capturing complex nonlinear mappings, in this work we propose a deep generative model and apply a variant of the structural constraint to learn the DAG. At the heart of the generative model is a variational autoencoder parameterized by a novel graph neural network architecture, which we coin DAG-GNN. In addition to the richer capacity, an advantage of the proposed model is that it naturally handles discrete variables as well as vector-valued ones. We demonstrate that on synthetic data sets, the proposed method learns more accurate graphs for nonlinearly generated samples; and on benchmark data sets with discrete variables, the learned graphs are reasonably close to the global optima. The code is available at <https://github.com/fishmoon1234/DAG-GNN>.


edGNN: a Simple and Powerful GNN for Directed Labeled Graphs

The ability of a graph neural network (GNN) to leverage both the graph t...

DAMNETS: A Deep Autoregressive Model for Generating Markovian Network Time Series

In this work, we introduce DAMNETS, a deep generative model for Markovia...

Parameterized Hypercomplex Graph Neural Networks for Graph Classification

Despite recent advances in representation learning in hypercomplex (HC) ...

Search to aggregate neighborhood for graph neural network

Recent years have witnessed the popularity and success of graph neural n...

Graph Neural Network for Large-Scale Network Localization

Graph neural networks (GNNs) are popular to use for classifying structur...

On the Role of Sparsity and DAG Constraints for Learning Linear DAGs

Learning graphical structure based on Directed Acyclic Graphs (DAGs) is ...

Recurrent Graph Neural Networks for Rumor Detection in Online Forums

The widespread adoption of online social networks in daily life has crea...

Code Repositories

1 Introduction

Bayesian Networks (BN) have been widely used in machine learning applications (Spirtes et al., 1999; Ott et al., 2004). The structure of a BN takes the form of a directed acyclic graph (DAG) and plays a vital part in causal inference (Pearl, 1988) with many applications in medicine, genetics, economics, and epidemics. Its structure learning problem is however NP-hard (Chickering et al., 2004) and stimulates a proliferation of literature.

Score-based methods generally formulate the structure learning problem as optimizing a certain score function with respect to the unknown (weighted) adjacency matrix and the observed data samples, with a combinatorial constraint stating that the graph must be acyclic. The intractable search space (with a complexity superexponential in the number of graph nodes) poses substantial challenges for optimization. Hence, for practical problems in a scale beyond small, approximate search often needs to be employed with additional structure assumption (Nie et al., 2014; Chow & Liu, 1968; Scanagatta et al., 2015; Chen et al., 2016).

Recently, Zheng et al. (2018) formulate an equivalent acyclicity constraint by using a continuous function of the adjacency matrix (specifically, the matrix exponential of ). This approach drastically changes the combinatorial nature of the problem to a continuous optimization, which may be efficiently solved by using maturely developed blackbox solvers. The optimization problem is nevertheless nonlinear, thus these solvers generally return only a stationary-point solution rather than the global optimum. Nevertheless, the authors show that empirically such local solutions are highly comparable to the global ones obtained through expensive combinatorial search.

With the inspiring reformulation of the constraint, we revisit the objective function. The score-based objective functions generally make assumptions of the variables and the model class. For example, Zheng et al. (2018) demonstrate on the linear structural equation model (SEM) with a least-squares loss. While convenient, such assumptions are often restricted and they may not correctly reflect the actual distribution of real-life data.

Hence, motivated by the remarkable success of deep neural networks, which are arguably universal approximators, in this work we develop a graph-based deep generative model aiming at better capturing the sampling distribution faithful to the DAG. To this end, we employ the machinery of variational inference and parameterize a pair of encoder/decoder with specially designed graph neural networks (GNN). The objective function (the score), then, is the evidence lower bound. Different from the current flourishing designs of GNNs (Bruna et al., 2014; Defferrard et al., 2016; Li et al., 2016; Kipf & Welling, 2017; Hamilton et al., 2017; Gilmer et al., 2017; Chen et al., 2018; Velic̆ković et al., 2018), the proposed ones are generalized from linear SEM, so that the new model performs at least as well as linear SEM when the data is linear.

Our proposal has the following distinct features and advantages. First, the work is built on the widespread use of deep generative models (specifically, variational autoencoders, VAE (Kingma & Welling, 2014)) that are able to capture complex distributions of data and to sample from them. Under the graph setting, the weighted adjacency matrix is an explicit parameter, rather than a latent structure, learnable together with other neural network parameters. The proposed network architecture has not been used before.

Second, the framework of VAE naturally handles various data types, notably not only continuous but also discrete ones. All one needs to do is to model the likelihood distribution (decoder output) consistent with the nature of the variables.

Third, owing to the use of graph neural networks for parameterization, each variable (node) can be not only scalar-valued but also vector-valued. These variables are considered node features input to/output of the GNNs.

Fourth, we propose a variant of the acyclicity constraint more suitable for implementation under current deep learning platforms. The matrix exponential suggested by Zheng et al. (2018), while mathematically elegant, may not be implemented or supported with automatic differentiation in all popular platforms. We propose a polynomial alternative more practically convenient and as numerically stable as the exponential.

We demonstrate the effectiveness of the proposed method on synthetic data generated from linear and nonlinear SEMs, benchmark data sets with discrete variables, and data sets from applications. For synthetic data, the proposed DAG-GNN outperforms DAG-NOTEARS, the algorithm proposed by Zheng et al. (2018) based on linear SEM. For benchmark data, our learned graphs compare favorably with those obtained through optimizing the Bayesian information criterion by using combinatorial search.

2 Background and Related Work

A DAG and a joint distribution are faithful to each other if all and only the conditional independencies true in are entailed by  (Pearl, 1988). The faithfulness condition enables one to recover from . Given independent and iid samples from an unknown distribution corresponding to a faithful but unknown DAG, structure learning refers to recovering the DAG from .

Many exact and approximate algorithms for learning DAG from data have been developed, including score-based and constraint-based approaches (Spirtes et al., 2000a; Chickering, 2002; Koivisto & Sood, 2004; Silander & Myllymaki, 2006; Jaakkola et al., 2010; Cussens, 2011; Yuan & Malone, 2013). Score-based methods generally use a score to measure the goodness of fit of different graphs over data; and then use a search procedure—such as hill-climbing (Heckerman et al., 1995; Tsamardinos et al., 2006; Gámez et al., 2011), forward-backward search (Chickering, 2002), dynamic programming (Singh & Moore, 2005; Silander & Myllymaki, 2006), A (Yuan & Malone, 2013), or integer programming (Jaakkola et al., 2010; Cussens, 2011; Cussens et al., 2016)—in order to find the best graph. Commonly used Bayesian score criteria, such as BDeu and Bayesian information criterion (BIC), are decomposable, consistent, locally consistent (Chickering, 2002), and score equivalent (Heckerman et al., 1995).

To make the DAG search space tractable, approximate methods make additional assumptions such as bounded tree-width (Nie et al., 2014), tree-like structures (Chow & Liu, 1968), approximation (Scanagatta et al., 2015), and other constraints about the DAG (Chen et al., 2016). Many bootstrap (Friedman et al., 1999) and sampling-based structure learning algorithms (Madigan et al., 1995; Friedman & Koller, 2003; Eaton & Murphy, 2012; Grzegorczyk & Husmeier, 2008; Niinimäki & Koivisto, 2013; Niinimaki et al., 2012; He et al., 2016) are also proposed to tackle the expensive search problem.

Constraint-based methods, in contrast, use (conditional) independence tests to test the existence of edges between each pair of variables. Popular algorithms include SGS (Spirtes et al., 2000b), PC (Spirtes et al., 2000b), IC (Pearl, 2003), and FCI (Spirtes et al., 1995; Zhang, 2008). Recently, there appears a suite of hybrid algorithms that combine score-based and constraint-based methods, such as MMHC (Tsamardinos et al., 2003), and apply constraint-based methods to multiple environments (Mooij et al., 2016).

Due to the NP-hardness, traditional DAG learning methods usually deal with discrete variables, as discussed above, or jointly Gaussian variables (Mohan et al., 2012; Mohammadi et al., 2015). Recently, a new continuous optimization approach is proposed (Zheng et al., 2018), which transforms the discrete search procedure into an equality constraint. This approach enables a suite of continuous optimization techniques such as gradient descent to be used. The approach achieves good structure recovery results, although it is applied to only linear SEM for ease of exposition.

Neural-network approaches started to surface only very recently. Kalainathan et al. (2018) propose a GAN-style (generative adversarial network) method, whereby a separate generative model is applied to each variable and a discriminator is used to distinguish between the joint distributions of real and generated samples. The approach appears to scale well but acyclicity is not enforced.

3 Neural DAG Structure Learning

Our method learns the weighted adjacency matrix of a DAG by using a deep generative model that generalizes linear SEM, with which we start the journey.

3.1 Linear Structural Equation Model

Let be the weighted adjacency matrix of the DAG with nodes and be a sample of a joint distribution of variables, where each row corresponds to one variable. In the literature, a variable is typically a scalar, but it can be trivially generalized to a -dimensional vector under the current setting. The linear SEM model reads


where is the noise matrix. When the graph nodes are sorted in the topological order, the matrix is strictly upper triangular. Hence, ancestral sampling from the DAG is equivalent to generating a random noise followed by a triangular solve


3.2 Proposed Graph Neural Network Model

Equation (2) may be written as , a general form recognized by the deep learning community as an abstraction of parameterized graph neural networks that take node features as input and return as high level representations. Nearly all graph neural networks (Bruna et al., 2014; Defferrard et al., 2016; Li et al., 2016; Kipf & Welling, 2017; Hamilton et al., 2017; Gilmer et al., 2017; Chen et al., 2018; Velic̆ković et al., 2018) can be written in this form. For example, the popular GCN (Kipf & Welling, 2017) architecture reads

where is a normalization of and and are parameter matrices.

Owing to the special structure (2), we propose a new graph neural network architecture


The parameterized functions and effectively perform (possibly nonlinear) transforms on and , respectively. If is invertible, then (3) is equivalent to , a generalized version of the linear SEM (1).

We will defer the instantiation of these functions in a later subsection. One of the reasons is that the activation in must match the type of the variable , a subject to be discussed together with discrete variables.

3.3 Model Learning with Variational Autoencoder

Given a specification of the distribution of and samples , one may learn the generative model (3) through maximizing the log-evidence

which, unfortunately, is generally intractable. Hence, we appeal to variational Bayes.

To this end, we use a variational posterior to approximate the actual posterior . The net result is the evidence lower bound (ELBO)



Each individual term departs from the log-evidence by , the KL-divergence between the variational posterior and the actual one.

The ELBO lends itself to a variational autoencoder (VAE) (Kingma & Welling, 2014), where given a sample , the encoder (inference model) encodes it into a latent variable with density ; and the decoder (generative model) tries to reconstruct from with density . Both densities may be parameterized by using neural networks.

Modulo the probability specification to be completed later, the generative model (

3) discussed in the preceding subsection plays the role of the decoder. Then, we propose the corresponding encoder


where and are parameterized functions that conceptually play the inverse role of and , respectively.

Figure 1: Architecture (for continuous variables). In the case of discrete variables, the decoder output is changed from to .

3.4 Architecture and Loss Function

To complete the VAE, one must specify the distributions in (4). Recall that for now both and are matrices. For simplicity, the prior is typically modeled as the standard matrix normal .

For the inference model, we let

be a multilayer perceptron (MLP) and

be the identity mapping. Then, the variational posterior is a factored Gaussian with mean

and standard deviation

, computed from the encoder


where , and and are parameter matrices.

For the generative model, we let be the identity mapping and be an MLP. Then, the likelihood is a factored Gaussian with mean and standard deviation , computed from the decoder


where and are parameter matrices.

One may switch the MLP and the identity mapping inside each of the encoder/decoder, but we find that the performance is less competitive. One possible reason is that the current design (7) places an emphasis on the nonlinear transform of a sample from linear SEM, which better captures nonlinearity.

Based on (6) and (7), the KL-divergence term in the ELBO (4) admits a closed form


and the reconstruction accuracy term may be computed with Monte Carlo approximation


where is a constant and and are the outputs of the decoder (7) by taking as input Monte Carlo samples , for .

Note that under the autoencoder framework, is considered latent (rather than the noise in linear SEM). Hence, the column dimension of may be different from . From the neural network point of view, changing the column dimension of affects only the sizes of the parameter matrices and . Sometimes, one may want to use a smaller number than if he/she observes that the data has a smaller intrinsic dimension.

An illustration of the architecture is shown in Figure 1.

3.5 Discrete Variables

One advantage of the proposed method is that it naturally handles discrete variables. We assume that each variable has a finite support of cardinality .

Hence, we let each row of be a one-hot vector, where the “on” location indicates the value of the corresponding variable. We still use standard matrix normal to model the prior and factored Gaussian to model the variational posterior, with (6) being the encoder. On the other hand, we need to slightly modify the likelihood to cope with the discrete nature of the variables.

Specifically, we let be a factored categorical distribution with probability matrix

, where each row is a probability vector for the corresponding categorical variable. To achieve so, we change

from the identity mapping to a row-wise softmax and modify the decoder (7) to


Correspondingly for the ELBO, the KL term (8) remains the same, but the reconstruction term (9) needs be modified to


where is the output of the decoder (10) by taking as input Monte Carlo samples , for .

3.6 Connection to Linear SEM

One has seen from the forgoing discussions how the proposed model is developed from linear SEM: We apply nonlinearality to the sampling procedure (2) of SEM, treat the resulting generative model as a decoder, and pair with it a variational encoder for tractable learning. Compared with a plain autoencoder, the variational version allows a modeling of the latent space, from which samples are generated.

We now proceed, in a reverse thought flow, to establish the connection between the loss function of the linear SEM considered in Zheng et al. (2018) and that of ours. We first strip off the variational component of the autoencoder. This plain version uses (5) as the encoder and (3) as the decoder. For notational clarity, let us write as the output of the decoder, to distinguish it from the encoder input . A typical sample loss to minimize is

where the first term is the reconstruction error and the second term is a regularization of the latent space. One recognizes that the reconstruction error is the same as the negative reconstruction accuracy (9) in the ELBO, up to a constant, if the standard deviation is , the mean is taken as , and only one Monte Carlo sample is drawn from the variational posterior. Moreover, the regularization term is the same as the KL-divergence (8) in the ELBO if the standard deviation is and the mean is taken as .

If we further strip off the (possibly nonlinear) mappings to , then the encoder (5) and decoder (3) read, respectively, and . This pair results in perfect reconstruction, and hence the sample loss reduces to


which is the least-squares loss used and justified by Zheng et al. (2018).

3.7 Acyclicity Constraint

Neither maximizing the ELBO (4) nor minimizing the least-squares loss (12) guarantees that the corresponding graph of the resulting is acyclic. Zheng et al. (2018) pair the loss function with an equality constraint, whose satisfaction ensures acyclicity.

The idea is based on the fact that the positivity of the element of the -th power of a nonnegative adjacency matrix indicates the existence of a length- path between nodes and . Hence, the positivity of the diagonal of

reveals cycles. The authors leverage the trick that the matrix exponential admits a Taylor series (because it is analytic on the complex plane), which is nothing but a weighted sum of all nonnegative integer powers of the matrix. The coefficient of the zeorth power (the identity matrix

) is , and hence the trace of the exponential of must be exactly for a DAG. To satisfy nonnegativity, one may let be the elementwise square of ; that is, .

Whereas the formulation of this acyclicity constraint is mathematically elegant, support of the matrix exponential may not be available in all deep learning platforms. To ease the coding effort, we propose an alternative constraint that is practically convenient.

Theorem 1.

Let be the (possibly negatively) weighted adjacency matrix of a directed graph. For any , the graph is acyclic if and only if


We use (13) as the equality constraint when maximizing the ELBO. The computations of both and

may meet numerical difficulty when the eigenvalues of

have a large magnitude. However, the former is less severe than the latter with a judicious choice of .

Theorem 2.

Let for some . Then for any complex , we have .

In practice,

may be treated as a hyperparameter and its setting depends on an estimation of the largest eigenvalue of

in magnitude. This value is the spectral radius of , and because of nonnegativity, it is bounded by the maximum row sum according to the Perron–Frobenius theorem.

3.8 Training

Based on the foregoing, the learning problem is


where the unknowns include the matrix and all the parameters of the VAE (currently we have ). Nonlinear equality-constrained problems are well studied and we use the augmented Lagrangian approach to solve it. For completeness, we summarize the algorithm here; the reader is referred to standard textbooks such as Section 4.2 of Bertsekas (1999) for details and convergence analysis.

Define the augmented Lagrangian

where is the Lagrange multiplier and is the penalty parameter. When , the minimizer of must satisfy , in which case is equal to the objective function . Hence, the strategy is to progressively increase , for each of which minimize the unconstrained augmented Lagrangian. The Lagrange multiplier is correspondingly updated so that it converges to the one under the optimality condition.

There exist a few variants for updating and increasing , but a typical effective rule reads:


where and are tuning parameters. We find that often and work well.

The subproblem (14) may be solved by using blackbox stochastic optimization solvers, by noting that the ELBO is defined on a set of samples.

4 Experiments

In this section, we present a comprehensive set of experiments to demonstrate the effectiveness of the proposed method DAG-GNN. In Section 4.1, we compare with DAG-NOTEARS, the method proposed by Zheng et al. (2018) based on linear SEM, on synthetic data sets generated by sampling generalized linear models, with an emphasis on nonlinear data and vector-valued data (). In Section 4.2, we showcase the capability of our model with discrete data, often seen in benchmark data sets with ground truths for assessing quality. To further illustrate the usefulness of the proposed method, in Section 4.3 we apply DAG-GNN on a protein data set for the discovery of consensus protein signaling network, as well as a knowledge base data set for learning causal relations.

Our implementation is based on PyTorch 

(Paszke et al., 2017). We use Adam (Kingma & Ba, 2015) to solve the subproblems (14). To avoid overparameterization, we parameterize the variational posterior

as a factored Gaussian with constant unit variance, and similarly for the likelihood

. When extracting the DAG, we use a thresholding value , following the recommendation of Zheng et al. (2018). For benchmark and application data sets, we include a Huber-norm regularization of in the objective function to encourage more rapid convergence.

4.1 Synthetic Data Sets

The synthetic data sets are generated in the following manner. We first generate a random DAG by using the Erdős–Rényi model with expected node degree 3, then assign uniformly random weights for the edges to obtain the weighted adjacency matrix . A sample is generated by sampling the (generalized) linear model with some function elaborated soon. The noise follows standard matrix normal. When the dimension , we use lowercase letters to denote vectors; that is, . We compare DAG-GNN with DAG-NOTEARS and report the structural Hamming distance (SHD) and false discovery rate (FDR), each averaged over five random repetitions. With sample size , we run experiments on four graph sizes . In Sections 4.1.1 and 4.1.2 we consider scalar-valued variables ( and in Section 4.1.3 vector-valued variables ().

4.1.1 Linear Case

This case is the linear SEM model, with being the identity mapping. The SHD and FDR are plotted in Figure 2. One sees that the graphs learned by the proposed method are substantially more accurate than those by DAG-NOTEARS when the graphs are large.

Figure 2: Structure discovery in terms of SHD and FDR to the true graph, on synthetic data set generated by .

4.1.2 Nonlinear Case

We now consider data generated by the following model

for some nonlinear function . Taking first-order approximation (ignoring higher-order terms of ), one obtains an amendatory approximation of the graph adjacency matrix, . This approximate ground truth maintains the DAG structure, with only a scaling on the edge weights.

We take and plot the SHD and FDR in Figure 3. one observes that DAG-GNN slightly improves over DAG-NOTEARS in terms of SHD. Further, FDR is substantially improved, by approximately a factor of three, which indicates that DAG-GNN tends to be more accurate on selecting correct edges. This observation is consistent with the parameter estimates shown in Figure 4, where the ground truth is set as . The heat map confirms that DAG-GNN results in fewer “false alarms” and recovers a relatively sparser matrix.

Figure 3: Structure discovery in terms of SHD and FDR to the true graph, on synthetic data set generated by .
Figure 4: Parameter estimates (before thresholding) of the graph on synthetic data set generated by .

We further experiment with a more complex nonlinear generation model, where the nonlinearity occurs after the linear combination of the variables, as opposed to the preceding case where nonlinearity is applied to the variables before linear combination. Specifically, we consider

and plot the results in Figure 5. One sees that with higher nonlinearity, the proposed method results in significantly better SHD and FDR than does DAG-NOTEARS.

Figure 5: Structure discovery in terms of SHD and FDR to the true graph, on synthetic data set generated by .

4.1.3 Vector-Valued Case

The proposed method offers a modeling benefit that the variables can be vector-valued with . Moreover, since resides in the latent space of the autoencoder and is not interpreted as noise as in linear SEM, one may take a smaller column dimension if he/she believes that the variables have a lower intrinsic dimension. To demonstrate this capability, we construct a data set where the different dimensions come from a randomly scaled and perturbed sample from linear SEM. Specifically, given a graph adjacency matrix , we first construct a sample from the linear SEM , and then generate for the -th dimension , where and are random scalars from standard normal and is a standard normal vector. The eventual sample is .

We let and and compare DAG-GNN with DAG-NOTEARS. The SHD and FDR are plotted in Figure 6. The figure clearly shows the significantly better performance of the proposed method. Moreover, the parameter estimates are shown in Figure 7, compared against the ground-truth . One sees that the estimated graph from DAG-GNN successfully captures all the ground truth edges and that the estimated weights are also similar. On the other hand, DAG-NOTEARS barely learns the graph.

Figure 6: Structure discovery in terms of SHD and FDR to the true graph, on synthetic vector-valued data set.
Figure 7: Parameter estimates (before thresholding) of the graph on synthetic vector-valued data set.

4.2 Benchmark Data Sets

A benefit of the proposed method is that it naturally handles discrete variables, a case precluded by linear SEM. We demonstrate the use of DAG-GNN on three discrete benchmark data sets: Child, Alarm, and Pigs (Tsamardinos et al., 2006). For comparison is the state-of-the-art exact DAG solver GOPNILP (Cussens et al., 2016), which is based on a constrained integer programming formulation. We use 1000 samples for learning.

One sees from Table 1 that our results are reasonably close to the ground truth, whereas not surprisingly the results of GOPNILP are nearly optimal. The BIC score gap exhibits by DAG-GNN may be caused by the relatively simple autoencoder architecture, which is less successful in approximating multinomial distributions. Nevertheless, it is encouraging that the proposed method as a unified framework can handle discrete variables with only slight changes in the network architecture.

Dataset Groundtruth GOPNILP DAG-GNN
Child 20 -1.27e+4 -1.27e+4 -1.38e+4
Alarm 37 -1.07e+4 -1.12e+4 -1.28e+4
Pigs 441 -3.48e+5 -3.50e+5 -3.69e+5
Table 1: BIC scores on benchmark datasets of discrete variables.

4.3 Applications

We consider a bioinformatics data set (Sachs et al., 2005) for the discovery of a protein signaling network based on expression levels of proteins and phospholipids. This is a widely used data set for research on graphical models, with experimental annotations accepted by the biological research community. The data set offers continuous measurements of expression levels of multiple phosphorylated proteins and phospholipid components in human immune system cells, and the modeled network provides the ordering of the connections between pathway components. Based on samples of cell types, Sachs et al. (2005) estimate 20 edges in the graph.

In Table 2, we compare DAG-GNN with DAG-NOTEARS as well as FSG, the fast greedy search method proposed by Ramsey et al. (2017), against the ground truth offered by Sachs et al. (2005). The proposed method achieves the lowest SHD. We further show in Figure 8 our estimated graph. One observes that it is acyclic. Our method successfully learns 8 out of 20 ground-truth edges (as marked by red arrows), and predicts 5 indirectly connected edges (blue dashed arrows) as well as 3 reverse edges (yellow arrows).

Method SHD # Predicted edges
FGS 22 17
DAG-GNN 19 18
Table 2: Results on protein signaling network: comparison of the predicted graphs with respect to the ground truth.
Figure 8: Estimate protein signaling network.

For another application, we develop a new causal inference task over relations defined in a knowledge base (KB) schema. The task aims at learning a BN, the nodes of which are relations and the edges indicate whether one relation suggests another. For example, the relation person/Nationality may imply person/Language, because the spoken language of a person naturally associates with his/her nationality. This task has a practical value, because most existing KBs are constructed by hand. The success of this task helps suggest meaningful relations for new entities and reduce human efforts. We construct a data set from FB15K-237 (Toutanova et al., 2015) and list in Table 3 a few extracted causal relations. Because of space limitation, we defer the details and more results in the supplementary material. One sees that these results are quite intuitive. We plan a comprehensive study with field experts to systematically evaluate the extraction results.

film/ProducedBy film/Country
film/ProductionCompanies film/Country
person/Nationality person/Languages
person/PlaceOfBirth person/Languages
person/PlaceOfBirth person/Nationality
person/PlaceLivedLocation person/Nationality
Table 3: Examples of extracted edges with high confidence.

5 Conclusion

DAG structure learning is a challenging problem that has long been pursued in the literature of graphical models. The difficulty, in a large part, is owing to the NP-hardness incurred in the combinatorial formulation. Zheng et al. (2018)

propose an equivalent continuous constraint that opens the opportunity of using well developed continuous optimization techniques for solving the problem. In this context, we explore the power of neural networks as functional approximators and develop a deep generative model to capture the complex data distribution, aiming at better recovering the underlying DAG with a different design of the objective function. In particular, we employ the machinery of variational autoencoders and parameterize them with new graph neural network architectures. The proposed method handles not only data generated by parametric models beyond linear, but also variables in general forms, including scalar/vector values and continuous/discrete types. We have performed extensive experiments on synthetic, benchmark, and application data and demonstrated the practical competitiveness of the proposal.


  • Bertsekas (1999) Bertsekas, D. P. Nonlinear Programming. Athena Scientific, 2nd edition, 1999.
  • Bruna et al. (2014) Bruna, J., Zaremba, W., Szlam, A., and LeCun, Y. Spectral networks and locally connected networks on graphs. In ICLR, 2014.
  • Chen et al. (2016) Chen, E. Y.-J., Shen, Y., Choi, A., and Darwiche, A. Learning Bayesian networks with ancestral constraints. In NIPS, 2016.
  • Chen et al. (2018) Chen, J., Ma, T., and Xiao, C. FastGCN: Fast learning with graph convolutional networks via importance sampling. In ICLR, 2018.
  • Chickering (2002) Chickering, D. M. Optimal structure identification with greedy search. Journal of Machine Learning Research, 2002.
  • Chickering et al. (2004) Chickering, D. M., Heckerman, D., and Meek, C. Large-sample learning of Bayesian networks is NP-hard. Journal of Machine Learning Research, 5:1287–1330, 2004.
  • Chow & Liu (1968) Chow, C. and Liu, C.

    Approximating discrete probability distributions with dependence trees.

    IEEE transactions on Information Theory, 14(3):462–467, 1968.
  • Cussens (2011) Cussens, J. Bayesian network learning with cutting planes. In UAI, 2011.
  • Cussens et al. (2016) Cussens, J., Haws, D., and Studenỳ, M. Polyhedral aspects of score equivalence in Bayesian network structure learning. Mathematical Programming, pp. 1–40, 2016.
  • Defferrard et al. (2016) Defferrard, M., Bresson, X., and Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, 2016.
  • Eaton & Murphy (2012) Eaton, D. and Murphy, K. Bayesian structure learning using dynamic programming and MCMC. arXiv preprint arXiv:1206.5247, 2012.
  • Friedman & Koller (2003) Friedman, N. and Koller, D. Being Bayesian about network structure. a Bayesian approach to structure discovery in Bayesian networks. Machine learning, 50(1-2):95–125, 2003.
  • Friedman et al. (1999) Friedman, N., Goldszmidt, M., and Wyner, A. Data analysis with Bayesian networks: A bootstrap approach. In UAI, 1999.
  • Gilmer et al. (2017) Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. In ICML, 2017.
  • Grzegorczyk & Husmeier (2008) Grzegorczyk, M. and Husmeier, D. Improving the structure MCMC sampler for Bayesian networks by introducing a new edge reversal move. Machine Learning, 71(2-3):265, 2008.
  • Gámez et al. (2011) Gámez, J., Mateo, J., and Puerta, J. Learning Bayesian networks by hill climbing: efficient methods based on progressive restriction of the neighborhood. Data Mining and Knowledge Discovery, 22(1-2):106–148, 2011.
  • Hamilton et al. (2017) Hamilton, W. L., Ying, R., and Leskovec, J. Inductive representation learning on large graphs. In NIPS, 2017.
  • He et al. (2016) He, R., Tian, J., and Wu, H. Structure learning in Bayesian networks of a moderate size by efficient sampling. Journal of Machine Learning Research, 17(1):3483–3536, 2016.
  • Heckerman et al. (1995) Heckerman, D., Geiger, D., and Chickering, D. M. Learning Bayesian networks: The combination of knowledge and statistical data. Machine learning, 20(3):197–243, 1995.
  • Jaakkola et al. (2010) Jaakkola, T., Sontag, D., Globerson, A., and Meila, M. Learning Bayesian network structure using LP relaxations. 2010.
  • Kalainathan et al. (2018) Kalainathan, D., Goudet, O., Guyon, I., Lopez-Paz, D., and Sebag, M. SAM: Structural agnostic model, causal discovery and penalized adversarial learning. arXiv preprint arXiv:1803.04929, 2018.
  • Kingma & Ba (2015) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2015.
  • Kingma & Welling (2014) Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. In ICLR, 2014.
  • Kipf & Welling (2017) Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
  • Koivisto & Sood (2004) Koivisto, M. and Sood, K. Exact Bayesian structure discovery in Bayesian networks. Journal of Machine Learning Research, 5:549–573, 2004.
  • Li et al. (2016) Li, Y., Tarlow, D., Brockschmidt, M., and Zemel, R. Gated graph sequence neural networks. In ICLR, 2016.
  • Madigan et al. (1995) Madigan, D., York, J., and Allard, D. Bayesian graphical models for discrete data. International Statistical Review/Revue Internationale de Statistique, pp. 215–232, 1995.
  • Mohammadi et al. (2015) Mohammadi, A., Wit, E. C., et al. Bayesian structure learning in sparse Gaussian graphical models. Bayesian Analysis, 10(1):109–138, 2015.
  • Mohan et al. (2012) Mohan, K., Chung, M., Han, S., Witten, D., Lee, S.-I., and Fazel, M. Structured learning of Gaussian graphical models. In NIPS, 2012.
  • Mooij et al. (2016) Mooij, J. M., Magliacane, S., and Claassen, T. Joint causal inference from multiple contexts. arXiv preprint arXiv:1611.10351, 2016.
  • Nie et al. (2014) Nie, S., Mauá, D. D., De Campos, C. P., and Ji, Q. Advances in learning Bayesian networks of bounded treewidth. In NIPS, 2014.
  • Niinimaki et al. (2012) Niinimaki, T., Parviainen, P., and Koivisto, M. Partial order MCMC for structure discovery in Bayesian networks. arXiv preprint arXiv:1202.3753, 2012.
  • Niinimäki & Koivisto (2013) Niinimäki, T. M. and Koivisto, M. Annealed importance sampling for structure learning in Bayesian networks. In IJCAI, 2013.
  • Ott et al. (2004) Ott, S., Imoto, S., and Miyano, S. Finding optimal models for small gene networks. In Pacific symposium on biocomputing, 2004.
  • Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in PyTorch. 2017.
  • Pearl (1988) Pearl, J. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann Publishers, Inc., 2 edition, 1988.
  • Pearl (2003) Pearl, J. Causality: models, reasoning, and inference. Econometric Theory, 19(46):675–685, 2003.
  • Ramsey et al. (2017) Ramsey, J., Glymour, M., Sanchez-Romero, R., and Glymour, C. A million variables and more: the fast greedy equivalence search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images.

    International Journal of Data Science and Analytics

    , 3(2):121–129, 2017.
  • Sachs et al. (2005) Sachs, K., Perez, O., Peer, D., Lauffenburger, D. A., and Nolan, G. P. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721):523–529, 2005.
  • Scanagatta et al. (2015) Scanagatta, M., de Campos, C. P., Corani, G., and Zaffalon, M. Learning Bayesian networks with thousands of variables. In NIPS, 2015.
  • Silander & Myllymaki (2006) Silander, T. and Myllymaki, P. A simple approach for finding the globally optimal Bayesian network structure. In UAI, 2006.
  • Singh & Moore (2005) Singh, A. P. and Moore, A. W. Finding optimal Bayesian networks by dynamic programming. Technical report, Carnegie Mellon University, 2005.
  • Spirtes et al. (1995) Spirtes, P., Meek, C., and Richardson, T. Causal inference in the presence of latent variables and selection bias. In UAI, 1995.
  • Spirtes et al. (1999) Spirtes, P., Glymour, C. N., and Scheines, R. Computation, Causation, and Discovery. AAAI Press, 1999.
  • Spirtes et al. (2000a) Spirtes, P., Glymour, C., Scheines, R., Kauffman, S., Aimale, V., and Wimberly, F. Constructing Bayesian network models of gene expression networks from microarray data, 2000a.
  • Spirtes et al. (2000b) Spirtes, P., Glymour, C. N., Scheines, R., Heckerman, D., Meek, C., Cooper, G., and Richardson, T. Causation, prediction, and search. MIT press, 2000b.
  • Toutanova et al. (2015) Toutanova, K., Chen, D., Pantel, P., Poon, H., Choudhury, P., and Gamon, M. Representing text for joint embedding of text and knowledge bases. In EMNLP, 2015.
  • Tsamardinos et al. (2003) Tsamardinos, I., Aliferis, C. F., and Statnikov, A. Time and sample efficient discovery of markov blankets and direct causal relations. In SIGKDD, 2003.
  • Tsamardinos et al. (2006) Tsamardinos, I., Brown, L., and Aliferis, C. The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning, 65(1):31–78, 2006.
  • Velic̆ković et al. (2018) Velic̆ković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. Graph attention networks. In ICLR, 2018.
  • Yuan & Malone (2013) Yuan, C. and Malone, B. Learning optimal Bayesian networks: A shortest path perspective. 48:23–65, 2013.
  • Zhang (2008) Zhang, J. On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artificial Intelligence, 172(16-17):1873–1896, 2008.
  • Zheng et al. (2018) Zheng, X., Aragam, B., Ravikumar, P., and Xing, E. P. DAGs with NO TEARS: Continuous optimization for structure learning. In NIPS, 2018.

Appendix A Proofs

Proof of Theorem 1.

Let . Clearly, is nonnegative. The binomial expansion reads

It is known that there is a cycle of length if and only if when . Because if there is a cycle then there is a cycle of length at most , we conclude that there is no cycle if and only if . ∎

Proof of Theorem 2.


For given and , the right-hand side of the equality is a function of . This function monotonically increases for positive and has a limit . Hence, for any finite , . ∎

Appendix B Structure Learning over KB Relations

We construct the data set from triples in FB15K-237 (Toutanova et al., 2015), which is a subset of FreeBase with approximately 15k entities and 237 relations. Each sample corresponds to an entity and each variable corresponds to a relation in this knowledge base. Each sample has on average 7.36 relations (i.e. 7.36 non-zero entries in each row).

Table 4 gives additional examples learned by our model with highest confidence scores. For each target relation on the right-hand side, we show the highest ranked relations within the same domain (i.e. the contents in the field before “/” such as “film” and “tvProgram”). On the left-hand side, we omit the relations that are common to the associated entity types, e.g. “profession” and “gender” to persons and “genre” to films, because almost all entities with these types will contain such a relation.

film/ProducedBy film/Country
film/ProductionCompanies film/Country
tvProgram/CountryOfOriginal tvProgram/Language
tvProgram/RegularCast.regularTv/AppearanceActor tvProgram/Language
person/Nationality person/Languages
person/PlaceOfBirth person/Languages
person/PlaceOfBirth person/Nationality
person/PlaceLivedLocation person/Nationality
organization/Headquarters.mailingAddress/Citytown organization/PlaceFounded
organization/Headquarters.mailingAddress/StateProvince organization/PlaceFounded
Table 4: (Continued from Table 3) Examples of extracted edges with high confidence. The dot appearing in means that the sample entity is connected to a virtual node (i.e. compound value types in FreeBase) via relation , followed by a relation to a real entity.