Learning a faithful directed acyclic graph (DAG) from samples of a joint distribution is a challenging combinatorial problem, owing to the intractable search space superexponential in the number of graph nodes. A recent breakthrough formulates the problem as a continuous optimization with a structural constraint that ensures acyclicity (Zheng et al., 2018). The authors apply the approach to the linear structural equation model (SEM) and the least-squares loss function that are statistically well justified but nevertheless limited. Motivated by the widespread success of deep learning that is capable of capturing complex nonlinear mappings, in this work we propose a deep generative model and apply a variant of the structural constraint to learn the DAG. At the heart of the generative model is a variational autoencoder parameterized by a novel graph neural network architecture, which we coin DAG-GNN. In addition to the richer capacity, an advantage of the proposed model is that it naturally handles discrete variables as well as vector-valued ones. We demonstrate that on synthetic data sets, the proposed method learns more accurate graphs for nonlinearly generated samples; and on benchmark data sets with discrete variables, the learned graphs are reasonably close to the global optima. The code is available at <https://github.com/fishmoon1234/DAG-GNN>.READ FULL TEXT VIEW PDF
Bayesian Networks (BN) have been widely used in machine learning applications (Spirtes et al., 1999; Ott et al., 2004). The structure of a BN takes the form of a directed acyclic graph (DAG) and plays a vital part in causal inference (Pearl, 1988) with many applications in medicine, genetics, economics, and epidemics. Its structure learning problem is however NP-hard (Chickering et al., 2004) and stimulates a proliferation of literature.
Score-based methods generally formulate the structure learning problem as optimizing a certain score function with respect to the unknown (weighted) adjacency matrix and the observed data samples, with a combinatorial constraint stating that the graph must be acyclic. The intractable search space (with a complexity superexponential in the number of graph nodes) poses substantial challenges for optimization. Hence, for practical problems in a scale beyond small, approximate search often needs to be employed with additional structure assumption (Nie et al., 2014; Chow & Liu, 1968; Scanagatta et al., 2015; Chen et al., 2016).
Recently, Zheng et al. (2018) formulate an equivalent acyclicity constraint by using a continuous function of the adjacency matrix (specifically, the matrix exponential of ). This approach drastically changes the combinatorial nature of the problem to a continuous optimization, which may be efficiently solved by using maturely developed blackbox solvers. The optimization problem is nevertheless nonlinear, thus these solvers generally return only a stationary-point solution rather than the global optimum. Nevertheless, the authors show that empirically such local solutions are highly comparable to the global ones obtained through expensive combinatorial search.
With the inspiring reformulation of the constraint, we revisit the objective function. The score-based objective functions generally make assumptions of the variables and the model class. For example, Zheng et al. (2018) demonstrate on the linear structural equation model (SEM) with a least-squares loss. While convenient, such assumptions are often restricted and they may not correctly reflect the actual distribution of real-life data.
Hence, motivated by the remarkable success of deep neural networks, which are arguably universal approximators, in this work we develop a graph-based deep generative model aiming at better capturing the sampling distribution faithful to the DAG. To this end, we employ the machinery of variational inference and parameterize a pair of encoder/decoder with specially designed graph neural networks (GNN). The objective function (the score), then, is the evidence lower bound. Different from the current flourishing designs of GNNs (Bruna et al., 2014; Defferrard et al., 2016; Li et al., 2016; Kipf & Welling, 2017; Hamilton et al., 2017; Gilmer et al., 2017; Chen et al., 2018; Velic̆ković et al., 2018), the proposed ones are generalized from linear SEM, so that the new model performs at least as well as linear SEM when the data is linear.
Our proposal has the following distinct features and advantages. First, the work is built on the widespread use of deep generative models (specifically, variational autoencoders, VAE (Kingma & Welling, 2014)) that are able to capture complex distributions of data and to sample from them. Under the graph setting, the weighted adjacency matrix is an explicit parameter, rather than a latent structure, learnable together with other neural network parameters. The proposed network architecture has not been used before.
Second, the framework of VAE naturally handles various data types, notably not only continuous but also discrete ones. All one needs to do is to model the likelihood distribution (decoder output) consistent with the nature of the variables.
Third, owing to the use of graph neural networks for parameterization, each variable (node) can be not only scalar-valued but also vector-valued. These variables are considered node features input to/output of the GNNs.
Fourth, we propose a variant of the acyclicity constraint more suitable for implementation under current deep learning platforms. The matrix exponential suggested by Zheng et al. (2018), while mathematically elegant, may not be implemented or supported with automatic differentiation in all popular platforms. We propose a polynomial alternative more practically convenient and as numerically stable as the exponential.
We demonstrate the effectiveness of the proposed method on synthetic data generated from linear and nonlinear SEMs, benchmark data sets with discrete variables, and data sets from applications. For synthetic data, the proposed DAG-GNN outperforms DAG-NOTEARS, the algorithm proposed by Zheng et al. (2018) based on linear SEM. For benchmark data, our learned graphs compare favorably with those obtained through optimizing the Bayesian information criterion by using combinatorial search.
A DAG and a joint distribution are faithful to each other if all and only the conditional independencies true in are entailed by (Pearl, 1988). The faithfulness condition enables one to recover from . Given independent and iid samples from an unknown distribution corresponding to a faithful but unknown DAG, structure learning refers to recovering the DAG from .
Many exact and approximate algorithms for learning DAG from data have been developed, including score-based and constraint-based approaches (Spirtes et al., 2000a; Chickering, 2002; Koivisto & Sood, 2004; Silander & Myllymaki, 2006; Jaakkola et al., 2010; Cussens, 2011; Yuan & Malone, 2013). Score-based methods generally use a score to measure the goodness of fit of different graphs over data; and then use a search procedure—such as hill-climbing (Heckerman et al., 1995; Tsamardinos et al., 2006; Gámez et al., 2011), forward-backward search (Chickering, 2002), dynamic programming (Singh & Moore, 2005; Silander & Myllymaki, 2006), A (Yuan & Malone, 2013), or integer programming (Jaakkola et al., 2010; Cussens, 2011; Cussens et al., 2016)—in order to find the best graph. Commonly used Bayesian score criteria, such as BDeu and Bayesian information criterion (BIC), are decomposable, consistent, locally consistent (Chickering, 2002), and score equivalent (Heckerman et al., 1995).
To make the DAG search space tractable, approximate methods make additional assumptions such as bounded tree-width (Nie et al., 2014), tree-like structures (Chow & Liu, 1968), approximation (Scanagatta et al., 2015), and other constraints about the DAG (Chen et al., 2016). Many bootstrap (Friedman et al., 1999) and sampling-based structure learning algorithms (Madigan et al., 1995; Friedman & Koller, 2003; Eaton & Murphy, 2012; Grzegorczyk & Husmeier, 2008; Niinimäki & Koivisto, 2013; Niinimaki et al., 2012; He et al., 2016) are also proposed to tackle the expensive search problem.
Constraint-based methods, in contrast, use (conditional) independence tests to test the existence of edges between each pair of variables. Popular algorithms include SGS (Spirtes et al., 2000b), PC (Spirtes et al., 2000b), IC (Pearl, 2003), and FCI (Spirtes et al., 1995; Zhang, 2008). Recently, there appears a suite of hybrid algorithms that combine score-based and constraint-based methods, such as MMHC (Tsamardinos et al., 2003), and apply constraint-based methods to multiple environments (Mooij et al., 2016).
Due to the NP-hardness, traditional DAG learning methods usually deal with discrete variables, as discussed above, or jointly Gaussian variables (Mohan et al., 2012; Mohammadi et al., 2015). Recently, a new continuous optimization approach is proposed (Zheng et al., 2018), which transforms the discrete search procedure into an equality constraint. This approach enables a suite of continuous optimization techniques such as gradient descent to be used. The approach achieves good structure recovery results, although it is applied to only linear SEM for ease of exposition.
Neural-network approaches started to surface only very recently. Kalainathan et al. (2018) propose a GAN-style (generative adversarial network) method, whereby a separate generative model is applied to each variable and a discriminator is used to distinguish between the joint distributions of real and generated samples. The approach appears to scale well but acyclicity is not enforced.
Our method learns the weighted adjacency matrix of a DAG by using a deep generative model that generalizes linear SEM, with which we start the journey.
Let be the weighted adjacency matrix of the DAG with nodes and be a sample of a joint distribution of variables, where each row corresponds to one variable. In the literature, a variable is typically a scalar, but it can be trivially generalized to a -dimensional vector under the current setting. The linear SEM model reads
where is the noise matrix. When the graph nodes are sorted in the topological order, the matrix is strictly upper triangular. Hence, ancestral sampling from the DAG is equivalent to generating a random noise followed by a triangular solve
Equation (2) may be written as , a general form recognized by the deep learning community as an abstraction of parameterized graph neural networks that take node features as input and return as high level representations. Nearly all graph neural networks (Bruna et al., 2014; Defferrard et al., 2016; Li et al., 2016; Kipf & Welling, 2017; Hamilton et al., 2017; Gilmer et al., 2017; Chen et al., 2018; Velic̆ković et al., 2018) can be written in this form. For example, the popular GCN (Kipf & Welling, 2017) architecture reads
where is a normalization of and and are parameter matrices.
Owing to the special structure (2), we propose a new graph neural network architecture
We will defer the instantiation of these functions in a later subsection. One of the reasons is that the activation in must match the type of the variable , a subject to be discussed together with discrete variables.
Given a specification of the distribution of and samples , one may learn the generative model (3) through maximizing the log-evidence
which, unfortunately, is generally intractable. Hence, we appeal to variational Bayes.
To this end, we use a variational posterior to approximate the actual posterior . The net result is the evidence lower bound (ELBO)
Each individual term departs from the log-evidence by , the KL-divergence between the variational posterior and the actual one.
The ELBO lends itself to a variational autoencoder (VAE) (Kingma & Welling, 2014), where given a sample , the encoder (inference model) encodes it into a latent variable with density ; and the decoder (generative model) tries to reconstruct from with density . Both densities may be parameterized by using neural networks.
Modulo the probability specification to be completed later, the generative model (3) discussed in the preceding subsection plays the role of the decoder. Then, we propose the corresponding encoder
where and are parameterized functions that conceptually play the inverse role of and , respectively.
To complete the VAE, one must specify the distributions in (4). Recall that for now both and are matrices. For simplicity, the prior is typically modeled as the standard matrix normal .
For the inference model, we let
be a multilayer perceptron (MLP) andbe the identity mapping. Then, the variational posterior is a factored Gaussian with mean , computed from the encoder
where , and and are parameter matrices.
For the generative model, we let be the identity mapping and be an MLP. Then, the likelihood is a factored Gaussian with mean and standard deviation , computed from the decoder
where and are parameter matrices.
One may switch the MLP and the identity mapping inside each of the encoder/decoder, but we find that the performance is less competitive. One possible reason is that the current design (7) places an emphasis on the nonlinear transform of a sample from linear SEM, which better captures nonlinearity.
and the reconstruction accuracy term may be computed with Monte Carlo approximation
where is a constant and and are the outputs of the decoder (7) by taking as input Monte Carlo samples , for .
Note that under the autoencoder framework, is considered latent (rather than the noise in linear SEM). Hence, the column dimension of may be different from . From the neural network point of view, changing the column dimension of affects only the sizes of the parameter matrices and . Sometimes, one may want to use a smaller number than if he/she observes that the data has a smaller intrinsic dimension.
An illustration of the architecture is shown in Figure 1.
One advantage of the proposed method is that it naturally handles discrete variables. We assume that each variable has a finite support of cardinality .
Hence, we let each row of be a one-hot vector, where the “on” location indicates the value of the corresponding variable. We still use standard matrix normal to model the prior and factored Gaussian to model the variational posterior, with (6) being the encoder. On the other hand, we need to slightly modify the likelihood to cope with the discrete nature of the variables.
Specifically, we let be a factored categorical distribution with probability matrix
, where each row is a probability vector for the corresponding categorical variable. To achieve so, we changefrom the identity mapping to a row-wise softmax and modify the decoder (7) to
where is the output of the decoder (10) by taking as input Monte Carlo samples , for .
One has seen from the forgoing discussions how the proposed model is developed from linear SEM: We apply nonlinearality to the sampling procedure (2) of SEM, treat the resulting generative model as a decoder, and pair with it a variational encoder for tractable learning. Compared with a plain autoencoder, the variational version allows a modeling of the latent space, from which samples are generated.
We now proceed, in a reverse thought flow, to establish the connection between the loss function of the linear SEM considered in Zheng et al. (2018) and that of ours. We first strip off the variational component of the autoencoder. This plain version uses (5) as the encoder and (3) as the decoder. For notational clarity, let us write as the output of the decoder, to distinguish it from the encoder input . A typical sample loss to minimize is
where the first term is the reconstruction error and the second term is a regularization of the latent space. One recognizes that the reconstruction error is the same as the negative reconstruction accuracy (9) in the ELBO, up to a constant, if the standard deviation is , the mean is taken as , and only one Monte Carlo sample is drawn from the variational posterior. Moreover, the regularization term is the same as the KL-divergence (8) in the ELBO if the standard deviation is and the mean is taken as .
Neither maximizing the ELBO (4) nor minimizing the least-squares loss (12) guarantees that the corresponding graph of the resulting is acyclic. Zheng et al. (2018) pair the loss function with an equality constraint, whose satisfaction ensures acyclicity.
The idea is based on the fact that the positivity of the element of the -th power of a nonnegative adjacency matrix indicates the existence of a length- path between nodes and . Hence, the positivity of the diagonal of
reveals cycles. The authors leverage the trick that the matrix exponential admits a Taylor series (because it is analytic on the complex plane), which is nothing but a weighted sum of all nonnegative integer powers of the matrix. The coefficient of the zeorth power (the identity matrix) is , and hence the trace of the exponential of must be exactly for a DAG. To satisfy nonnegativity, one may let be the elementwise square of ; that is, .
Whereas the formulation of this acyclicity constraint is mathematically elegant, support of the matrix exponential may not be available in all deep learning platforms. To ease the coding effort, we propose an alternative constraint that is practically convenient.
Let be the (possibly negatively) weighted adjacency matrix of a directed graph. For any , the graph is acyclic if and only if
We use (13) as the equality constraint when maximizing the ELBO. The computations of both and
may meet numerical difficulty when the eigenvalues ofhave a large magnitude. However, the former is less severe than the latter with a judicious choice of .
Let for some . Then for any complex , we have .
Based on the foregoing, the learning problem is
where the unknowns include the matrix and all the parameters of the VAE (currently we have ). Nonlinear equality-constrained problems are well studied and we use the augmented Lagrangian approach to solve it. For completeness, we summarize the algorithm here; the reader is referred to standard textbooks such as Section 4.2 of Bertsekas (1999) for details and convergence analysis.
Define the augmented Lagrangian
where is the Lagrange multiplier and is the penalty parameter. When , the minimizer of must satisfy , in which case is equal to the objective function . Hence, the strategy is to progressively increase , for each of which minimize the unconstrained augmented Lagrangian. The Lagrange multiplier is correspondingly updated so that it converges to the one under the optimality condition.
There exist a few variants for updating and increasing , but a typical effective rule reads:
where and are tuning parameters. We find that often and work well.
The subproblem (14) may be solved by using blackbox stochastic optimization solvers, by noting that the ELBO is defined on a set of samples.
In this section, we present a comprehensive set of experiments to demonstrate the effectiveness of the proposed method DAG-GNN. In Section 4.1, we compare with DAG-NOTEARS, the method proposed by Zheng et al. (2018) based on linear SEM, on synthetic data sets generated by sampling generalized linear models, with an emphasis on nonlinear data and vector-valued data (). In Section 4.2, we showcase the capability of our model with discrete data, often seen in benchmark data sets with ground truths for assessing quality. To further illustrate the usefulness of the proposed method, in Section 4.3 we apply DAG-GNN on a protein data set for the discovery of consensus protein signaling network, as well as a knowledge base data set for learning causal relations.
Our implementation is based on PyTorch(Paszke et al., 2017). We use Adam (Kingma & Ba, 2015) to solve the subproblems (14). To avoid overparameterization, we parameterize the variational posterior
as a factored Gaussian with constant unit variance, and similarly for the likelihood. When extracting the DAG, we use a thresholding value , following the recommendation of Zheng et al. (2018). For benchmark and application data sets, we include a Huber-norm regularization of in the objective function to encourage more rapid convergence.
The synthetic data sets are generated in the following manner. We first generate a random DAG by using the Erdős–Rényi model with expected node degree 3, then assign uniformly random weights for the edges to obtain the weighted adjacency matrix . A sample is generated by sampling the (generalized) linear model with some function elaborated soon. The noise follows standard matrix normal. When the dimension , we use lowercase letters to denote vectors; that is, . We compare DAG-GNN with DAG-NOTEARS and report the structural Hamming distance (SHD) and false discovery rate (FDR), each averaged over five random repetitions. With sample size , we run experiments on four graph sizes . In Sections 4.1.1 and 4.1.2 we consider scalar-valued variables ( and in Section 4.1.3 vector-valued variables ().
This case is the linear SEM model, with being the identity mapping. The SHD and FDR are plotted in Figure 2. One sees that the graphs learned by the proposed method are substantially more accurate than those by DAG-NOTEARS when the graphs are large.
We now consider data generated by the following model
for some nonlinear function . Taking first-order approximation (ignoring higher-order terms of ), one obtains an amendatory approximation of the graph adjacency matrix, . This approximate ground truth maintains the DAG structure, with only a scaling on the edge weights.
We take and plot the SHD and FDR in Figure 3. one observes that DAG-GNN slightly improves over DAG-NOTEARS in terms of SHD. Further, FDR is substantially improved, by approximately a factor of three, which indicates that DAG-GNN tends to be more accurate on selecting correct edges. This observation is consistent with the parameter estimates shown in Figure 4, where the ground truth is set as . The heat map confirms that DAG-GNN results in fewer “false alarms” and recovers a relatively sparser matrix.
We further experiment with a more complex nonlinear generation model, where the nonlinearity occurs after the linear combination of the variables, as opposed to the preceding case where nonlinearity is applied to the variables before linear combination. Specifically, we consider
and plot the results in Figure 5. One sees that with higher nonlinearity, the proposed method results in significantly better SHD and FDR than does DAG-NOTEARS.
The proposed method offers a modeling benefit that the variables can be vector-valued with . Moreover, since resides in the latent space of the autoencoder and is not interpreted as noise as in linear SEM, one may take a smaller column dimension if he/she believes that the variables have a lower intrinsic dimension. To demonstrate this capability, we construct a data set where the different dimensions come from a randomly scaled and perturbed sample from linear SEM. Specifically, given a graph adjacency matrix , we first construct a sample from the linear SEM , and then generate for the -th dimension , where and are random scalars from standard normal and is a standard normal vector. The eventual sample is .
We let and and compare DAG-GNN with DAG-NOTEARS. The SHD and FDR are plotted in Figure 6. The figure clearly shows the significantly better performance of the proposed method. Moreover, the parameter estimates are shown in Figure 7, compared against the ground-truth . One sees that the estimated graph from DAG-GNN successfully captures all the ground truth edges and that the estimated weights are also similar. On the other hand, DAG-NOTEARS barely learns the graph.
A benefit of the proposed method is that it naturally handles discrete variables, a case precluded by linear SEM. We demonstrate the use of DAG-GNN on three discrete benchmark data sets: Child, Alarm, and Pigs (Tsamardinos et al., 2006). For comparison is the state-of-the-art exact DAG solver GOPNILP (Cussens et al., 2016), which is based on a constrained integer programming formulation. We use 1000 samples for learning.
One sees from Table 1 that our results are reasonably close to the ground truth, whereas not surprisingly the results of GOPNILP are nearly optimal. The BIC score gap exhibits by DAG-GNN may be caused by the relatively simple autoencoder architecture, which is less successful in approximating multinomial distributions. Nevertheless, it is encouraging that the proposed method as a unified framework can handle discrete variables with only slight changes in the network architecture.
We consider a bioinformatics data set (Sachs et al., 2005) for the discovery of a protein signaling network based on expression levels of proteins and phospholipids. This is a widely used data set for research on graphical models, with experimental annotations accepted by the biological research community. The data set offers continuous measurements of expression levels of multiple phosphorylated proteins and phospholipid components in human immune system cells, and the modeled network provides the ordering of the connections between pathway components. Based on samples of cell types, Sachs et al. (2005) estimate 20 edges in the graph.
In Table 2, we compare DAG-GNN with DAG-NOTEARS as well as FSG, the fast greedy search method proposed by Ramsey et al. (2017), against the ground truth offered by Sachs et al. (2005). The proposed method achieves the lowest SHD. We further show in Figure 8 our estimated graph. One observes that it is acyclic. Our method successfully learns 8 out of 20 ground-truth edges (as marked by red arrows), and predicts 5 indirectly connected edges (blue dashed arrows) as well as 3 reverse edges (yellow arrows).
|Method||SHD||# Predicted edges|
For another application, we develop a new causal inference task over relations defined in a knowledge base (KB) schema. The task aims at learning a BN, the nodes of which are relations and the edges indicate whether one relation suggests another. For example, the relation person/Nationality may imply person/Language, because the spoken language of a person naturally associates with his/her nationality. This task has a practical value, because most existing KBs are constructed by hand. The success of this task helps suggest meaningful relations for new entities and reduce human efforts. We construct a data set from FB15K-237 (Toutanova et al., 2015) and list in Table 3 a few extracted causal relations. Because of space limitation, we defer the details and more results in the supplementary material. One sees that these results are quite intuitive. We plan a comprehensive study with field experts to systematically evaluate the extraction results.
DAG structure learning is a challenging problem that has long been pursued in the literature of graphical models. The difficulty, in a large part, is owing to the NP-hardness incurred in the combinatorial formulation. Zheng et al. (2018)
propose an equivalent continuous constraint that opens the opportunity of using well developed continuous optimization techniques for solving the problem. In this context, we explore the power of neural networks as functional approximators and develop a deep generative model to capture the complex data distribution, aiming at better recovering the underlying DAG with a different design of the objective function. In particular, we employ the machinery of variational autoencoders and parameterize them with new graph neural network architectures. The proposed method handles not only data generated by parametric models beyond linear, but also variables in general forms, including scalar/vector values and continuous/discrete types. We have performed extensive experiments on synthetic, benchmark, and application data and demonstrated the practical competitiveness of the proposal.
Approximating discrete probability distributions with dependence trees.IEEE transactions on Information Theory, 14(3):462–467, 1968.
International Journal of Data Science and Analytics, 3(2):121–129, 2017.
Let . Clearly, is nonnegative. The binomial expansion reads
It is known that there is a cycle of length if and only if when . Because if there is a cycle then there is a cycle of length at most , we conclude that there is no cycle if and only if . ∎
For given and , the right-hand side of the equality is a function of . This function monotonically increases for positive and has a limit . Hence, for any finite , . ∎
We construct the data set from triples in FB15K-237 (Toutanova et al., 2015), which is a subset of FreeBase with approximately 15k entities and 237 relations. Each sample corresponds to an entity and each variable corresponds to a relation in this knowledge base. Each sample has on average 7.36 relations (i.e. 7.36 non-zero entries in each row).
Table 4 gives additional examples learned by our model with highest confidence scores. For each target relation on the right-hand side, we show the highest ranked relations within the same domain (i.e. the contents in the field before “/” such as “film” and “tvProgram”). On the left-hand side, we omit the relations that are common to the associated entity types, e.g. “profession” and “gender” to persons and “genre” to films, because almost all entities with these types will contain such a relation.