Efficient Permutation Discovery in Causal DAGs

11/06/2020 ∙ by Chandler Squires, et al. ∙ 0

The problem of learning a directed acyclic graph (DAG) up to Markov equivalence is equivalent to the problem of finding a permutation of the variables that induces the sparsest graph. Without additional assumptions, this task is known to be NP-hard. Building on the minimum degree algorithm for sparse Cholesky decomposition, but utilizing DAG-specific problem structure, we introduce an efficient algorithm for finding such sparse permutations. We show that on jointly Gaussian distributions, our method with depth w runs in O(p^w+3) time. We compare our method with w = 1 to algorithms for finding sparse elimination orderings of undirected graphs, and show that taking advantage of DAG-specific problem structure leads to a significant improvement in the discovered permutation. We also compare our algorithm to provably consistent causal structure learning algorithms, such as the PC algorithm, GES, and GSP, and show that our method achieves comparable performance with a shorter runtime. Thus, our method can be used on its own for causal structure discovery. Finally, we show that there exist dense graphs on which our method achieves almost perfect performance, so that unlike most existing causal structure learning algorithms, the situations in which our algorithm achieves both good performance and good runtime are not limited to sparse graphs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The discovery of causal structure, represented by a directed acyclic graph (DAG), from data has received much attention over the past two decades (Spirtes et al., 2000; Chickering, 2002; Shimizu et al., 2006; Hauser and Bühlmann, 2012; Peters et al., 2014; Solus et al., 2020), due to the ability of causal models to answer questions about the effect of hypothetical interventions, such as “how will a new treatment affect a patient’s diabetes”, or “how will a new housing law affect rent prices”? Methods for causal structure discovery exploit a number of patterns in the data, including (1) conditional independencies, as in the PC algorithm (Spirtes et al., 2000), (2) asymmetries arising from nonlinearities, as in the LiNGAM algorithm (Shimizu et al., 2006), or (3) data likelihood, as in GES (Chickering, 2002). While some of these causal discovery algorithms are provably consistent, in the sense that given infinite data they converge to the correct causal model (or one that is equivalent to it, consistency comes at a high computational price; e.g. the complexity of PC and GES grows exponentially in the maximum indegree of the graph, so that these algorithms are infeasible to run on large, dense graphs. Since the problem of causal structure discovery is known to be NP-hard (Chickering et al., 2004), this motivates the development of alternative, approximate methods, which scale to dense graphs while still providing insights about the true causal graph.

Finding causal orderings via sparsity. A number of recent algorithms (Peters et al., 2014; Raskutti and Uhler, 2018; Solus et al., 2020; Yang et al., ; Wang et al., 2017; Squires et al., ) have utilized the fact that, given the true causal ordering between variables (i.e., an ordering that is consistent with the topological ordering of the causal DAG), the causal graph can be easily recovered from conditional independence statements. Motivated by this connection, we here develop a method for inferring the causal ordering by exploiting the relationships between conditional independencies in a novel way. Our new method, which we call Removal-Fill-Degree (RFD), can be seen as an extension of the Minimum-Degree (MD) algorithm for finding sparse elimination orderings of undirected graphs (Rose, 1972; Heggernes et al., 2001). MD and modifications thereof have been well-studied for the purpose of finding sparse Cholesky decompositions, which have a wide array of applications in numerical linear algebra (Rothberg and Gupta, 1994). While these algorithms have focused on undirected graphs, in this paper, we introduce a number of modifications that are natural when extending to DAGs.

Organization of the paper. In sec:background, we review background on graphical models, as well as related work on DAG structure learning and algorithms for finding sparse Cholesky decompositions. In sec:theory, we introduce new concepts and theoretical results which motivate our method. In sec:methods, we introduce our RFD algorithm for finding permutations which induce sparse causal graphs, and evaluate its runtime. We also describe a construction for a dense graph on which the RFD algorithm performs well, demonstrating that the computational efficiency of our algorithm does not limit it to perform well only on sparse graphs. In sec:empirical, we compare the RFD method to other methods on the tasks of causal structure discovery from (1) noiseless data and (2) noisy data.

2 Background and Related Work

UGs, DAGs and IMAPs. Given a graph over nodes , we associate to each node

a random variable

. In an undirected graph (UG) , we write whenever two nodes and are separated given . Similarly, in a directed acyclic graph (DAG) , we write whenever two nodes and are d-separated given (Koller and Friedman, 2009). We say that a distribution is Markov to a UG (DAG) if whenever and are (d-)separated given in the graph, we have . If and are not (d-) separated given , we call them (d-)connected given , and write . See Koller and Friedman (2009) for a review of separation and connection statement in graphical models. We use , to refer to the parents and descendants of in a directed graph . We use to refer to the neighbors of in an undirected graph; if and are neighbors in then we write . We take for all of these terms the standard definitions from Lauritzen (1996). Given a path with consecutive nodes , there is a collider on in the path if .

A graph is an independence map (IMAP) of a graph if every conditional independence statement entailed by is also entailed by . We denote an induced subgraph of on the vertices by . An IMAP of is minimal if no induced subgraph of is an IMAP of . Two DAGs and are Markov equivalent when they entail the same set of d-separation statements. Given a DAG , for any permutation of , the graph

is a minimal IMAP of (Verma and Pearl, 1990), where .

Ordering-Based Algorithms. Raskutti and Uhler (2018) established a fundamental connection between causal structure learning and the problem of finding permutations which induce sparse minimal IMAPs. Based on this connection, Solus et al. (2020) introduced the Greedy Sparsest Permutation (GSP) algorithm, which performs a greedy search over the space of permutations, searching for a graph with the minimum number of edges. In order to establish high-dimensional consistency guarantees, the authors relied on a good choice of an initial permutation in order to limit the steps (and thus, the number of hypothesis tests) required to find the optimal permutation.

Finding Sparse Elimination Orderings.

As a heuristic for discovering a good initial permutation,

Solus et al. (2020) proposed the use of the Minimum-Degree (MD) algorithm. The MD algorithm was designed for the problem of sparse Cholesky decomposition, i.e., finding a permutation of a matrix such that the lower diagonal component in its Cholesky decomposition is sparse. Given an undirected graph and an order of the nodes, vertex elimination is the process of iteratively removing each node in the order and connecting its neighbors. Initial work on sparse Cholesky decompositions (Rose, 1972) provided a connection between finding sparse Cholesky decompositions and finding elimination orderings that introduce few edges. Further work on this problem included improvements to time and space complexity, versions removing multiple nodes at a time (Liu, 1985), and versions using an approximation of the degree (Amestoy et al., 1996). See George and Liu (1989) or Heggernes et al. (2001) for an extensive survey.

Since the MD algorithm and its current extensions are designed for undirected graphs, these methods fail to exploit patterns that are helpful for discovering topological orderings of DAGs, which we describe in this paper. Moreover, these methods assume that the original matrix is sparse, i.e., has entries exactly equal to zero. However, in many applications, including causal structure learning, the matrix may be the result of some noisy process which only induces entries that are approximately equal to zero. We provide details on how to efficiently incorporate conditional independence testing into methods for finding sparse elimination orderings.

Figure 1: A DAG , its moral graph , a moral subgraph , and elimination graph .

3 Theoretical Results

The interplay between DAGs and undirected graphs will be central to our algorithm. Recall that given a DAG , its moral graph is the unique undirected minimal IMAP of (Koller and Friedman, 2009). We extend this concept to arbitrary subsets of the vertices of a DAG :

(a)
(b)
(c)
(d)
(e)
(f)
Figure 2: A DAG and several of its moral subgraphs. Blue lines indicate fill edges, and dotted red lines indicate removed edges.
Definition 1.

The moral subgraph of a DAG over vertices is the undirected graph with vertex set and edge set

As with the moral graph, the moral subgraph is the unique undirected minimal IMAP of the marginal distribution .

The elimination graph. The moral subgraph is closely related to the elimination graph that has previously been studied in the sparse elimination ordering literature. The elimination graph for a given elimination ordering is defined by successively removing each node in the ordering and connecting its neighbors. Equivalently, the elimination graph can be described through the moral graph as follows: is

Figure 1 illustrates that the moral subgraph and elimination graph do not necessarily coincide; a notable exception is the case of chordal graphs when following a perfect elimination ordering (see e.g. Vandenberghe and Andersen (2015) for an excellent overview).

Fill and removal edges. Given a moral subgraph , we consider the effect of marginalizing out a node . Since we will frequently add and remove single elements from sets, we let and for short. Marginalizing from results in the new moral subgraph . Unlike in the elimination algorithm, removing a vertex does not only add edges, but may result in the removal of some edges. The edges removed and added after removing node are captured in the following definitions:

Definition 2.

The removed edge set of vertex , over vertices , is

The removal score of on is .

Definition 3.

The fill edge set of vertex , over vertices , is

The fill score of over is

The following example demonstrates these definitions.

Example 1.

fig:mug-examples shows a DAG (fig:dag), its moral graph (fig:mug), and several of its moral subgraphs. fig:mug6 shows that removing a vertex may have no effect on the removal or fill score. fig:mug7 shows that removing a vertex can lead to the removal of an edge between its parents. fig:mug8 shows that the lack of an edge between parents of a collider in is not sufficient for removal to occur, in this case because they share another child. Finally fig:mug4 is an example of how removing a vertex with descendants in may cause a significant amount of fill.

Fill edges are closely related to the edges added to an undirected graph during the course of vertex elimination. In fact, Proposition 1 establishes that the fill edge set for a node is precisely the same as in vertex elimination. The proof of this proposition is trivial in the case of multivariate Gaussians. The marginal precision matrix (i.e., precision matrix of the marginal distribution) over of a multivariate Gaussian with precision matrix may be computed via the recursive formula

In a multivariate Gaussian, the conditional independence statement is equivalent to . Thus, an edge is in the moral subgraph if and only if .

Thus, we may conclude that if and , then and , i.e., if , then and . Proposition 1 establishes the corresponding result in the general, non-parametric case.

Proposition 1.

The fill edge set of node over nodes is equal to the neighbors of in which are not themselves adjacent, i.e.,

Proof.

Let such that , i.e. is a fill edge. Then there exists some d-connecting path from to given that is not d-connecting given . It follows that (1) is the only non-collider in which is also in , and (2) each collider in has a descendent in . Thus, the sub paths and are d-connecting given and respectively. In other words, , . By assumption , so we conclude that .

Conversely let such that and , then there exist two d-connecting paths from to given , and from to given . Since there is no d-connecting path from to given , so must be a non-collider in the concatenated path given . But if we restrict the conditioning set to , then becomes d-connecting and as a result . Since we assumed and are not d-connected given , it follows that if and , then . ∎

Maximal nodes. To discover a permutation with a sparse minimal IMAP, we will build the permutation from the last vertex to the first. At each step, we would like to pick a vertex which has no descendants remaining. Formally, if is the set of vertices left unpicked, we seek , where

is the set of maximal nodes of with respect to . The next two propositions establish that the removal score and fill score are helpful indicators of whether or not a vertex is maximal.

Proposition 2.

.

Proof.

We prove the contrapositive. Let be a descendant of and let such that , i.e, there is a d-connecting path from to given in . Assume is a descendent of a collider in , then is as well because is a descendant of . As a result, the path will remain d-connecting if we restrict the conditioning set to because . Thus, we conclude that . Note that if is not a descendent of a collider in , then it cannot be on the path; otherwise, would not be a d-connecting given . Thus, in this case still holds. These results show that and it follows that . ∎

Proposition 2 gives us a way to certify that a node has no descendants remaining, and thus adding it to the end of the node ordering will be topologically consistent. However, the converse is not true: a node may have no descendants remaining, but still have removal score zero. This happens for example if the parents of form a clique in , as in Figure 1(c). Thus, a certificate is not always available. If every node has zero removal score, then we cannot use Proposition 2 to find maximal nodes. Instead, we resort to proving that some nodes are not maximal, which helps prune the search space and increase the likelihood that we pick a maximal node.

Proposition 3.

.

Proof.

We prove the contrapositive. Let such that , then there exist two d-connecting paths: from to given , and from to given . We wish to show that is a d-connecting path from to given , which by Proposition 1 will imply that . To this end, we will prove that must be a collider on . Suppose otherwise, and without loss of generality assume is of the form , then we claim that there exists a collider on the path . Otherwise, would be a descendant of in . Thus, let be the collider closest to in and let be a descendent of in . We know that exists because if it did not then would not a d-connecting path given . Note that is also a descendent of , but since this contradicts the hypothesis. Thus we conclude that must be a collider, which implies a d-connecting path from to given . ∎

However, again, the converse is not true. A non-maximal node may have zero fill score, for instance if the descendants of form a clique in .

Input: Distribution , depth
Output: Permutation
Estimate via any undirected structure-learning algorithm
Let
Let
while  do
     Pick
     Let
     
end while
return
Algorithm 1 RFD
1:Input: Distribution , , depth
2:Output: Permutation
3:
4:
5:while  and  do
6:     
7:     for  do
8:         Let ,
9:         if  then
10:              for  do
11:                   to new_paths
12:              end for
13:         else
14:              Let ,
15:              for  do
16:                   to new_paths
17:              end for
18:         end if
19:     end for
20:     
21:     
22:     
23:end while
24:return
Algorithm 2 RFDStep

4 Method

The above theoretical results suggest using a combination of the removal and fill scores to discover the ordering of the nodes in the graph. In this section, we develop a method based on the removal score, fill score, and the degree or a node; we call this method the Removal-Fill-Degree (RFD) algorithm.

alg:rfd-alg begins by estimating an undirected graph over all of the variables. In the multivariate Gaussian case, this can easily be done by thresholding the partial correlation matrix, but this estimator can only be computed if there are more samples than variables () and it has poor performance if . Fortunately, there is a large literature on undirected graph estimation in the sparse high-dimensional setting. For example, the CLIME estimator (Cai et al., 2011), which estimates a precision matrix as a minimum -1 norm estimate under the constraint that its inverse is entry-wise close to the sample covariance matrix, converges in spectral norm to the true precision matrix with rate , where is the number of nonzero entries in the true precision matrix.

Figure 3: The example call to RFDStep described in ex:rfd-step. denotes the degree of in .

Description of RFD. The main principle behind our algorithm is to search for maximal nodes. At each step, we perform a breadth-first search of depth to greedily pick the “best” set of up to nodes to add to the end of the order. For each path in our depth-first search, we define the quantities and to be the removal and degree scores of the most recently added node to the path. As suggested by Proposition 3, a nonzero removal score for a node indicates that it is maximal. Thus, if any node has nonzero removal score, we pick amongst the nodes with maximum removal score (line:pick-max-remove) and append them to our search path. If all nodes have zero removal score, then Proposition 3 suggests a way to prune some search directions, since any node with positive fill score is not a candidate sink. In the noiseless case, we would only need to limit our search to nodes with zero fill score. However, on real data, we may find that all nodes have positive fill score due to noise, so we pick amongst the nodes with minimum fill score (line:pick-min-fill) and append them to our search path. If any path ends in a node with nonzero removal score (), then we exit the search and return one of the paths with maximum removal score, with tie-breaking giving preference to nodes with smaller degree, (line:return-path), since this prefers sparser graphs.

The following example demonstrates RFD  with .

Example 2.

Let the true DAG be as shown in Figure 3. On the first level of the breadth-first search, no node has positive removal score, and the nodes with minimum fill score (in this case, 0 since there is no noise) are 1, 4, and 8 (we take the else branch on line:else-zero-fill). On the second level of the breadth-first search, we find that removing 7 after 8, or removing 1 after 4, both result in a removal score of 1, so we add to both paths (line:pick-max-remove). Finally, tie-breaking between these two paths is done by picking the path with smaller degree (line:return-path), so we add 7,8 to the end of the permutation.

4.1 Runtime

We now characterize the runtime of RFD, run with depth .

Proposition 4.

Suppose that updating the undirected graph after marginalization takes time, for nodes and samples. Then RFDStep with depth takes time, and RFD with depth takes time.

Proof.

At each step of the breadth-first search in RFDStep, we need to calculate the removal and fill scores for up to nodes. Given the undirected graph, both of these quantities take at most time to compute. Thus, each RFDStep takes time. RFD calls RFDStep at most times. ∎

In the multivariate Gaussian case, when using the partial correlation thresholding estimator, we have , as described in app:update-gaussian. This gives us the following corollary:

Corollary 1.

In the multivariate Gaussian setting, RFDStep with depth takes time, and RFD with depth takes time. In particular, RFD with depth 1 takes time.

In comparison, most provably consistent causal structure learning algorithms require bounds on certain graph parameters, such as maximum indegree, in order to achieve a polynomial run time. For instance, the prominent PC algorithm must perform conditional independence tests, where is the maximum indegree of the true DAG. Similarly, recent versions of GES (Chickering and Meek, 2015) require calls to a scoring function. In the case of GSP Solus et al. (2020), the complexity at each step depends on the size of the Markov equivalence class, rather than the usual scaling based on maximum indegree.

Performance on Dense Graphs. Since RFD is able to run in polynomial time without an explicit sparsity assumption on the underlying graph, and RFD is an approximate algorithm, a natural question arises: “does RFD always perform poorly when the underlying graph is not sparse?” We show that the answer to this question is “no”: there exist dense graphs on which our method performs well; i.e., the combination of speed and performance achieved by our method does not rely on sparsity of the underlying graph.

Let denote a graph on nodes, with edges generated as follows. For each , pick , with , such that for any . Let for , and let there be a complete graph on , with topological order given by numerical order. fig:dense-graph shows an example of this construction for .

Figure 4:

The number of missing edges in is less than , whereas the number of possible edges is , so that is dense. Furthermore, the RFD algorithm perfectly recovers , since the removal score of each becomes exactly 1 only after all nodes after in the ordering are removed, and the ordering of the first nodes is arbitrary.

Figure 5: Performance of permutation-finding algorithms in the noiseless setting, measured by the density of their induced minimal IMAPs relative to the true graph. Each point represents the average over 100 randomly generated DAGs.

5 Empirical Results

In this section, we generate DAGs according to an Erdös-Rényi skeleton with random order, varying the number of nodes and the edge density , which may be a function of . We pick edge weights independently from to ensure that they are bounded away from zero.

5.1 Noiseless Setting

We first investigate the quality of the permutation found by our algorithm in the noiseless setting, i.e., when we are given the true precision matrix. Given the output permutation of RFD, we may return the graph estimate , i.e., the minimal IMAP discussed in sec:background. The performance of RFD can be measured by the ratio of the number of edges in to the number of edges in the true graph, . This ratio is always greater than or equal to 1, with equality if and only if is in the Markov equivalence class of .

We compare to a number of baselines, including random permutations (RP) and the following greedy selection strategies, where is the set of unpicked nodes at step of the algorithm and is the node picked at step :

  • Min-degree (MD):

  • Min-fill (MF):

  • Max-remove (MR):

Figure 5 demonstrates that the RFD algorithm is often able to find a permutation which induces a minimal IMAP that is nearly as sparse as the true DAG. The RFD algorithm clearly outperforms all of the baselines on this task. It is notable that on dense graphs (), with a large number of nodes (), the MR algorithm matches the performance of the RFD algorithm, indicating that the removal score is an especially valuable way to identify maximal nodes in such settings. In contrast, on dense graphs, the performance of the MD and MF algorithms both degrade as the number of nodes increases.

Figure 6: Computation time of various causal structure learning algorithms, with nodes and samples. Each point represents the average over 35 randomly generated DAGs.
Figure 7: Performance and computation time of GES, RFD, and MD as a function of the number of nodes. Given nodes, we take samples.

5.2 Noisy Setting

We now compare the performance of RFD to that of other algorithms on the task of causal structure learning from data. We first investigate the computational scaling of our method along with several other prominent methods for causal structure learning across a range of densities. In app:roc-noisy, we show that the ROC curves of all of the algorithms are similar across the range of densities. Since all algorithms perform similarly, we focus on how their computation time grows as a function of how dense of a graph they estimate, measured via the true positive rate. As evidenced by Figure 6, the computation times of PC, GSP, and to a lesser extent GES and GSP all scale poorly as their sparsity parameters are tuned to yield denser graph. Meanwhile, the computation times of RFD  and MD are almost constant across their ranges. These results suggest that, especially in dense and/or high-dimensional regimes, the RFD algorithm can be used as a computationally efficient alternative to existing causal structure learning algorithms.

In fig:scaling, we further compare RFD and MD to GES as a function of the number of nodes in the graph, in the dense setting of . For all algorithms, we aimed to pick their parameter such that the true positive rate was approximately .7 on 20 node graphs, since this was uniformly a point on the ROC curve which offered a good compromise between true positive rate and false positive rate. In the case of RFD and MD, we used a significance level of for all hypothesis tests. For GES, we picked two values of the regularization parameter which “sandwich” the TPR and FPR of our algorithms; (in GES1) and (in GES2). We find that while the performance of GES is comparable to that of RFD, the required computation time scales much more dramatically. Moreover, RFD beats the slightly faster MD algorithm at almost every point, with a higher true positive rate and lower false positive rate.

6 Discussion

In this paper, we introduced a novel, efficient method for approximately recovering the node ordering of a causal DAG. Our method, the RFD algorithm, is motivated by the minimum-degree algorithm for sparse Cholesky decomposition, but leverages additional DAG-specific structure for improved performance. In particular, our method is based on the phenomenon of edge removal after marginalization of a sink node. Our method systematically combines signals about the causal ordering from a combination of edge removal, edge addition, and node degrees, and greatly outperforms methods which use only a single one of these signals on the task of permutation discovery. Moreover, running our method with fixed depth offers a polynomial-time alternative to provably consistent causal structure learning algorithms, which only run in polynomial time under assumptions on the underlying graph. We demonstrate that using our algorithm for causal structure learning performs comparably to existing causal structure learning algorithms, but with a significant speedup in run time.

Developing scalable causal structure learning algorithms is critical, since many of the domains in which causal structure learning is valuable involve thousands to millions of variables, for example in genomics (Bucur et al., 2019; Belyaeva et al., 2020) and neuroscience (Dubois et al., 2017). Since the RFD algorithm is not specific to using the partial correlation thresholding estimator for the undirected graph, our algorithm can be used in non-Gaussian and high-dimensional settings. In the non-Gaussian case, our algorithm can be combined with nonparametric conditional independence tests such as HSIC (Gretton et al., 2008) for estimating the undirected graph. In the high-dimensional case, there are a variety of estimators with guarantees for sparse graphs. One current advantage of the partial correlation thresholding estimator, in comparison to these estimators, is the ability to update the moral subgraph via a rank-one matrix addition in time. To put other estimators into practice with our algorithm, especially on large graphs, it is necessary to develop efficient ways of updating estimates after marginalization, with provable guarantees that such updates do not introduce additional error.

Acknowledgments

Chandler Squires was partially supported by an NSF Graduate Fellowship, MIT J-Clinic for Machine Learning and Health, and IBM. Caroline Uhler was partially supported by NSF (DMS-1651995), ONR (N00014-17-1-2147 and N00014-18-1-2765), and a Simons Investigator Award.

References

  • P. R. Amestoy, T. A. Davis, and I. S. Duff (1996) An approximate minimum degree ordering algorithm. SIAM Journal on Matrix Analysis and Applications 17 (4), pp. 886–905. Cited by: §2.
  • A. Belyaeva, L. Cammarata, A. Radhakrishnan, C. Squires, K. D. Yang, G. Shivashankar, and C. Uhler (2020) Causal network models of SARS-CoV-2 expression and aging to identify candidates for drug repurposing. arXiv preprint arXiv:2006.03735. Cited by: §6.
  • I. G. Bucur, T. Claassen, and T. Heskes (2019) Large-scale local causal inference of gene regulatory relationships. International Journal of Approximate Reasoning 115, pp. 50–68. Cited by: §6.
  • T. Cai, W. Liu, and X. Luo (2011) A constrained minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association 106 (494), pp. 594–607. Cited by: §4.
  • D. M. Chickering, D. Heckerman, and C. Meek (2004)

    Large-sample learning of Bayesian networks is NP-hard

    .
    Journal of Machine Learning Research 5 (Oct), pp. 1287–1330. Cited by: §1.
  • D. M. Chickering and C. Meek (2015) Selective greedy equivalence search: Finding optimal Bayesian networks using a polynomial number of score evaluations. arXiv preprint arXiv:1506.02113. Cited by: §4.1.
  • D. M. Chickering (2002) Optimal structure identification with greedy search. Journal of Machine Learning Research 3 (Nov), pp. 507–554. Cited by: §1.
  • J. Dubois, H. Oya, J. M. Tyszka, M. Howard III, F. Eberhardt, and R. Adolphs (2017) Causal mapping of emotion networks in the human brain: framework and initial findings. Neuropsychologia. Cited by: §6.
  • A. George and J. W. Liu (1989) The evolution of the minimum degree ordering algorithm. Siam Review 31 (1), pp. 1–19. Cited by: §2.
  • A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola (2008) A kernel statistical test of independence. In Advances in Neural Information Processing Systems, pp. 585–592. Cited by: §6.
  • A. Hauser and P. Bühlmann (2012) Characterization and greedy learning of interventional Markov equivalence classes of directed acyclic graphs. Journal of Machine Learning Research 13 (1), pp. 2409–2464. Cited by: §1.
  • P. Heggernes, S. Eisestat, G. Kumfert, and A. Pothen (2001) The computational complexity of the minimum degree algorithm. Technical report Institute for Computer Applications in Science and Engineering, Hampton VA. Cited by: §1, §2.
  • D. Koller and N. Friedman (2009) Probabilistic graphical models: principles and techniques. MIT press. Cited by: §2, §3.
  • S. L. Lauritzen (1996) Graphical models. Vol. 17, Clarendon Press. Cited by: §2.
  • J. W. Liu (1985) Modification of the minimum-degree algorithm by multiple elimination. ACM Transactions on Mathematical Software (TOMS) 11 (2), pp. 141–153. Cited by: §2.
  • J. Peters, J. M. Mooij, D. Janzing, and B. Schölkopf (2014) Causal discovery with continuous additive noise models. Journal of Machine Learning Research 15 (1), pp. 2009–2053. Cited by: §1, §1.
  • G. Raskutti and C. Uhler (2018) Learning directed acyclic graph models based on sparsest permutations. Stat 7 (1), pp. e183. Cited by: §1, §2.
  • D. J. Rose (1972) A graph-theoretic study of the numerical solution of sparse positive definite systems of linear equations. In Graph Theory and Computing, pp. 183–217. Cited by: §1, §2.
  • E. Rothberg and A. Gupta (1994) An efficient block-oriented approach to parallel sparse Cholesky factorization. SIAM Journal on Scientific Computing 15 (6), pp. 1413–1439. Cited by: §1.
  • S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. Kerminen (2006) A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research 7 (Oct), pp. 2003–2030. Cited by: §1.
  • L. Solus, Y. Wang, and C. Uhler (2020) Consistency guarantees for greedy permutation-based causal inference algorithms. Biometrika. Cited by: §1, §1, §2, §2, §4.1.
  • P. Spirtes, C. N. Glymour, R. Scheines, and D. Heckerman (2000) Causation, prediction, and search. MIT press. Cited by: §1.
  • [23] C. Squires, Y. Wang, and C. Uhler Permutation-based causal structure learning with unknown intervention targets.

    Proceedings of the Thirty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI)

    .
    Cited by: §1.
  • L. Vandenberghe and M. S. Andersen (2015) Chordal graphs and semidefinite optimization. Foundations and Trends in Optimization 1 (4), pp. 241–433. Cited by: §3.
  • T. Verma and J. Pearl (1990) Causal networks: semantics and expressiveness. In

    Machine Intelligence and Pattern Recognition

    ,
    Vol. 9, pp. 69–76. Cited by: §2.
  • Y. Wang, L. Solus, K. D. Yang, and C. Uhler (2017) Permutation-based causal inference algorithms with interventions. In Neural Information Processing Systems, Vol. 31. Cited by: §1.
  • [27] K. D. Yang, A. Katcoff, and C. Uhler Characterizing and learning equivalence classes of causal DAGs under interventions. Proceedings of Machine Learning Research 80, pp. 5537–5546. Cited by: §1.

Supplementary Material

References

  • P. R. Amestoy, T. A. Davis, and I. S. Duff (1996) An approximate minimum degree ordering algorithm. SIAM Journal on Matrix Analysis and Applications 17 (4), pp. 886–905. Cited by: §2.
  • A. Belyaeva, L. Cammarata, A. Radhakrishnan, C. Squires, K. D. Yang, G. Shivashankar, and C. Uhler (2020) Causal network models of SARS-CoV-2 expression and aging to identify candidates for drug repurposing. arXiv preprint arXiv:2006.03735. Cited by: §6.
  • I. G. Bucur, T. Claassen, and T. Heskes (2019) Large-scale local causal inference of gene regulatory relationships. International Journal of Approximate Reasoning 115, pp. 50–68. Cited by: §6.
  • T. Cai, W. Liu, and X. Luo (2011) A constrained minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association 106 (494), pp. 594–607. Cited by: §4.
  • D. M. Chickering, D. Heckerman, and C. Meek (2004)

    Large-sample learning of Bayesian networks is NP-hard

    .
    Journal of Machine Learning Research 5 (Oct), pp. 1287–1330. Cited by: §1.
  • D. M. Chickering and C. Meek (2015) Selective greedy equivalence search: Finding optimal Bayesian networks using a polynomial number of score evaluations. arXiv preprint arXiv:1506.02113. Cited by: §4.1.
  • D. M. Chickering (2002) Optimal structure identification with greedy search. Journal of Machine Learning Research 3 (Nov), pp. 507–554. Cited by: §1.
  • J. Dubois, H. Oya, J. M. Tyszka, M. Howard III, F. Eberhardt, and R. Adolphs (2017) Causal mapping of emotion networks in the human brain: framework and initial findings. Neuropsychologia. Cited by: §6.
  • A. George and J. W. Liu (1989) The evolution of the minimum degree ordering algorithm. Siam Review 31 (1), pp. 1–19. Cited by: §2.
  • A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola (2008) A kernel statistical test of independence. In Advances in Neural Information Processing Systems, pp. 585–592. Cited by: §6.
  • A. Hauser and P. Bühlmann (2012) Characterization and greedy learning of interventional Markov equivalence classes of directed acyclic graphs. Journal of Machine Learning Research 13 (1), pp. 2409–2464. Cited by: §1.
  • P. Heggernes, S. Eisestat, G. Kumfert, and A. Pothen (2001) The computational complexity of the minimum degree algorithm. Technical report Institute for Computer Applications in Science and Engineering, Hampton VA. Cited by: §1, §2.
  • D. Koller and N. Friedman (2009) Probabilistic graphical models: principles and techniques. MIT press. Cited by: §2, §3.
  • S. L. Lauritzen (1996) Graphical models. Vol. 17, Clarendon Press. Cited by: §2.
  • J. W. Liu (1985) Modification of the minimum-degree algorithm by multiple elimination. ACM Transactions on Mathematical Software (TOMS) 11 (2), pp. 141–153. Cited by: §2.
  • J. Peters, J. M. Mooij, D. Janzing, and B. Schölkopf (2014) Causal discovery with continuous additive noise models. Journal of Machine Learning Research 15 (1), pp. 2009–2053. Cited by: §1, §1.
  • G. Raskutti and C. Uhler (2018) Learning directed acyclic graph models based on sparsest permutations. Stat 7 (1), pp. e183. Cited by: §1, §2.
  • D. J. Rose (1972) A graph-theoretic study of the numerical solution of sparse positive definite systems of linear equations. In Graph Theory and Computing, pp. 183–217. Cited by: §1, §2.
  • E. Rothberg and A. Gupta (1994) An efficient block-oriented approach to parallel sparse Cholesky factorization. SIAM Journal on Scientific Computing 15 (6), pp. 1413–1439. Cited by: §1.
  • S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. Kerminen (2006) A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research 7 (Oct), pp. 2003–2030. Cited by: §1.
  • L. Solus, Y. Wang, and C. Uhler (2020) Consistency guarantees for greedy permutation-based causal inference algorithms. Biometrika. Cited by: §1, §1, §2, §2, §4.1.
  • P. Spirtes, C. N. Glymour, R. Scheines, and D. Heckerman (2000) Causation, prediction, and search. MIT press. Cited by: §1.
  • [23] C. Squires, Y. Wang, and C. Uhler Permutation-based causal structure learning with unknown intervention targets.

    Proceedings of the Thirty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI)

    .
    Cited by: §1.
  • L. Vandenberghe and M. S. Andersen (2015) Chordal graphs and semidefinite optimization. Foundations and Trends in Optimization 1 (4), pp. 241–433. Cited by: §3.
  • T. Verma and J. Pearl (1990) Causal networks: semantics and expressiveness. In

    Machine Intelligence and Pattern Recognition

    ,
    Vol. 9, pp. 69–76. Cited by: §2.
  • Y. Wang, L. Solus, K. D. Yang, and C. Uhler (2017) Permutation-based causal inference algorithms with interventions. In Neural Information Processing Systems, Vol. 31. Cited by: §1.
  • [27] K. D. Yang, A. Katcoff, and C. Uhler Characterizing and learning equivalence classes of causal DAGs under interventions. Proceedings of Machine Learning Research 80, pp. 5537–5546. Cited by: §1.

Appendix A Efficiently updating the undirected graph for multivariate Gaussians

We first describe the partial correlation thresholding estimator of the moral subgraph, showing that it takes time given the sample precision matrix . Then, we show that after marginalizing a node , the sample precision matrix takes time to compute. Thus, by retaining the sample precision matrix over the current set of nodes at each iteration of the RFD algorithm, we may compute the new undirected graph in time.

a.1 Partial correlation thresholding estimator

The partial correlation between and given , denoted , is equal to the correlation of the residuals of and

after performing linear regression on

. Supposing has a multivariate Gaussian distribution, recall that

A classical result states that if , and is the sample partial correlation computed from samples, then the quantity

(S.1)

is distributed as a standard normal; i.e., to test the null hypothesis

at significance level , we can reject if .

Let denote the marginal sample precision matrix over , i.e., , where is the sample covariance matrix. Then the matrix of sample partial correlations can be efficiently computed from via the following formula:

Applying eq:z-transform element-wise to

and thresholding gives an estimate of the moral subgraph .

We perform operations on each element of , so computing the moral subgraph given is .

a.2 Updating the Marginal Precision Matrix

If we consider the effect of marginalizing out , the new marginal sample precision matrix is related to by the following rank-one update:

Thus, given access to , we may compute in time.

Appendix B Performance on 20-node graphs

fig:noisy-results shows that the various causal structure learning algorithms which we test perform similarly.

Figure 8: ROC curve