On Causal Discovery with Equal Variance Assumption

Prior work has shown that causal structure can be uniquely identified from observational data when these follow a structural equation model whose error terms have equal variances. We show that this fact is implied by an ordering among (conditional) variances. We demonstrate that ordering estimates of these variances yields a simple yet state-of-the-art method for causal structure learning that is readily extendable to high-dimensional problems.

Authors

• 38 publications
• 10 publications
• 11 publications
05/11/2012

Identifiability of Gaussian structural equation models with equal error variances

We consider structural equation models in which variables can be written...
03/29/2018

High-Dimensional Causal Discovery Under non-Gaussianity

We consider data from graphical models based on a recursive system of li...
06/21/2018

Identifiability of Gaussian Structural Equation Models with Dependent Errors Having Equal Variances

In this paper, we prove that some Gaussian structural equation models wi...
08/24/2015

A note on the complexity of the causal ordering problem

In this note we provide a concise report on the complexity of the causal...
03/29/2018

High-Dimensional Discovery Under non-Gaussianity

We consider data from graphical models based on a recursive system of li...
05/08/2018

On the Conditional Logic of Simulation Models

We propose analyzing conditional reasoning by appeal to a notion of inte...
01/29/2019

Identifiability of Gaussian Structural Equation Models with Homogeneous and Heterogeneous Error Variances

In this work, we consider the identifiability assumption of Gaussian str...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

A structural equation model for a random vector

postulates causal relations in which each variable is a function of a subset of the other variables and a stochastic error . Causal discovery/structure learning is the problem of inferring which of other variables each variable

depends on. We consider this problem where only observational data, that is, a sample from the joint distribution of

, is available. While in general only an equivalence class of structures can then be inferred (Pearl, 2009; Spirtes et al., 2000), recent work stresses that unique identification is possible under assumptions such as non-linearity with additive errors, linearity with non-Gaussian errors, and linearity with errors of equal variance; see the reviews of Drton and Maathuis (2017) and Heinze-Deml et al. (2018) or the book of Peters et al. (2017).

This note is concerned with the equal variance case treated by Peters and Bühlmann (2014) and Loh and Bühlmann (2014) who prove identifiability of the causal structure and propose greedy search methods for its estimation. Our key observation is that the identifiability is implied by an ordering among certain conditional variances. Ordering estimates of these variances yields a fast method for estimation of the causal ordering of the variables. The precise causal structure can then be inferred using variable selection techniques for regression (Shojaie and Michailidis, 2010). Specifically, we develop a top-down approach that infers the ordering by successively identifying sources. The method is developed for low- as well as high-dimensional problems. Simulations show significant gains in computational efficiency when compared with greedy search and increased accuracy when the number of variables is large.

An earlier version of this note also included a bottom-up method which identified the causal ordering by successively finding sinks via minimal precisions. However, after the note was finished, we became aware of Ghoshal and Honorio (2018) who proposed a similar bottom-up approach. We emphasize that our top-down approach only requires control of the maximum in-degree as opposed to the bottom-up approach which requires control of the maximum Markov blanket. This is discussed further in Section 4.2 and a direct numerical comparison is given in Section 5.2.

2. Structural Equation Models and Directed Acyclic Graphs

Suppose, without loss of generality, that the observed random vector is centered. In a linear structural equation model, then solves an equation system

 (1) Xj=∑k≠jβjkXk+εj,j=1,…,p,

where the

are independent random variables with mean zero, and the coefficients

are unknown parameters. Following Peters and Bühlmann (2014), we assume that all have a common unknown variance . We will write to express the assumption that there indeed exist independent errors of equal variance such that solves (1) for coefficients given by a real matrix with zeros along the diagonal.

The causal structure inherent to the equations in (1) is encoded in a directed graph with vertex set and edge set equal to the support of . So, . Inference of is the goal of causal discovery as considered in this paper. As in related work, we assume to be a directed acyclic graph (DAG) so that is permutation similar to a triangular matrix. Then (1) admits the unique solution where . Hence, the covariance matrix of is

 (2) Σ:=E(XXT)=σ2(I−B)−1(I−B)−T.

We will invoke the following graphical concepts. If the considered graph contains the edge , then is a parent of its child . We write for the set of all parents of a node . Similarly, is the set of children of . If there exists a directed path , then is an ancestor of its descendant . The sets of ancestors and descendants of are and , respectively. Here, and . A set of nodes is ancestral if for all . If is a DAG, then it admits a topological ordering of its vertices. In other words, there exists a numbering such that only if . Finally, every DAG contains at least one source, that is, a node with . Similarly, every DAG contains at least one sink, which is a node with .

3. Identifiability by Ordering Variances

The main result of Peters and Bühlmann (2014) shows that the graph and the parameters and are identifiable from the covariance in (2). No faithfulness assumptions are needed.

Theorem 1.

Let and with both and directed and acyclic. If , then , , and .

In this section we first give an inductive proof of Theorem 1 that proceeds by recursively identifying source nodes for and subgraphs. We then clarify that alternatively one could identify sink nodes. Our first lemma clarifies that the sources in are characterized by minimal variances. We define

 (3) ζ≡ζ(B)=min(k,j)∈E(B)β2jk.
Lemma 1.

Let with directed and acyclic. If , then . If , then .

Proof.

Let . Each total effect is the sum over directed paths from to of products of coefficients along each path. In particular, . From (2), . Hence, if , then because for all . If then by acyclicity of there exists a node such that . Then and

 var(Xj)=σ2(1+∑k≠jπ2jk)≥σ2(1+π2jℓ)≥σ2(1+ζ).

The next lemma shows that by conditioning on a source, or more generally an ancestral set, one recovers a structural equation model with equal error variance whose graph has the source node or the entire ancestral set removed. For a variable and a vector , we define .

Lemma 2.

Let with directed and acyclic. Let be an ancestral set in . Then for submatrix .

Proof.

Let . Since is ancestral, is a function of only and thus independent of . Hence, . Because it also holds that for , we have from (1) that

 Xj.C=∑k∈pa(j)∖CβjkXk.C+εj.

The lemmas can be combined to identify a topological ordering of and prove Theorem 1.

of Theorem 1.

The claim is trivial for variables, which gives the base for an induction on . If , then Lemma 1 identifies a source by variance minimization. Conditioning on as in Lemma 2 reduces the problem to size . By the induction assumption, and can be identified. The regression coefficients in the conditional expectations for identify the missing first row and column of ; see e.g. Drton (2018, §7). ∎

Next, we show that alternatively one may minimize precisions to identify a sink node. We state analogues of Lemma 1 and 2 which can also be used to prove Theorem 1.

Lemma 3.

Let with directed and acyclic. Let be the covariance matrix of , and the precision matrix. If , then . If , then .

Proof.

The diagonal entries of are So if , and if . ∎

Marginalization of a sink is justified by the following well-known fact (e.g. Drton, 2018, §5).

Lemma 4.

Let with directed and acyclic. Let be an ancestral set in . Then for submatrix .

4. Estimation Algorithms

4.1. Low-dimensional Problems

The results from Section 3 naturally yield an iterative top-down algorithm for estimation of a topological ordering for . In each step of the procedure we select a source node by comparing variances conditional on the previously selected variables, so the criterion in the minimization in Algorithm 1 is the variance

 (4) f1(^Σ,Θ,j)=^Σj,j−^Σj,Θ^Σ−1Θ,Θ^ΣΘ,j=1{(^ΣΘ∪{j},Θ∪{j})−1}j,j,

where is the sample covariance matrix. Alternatively, and as also observed by Ghoshal and Honorio (2018), a bottom-up procedure could construct the reverse causal ordering by successively minimizing precisions (or in other words, full conditional variances).

The bottom-up variant estimates a reversed topological ordering by minimization of precisions, so the criterion is

 (5) f2(^Σ,Θ,j)={(^ΣV∖Θ,V∖Θ)−1}j,j=1f1(^Σ,V∖Θ,j).

After a topological ordering is inferred, either method infers the graph by variable selection in the regressions of each variable on its predecessors in the ordering. We use the cross-validated lasso for this purpose.

To facilitate theoretical statements about our top-down procedure, we assume that the errors in (1) are all sub-Gaussian with maximal sub-Gaussian parameter . We indicate this by writing . Our analysis is restricted to inference of a topological ordering. Shojaie and Michailidis (2010) give results on lasso-based inference of the graph given an ordering.

Theorem 2.

Let with directed and acyclic. Suppose the covariance matrix

has minimum eigenvalue

. If

then Algorithm 1 using criterion criterion (4) recovers a topological ordering of

with probability at least

.

The result follows using concentration for sample covariances (Ravikumar et al., 2011, Lemma 1) and error propagation analysis as in Harris and Drton (2013, Lemma 5). We give details in Appendix A, which is found in the supplementary materials.

4.2. High-dimensional Problems

The consistency result in Theorem 2 requires the sample size to exceed a multiple of and only applies to low-dimensional problems. If , method will stop at the th step when the conditional variance in (4) becomes zero for all .

However, in the high-dimensional setting if has maximum in-degree bounded by a small integer , we may modify the criterion from (4) to

 (6) f2(^Σ,Θ,j)=minC⊆Θ,|C|=qf1(^Σ,C,j)=minC⊆Θ,|C|=q^Σj,j−^Σj,C(^ΣC,C)−1^ΣC,j.

The intuition is that in the population case, adjusting by a smaller set with yields the same results as adjusting by all of . The next lemma makes the idea rigorous.

Lemma 5.

Let with directed and acyclic with maximum in-degree at most . Let , and suppose is an ancestral set. If , then . If , then .

Proof.

The conditional variance of given is the variance of the residual . By Lemma 2, has the same distribution as when . Now, is a source of if and only if . Lemma 1 implies that if and otherwise. The claim about now follows. ∎

Based on Lemma 5, we have the following result whose proof is analogous to that of Theorem 2. The key feature of the result is a drop from to in the sample size requirement.

Theorem 3.

Let with directed and acyclic with of maximum in-degree at most . Suppose all principal submatrices of have minimum eigenvalue at least . If

 n>(q+1)2{log(p2+p)−log(ϵ/2)}128(1+4γ2σ2)2(maxj∈VΣj,j)2(ζλmin+2σ2ζλ2min)2,

then Algorithm 1 using criterion (6) recovers a topological ordering of with probability at least .

We contrast our guarantees with those for the bottom-up method of Ghoshal and Honorio (2018) which selects sinks by minimizing conditional precisions that are estimated using the CLIME estimator (Cai et al., 2011). Because CLIME requires small Markov blankets, the bottom-up procedure has sample complexity where is the maximum total degree. This implies that the procedure cannot consistently discover graphs with hubs, i.e., nodes with very large out-degree, in the high dimensional setting. This said, the computational complexity of the bottom-up procedure is polynomial in , while our top-down procedure is exponential in the maximum in-degree. In practice, we use a branch-and-bound procedure (Lumley, 2017) to efficiently select the set which minimizes the conditional variance; see Section 5.2.

Bottom-up Approach: At each step of the bottom-up algorithm, we require estimates of all conditional precisions, i.e., the inverses of all conditional variances. These precisions can be obtained by either estimating the entire conditional precision matrix, or by directly estimating all the conditional variances. It is worth noting that in high-dimensional setting, consistent estimation of both targets requires the considered graph to have small Markov blanket, as opposed to small maximum in-degree as for the top-down approach. Despite the stricter assumption needed, we present the bottom-up methods in this note as they could be potentially cheaper to compute.

It comes to our attention that a method based on precision matrix estimation has been proposed in Ghoshal and Honorio (2018). The method uses constrained inverse matrix estimation (CLIME) method by Cai et al. (2011). Performance of this method is evaluated in the simulation studies alongside our other proposed methods.

In this note, we propose a bottom-up approach based on conditional variance estimation. In the notation from (4) and (5), the conditional precisions are . In a high-dimensional setting, we may apply methods for estimation of the error variance in regularized regression as developed, e.g., by sun2012scaled or guo2017olasso. Specifically we apply the organic lasso method of guo2017olasso and estimate the conditional variance for given the yet unordered variables as

 (7) ^σ2j,Θ,λ:=minβ∈Rp−|Θ|−1(1n∥Xj−XV∖(Θ∪{j})β∥22+2λ∥β∥21).

with for some .

For fixed , consistently estimates in the high-dimensional setting, and in our numerical studies we see that the bottom-up approach works quite well. However, we will not pursue theoretical statements about the method in this note. For such theory, additional technical assumptions would have to be laid out that then allow one to turn theory such as the one in sun2012scaled into the exponential concentration inequalities for that would be needed for high-dimensionally consistent estimation of the topological order/the DAG. We emphasize that for such concentration, sparsity assumptions will be needed and for sparseness in from (8) the considered graph will need to have small maximum Markov blanket. Another issue is the subtlety that for graphs with non-unique topological ordering it is difficult to control which regression problems have to be considered in a union bound that yields a guaranteed probability of success of the bottom-up algorithm.

A variety of methods are available to estimate high-dimensional precision matrices (Drton and Maathuis, 2017). As we only need the diagonal entries of the precision matrices, i.e., the inverses of the full conditional variances , we may also simply use a lasso-approach for variance estimation in a high-dimensional linear model. Following guo2017olasso, we estimate the conditional variance with the organic lasso estimator, which is the minimal value of the penalized problem

 (8) ^σ2λ=minβ∈Rp(1n∥y−Xβ∥22+2λ∥β∥21).

We modify the criterion from (5) to

 (9) f4(^Σ,Θ,j)=ˆvar(Xj|XV∖(Θ∪{j})):=^σ2j,Θ,λ,

which is the solution of (8) when we regress on the standardized covariates . We derive the Theorem 4 using Theorem 9 in guo2017olasso. Although the statement of Theorem 4 does not explicitly enforce sparsity in the graph, in general, the value of will depend on the maximum degree of the graph.

Theorem 4.

Let with directed acyclic and covariance matrix . Let

 A=maxj∈[p]⎛⎜ ⎜⎝∥(ΣV∖{j},V∖{j})−1∥21∥ΣV∖{j},j∥21var(Xj|XV∖{j}),∥(ΣV∖{j},V∖{j})−1∥1∥ΣV∖{j},j∥1√var(Xj|XV∖{j})+14⎞⎟ ⎟⎠,

and fix some . If

 n>1ζ2ϵ{A(8M+p1−8Mlogp)(logp)1/2+√2}2,

then the bottom-up algorithm that uses criterion (9) with the organic lasso variance estimator for recovers a topological ordering of with probability at least .

5. Numerical Results

5.1. Low-dimensional Setting

We first assess performance in the low-dimensional setting. Random DAGs with nodes and a unique topological ordering are generated by: (1) always including edge for , and (2) including edge with probability for all . We consider a sparse setting with and a dense setting with . All linear coefficients are drawn uniformly from . The error terms are standard normal. Performance is measured using Kendall’s between rankings of variables according to the true and estimated topological orderings. Although the true graph admits a unique ordering by construction, the graph estimated by the greedy search may not admit a unique ordering. Nevertheless, the ranking of variables according to the estimated graph is unique if we allow ties, and Kendall’s remains a good measure for all the methods. We also compute the percentage of true edges discovered (Recall), the percentage of estimated edges that are flipped in the true graph (Flipped), and the proportion of estimated edges which are either flipped or not present in the true graph (false discovery rate; FDR). Tables 1 and 2 show averages over 500 random realizations for our top-down procedure (TD), the bottom-up procedure (BU) of Ghoshal and Honorio (2018), and greedy DAG search (GDS). For the bottom up procedure in the low-dimensional setting, we may in fact simply invert the sample covariance to estimate precisions. For GDS, we allow for 5 random restarts using the same procedure as Peters and Bühlmann (2014).

In both dense and sparse settings, when , greedy search performs best in all metrics. However, for and , the top-down approach does best, followed by bottom-up, and finally greedy search. The top-down and bottom-up method both have a substantially higher average Kendall’s than greedy search.

In our experiments, the proposed methods are roughly 50 to 500 times faster than greedy search as graph size and density increases. On our personal computer, the average run time in the dense setting with and is 8 seconds for the top-down and bottom-up methods, but 4,500 seconds for the greedy search.

5.2. High-dimensional Setting

We now test the proposed procedures in a high-dimensional setting with in two scenarios. Random DAGs with nodes and a unique topological ordering are generated by: (1) always including edge for , and either (2a) for each , including , where , and has out-degree , or (2b) for each , including , where . In both scenarios, the maximum in-degree is fixed to be . In the first scenario, it is also guaranteed that the maximum Markov blanket size is small, bounded by . In the second scenario when there exists hubs in the graph, the maximum Markov blanket size grows with , with . The errors are standard normal.

Algorithm 1 with (6) as HTD (high-dimensional top-down) and to the bottom-up method of Ghoshal and Honorio (2018) as HBU. The best subset search step in HTD is carried with subset size ; increasing beyond the true maximum in-degree does not change performance substantially. The HBU is tuned with . Results for greedy search are not shown as computation becomes intractable when . Performance is measured by Kendall’s to provide direct comparison.

Table 3 demonstrates that in the first scenario, both methods perform reasonably well when the considered graph has small Markov blanket. The HTD procedure performs the best in low-dimensional and moderately high-dimensional settings, and both methods have similar performance in very high-dimensional settings. However, when there exists nodes with very large Markov blanket, the top-down method substantially outperforms the bottom-up method.

On our personal computer, the average run time for problems of size is 10 minutes for the HTD method with . The computational complexity of HBU is determined by the choice of tunning parameter in the precisions estimation step.

Additional simulation settings are presented in Appendix B-E in the supplement including a setting with Rademacher errors as considered by Ghoshal and Honorio (2018).

6. Discussion

In this note, we proposed a simple method for causal discovery under a linear structural equation model with equal error variances. The procedure consistently estimates a topological ordering of the underlying graph and easily extends to the high-dimensional setting where . Simulations demonstrate that the procedure is an attractive alternative to previously considered greedy search methods in terms of both accuracy and computational effort. The advantages of the proposed procedures become especially salient as the number of considered variables increases.

In comparison to the related work of Ghoshal and Honorio (2018), our approach is computationally more demanding for graphs with higher in-degree but requires only control over the maximum in-degree of the graph as opposed to the maximum degree. We also note that as shown in simulations in Appendix E a hybrid method in which greedy search is initialized at estimates obtained from our variance ordering procedures can yield further improvements in performance.

Finally, we note that all discussed methods extend to structural equation models where the error variances are unequal, but known up to ratio. Indeed, if for some unknown but known , we may consider instead of the original variables.

Acknowledgements

This work was supported by the U.S. National Science Foundation (Grant No. DMS 1712535).

References

• Cai et al. (2011) Cai, T., Liu, W., and Luo, X. (2011). A constrained minimization approach to sparse precision matrix estimation. J. Amer. Statist. Assoc., 106(494):594–607.
• Drton (2018) Drton, M. (2018). Algebraic problems in structural equation modeling. In Hibi, T., editor, The 50th Anniversary of Gröbner Bases, Advanced Studies in Pure Mathematics. Mathematical Society of Japan. arXiv:1612.05994.
• Drton and Maathuis (2017) Drton, M. and Maathuis, M. H. (2017). Structure learning in graphical modeling. Annu. Rev. Stat. Appl., 4:365–393.
• Ghoshal and Honorio (2018) Ghoshal, A. and Honorio, J. (2018). Learning linear structural equation models in polynomial time and sample complexity. In

Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics

, volume 84, pages 1466–1475. PMLR.
• Harris and Drton (2013) Harris, N. and Drton, M. (2013). PC algorithm for nonparanormal graphical models. J. Mach. Learn. Res., 14:3365–3383.
• Heinze-Deml et al. (2018) Heinze-Deml, C., Maathuis, M. H., and Meinshausen, N. (2018). Causal structure learning. Annu. Rev. Stat. Appl., 5:371–394.
• Loh and Bühlmann (2014) Loh, P.-L. and Bühlmann, P. (2014). High-dimensional learning of linear causal networks via inverse covariance estimation. J. Mach. Learn. Res., 15:3065–3105.
• Lumley (2017) Lumley, T. (2017). leaps: Regression Subset Selection. R package version 3.0.
• Pearl (2009) Pearl, J. (2009). Causality. Cambridge University Press, Cambridge, second edition. Models, reasoning, and inference.
• Peters and Bühlmann (2014) Peters, J. and Bühlmann, P. (2014). Identifiability of Gaussian structural equation models with equal error variances. Biometrika, 101(1):219–228.
• Peters et al. (2017) Peters, J., Janzing, D., and Schölkopf, B. (2017). Elements of Causal Inference. MIT Press, Cambridge, MA.
• Ravikumar et al. (2011) Ravikumar, P., Wainwright, M. J., Raskutti, G., and Yu, B. (2011). High-dimensional covariance estimation by minimizing -penalized log-determinant divergence. Electron. J. Stat., 5:935–980.
• Shojaie and Michailidis (2010) Shojaie, A. and Michailidis, G. (2010). Penalized likelihood methods for estimation of sparse high-dimensional directed acyclic graphs. Biometrika, 97(3):519–538.
• Spirtes et al. (2000) Spirtes, P., Glymour, C., and Scheines, R. (2000). Causation, prediction, and search.

Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, second edition.

Appendix A Proof of Theorem 2

We first give a lemma that addresses the estimation error for inverse covariances.

Lemma 6.

Assume . Suppose all principal submatrices of have minimum eigenvalue at least . If for we have

 (10) n≥(q+1)2{log(p2+p)−log(ϵ/2)}128(1+4γ2σ2)2(maxj∈VΣj,j)2(ηλmin+1ηλ2min)2.

then

 maxC⊆V,|C|≤q+1∥(ΣC,C)−1−(^ΣC,C)−1∥∞≤η

with probability at least .

Proof.

Let . Because , by Lemma 5 from Harris and Drton (2013), we have

 maxC⊆V,|C|≤(q+1)∥(ΣC,C)−1−(^ΣC,C)−1∥∞≤(q+1)δ/λ2min1−(q+1)δ/λmin=η

provided . The proof is thus complete if we show that .

Note that has variance . Since is a bound on the sub-Gaussian parameters of all , it follows that is sub-Gaussian with parameter at most . Lemma 1 of Ravikumar et al. (2011) applies and gives

 P{|ˆΣi,j−Σi,j|>δ}≤4exp{−nδ2128(1+4γ2/σ2)2maxj(Σj,j)2}≤2p(p+1)ϵ.

A union bound over the entries of yields that indeed . ∎

of Theorem 2.

Our assumption on is as in (10) with . Lemma 6 thus implies that, with probability at least , we have for all subsets with that

 (11) ∥(^ΣΘ,Θ)−1−(ΣΘ,Θ)−1∥∞≤ζ2σ2.

Let be a source in , and let be a non-source. Note that variance of conditional on some set is

 σ2j∣C1=1{(ΣC1∪{j},C1∪{j})−1}j,j.

By Lemma 5, for any such that is an ancestral set and

 (12) {(ΣC1∪{j},C1∪{j})−1}j,j−{(ΣC2∪{k},C2∪{k})−1}k,k ≥1σ2−1σ2(1+ζ)≥ζσ2

Using (11), when and are both at most , we obtain that

 (13)

Thus which implies that Algorithm 1 correctly selects a source node at each step. On the first step, which is trivially an ancestral set. By induction, each subsequent step then correctly adds a sink to so remains ancestral and a correct ordering is recovered. ∎

Now to show consistency of the high-dimensional bottom-up algorithm. We first state Theorem 9 from guo2017olasso.

Theorem 5.

Suppose each column of of the matrix has been scaled so that for all , and . Then, for any constant , the organic lasso estimate,

 ^σλ=minβ∈Rp(1n∥y−Xβ∥22+2λ∥β∥21),

with satisfies the following relative mean squared error bound

 E⎧⎨⎩(^σ2λσ20−1)2⎫⎬⎭≤{(8M+p1−8Mlogp)max(∥β∗∥21σ20,∥β∗∥1σ0+14)(logpn)1/2+(2n)1/2}2.
of Theorem 4.

When estimating , we regress onto . So in our setting,

 β∗j=(ΣV∖{Θ,j},V∖{Θ,j})−1ΣV∖{Θ,j},j,

where for all not in the Markov blanket (parents, children, and parents of children) of in the sub-graph induced by . For any and ,

 (14) ∥β∗∥1 =∥(ΣV∖{Θ,j},V∖{Θ,j})−1ΣV∖{Θ,j},j∥1 ≤∥(ΣV∖{Θ,j},V∖{Θ,j})−1∥1∥ΣV∖{Θ,j},j∥1 ≤∥(ΣV∖{j},V∖{j})−1∥1∥ΣV∖{j},j∥1.

So provides a uniform bound over the relevant quantity for each . For the values of and we specified, the above theorem suggests . Then by Markov inequality, for any subset ,

 P{(^σ2j,Θ,λ−var(Xj|XV∖(Θ∪{j}))2)2>σ4ζ2}≤ϵ,

Let be a sink in and let be a non-sink, then,

 var(Xj|XV∖(Θ∪{j}))2−var(Xk|XV∖(Θ∪{k}))2=σ2−1/(Σ−1)k,k=σ2∑q∈ch(k)β2qk>ζσ2.

Hence, we obtain that

 ^σ2j,Θ,λ−^σ2k,Θ,λ≥var(Xj|XV∖(Θ∪{j}))2−var(Xk|XV∖(Θ∪{k}))2−σ2ζ>0.

This implies that in the first step the bottom-up algorithm correctly selects a sink node. By Lemma 4, we may repeat the argument just given in each step, and the bottom-up algorithm correctly estimates a reversed topological ordering of . ∎

Appendix B Simulations as in Peters and Bühlmann (2014)

We revisit the simulation study of Peters and Bühlmann (2014). DAGs are generated by first creating a random topological ordering, then between any two nodes, an edge is included with probability . We simulate a sparse setting with and a dense setting with . The linear coefficients are drawn uniformly from

and the errors are drawn from a standard Gaussian distribution. Since there may not be a unique ordering for the true graph, we compute the Hamming distance between the true and estimated adjacency matrix rather than Kendall’s

.

Tables 4 and 5 demonstrate that in both settings, the greedy algorithm performs better when is small. However, when the proposed algorithms infer the graph more accurately. In the dense setting, the proposed methods have similar FDR to greedy search, but substantially higher recall. In the sparse setting, the proposed methods have lower recall than greedy search, but also substantially lower FDR.

Appendix C Simulations as in Ghoshal and Honorio (2018)

We construct random graphs as in Section 5.2, but we follow the data sampling procedure as used in Ghoshal and Honorio (2018). All linear coefficients are drawn uniformly from , and errors are drown from the Rademacher distribution and scaled to have . Table 6 demonstrates that both methods performs reasonably well when Markov blankets are restricted to be small, and the top-down approach performs substantially better when there are hubs.

Appendix D Simulations of fully connected graphs

We run simulations with fully connected graphs, as suggested by a reviewer. The linear coefficients are drawn uniformly from and the errors are drawn from a standard Gaussian distribution. The results confirm the advantages of the proposed methods and are shown in Table 7. In general, the estimated graphs from the top-down and bottom-up procedure differ only slightly, and the values reported in the table differ in the 3rd or 4th digit.

Appendix E As initializer for greedy search

As suggested by a reviewer, we explore the performance of the greedy DAG search (GDS) algorithm initialized with the estimates from the proposed procedures. We run simulations with the same data as in Section 5.1. Tables 8 and 9 show averages over 500 random realizations for the top-down procedure (TD), the greedy DAG search with random initialization (GR), and the greedy DAG search with warm initialization (GW). The GR procedure is identical to the GDS procedure described in Section 5.1 and Peters and Bühlmann (2014). In the GW procedure, we initialize with the output from the top-down method, then search through a large number of graph neighbors () at each greedy step. Since the GW procedure is supplied with a good initializer, we do not restart the greedy search after it terminates, while 5 random restarting with is used in GR to insure performance. For simplicity, we omitted the experiment with the bottom-up procedure (BU).

Tables 8 and 9 shows that in all the settings, GW performs better than the other two methods, especially when is large. For reference, the average run time in the dense setting with and is 8 seconds for the top-down method, 4,500 seconds for GR, and 400 seconds for GW.