1. Introduction
A structural equation model for a random vector
postulates causal relations in which each variable is a function of a subset of the other variables and a stochastic error . Causal discovery/structure learning is the problem of inferring which of other variables each variabledepends on. We consider this problem where only observational data, that is, a sample from the joint distribution of
, is available. While in general only an equivalence class of structures can then be inferred (Pearl, 2009; Spirtes et al., 2000), recent work stresses that unique identification is possible under assumptions such as nonlinearity with additive errors, linearity with nonGaussian errors, and linearity with errors of equal variance; see the reviews of Drton and Maathuis (2017) and HeinzeDeml et al. (2018) or the book of Peters et al. (2017).This note is concerned with the equal variance case treated by Peters and Bühlmann (2014) and Loh and Bühlmann (2014) who prove identifiability of the causal structure and propose greedy search methods for its estimation. Our key observation is that the identifiability is implied by an ordering among certain conditional variances. Ordering estimates of these variances yields a fast method for estimation of the causal ordering of the variables. The precise causal structure can then be inferred using variable selection techniques for regression (Shojaie and Michailidis, 2010). Specifically, we develop a topdown approach that infers the ordering by successively identifying sources. The method is developed for low as well as highdimensional problems. Simulations show significant gains in computational efficiency when compared with greedy search and increased accuracy when the number of variables is large.
An earlier version of this note also included a bottomup method which identified the causal ordering by successively finding sinks via minimal precisions. However, after the note was finished, we became aware of Ghoshal and Honorio (2018) who proposed a similar bottomup approach. We emphasize that our topdown approach only requires control of the maximum indegree as opposed to the bottomup approach which requires control of the maximum Markov blanket. This is discussed further in Section 4.2 and a direct numerical comparison is given in Section 5.2.
2. Structural Equation Models and Directed Acyclic Graphs
Suppose, without loss of generality, that the observed random vector is centered. In a linear structural equation model, then solves an equation system
(1) 
where the
are independent random variables with mean zero, and the coefficients
are unknown parameters. Following Peters and Bühlmann (2014), we assume that all have a common unknown variance . We will write to express the assumption that there indeed exist independent errors of equal variance such that solves (1) for coefficients given by a real matrix with zeros along the diagonal.The causal structure inherent to the equations in (1) is encoded in a directed graph with vertex set and edge set equal to the support of . So, . Inference of is the goal of causal discovery as considered in this paper. As in related work, we assume to be a directed acyclic graph (DAG) so that is permutation similar to a triangular matrix. Then (1) admits the unique solution where . Hence, the covariance matrix of is
(2) 
We will invoke the following graphical concepts. If the considered graph contains the edge , then is a parent of its child . We write for the set of all parents of a node . Similarly, is the set of children of . If there exists a directed path , then is an ancestor of its descendant . The sets of ancestors and descendants of are and , respectively. Here, and . A set of nodes is ancestral if for all . If is a DAG, then it admits a topological ordering of its vertices. In other words, there exists a numbering such that only if . Finally, every DAG contains at least one source, that is, a node with . Similarly, every DAG contains at least one sink, which is a node with .
3. Identifiability by Ordering Variances
The main result of Peters and Bühlmann (2014) shows that the graph and the parameters and are identifiable from the covariance in (2). No faithfulness assumptions are needed.
Theorem 1.
Let and with both and directed and acyclic. If , then , , and .
In this section we first give an inductive proof of Theorem 1 that proceeds by recursively identifying source nodes for and subgraphs. We then clarify that alternatively one could identify sink nodes. Our first lemma clarifies that the sources in are characterized by minimal variances. We define
(3) 
Lemma 1.
Let with directed and acyclic. If , then . If , then .
Proof.
Let . Each total effect is the sum over directed paths from to of products of coefficients along each path. In particular, . From (2), . Hence, if , then because for all . If then by acyclicity of there exists a node such that . Then and
∎
The next lemma shows that by conditioning on a source, or more generally an ancestral set, one recovers a structural equation model with equal error variance whose graph has the source node or the entire ancestral set removed. For a variable and a vector , we define .
Lemma 2.
Let with directed and acyclic. Let be an ancestral set in . Then for submatrix .
Proof.
Let . Since is ancestral, is a function of only and thus independent of . Hence, . Because it also holds that for , we have from (1) that
∎
The lemmas can be combined to identify a topological ordering of and prove Theorem 1.
of Theorem 1.
The claim is trivial for variables, which gives the base for an induction on . If , then Lemma 1 identifies a source by variance minimization. Conditioning on as in Lemma 2 reduces the problem to size . By the induction assumption, and can be identified. The regression coefficients in the conditional expectations for identify the missing first row and column of ; see e.g. Drton (2018, §7). ∎
Next, we show that alternatively one may minimize precisions to identify a sink node. We state analogues of Lemma 1 and 2 which can also be used to prove Theorem 1.
Lemma 3.
Let with directed and acyclic. Let be the covariance matrix of , and the precision matrix. If , then . If , then .
Proof.
The diagonal entries of are So if , and if . ∎
Marginalization of a sink is justified by the following wellknown fact (e.g. Drton, 2018, §5).
Lemma 4.
Let with directed and acyclic. Let be an ancestral set in . Then for submatrix .
4. Estimation Algorithms
4.1. Lowdimensional Problems
The results from Section 3 naturally yield an iterative topdown algorithm for estimation of a topological ordering for . In each step of the procedure we select a source node by comparing variances conditional on the previously selected variables, so the criterion in the minimization in Algorithm 1 is the variance
(4) 
where is the sample covariance matrix. Alternatively, and as also observed by Ghoshal and Honorio (2018), a bottomup procedure could construct the reverse causal ordering by successively minimizing precisions (or in other words, full conditional variances).
The bottomup variant estimates a reversed topological ordering by minimization of precisions, so the criterion is
(5) 
After a topological ordering is inferred, either method infers the graph by variable selection in the regressions of each variable on its predecessors in the ordering. We use the crossvalidated lasso for this purpose.
To facilitate theoretical statements about our topdown procedure, we assume that the errors in (1) are all subGaussian with maximal subGaussian parameter . We indicate this by writing . Our analysis is restricted to inference of a topological ordering. Shojaie and Michailidis (2010) give results on lassobased inference of the graph given an ordering.
Theorem 2.
Let with directed and acyclic. Suppose the covariance matrix
has minimum eigenvalue
. Ifthen Algorithm 1 using criterion criterion (4) recovers a topological ordering of
with probability at least
.4.2. Highdimensional Problems
The consistency result in Theorem 2 requires the sample size to exceed a multiple of and only applies to lowdimensional problems. If , method will stop at the th step when the conditional variance in (4) becomes zero for all .
However, in the highdimensional setting if has maximum indegree bounded by a small integer , we may modify the criterion from (4) to
(6) 
The intuition is that in the population case, adjusting by a smaller set with yields the same results as adjusting by all of . The next lemma makes the idea rigorous.
Lemma 5.
Let with directed and acyclic with maximum indegree at most . Let , and suppose is an ancestral set. If , then . If , then .
Proof.
Based on Lemma 5, we have the following result whose proof is analogous to that of Theorem 2. The key feature of the result is a drop from to in the sample size requirement.
Theorem 3.
We contrast our guarantees with those for the bottomup method of Ghoshal and Honorio (2018) which selects sinks by minimizing conditional precisions that are estimated using the CLIME estimator (Cai et al., 2011). Because CLIME requires small Markov blankets, the bottomup procedure has sample complexity where is the maximum total degree. This implies that the procedure cannot consistently discover graphs with hubs, i.e., nodes with very large outdegree, in the high dimensional setting. This said, the computational complexity of the bottomup procedure is polynomial in , while our topdown procedure is exponential in the maximum indegree. In practice, we use a branchandbound procedure (Lumley, 2017) to efficiently select the set which minimizes the conditional variance; see Section 5.2.
Bottomup Approach: At each step of the bottomup algorithm, we require estimates of all conditional precisions, i.e., the inverses of all conditional variances. These precisions can be obtained by either estimating the entire conditional precision matrix, or by directly estimating all the conditional variances. It is worth noting that in highdimensional setting, consistent estimation of both targets requires the considered graph to have small Markov blanket, as opposed to small maximum indegree as for the topdown approach. Despite the stricter assumption needed, we present the bottomup methods in this note as they could be potentially cheaper to compute.
It comes to our attention that a method based on precision matrix estimation has been proposed in Ghoshal and Honorio (2018). The method uses constrained inverse matrix estimation (CLIME) method by Cai et al. (2011). Performance of this method is evaluated in the simulation studies alongside our other proposed methods.
In this note, we propose a bottomup approach based on conditional variance estimation. In the notation from (4) and (5), the conditional precisions are . In a highdimensional setting, we may apply methods for estimation of the error variance in regularized regression as developed, e.g., by sun2012scaled or guo2017olasso. Specifically we apply the organic lasso method of guo2017olasso and estimate the conditional variance for given the yet unordered variables as
(7) 
with for some .
For fixed , consistently estimates in the highdimensional setting, and in our numerical studies we see that the bottomup approach works quite well. However, we will not pursue theoretical statements about the method in this note. For such theory, additional technical assumptions would have to be laid out that then allow one to turn theory such as the one in sun2012scaled into the exponential concentration inequalities for that would be needed for highdimensionally consistent estimation of the topological order/the DAG. We emphasize that for such concentration, sparsity assumptions will be needed and for sparseness in from (8) the considered graph will need to have small maximum Markov blanket. Another issue is the subtlety that for graphs with nonunique topological ordering it is difficult to control which regression problems have to be considered in a union bound that yields a guaranteed probability of success of the bottomup algorithm.
A variety of methods are available to estimate highdimensional precision matrices (Drton and Maathuis, 2017). As we only need the diagonal entries of the precision matrices, i.e., the inverses of the full conditional variances , we may also simply use a lassoapproach for variance estimation in a highdimensional linear model. Following guo2017olasso, we estimate the conditional variance with the organic lasso estimator, which is the minimal value of the penalized problem
(8) 
We modify the criterion from (5) to
(9) 
which is the solution of (8) when we regress on the standardized covariates . We derive the Theorem 4 using Theorem 9 in guo2017olasso. Although the statement of Theorem 4 does not explicitly enforce sparsity in the graph, in general, the value of will depend on the maximum degree of the graph.
Theorem 4.
Let with directed acyclic and covariance matrix . Let
and fix some . If
then the bottomup algorithm that uses criterion (9) with the organic lasso variance estimator for recovers a topological ordering of with probability at least .
5. Numerical Results
5.1. Lowdimensional Setting
We first assess performance in the lowdimensional setting. Random DAGs with nodes and a unique topological ordering are generated by: (1) always including edge for , and (2) including edge with probability for all . We consider a sparse setting with and a dense setting with . All linear coefficients are drawn uniformly from . The error terms are standard normal. Performance is measured using Kendall’s between rankings of variables according to the true and estimated topological orderings. Although the true graph admits a unique ordering by construction, the graph estimated by the greedy search may not admit a unique ordering. Nevertheless, the ranking of variables according to the estimated graph is unique if we allow ties, and Kendall’s remains a good measure for all the methods. We also compute the percentage of true edges discovered (Recall), the percentage of estimated edges that are flipped in the true graph (Flipped), and the proportion of estimated edges which are either flipped or not present in the true graph (false discovery rate; FDR). Tables 1 and 2 show averages over 500 random realizations for our topdown procedure (TD), the bottomup procedure (BU) of Ghoshal and Honorio (2018), and greedy DAG search (GDS). For the bottom up procedure in the lowdimensional setting, we may in fact simply invert the sample covariance to estimate precisions. For GDS, we allow for 5 random restarts using the same procedure as Peters and Bühlmann (2014).
Kendall’s  Recall %  Flipped %  FDR %  
TD  BU  GDS  TD  BU  GDS  TD  BU  GDS  TD  BU  GDS  
5  100  0.85  0.82  0.88  91  89  91  7  8  6  17  18  9 
500  0.98  0.97  0.98  99  98  99  1  1  1  4  4  2  
1000  0.99  0.98  0.99  99  99  99  1  1  1  3  3  1  
20  100  0.92  0.85  0.61  85  83  62  3  5  13  32  35  43 
500  0.99  0.97  0.75  99  98  81  1  1  11  28  29  35  
1000  1.00  0.99  0.82  100  100  88  0  0  8  26  26  28  
40  100  0.96  0.91  0.53  71  69  44  2  3  11  41  43  58 
500  0.99  0.98  0.59  96  96  63  0  1  14  41  42  57  
1000  1.00  0.99  0.64  97  97  71  0  0  14  40  41  57 
Kendall’s  Recall %  Flipped %  FDR %  
TD  BU  GDS  TD  BU  GDS  TD  BU  GDS  TD  BU  GDS  
5  100  0.87  0.84  0.88  91  89  90  6  7  6  16  17  9 
500  0.98  0.96  0.98  98  98  99  1  2  1  5  5  2  
1000  0.99  0.98  0.99  99  99  99  1  1  1  3  4  1  
20  100  0.77  0.59  0.60  85  79  77  9  13  15  35  40  39 
500  0.96  0.88  0.77  98  96  89  2  4  10  19  22  26  
1000  0.99  0.94  0.81  100  98  90  0  2  9  14  16  23  
40  100  0.72  0.44  0.47  81  72  72  10  16  20  38  46  54 
500  0.96  0.80  0.58  98  94  81  2  5  18  24  31  47  
1000  0.99  0.91  0.61  99  98  82  1  2  17  17  22  48 
In both dense and sparse settings, when , greedy search performs best in all metrics. However, for and , the topdown approach does best, followed by bottomup, and finally greedy search. The topdown and bottomup method both have a substantially higher average Kendall’s than greedy search.
In our experiments, the proposed methods are roughly 50 to 500 times faster than greedy search as graph size and density increases. On our personal computer, the average run time in the dense setting with and is 8 seconds for the topdown and bottomup methods, but 4,500 seconds for the greedy search.
5.2. Highdimensional Setting
We now test the proposed procedures in a highdimensional setting with in two scenarios. Random DAGs with nodes and a unique topological ordering are generated by: (1) always including edge for , and either (2a) for each , including , where , and has outdegree , or (2b) for each , including , where . In both scenarios, the maximum indegree is fixed to be . In the first scenario, it is also guaranteed that the maximum Markov blanket size is small, bounded by . In the second scenario when there exists hubs in the graph, the maximum Markov blanket size grows with , with . The errors are standard normal.
Algorithm 1 with (6) as HTD (highdimensional topdown) and to the bottomup method of Ghoshal and Honorio (2018) as HBU. The best subset search step in HTD is carried with subset size ; increasing beyond the true maximum indegree does not change performance substantially. The HBU is tuned with . Results for greedy search are not shown as computation becomes intractable when . Performance is measured by Kendall’s to provide direct comparison.
Small  Hub graph  
HTD  HBU  HTD  HBU  
80  0.5  0.99  0.89  1.00  0.70 
0.75  0.98  0.89  0.99  0.52  
0.95  0.87  0.95  0.39  
1.5  0.84  0.83  0.77  0.25  
2  0.72  0.73  0.55  0.16  
100  0.5  1.00  0.93  1.00  0.70 
0.75  0.99  0.92  1.00  0.50  
0.97  0.87  0.97  0.38  
1.5  0.86  0.84  0.74  0.26  
2  0.73  0.78  0.63  0.12  
200  0.5  1.00  0.95  1.00  0.77 
0.75  1.00  0.90  1.00  0.61  
0.99  0.79  0.99  0.48  
1.5  0.87  0.74  0.80  0.20  
2  0.74  0.64  0.65  0.13 
Table 3 demonstrates that in the first scenario, both methods perform reasonably well when the considered graph has small Markov blanket. The HTD procedure performs the best in lowdimensional and moderately highdimensional settings, and both methods have similar performance in very highdimensional settings. However, when there exists nodes with very large Markov blanket, the topdown method substantially outperforms the bottomup method.
On our personal computer, the average run time for problems of size is 10 minutes for the HTD method with . The computational complexity of HBU is determined by the choice of tunning parameter in the precisions estimation step.
6. Discussion
In this note, we proposed a simple method for causal discovery under a linear structural equation model with equal error variances. The procedure consistently estimates a topological ordering of the underlying graph and easily extends to the highdimensional setting where . Simulations demonstrate that the procedure is an attractive alternative to previously considered greedy search methods in terms of both accuracy and computational effort. The advantages of the proposed procedures become especially salient as the number of considered variables increases.
In comparison to the related work of Ghoshal and Honorio (2018), our approach is computationally more demanding for graphs with higher indegree but requires only control over the maximum indegree of the graph as opposed to the maximum degree. We also note that as shown in simulations in Appendix E a hybrid method in which greedy search is initialized at estimates obtained from our variance ordering procedures can yield further improvements in performance.
Finally, we note that all discussed methods extend to structural equation models where the error variances are unequal, but known up to ratio. Indeed, if for some unknown but known , we may consider instead of the original variables.
Acknowledgements
This work was supported by the U.S. National Science Foundation (Grant No. DMS 1712535).
References
 Cai et al. (2011) Cai, T., Liu, W., and Luo, X. (2011). A constrained minimization approach to sparse precision matrix estimation. J. Amer. Statist. Assoc., 106(494):594–607.
 Drton (2018) Drton, M. (2018). Algebraic problems in structural equation modeling. In Hibi, T., editor, The 50th Anniversary of Gröbner Bases, Advanced Studies in Pure Mathematics. Mathematical Society of Japan. arXiv:1612.05994.
 Drton and Maathuis (2017) Drton, M. and Maathuis, M. H. (2017). Structure learning in graphical modeling. Annu. Rev. Stat. Appl., 4:365–393.

Ghoshal and Honorio (2018)
Ghoshal, A. and Honorio, J. (2018).
Learning linear structural equation models in polynomial time and
sample complexity.
In
Proceedings of the TwentyFirst International Conference on Artificial Intelligence and Statistics
, volume 84, pages 1466–1475. PMLR.  Harris and Drton (2013) Harris, N. and Drton, M. (2013). PC algorithm for nonparanormal graphical models. J. Mach. Learn. Res., 14:3365–3383.
 HeinzeDeml et al. (2018) HeinzeDeml, C., Maathuis, M. H., and Meinshausen, N. (2018). Causal structure learning. Annu. Rev. Stat. Appl., 5:371–394.
 Loh and Bühlmann (2014) Loh, P.L. and Bühlmann, P. (2014). Highdimensional learning of linear causal networks via inverse covariance estimation. J. Mach. Learn. Res., 15:3065–3105.
 Lumley (2017) Lumley, T. (2017). leaps: Regression Subset Selection. R package version 3.0.
 Pearl (2009) Pearl, J. (2009). Causality. Cambridge University Press, Cambridge, second edition. Models, reasoning, and inference.
 Peters and Bühlmann (2014) Peters, J. and Bühlmann, P. (2014). Identifiability of Gaussian structural equation models with equal error variances. Biometrika, 101(1):219–228.
 Peters et al. (2017) Peters, J., Janzing, D., and Schölkopf, B. (2017). Elements of Causal Inference. MIT Press, Cambridge, MA.
 Ravikumar et al. (2011) Ravikumar, P., Wainwright, M. J., Raskutti, G., and Yu, B. (2011). Highdimensional covariance estimation by minimizing penalized logdeterminant divergence. Electron. J. Stat., 5:935–980.
 Shojaie and Michailidis (2010) Shojaie, A. and Michailidis, G. (2010). Penalized likelihood methods for estimation of sparse highdimensional directed acyclic graphs. Biometrika, 97(3):519–538.

Spirtes et al. (2000)
Spirtes, P., Glymour, C., and Scheines, R. (2000).
Causation, prediction, and search.
Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, second edition.
Appendix A Proof of Theorem 2
We first give a lemma that addresses the estimation error for inverse covariances.
Lemma 6.
Assume . Suppose all principal submatrices of have minimum eigenvalue at least . If for we have
(10) 
then
with probability at least .
Proof.
Let . Because , by Lemma 5 from Harris and Drton (2013), we have
provided . The proof is thus complete if we show that .
Note that has variance . Since is a bound on the subGaussian parameters of all , it follows that is subGaussian with parameter at most . Lemma 1 of Ravikumar et al. (2011) applies and gives
A union bound over the entries of yields that indeed . ∎
of Theorem 2.
Our assumption on is as in (10) with . Lemma 6 thus implies that, with probability at least , we have for all subsets with that
(11) 
Let be a source in , and let be a nonsource. Note that variance of conditional on some set is
By Lemma 5, for any such that is an ancestral set and
(12) 
Using (11), when and are both at most , we obtain that
(13) 
Thus which implies that Algorithm 1 correctly selects a source node at each step. On the first step, which is trivially an ancestral set. By induction, each subsequent step then correctly adds a sink to so remains ancestral and a correct ordering is recovered. ∎
Now to show consistency of the highdimensional bottomup algorithm. We first state Theorem 9 from guo2017olasso.
Theorem 5.
Suppose each column of of the matrix has been scaled so that for all , and . Then, for any constant , the organic lasso estimate,
with satisfies the following relative mean squared error bound
of Theorem 4.
When estimating , we regress onto . So in our setting,
where for all not in the Markov blanket (parents, children, and parents of children) of in the subgraph induced by . For any and ,
(14)  
So provides a uniform bound over the relevant quantity for each . For the values of and we specified, the above theorem suggests . Then by Markov inequality, for any subset ,
Let be a sink in and let be a nonsink, then,
Hence, we obtain that
This implies that in the first step the bottomup algorithm correctly selects a sink node. By Lemma 4, we may repeat the argument just given in each step, and the bottomup algorithm correctly estimates a reversed topological ordering of . ∎
Appendix B Simulations as in Peters and Bühlmann (2014)
We revisit the simulation study of Peters and Bühlmann (2014). DAGs are generated by first creating a random topological ordering, then between any two nodes, an edge is included with probability . We simulate a sparse setting with and a dense setting with . The linear coefficients are drawn uniformly from
and the errors are drawn from a standard Gaussian distribution. Since there may not be a unique ordering for the true graph, we compute the Hamming distance between the true and estimated adjacency matrix rather than Kendall’s
.Tables 4 and 5 demonstrate that in both settings, the greedy algorithm performs better when is small. However, when the proposed algorithms infer the graph more accurately. In the dense setting, the proposed methods have similar FDR to greedy search, but substantially higher recall. In the sparse setting, the proposed methods have lower recall than greedy search, but also substantially lower FDR.
Hamming Dist.  Recall %  Flipped %  FDR %  
p  n  TD  BU  GDS  TD  BU  GDS  TD  BU  GDS  TD  BU  GDS 
5  100  1.3  1.3  1.1  73  73  78  7  7  7  16  15  18 
500  0.7  0.7  0.5  80  80  88  4  4  5  8  7  9  
1000  0.5  0.5  0.4  85  84  92  3  3  5  5  5  7  
20  100  31  32  30  73  73  74  4  3  6  27  28  25 
500  22  22  14  91  91  91  2  3  4  24  24  13  
1000  28  28  8  94  94  96  2  2  2  21  21  10  
40  100  170  174  215  66  65  54  2  3  8  36  37  45 
500  152  155  186  93  93  76  2  2  9  38  39  42  
1000  136  137  168  96  95  83  1  1  8  36  36  38 
Hamming Dist.  Recall %  Flipped %  FDR %  
p  n  TD  BU  GDS  TD  BU  GDS  TD  BU  GDS  TD  BU  GDS 
5  100  1.6  1.7  1.4  74  73  78  8  8  8  18  18  17 
500  0.8  0.9  0.6  85  84  91  3  4  5  7  7  9  
1000  0.6  0.6  0.4  88  88  94  3  4  5  6  6  7  
20  100  7  7  12  69  69  81  4  4  6  16  17  43 
500  3.5  3.5  4.5  85  84  93  4  4  4  9  8  21  
1000  2.2  2.2  2.8  90  90  97  3  2  3  5  5  14  
40  100  14  15  45  64  63  78  3  4  8  16  18  62 
500  7  7  16  84  84  94  3  3  3  8  7  33  
1000  5  5  10  90  89  97  3  3  3  6  6  24 
Appendix C Simulations as in Ghoshal and Honorio (2018)
We construct random graphs as in Section 5.2, but we follow the data sampling procedure as used in Ghoshal and Honorio (2018). All linear coefficients are drawn uniformly from , and errors are drown from the Rademacher distribution and scaled to have . Table 6 demonstrates that both methods performs reasonably well when Markov blankets are restricted to be small, and the topdown approach performs substantially better when there are hubs.
Small  Hub graph  
HTD  HBU  HTD  HBU  
80  0.5  0.99  0.95  0.98  0.73 
0.75  0.98  0.90  0.89  0.46  
0.96  0.90  0.76  0.36  
1.5  0.84  0.86  0.52  0.23  
2  0.71  0.80  0.35  0.10  
100  0.5  0.99  0.97  0.99  0.69 
0.75  0.99  0.95  0.92  0.46  
0.96  0.93  0.76  0.34  
1.5  0.84  0.88  0.52  0.26  
2  0.72  0.82  0.39  0.13  
200  0.5  1.00  0.99  1.00  0.79 
0.75  1.00  0.98  0.98  0.59  
0.98  0.97  0.86  0.47  
1.5  0.86  0.84  0.61  0.20  
2  0.73  0.77  0.48  0.10 
Appendix D Simulations of fully connected graphs
We run simulations with fully connected graphs, as suggested by a reviewer. The linear coefficients are drawn uniformly from and the errors are drawn from a standard Gaussian distribution. The results confirm the advantages of the proposed methods and are shown in Table 7. In general, the estimated graphs from the topdown and bottomup procedure differ only slightly, and the values reported in the table differ in the 3rd or 4th digit.
Kendall’s  Recall %  Flipped %  FDR %  
TD  BU  GDS  TD  BU  GDS  TD  BU  GDS  TD  BU  GDS  
5  100  0.92  0.93  0.83  91  92  80  4  3  7  4  4  9 
500  0.99  0.99  0.97  98  98  98  1  1  1  1  1  1  
1000  1.00  1.00  0.99  99  100  99  0  0  1  0  0  1  
20  100  0.98  0.98  0.62  74  74  45  1  1  9  1  1  17 
500  1.00  1.00  0.73  90  90  66  0  0  8  0  0  12  
1000  1.00  1.00  0.81  92  92  76  0  0  7  0  0  8  
40  100  0.99  0.99  0.55  42  42  33  0  0  7  1  1  17 
500  1.00  1.00  0.62  50  50  49  0  0  8  0  0  14  
1000  1.00  1.00  0.67  52  52  59  0  0  8  0  0  12 
Appendix E As initializer for greedy search
As suggested by a reviewer, we explore the performance of the greedy DAG search (GDS) algorithm initialized with the estimates from the proposed procedures. We run simulations with the same data as in Section 5.1. Tables 8 and 9 show averages over 500 random realizations for the topdown procedure (TD), the greedy DAG search with random initialization (GR), and the greedy DAG search with warm initialization (GW). The GR procedure is identical to the GDS procedure described in Section 5.1 and Peters and Bühlmann (2014). In the GW procedure, we initialize with the output from the topdown method, then search through a large number of graph neighbors () at each greedy step. Since the GW procedure is supplied with a good initializer, we do not restart the greedy search after it terminates, while 5 random restarting with is used in GR to insure performance. For simplicity, we omitted the experiment with the bottomup procedure (BU).
Tables 8 and 9 shows that in all the settings, GW performs better than the other two methods, especially when is large. For reference, the average run time in the dense setting with and is 8 seconds for the topdown method, 4,500 seconds for GR, and 400 seconds for GW.
Kendall’s  Recall %  Flipped %  FDR %  
TD  GR  GW  TD  GR  GW  TD  GR  GW  TD  GR  GW  
5  100  0.85  0.88  0.88  91  91  91  7  6  6  17  9  10 
500  0.98  0.98  0.99  99  99  99  1  1  1  4  2  2  
1000  0.99  0.99  0.99  99  99  99  1  1  1  3  1  1  
20  100  0.92  0.61  0.94  85  62  90  3  13  3  32  43  15 
500  0.99  0.75  0.99  99  81  99  1  11  0  28  35  3  
1000  1.00  0.82  1.00  100  88  100  0  8  0  26  28  2  
40  100  0.96  0.53  0.96  71  44  84  2  11  2  41  58  20 
500  0.99  0.59  1.00  96  63  100  0  14  0  41  57  4  
1000  1.00  0.64  1.00  97  71  100  0  14  0  40  57  2 
Kendall’s  Recall %  Flipped %  FDR %  
TD  GR  GW  TD  GR  GW  TD  GR  GW  TD  GR  GW  
5  100  0.87  0.88  0.87  91  90  91  6  6  6  16  9  10 
500  0.98  0.98  0.98  98  99  99  1  1  1  5  2  2  
1000  0.99  0.99  0.99  99  99  99  1  1  1  3  1  1  
20  100  0.77  0.60  0.82  85  77  90  9  15  7  35  39  25 
500  0.96  0.77  0.98  98  89  99  2  10  1  19  26  8  
1000  0.99  0.81  0.99  100  90  100  0  9  0  14  23  4  
40  100  0.72  0.47  0.79  81  72  89  10  20  7  38  54  36 
500  0.96  0.58  0.98  98  81  99  2  18  1  24  47  13  
1000  0.99  0.61  0.99  99  82  100  1  17  0  17  48  8 
Comments
There are no comments yet.