Randomized experiments are the gold standard for inferring causal relationships. However, experiments can be expensive, unethical, or simply impossible given available technology. Observational studies thus remain an important source of information for many problems, and methodology that proposes causal relationships based on analysis of observational data is valuable for hypothesis generation and accelerating scientific discovery.
In this paper, we consider multivariate data from an observational study, specifically, the observations form an independent, identically distributed sample. We encode the causal structure generating dependences in the underlying
-variate joint distribution by a graphwith vertex set . Each node, , corresponds to an observed variable in , and each directed edge, , indicates that has a direct causal effect on . Thus, positing causal structure is equivalent to selecting a graph. We will only consider directed acyclic graphs (DAGs), so directed graphs which do not contain directed cycles. Given the correspondence between a node
and the random variable, we will at times let stand in for ; for instance, when stating stochastic independence relations.
In general, discovery of causal structure from observational data is difficult because of the super-exponential set of possible models, some of which may be indistinguishable from others. This can be especially complicated in settings where the number of variables, , is comparable or even exceeds the number of samples, . Despite the inherent difficulty of the problem, many methods for causal discovery have been developed and have seen fruitful applications; see the recent review of Drton and Maathuis (2017). In particular, the celebrated PC algorithm (Spirtes et al., 2000) is a constraint-based method which first infers a set of conditional independence relationships and then identifies the associated Markov equivalence class. The class contains all DAGs compatible with the inferred conditional independences. For the graphs from Figure 1, for example, conditional independence tests can distinguish model (a) from the rest but cannot distinguish models (b), (c), and (d) from each other. Results on the size and number of Markov equivalence classes, which may be exponentially large, can be found e.g. in Steinsky (2013). Kalisch and Bühlmann (2007) show if the maximum total degree of the graph is controlled and the data is Gaussian, then the PC algorithm can consistently recover the true Markov equivalence class also in high-dimensional settings. Harris and Drton (2013) extend the result to Gaussian copula models using rank correlations. However, although Maathuis et al. (2009) provide a procedure for bounding the size of a causal effect over graphs within an equivalence class, scientific interpretation of the set of possibly conflicting graphs can remain difficult.
In contrast, it has been shown that under various additional assumptions, the exact graph can be recovered from observational data (Loh and Bühlmann, 2014; Peters and Bühlmann, 2014; Ernest et al., 2016). In particular, Shimizu et al. (2006) show that this is the case when the data follow a linear non-Gaussian acyclic model (LiNGAM). The four models from Figure 1 are then mutually distinguishable. Shimizu et al. (2006)
use independent component analysis to estimate graph structure, and the subsequent DirectLINGAM(Shimizu et al., 2011) and Pairwise LiNGAM (Hyvärinen and Smith, 2013) methods iteratively select a causal ordering by computing pairwise statistics. However, the methods proposed for LiNGAMs are inconsistent in high-dimensional settings that allow the number of variables to scale as fast or faster than the sample size.
In this work, we consider recursive linear structural equation models with non-Gaussian errors, as in Shimizu et al. (2006)
. We propose a new test statistic which encodes causal direction and is suitable for the high-dimensional case. We also propose a modified DirectLiNGAM algorithm and show, under certain conditions and choice of tuning parameter, that it consistently recovers the causal graph even in high-dimensional settings and under slight model misspecification. Most notably, our theoretical analysis considers restricted maximum in-degree of the graph and assumes log-concave distributions. Our guarantees also apply to hub graphs where the maximum out-degree may grow with the size of the graph. This corresponds to many observed biological networks(Hao et al., 2012) which do not satisfy the conditions needed for high-dimensional consistency of the PC algorithm (Kalisch and Bühlmann, 2007).
2. Causal discovery algorithm
2.1. Generative model and notation
Consider real-valued -variate observations
which are independent, identically distributed replications. We assume that the observations are generated from a linear structural equation model so that the elements of each random vectorsatisfy
where the are unknown real parameters that quantify the direct linear effect of variable on variable , and is an error term of unknown distribution . We assume has mean 0 and is independent of all other error terms. Our interest is in models that postulate that a particular set of coefficients is zero. In particular, the absence of an edge, , indicates that the model constrains the parameter to zero. Assuming that the graph representation, , is a DAG implies that the structural equation model is recursive, that is, there exists a permutation of the variables, , such that is constrained to be zero unless .
We denote the model given by graph by . Each distribution is induced through a choice of linear coefficients and error distributions . Let be the matrix determined by the model constraints and the chosen free coefficients. Then the equations in (1) admit a unique solution with . The error vectors are independent and identically distributed and follow the product distribution . The distribution is then the joint distribution for that is induced by the transformation of .
Standard notation has the set comprise the parents of a given node . The set of ancestors contains any node with a directed path from to ; we let . The set of descendants contains the nodes with .
2.2. Parental faithfulness
An important approach to causal discovery begins by inferring relations such as conditional independence and then determines graphs compatible with empirically found relations. For this approach to succeed, the considered relations must correspond to structure in the graph as opposed to a special choice of parameters. In the context of conditional independence, the assumption that any relation present in an underlying joint distribution corresponds to the absence of certain paths in is known as the faithfulness assumption; see Uhler et al. (2013) for a detailed discussion. For our work, we define a weaker condition, parental faithfulness. In particular, if , we require that the total effect of on does not vanish when we modify the considered distribution by regressing onto any set of its non-descendants.
Let be a directed path in , so for all . Given coefficients , the path weight of is . Let be the set of all directed paths from to . Then the total effect of on is . If , the effect characterizes the conditional mean with . By convention, . If , then . The total effects may be calculated by inverting . Indeed, .
Let be the covariance matrix of the, for convenience, centered random vector . Let be the principal sub-matrix for a choice of indices . For , let be the sub-vector comprised of the entries in places for . Let
be the population regression coefficients when is regressed onto . The quantity is defined even if , and in general even if . A pair is parentally faithful if for any set , the residual total effect is
If this holds for every pair , we say that the joint distribution is parentally faithful with respect to
. Parental faithfulness only pertains to the linear coefficients and error variances, and the choices for which parental faithfulness fails form a set of (Lebesgue) measure zero. The concept is exemplified in Figure2.
2.3. Test statistic
In general, reliable determination of the causal direction between and requires all confounding to be removed. Thus, Shimizu et al. (2011) and Hyvärinen and Smith (2013) adjust and for all such that and . However, adjusting by an increasingly larger set of variables propagates error proportional to the number of variables, rendering high-dimensional estimation inconsistent, or impossible when the size of the adjustment set exceeds the sample size. On the other hand, restricting the size of the adjustment sets, may not remove confounding completely. The method we present provides a solution to this problem via the use of a test statistic that can simultaneously certify causal direction and a lack of confounding.
Shimizu et al. (2011) calculate the kernel based mutual information between and the residuals of when it is regressed onto . In general, the corresponding population information is positive if and only if or there is uncontrolled confounding between and , that is, and have a common ancestor even when certain edges are removed from the graph. Hence, the mutual information can be used to test the hypothesis that versus the hypothesis that or there is confounding between and . Unfortunately, calculating the mutual information can be computationally burdensome, so Hyvärinen and Smith (2013) propose a different parameter . Without confounding, if and if . With confounding, however, the parameter can take either sign, so it cannot be reliably used if we remain uncertain about whether or not confounding occurs.
We introduce a parameter which shares the favorable properties of the mutual information used by Shimizu et al. (2011)
but admits computationally inexpensive estimators. These estimators are rational functions of the sample moments ofwhich facilitates analysis of error propagation. Define for any the residual , where are the population regression coefficients from (2).
Let be a distribution in the model given by a DAG , and let . For , two distinct nodes and , and any set , define
If and , then .
Suppose is parentally faithful with respect to . If and , then for generic error moments up to order , we have .
Estimators of the parameter from (4) are naturally obtained from empirical regression coefficients and empirical moments.
In Theorem 12, the term generic indicates that the set of error moments for which this statement does not hold has measure zero. Given that there is a finite number of sets , the union of all exceptional sets is also a null set. A detailed proof of Theorem 1 is included in the Supplement. Claim 1 can be shown via direct calculation, and we give a brief sketch of 2 here. For fixed coefficients and set , is a rational function of the error moments. Thus existence of a single choice of error moments for which is sufficient to show that the statement holds for generic error moments. As the argument boils down to showing that a certain polynomial is not the zero polynomial (Okamoto, 1973), the choice considered need not necessarily be realizable by a particular distribution. In particular, we choose all moments of order less than
equal to those of the centered Gaussian distribution with variance, but for the th moment we add an offset , so
where is a double factorial. If there is no confounding between and , that is, no ancestor of is the source of a directed path to that avoids , then
with , by the assumed parental faithfulness. Thus, a choice of offsets with implies . A more involved but similar argument can be made in the case of confounding.
Let and be two distributions that each have all moments up to order equal to those of some Gaussian distribution. Then there exists a graph , for which , and distributions which are parentally faithful with respect to , but for some set .
Corollary 1 confirms that the null set to be avoided in Theorem 12 contains any point for which all error moments are consistent with some Gaussian distribution. Thus, our identification of causal direction requires that the error moments of order at most be inconsistent with all Gaussian distributions. In practice, we consider . Hoyer et al. (2008) give a full characterization of when graphs with both Gaussian and non-Gaussian errors are identifiable.
We now specify a parameter that reveals whether a node has a parent in some set .
Let , let , and consider two disjoint sets . For a chosen non-negative integer , define
where if and if .
If and , then
Suppose for all . If and , then for generic error moments of order up to , we have and .
1 The statement follows immediately from Theorem 1. 2 Since, , but , there exists some such that . For that and any , the residual total effect is because the assumed facts and imply that for all and . We have assumed , so, by Theorem 1, generic error moments will ensure that , which in turn implies that for . ∎
Theorem 12 requires parental faithfulness since we consider arbitrary , whereas Corollary 22 only requires that since we maximize over . Use of sample moments yields estimates , which in turn yields estimates of for . In the remainder of the paper, we drop the subscript in statements that apply to both parameters/estimators. Moreover, as we always fix , we lighten notation by omitting the superscript, writing , and .
3. Graph estimation procedure
We now present a modified DirectLiNGAM algorithm which estimates the underlying causal structure. At each step, we identify a root (i.e., a node without a parent), then recur on the sub-graph induced by removing the identified root. After step , we will then have a -tuple, , which gives an ordering of the roots identified so far, and are the remaining unordered nodes. We say that is consistent with a valid ordering of if for every , only if and . For each step, we identify a root by selecting the node with the smallest statistic for some .
In contrast to the DirectLiNGAM method, the proposed algorithm does not adjust for all non-descendants, but only for subsets of limited size. This is required for meaningful regression residuals when the number of variables exceeds the sample size and also limits error propagation from the estimated linear coefficients. However, if we naively allow , the number of subsets such that grow at . Thus, we reduce computational effort by pruning nodes which are not parents of by letting
for some cut-off . In practice, specifying a priori can be difficult, so we use a data-driven rising cut-off , where is the root selected at step and is a non-negative tuning parameter. A larger value of prunes more aggressively, decreasing the computational effort. However, setting too large could result in incorrect estimates if some parent of is incorrectly pruned from . Section 3.2 discusses selecting an appropriate .
For fixed , , and , the signs of (min-max) and (max-in) will always agree, so both could be used to certify whether . When the parameters are positive, , so the min-max statistic may be more robust to sampling error when testing if the parameter is non-zero. However, it comes at a greater computational cost. At each step, decreases by a single node and may grow by a single node. If the values of values have been stored, updating , the max-min, only requires testing the subsets of which include the variable selected at the previous step. Updating the min-max statistic without redundant computation would require storing the values of . In practice, we completely recompute at each step. Section 4 demonstrates this trade-off between computational burden and robustness on simulated data.
3.2. Deterministic statement
Theorem 2 below makes a deterministic statement about sufficient conditions under which Algorithm 1 will output a specific graph when given data . We assume each but allow model misspecification so that may not be in for any DAG . However, we require that the sample moments of are close enough to the population moments for some distribution . For notational convenience, for and , let denote a sample moment estimated from data , and let denote a population moment for .
For some -variate distribution , there exists a DAG with for all such that:
For all and with and ; if then the population quantities for satisfy .
For all and with and , if , then the population quantities for satisfy .
The population covariance of has minimum eigenvalue
has minimum eigenvalue.
All population moments of up to degree , for , are bounded by a constant for positive integer .
All sample moments of up to degree , for , are within of the corresponding population values of with .
The constraint in Condition (C3) that is only used to facilitate simplification of the error bounds, and not otherwise necessary. Condition (C1) is a faithfulness type assumption on , and in Theorem 2 we make a further assumption on which ensures strong faithfulness. However, it is not strictly stronger or weaker than the Gaussian strong faithfulness type assumption. In particular the condition requires the linear coefficients and error moments to be jointly “sufficiently parentally faithful and non-Gaussian,” so even if the linear coefficients would not have otherwise satisfied strong faithfulness, sufficiently non-Gaussian moments could still ensure the strong faithfulness condition holds.
Finally, let be the subset of distributions with whenever and . Then the set of linear coefficients and error moments that induce an element of has measure zero. This set difference includes distributions which are not parentally faithful with respect to and distributions for which there exist a parent/child pair for which both error distributions have Gaussian moments up to order .
For some -variate distribution and data :
Suppose Condition (C1) holds. Then there exists a DAG such that and is unique such that for any other DAG with maximum in-degree .
The main result of Theorem 2 is part (ii). The identifiability of a DAG was previously shown by Shimizu et al. (2006) by appealing to results for independent component analysis; however, our direct analysis of rational functions of allows for an explicit tolerance for how sample moments of may deviate from corresponding population moments of . This implicitly allows for model misspecification; see Corollary 3. The proof of Theorem 2 requires Lemmas 1-3, which we develop first. The lemmas are proven in the Supplement. Recall that are the population regression coefficients from (2), and let denote the coefficients estimated from .
Recall, that . Let denote the analogous quantity for , and let .
The proof of Lemma 2 relies on the fact that the map from moments of to the quantities of interest are Lipschitz continuous within a bounded domain.
The proof of Lemma 3 is an application of the triangle inequality.
of Theorem 2.
(ii) We proceed by induction. By Lemma 3 and assuming (7), each statistic is within of the corresponding population quantity. Thus, any statistic corresponding to a parameter with value 0 is less than and, by Condition (C1) and the condition on in (7), all statistics corresponding to a non-zero parameter are greater than .
Recall that is a topological ordering of nodes. Assume for some step , that is consistent with a valid ordering of . Let so that any is a root in the subgraph induced by and is consistent with . The base case for is trivially satisfied since .
Setting does not incorrectly prune any parents, so , which implies for all . Similarly, for any , there exists with for all . Thus, for every and . This implies the next root selected, must be in , and thus remains consistent with .
The statement in Theorem 2 concerned an explicit cut-off . In practice, we specify a tuning parameter which is easier to interpret and tune. Letting ensures that under the specified conditions, , so no parents will be incorrectly pruned and the estimated topological ordering will still be correct. This, however, may not remove all ancestors which are not parents so the estimated edge set may still a superset of the true edge set. Specifically, letting
which under the assumptions is strictly greater than 1, will correctly prune ancestors and not parents. Setting too large may result in an incorrect estimate of the ordering since a true parent may be errantly pruned. Thus, we advocate a more conservative approach of setting which anecdotally is more robust to sampling error and violations of strong faithfulness.
Suppose but is not necessarily parentally faithful with respect to . If and for any , then for generic error moments a correct ordering will still be recovered consistently as .
Corollary 22 holds even without parental faithfulness. So for generic error moments, such that for all for all steps . However, without parental faithfulness, a parent node may be errantly pruned if . To ensure Corollary 21 holds, we need for all , which is satisfied by letting . For fixed , since as , then there exists some sufficiently small so that .
3.3. High-dimensional consistency
We now consider a sequence of graphs, observations, and distributions indexed by the number of variables, : , , , and . For notational brevity, we do not explicitly include the index in the notation. The following corollary states conditions sufficient for the conditions of Theorem 2
to hold with probability tending to 1. We now make explicit assumptions onand let denote the population moments of . Again, we allow for misspecification, but require control of the distance between population moments of and some .
is a log-concave distribution.
All population moments of up to degree , for , are bounded by .
Each population moment of up to degree , for , is within of the corresponding population moment of .
When is actually generated from a recursive linear structural equation model, Condition (C7) trivially holds with and log-concave errors imply that is log-concave.
Conditions (C6) and (C7) imply Condition (C3). It remains to be shown that Condition (C4) and (7) hold for the specified in Condition (C1). Solving the inequality in Lemma 3 for shows (7) will be satisfied if the sample moments of are within of the population moments such that with less than
for defined in (8). Since , ensure that first term is the relevant term. We further simplify the expression since
Thus, the conditions of Theorem 2 will be satisfied if
Specifically, we analyze the case when and for all . If follows a log-concave distribution, we can directly apply Lemma B.3 of Lin et al. (2016) which states for , some degree polynomial of log-concave random variables , and some absolute constant, , if
Letting be the sample moments of up to degree , Condition (C6) implies the variance is bounded by