# High-Dimensional Discovery Under non-Gaussianity

We consider data from graphical models based on a recursive system of linear structural equations. This implies that there is an ordering, σ, of the variables such that each observed variable Y_v is a linear function of a variable specific error term and the other observed variables Y_u with σ(u) < σ (v). The causal relationships, i.e., which other variables the linear functions depend on, can be described using a directed graph. It has been previously shown that when the variable specific error terms are non-Gaussian, the exact causal graph, as opposed to a Markov equivalence class, can be consistently estimated from observational data. We propose an algorithm that yields consistent estimates of the graph also in high-dimensional settings in which the number of variables may grow at a faster rate than the number of observations but in which the underlying causal structure features suitable sparsity, specifically, the maximum in-degree of the graph is controlled. Our theoretical analysis is couched in the setting of log-concave error distributions.

## Authors

• 11 publications
• 38 publications
03/29/2018

### High-Dimensional Causal Discovery Under non-Gaussianity

We consider data from graphical models based on a recursive system of li...
07/21/2020

### Causal Discovery with Unobserved Confounding and non-Gaussian Data

We consider the problem of recovering causal structure from multivariate...
07/09/2018

### On Causal Discovery with Equal Variance Assumption

Prior work has shown that causal structure can be uniquely identified fr...
06/10/2021

### Confidence in Causal Discovery with Linear Causal Models

Structural causal models postulate noisy functional relations among a se...
01/28/2020

### Multi-trek separation in Linear Structural Equation Models

Building on the theory of causal discovery from observational data, we s...
11/29/2021

### A Fast Non-parametric Approach for Causal Structure Learning in Polytrees

We study the problem of causal structure learning with no assumptions on...
06/21/2021

### Nonparametric causal structure learning in high dimensions

The PC and FCI algorithms are popular constraint-based methods for learn...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Randomized experiments are the gold standard for inferring causal relationships. However, experiments can be expensive, unethical, or simply impossible given available technology. Observational studies thus remain an important source of information for many problems, and methodology that proposes causal relationships based on analysis of observational data is valuable for hypothesis generation and accelerating scientific discovery.

In this paper, we consider multivariate data from an observational study, specifically, the observations form an independent, identically distributed sample. We encode the causal structure generating dependences in the underlying

-variate joint distribution by a graph

with vertex set . Each node, , corresponds to an observed variable in , and each directed edge, , indicates that has a direct causal effect on . Thus, positing causal structure is equivalent to selecting a graph. We will only consider directed acyclic graphs (DAGs), so directed graphs which do not contain directed cycles. Given the correspondence between a node

and the random variable

, we will at times let stand in for ; for instance, when stating stochastic independence relations.

In general, discovery of causal structure from observational data is difficult because of the super-exponential set of possible models, some of which may be indistinguishable from others. This can be especially complicated in settings where the number of variables, , is comparable or even exceeds the number of samples, . Despite the inherent difficulty of the problem, many methods for causal discovery have been developed and have seen fruitful applications; see the recent review of Drton and Maathuis (2017). In particular, the celebrated PC algorithm (Spirtes et al., 2000) is a constraint-based method which first infers a set of conditional independence relationships and then identifies the associated Markov equivalence class. The class contains all DAGs compatible with the inferred conditional independences. For the graphs from Figure 1, for example, conditional independence tests can distinguish model (a) from the rest but cannot distinguish models (b), (c), and (d) from each other. Results on the size and number of Markov equivalence classes, which may be exponentially large, can be found e.g. in Steinsky (2013). Kalisch and Bühlmann (2007) show if the maximum total degree of the graph is controlled and the data is Gaussian, then the PC algorithm can consistently recover the true Markov equivalence class also in high-dimensional settings. Harris and Drton (2013) extend the result to Gaussian copula models using rank correlations. However, although Maathuis et al. (2009) provide a procedure for bounding the size of a causal effect over graphs within an equivalence class, scientific interpretation of the set of possibly conflicting graphs can remain difficult.

In contrast, it has been shown that under various additional assumptions, the exact graph can be recovered from observational data (Loh and Bühlmann, 2014; Peters and Bühlmann, 2014; Ernest et al., 2016). In particular, Shimizu et al. (2006) show that this is the case when the data follow a linear non-Gaussian acyclic model (LiNGAM). The four models from Figure 1 are then mutually distinguishable. Shimizu et al. (2006)

use independent component analysis to estimate graph structure, and the subsequent DirectLINGAM

(Shimizu et al., 2011) and Pairwise LiNGAM (Hyvärinen and Smith, 2013) methods iteratively select a causal ordering by computing pairwise statistics. However, the methods proposed for LiNGAMs are inconsistent in high-dimensional settings that allow the number of variables to scale as fast or faster than the sample size.

In this work, we consider recursive linear structural equation models with non-Gaussian errors, as in Shimizu et al. (2006)

. We propose a new test statistic which encodes causal direction and is suitable for the high-dimensional case. We also propose a modified DirectLiNGAM algorithm and show, under certain conditions and choice of tuning parameter, that it consistently recovers the causal graph even in high-dimensional settings and under slight model misspecification. Most notably, our theoretical analysis considers restricted maximum in-degree of the graph and assumes log-concave distributions. Our guarantees also apply to hub graphs where the maximum out-degree may grow with the size of the graph. This corresponds to many observed biological networks

(Hao et al., 2012) which do not satisfy the conditions needed for high-dimensional consistency of the PC algorithm (Kalisch and Bühlmann, 2007).

## 2. Causal discovery algorithm

### 2.1. Generative model and notation

Consider real-valued -variate observations

which are independent, identically distributed replications. We assume that the observations are generated from a linear structural equation model so that the elements of each random vector

satisfy

 (1) Yvi=∑u≠vβvuYui+εvi,

where the are unknown real parameters that quantify the direct linear effect of variable on variable , and is an error term of unknown distribution . We assume has mean 0 and is independent of all other error terms. Our interest is in models that postulate that a particular set of coefficients is zero. In particular, the absence of an edge, , indicates that the model constrains the parameter to zero. Assuming that the graph representation, , is a DAG implies that the structural equation model is recursive, that is, there exists a permutation of the variables, , such that is constrained to be zero unless .

We denote the model given by graph by . Each distribution is induced through a choice of linear coefficients and error distributions . Let be the matrix determined by the model constraints and the chosen free coefficients. Then the equations in (1) admit a unique solution with . The error vectors are independent and identically distributed and follow the product distribution . The distribution is then the joint distribution for that is induced by the transformation of .

Standard notation has the set comprise the parents of a given node . The set of ancestors contains any node with a directed path from to ; we let . The set of descendants contains the nodes with .

### 2.2. Parental faithfulness

An important approach to causal discovery begins by inferring relations such as conditional independence and then determines graphs compatible with empirically found relations. For this approach to succeed, the considered relations must correspond to structure in the graph as opposed to a special choice of parameters. In the context of conditional independence, the assumption that any relation present in an underlying joint distribution corresponds to the absence of certain paths in is known as the faithfulness assumption; see Uhler et al. (2013) for a detailed discussion. For our work, we define a weaker condition, parental faithfulness. In particular, if , we require that the total effect of on does not vanish when we modify the considered distribution by regressing onto any set of its non-descendants.

Let be a directed path in , so for all . Given coefficients , the path weight of is . Let be the set of all directed paths from to . Then the total effect of on is . If , the effect characterizes the conditional mean with . By convention, . If , then . The total effects may be calculated by inverting . Indeed, .

Let be the covariance matrix of the, for convenience, centered random vector . Let be the principal sub-matrix for a choice of indices . For , let be the sub-vector comprised of the entries in places for . Let

 (2) βvC=(βvc.C)c∈C=(ΣCC)−1ΣCv

be the population regression coefficients when is regressed onto . The quantity is defined even if , and in general even if . A pair is parentally faithful if for any set , the residual total effect is

 (3) πvu.C=πvu−∑c∈Cβvc.Cπcu≠0.

If this holds for every pair , we say that the joint distribution is parentally faithful with respect to

. Parental faithfulness only pertains to the linear coefficients and error variances, and the choices for which parental faithfulness fails form a set of (Lebesgue) measure zero. The concept is exemplified in Figure

2.

### 2.3. Test statistic

In general, reliable determination of the causal direction between and requires all confounding to be removed. Thus, Shimizu et al. (2011) and Hyvärinen and Smith (2013) adjust and for all such that and . However, adjusting by an increasingly larger set of variables propagates error proportional to the number of variables, rendering high-dimensional estimation inconsistent, or impossible when the size of the adjustment set exceeds the sample size. On the other hand, restricting the size of the adjustment sets, may not remove confounding completely. The method we present provides a solution to this problem via the use of a test statistic that can simultaneously certify causal direction and a lack of confounding.

Shimizu et al. (2011) calculate the kernel based mutual information between and the residuals of when it is regressed onto . In general, the corresponding population information is positive if and only if or there is uncontrolled confounding between and , that is, and have a common ancestor even when certain edges are removed from the graph. Hence, the mutual information can be used to test the hypothesis that versus the hypothesis that or there is confounding between and . Unfortunately, calculating the mutual information can be computationally burdensome, so Hyvärinen and Smith (2013) propose a different parameter . Without confounding, if and if . With confounding, however, the parameter can take either sign, so it cannot be reliably used if we remain uncertain about whether or not confounding occurs.

We introduce a parameter which shares the favorable properties of the mutual information used by Shimizu et al. (2011)

but admits computationally inexpensive estimators. These estimators are rational functions of the sample moments of

which facilitates analysis of error propagation. Define for any the residual , where are the population regression coefficients from (2).

###### Theorem 1.

Let be a distribution in the model given by a DAG , and let . For , two distinct nodes and , and any set , define

 (4) τ(K)v.C→u=EP(YK−1vi.CYui)EP(Y2vi.C)−EP(YKvi.C)EP(Yvi.CYui).
1. [label=()]

2. If and , then .

3. Suppose is parentally faithful with respect to . If and , then for generic error moments up to order , we have .

Estimators of the parameter from (4) are naturally obtained from empirical regression coefficients and empirical moments.

In Theorem 12, the term generic indicates that the set of error moments for which this statement does not hold has measure zero. Given that there is a finite number of sets , the union of all exceptional sets is also a null set. A detailed proof of Theorem 1 is included in the Supplement. Claim 1 can be shown via direct calculation, and we give a brief sketch of 2 here. For fixed coefficients and set , is a rational function of the error moments. Thus existence of a single choice of error moments for which is sufficient to show that the statement holds for generic error moments. As the argument boils down to showing that a certain polynomial is not the zero polynomial (Okamoto, 1973), the choice considered need not necessarily be realizable by a particular distribution. In particular, we choose all moments of order less than

equal to those of the centered Gaussian distribution with variance

, but for the th moment we add an offset , so

 (5) E(εKv)={ηv if K is % odd,(K−1)!!σKv+ηv if K is even,

where is a double factorial. If there is no confounding between and , that is, no ancestor of is the source of a directed path to that avoids , then

 (6) τ(K)v.C→u=πvu.C(πK−2vu.Cηuσ2v−ηvσ2u)

with , by the assumed parental faithfulness. Thus, a choice of offsets with implies . A more involved but similar argument can be made in the case of confounding.

###### Corollary 1.

Let and be two distributions that each have all moments up to order equal to those of some Gaussian distribution. Then there exists a graph , for which , and distributions which are parentally faithful with respect to , but for some set .

###### Proof.

The moments of and satisfy (5) with . Consequently, if there exists a set such that there is no confounding between and , then satisfies (6), the right-hand side of which is zero when . ∎

Corollary 1 confirms that the null set to be avoided in Theorem 12 contains any point for which all error moments are consistent with some Gaussian distribution. Thus, our identification of causal direction requires that the error moments of order at most be inconsistent with all Gaussian distributions. In practice, we consider . Hoyer et al. (2008) give a full characterization of when graphs with both Gaussian and non-Gaussian errors are identifiable.

We now specify a parameter that reveals whether a node has a parent in some set .

###### Corollary 2.

Let , let , and consider two disjoint sets . For a chosen non-negative integer , define

 T(K)1(v,V1,V2) =minC∈V1(J)maxu∈V2|τ(K)v.C→u|,T(K)2(v,V1,V2) =maxu∈V2minC∈V1(J)|τ(K)v.C→u|,

where if and if .

1. [label=()]

2. If and , then

 T(K)1(v,V1,V2)=T(K)2(v,V1,V2)=0.
3. Suppose for all . If and , then for generic error moments of order up to , we have and .

###### Proof.

1 The statement follows immediately from Theorem 1. 2 Since, , but , there exists some such that . For that and any , the residual total effect is because the assumed facts and imply that for all and . We have assumed , so, by Theorem 1, generic error moments will ensure that , which in turn implies that for . ∎

Theorem 12 requires parental faithfulness since we consider arbitrary , whereas Corollary 22 only requires that since we maximize over . Use of sample moments yields estimates , which in turn yields estimates of for . In the remainder of the paper, we drop the subscript in statements that apply to both parameters/estimators. Moreover, as we always fix , we lighten notation by omitting the superscript, writing , and .

## 3. Graph estimation procedure

### 3.1. Algorithm

We now present a modified DirectLiNGAM algorithm which estimates the underlying causal structure. At each step, we identify a root (i.e., a node without a parent), then recur on the sub-graph induced by removing the identified root. After step , we will then have a -tuple, , which gives an ordering of the roots identified so far, and are the remaining unordered nodes. We say that is consistent with a valid ordering of if for every , only if and . For each step, we identify a root by selecting the node with the smallest statistic for some .

In contrast to the DirectLiNGAM method, the proposed algorithm does not adjust for all non-descendants, but only for subsets of limited size. This is required for meaningful regression residuals when the number of variables exceeds the sample size and also limits error propagation from the estimated linear coefficients. However, if we naively allow , the number of subsets such that grow at . Thus, we reduce computational effort by pruning nodes which are not parents of by letting

 C(z)v={p∈Θ(z−1):minC⊆Θ(z−1)∖{p}|C|≤J^τv.C→p>g(z)}

for some cut-off . In practice, specifying a priori can be difficult, so we use a data-driven rising cut-off , where is the root selected at step and is a non-negative tuning parameter. A larger value of prunes more aggressively, decreasing the computational effort. However, setting too large could result in incorrect estimates if some parent of is incorrectly pruned from . Section 3.2 discusses selecting an appropriate .

For fixed , , and , the signs of (min-max) and (max-in) will always agree, so both could be used to certify whether . When the parameters are positive, , so the min-max statistic may be more robust to sampling error when testing if the parameter is non-zero. However, it comes at a greater computational cost. At each step, decreases by a single node and may grow by a single node. If the values of values have been stored, updating , the max-min, only requires testing the subsets of which include the variable selected at the previous step. Updating the min-max statistic without redundant computation would require storing the values of . In practice, we completely recompute at each step. Section 4 demonstrates this trade-off between computational burden and robustness on simulated data.

### 3.2. Deterministic statement

Theorem 2 below makes a deterministic statement about sufficient conditions under which Algorithm 1 will output a specific graph when given data . We assume each but allow model misspecification so that may not be in for any DAG . However, we require that the sample moments of are close enough to the population moments for some distribution . For notational convenience, for and , let denote a sample moment estimated from data , and let denote a population moment for .

###### Condition (C1).

For some -variate distribution , there exists a DAG with for all such that:

1. [label=(),ref=1()]

2. For all and with and ; if then the population quantities for satisfy .

3. For all and with and , if , then the population quantities for satisfy .

###### Condition (C2).

The population covariance of

has minimum eigenvalue

.

###### Condition (C3).

All population moments of up to degree , for , are bounded by a constant for positive integer .

###### Condition (C4).

All sample moments of up to degree , for , are within of the corresponding population values of with .

The constraint in Condition (C3) that is only used to facilitate simplification of the error bounds, and not otherwise necessary. Condition (C1) is a faithfulness type assumption on , and in Theorem 2 we make a further assumption on which ensures strong faithfulness. However, it is not strictly stronger or weaker than the Gaussian strong faithfulness type assumption. In particular the condition requires the linear coefficients and error moments to be jointly “sufficiently parentally faithful and non-Gaussian,” so even if the linear coefficients would not have otherwise satisfied strong faithfulness, sufficiently non-Gaussian moments could still ensure the strong faithfulness condition holds.

Finally, let be the subset of distributions with whenever and . Then the set of linear coefficients and error moments that induce an element of has measure zero. This set difference includes distributions which are not parentally faithful with respect to and distributions for which there exist a parent/child pair for which both error distributions have Gaussian moments up to order .

###### Theorem 2.

For some -variate distribution and data :

1. [label=(),ref=2()]

2. Suppose Condition (C1) holds. Then there exists a DAG such that and is unique such that for any other DAG with maximum in-degree .

3. Furthermore, suppose Conditions (C1)-(C4) hold for constants which satisfy

 (7) γ/2>δ3: =4Mδ1{16(3K)(J+K)KKJ(K+4)/2MK+1λK+1min} +2[δ1{16(3K)(J+K)KKJ(K+4)/2MK+1λK+1min}]2.

Then with pruning parameter , Algorithm 1 will output .

The main result of Theorem 2 is part (ii). The identifiability of a DAG was previously shown by Shimizu et al. (2006) by appealing to results for independent component analysis; however, our direct analysis of rational functions of allows for an explicit tolerance for how sample moments of may deviate from corresponding population moments of . This implicitly allows for model misspecification; see Corollary 3. The proof of Theorem 2 requires Lemmas 1-3, which we develop first. The lemmas are proven in the Supplement. Recall that are the population regression coefficients from (2), and let denote the coefficients estimated from .

###### Lemma 1.

Suppose Conditions (C2), (C3), and (C4) hold. Then for any , , and ,

 ∥^βvC−βvC∥∞<δ2=4J3/2Mδ1λ2min.

Recall, that . Let denote the analogous quantity for , and let .

###### Lemma 2.

Suppose that Conditions (C2), (C3), and (C4) hold. Let , be non-negative integers such that , and let . For any and such that ,

 ∣∣ ∣∣1n∑i^Ysvi.CYrui−E(Zsv.CZru)∣∣ ∣∣<δ1Φ(J,K,M,λmin)

where

 (8) Φ(J,K,M,λmin) ={16(3K)(J+K)KKJ(K+4)/2MK+1λK+1min}.

The proof of Lemma 2 relies on the fact that the map from moments of to the quantities of interest are Lipschitz continuous within a bounded domain.

###### Lemma 3.

Suppose that Conditions (C2), (C3), and (C4) hold. Then

 |^τv.C→u−τv.C→u|<4Mδ1Φ(J,K,M,λmin)+2{δ1Φ(J,K,M,λmin)}2=δ3

for the function given in Lemma 2.

The proof of Lemma 3 is an application of the triangle inequality.

###### of Theorem 2.

(ii) We proceed by induction. By Lemma 3 and assuming (7), each statistic is within of the corresponding population quantity. Thus, any statistic corresponding to a parameter with value 0 is less than and, by Condition (C1) and the condition on in (7), all statistics corresponding to a non-zero parameter are greater than .

Recall that is a topological ordering of nodes. Assume for some step , that is consistent with a valid ordering of . Let so that any is a root in the subgraph induced by and is consistent with . The base case for is trivially satisfied since .

Setting does not incorrectly prune any parents, so , which implies for all . Similarly, for any , there exists with for all . Thus, for every and . This implies the next root selected, must be in , and thus remains consistent with .

(i) The fact that follows directly from the definition. To show uniqueness, we use population quantities so that which in turn implies . Then for any , Algorithm 1 will return . Thus, by 2, must be unique. ∎

###### Remark.

Under the conditions of Theorem 2 if the pruning parameter , Algorithm 1 will return a topological ordering consistent with , but may be a superset of . However, there exists some which will recover the exact graph.

The statement in Theorem 2 concerned an explicit cut-off . In practice, we specify a tuning parameter which is easier to interpret and tune. Letting ensures that under the specified conditions, , so no parents will be incorrectly pruned and the estimated topological ordering will still be correct. This, however, may not remove all ancestors which are not parents so the estimated edge set may still a superset of the true edge set. Specifically, letting

 (9) α=minvmina∈pa(v)minC∩de(v)=∅^τv.C→amaxvmaxa∈an(v)∖pa(v)minC∩de(v)=∅^τv.C→a,

which under the assumptions is strictly greater than 1, will correctly prune ancestors and not parents. Setting too large may result in an incorrect estimate of the ordering since a true parent may be errantly pruned. Thus, we advocate a more conservative approach of setting which anecdotally is more robust to sampling error and violations of strong faithfulness.

###### Remark.

Suppose but is not necessarily parentally faithful with respect to . If and for any , then for generic error moments a correct ordering will still be recovered consistently as .

Corollary 22 holds even without parental faithfulness. So for generic error moments, such that for all for all steps . However, without parental faithfulness, a parent node may be errantly pruned if . To ensure Corollary 21 holds, we need for all , which is satisfied by letting . For fixed , since as , then there exists some sufficiently small so that .

### 3.3. High-dimensional consistency

We now consider a sequence of graphs, observations, and distributions indexed by the number of variables, : , , , and . For notational brevity, we do not explicitly include the index in the notation. The following corollary states conditions sufficient for the conditions of Theorem 2

to hold with probability tending to 1. We now make explicit assumptions on

and let denote the population moments of . Again, we allow for misspecification, but require control of the distance between population moments of and some .

###### Condition (C5).

is a log-concave distribution.

###### Condition (C6).

All population moments of up to degree , for , are bounded by .

###### Condition (C7).

Each population moment of up to degree , for , is within of the corresponding population moment of .

When is actually generated from a recursive linear structural equation model, Condition (C7) trivially holds with and log-concave errors imply that is log-concave.

###### Corollary 3.

For a sequence of distributions and data assume Conditions (C1), (C2), (C5), (C6), and (C7) hold. For pruning parameter , Algorithm 1 will return the graph with probability tending to if

 (10) log(p)n1/(2K)J5/2K5/2M2γ1/2λ3/2min→0,ξ3KKK+1J(3K)/2+2MK+2γλK+1min→0

when and .

###### Proof.

Conditions (C6) and (C7) imply Condition (C3). It remains to be shown that Condition (C4) and (7) hold for the specified in Condition (C1). Solving the inequality in Lemma 3 for shows (7) will be satisfied if the sample moments of are within of the population moments such that with less than

 min⎛⎜⎝−8MΦ+{(8MΦ)2+16Φ2γ}1/28Φ2,λmin2J,M⎞⎟⎠ =min⎛⎜⎝(M2+γ/4)1/2−MΦ,λmin2J⎞⎟⎠

for defined in (8). Since , ensure that first term is the relevant term. We further simplify the expression since

 (M2+γ/4)1/2≥M+γmint∈(0,γ)∂(M2+γ/4)1/2∂γ∣∣ ∣∣γ=t=M+γ8(M2+γ/4)1/2.

Thus, the conditions of Theorem 2 will be satisfied if

 δ+ξ≤γ8(M2+γ/4)1/2Φ=:δ4.

Specifically, we analyze the case when and for all . If follows a log-concave distribution, we can directly apply Lemma B.3 of Lin et al. (2016) which states for , some degree polynomial of log-concave random variables , and some absolute constant, , if

 2L(δ(e)[var{f(Y)}]1/2)1/K≥2

then

 Pr[|f(Y)−E{f(Y)}|>δ]≤exp⎧⎨⎩−2L(δ[var{f(Y)}]1/2)1/K⎫⎬⎭.

Letting be the sample moments of up to degree , Condition (C6) implies the variance is bounded by