Ancestor regression in linear structural equation models

by   Christoph Schultheiss, et al.

We present a new method for causal discovery in linear structural equation models. We propose a simple "trick" based on statistical testing in linear models that shall distinguish between ancestors and non-ancestors of any given variable. Naturally, this can then be extended to estimating full graphs. Unlike many methods, we provide explicit error control for false causal discovery, at least asymptotically. This holds true even under Gaussianity where various methods fail due to non-identifiable structures.


page 1

page 2

page 3

page 4


Joint estimation of linear non-Gaussian acyclic models

A linear non-Gaussian structural equation model called LiNGAM is an iden...

Case based error variance corrected estimation of structural models

A new method for estimating structural equation models (SEM) is proposed...

Decomposition and Identification of Linear Structural Equation Models

In this paper, we address the problem of identifying linear structural e...

Algebraic Equivalence of Linear Structural Equation Models

Despite their popularity, many questions about the algebraic constraints...

Empirical Likelihood for Linear Structural Equation Models with Dependent Errors

We consider linear structural equation models that are associated with m...

Communication-Efficient False Discovery Rate Control via Knockoff Aggregation

The false discovery rate (FDR)---the expected fraction of spurious disco...

Structural Equation Modeling using Computation Graphs

Structural equation modeling (SEM) is evolving as available data is beco...

1 Introduction

1.1 Our contribution and related work

We propose a very simple yet effective method to infer the ancestor variables in a linear structural equation model from observational data.

Consider a response variable of interest

and covariates in a linear structural equation model. The procedure is as follows. For a nonlinear function , for example , run a least squares regression of versus and all covariates : the p-value corresponding to the -th covariate is measuring the significance that is an ancestor variable of

, and it provides type I error control.

We refer to this methods as ancestor regression. Its power (i.e., type II error) depends on the nature of the underlying data-generating probability distribution. Obviously, the proposed method is extremely simple and easy to be used; yet, it deals with a difficult problem of finding the causal order among random variables. In particular, the proposed method does not need any new software and it is computationally very efficient.

Structure search methods based on observational data for the graphical structure in linear structural equation models have been developed extensively for various settings: for the Markov equivalence class in linear Gaussian structural equation models spirtes2001causation, Chapter 5.4; chickering2002optimal or for the single identifiable directed acyclic graph in non-Gaussian linear structural equation models (shimizu2006linear; gnecco2021causal)

or for models with equal error variances

(peters2014identifiability). All these methods do not come with p-values and type I error control. In addition, for the identifiable cases, the corresponding algorithms assume certain assumptions such as non-Gaussian errors, while our procedure does not rely on such conditions but automatically exploits whether the structure is identifiable or not.

Our contribution can be summarized in short as follows. We propose ancestor regression which, as ordinary least squares regression, is very easy to be used and equipped with type I error control for inferring ancestor variables in linear structural equation models. The method does not rely on additional assumptions such as non-Gaussianity: it automatically avoids to claim causal directions if the problem is non-identifiable.

1.2 Notation

We present some notation that is used throughout this work. We use upper case letters to denote a random variable, e.g., or . We use lower case letters to denote i.i.d. copies of a random variable, e.g., . If , then . With a slight abuse of notation, can either denote the copies or realizations thereof. We write to denote the j-th column of matrix and to denote the element in row and column . With , we emphasize that an equality between random variables is induced by a causal mechanism.

2 Ancestor regression

2.1 Model and method

Let be given by the following linear structural equation model (SEM)


where the are independent and centered random variables. We assume that such that the covariance matrix of exists and has full rank. We use the notation , , and for ’s parents, children, ancestors, and descendants. Further, assume that there exists a directed acyclic graph (DAG) representing this structure.

Let with be a variable of interest; it has been denoted as response in Section 1. Consider a nonlinear function . The following result describes the population property of ancestor regression, with general function .

Theorem 1.

Assume that the data follows the model (1). Consider the ordinary least squares regression versus , denote the according OLS parameter by and assume that it exists. Then,

Importantly, itself must be included in the set of predictors as well. The beauty of Theorem 1 lies in the fact that no assumptions on the distribution of the or the size of the

apart from existence of the moments must be taken. This allows to control against false discovery of ancestor variables. The proof of this and the subsequent theoretical results is available in the supplementary material.

Typically, holds for ancestors since a nonlinear function of that ancestor cannot be completely regressed out by the other regressors using only linear terms. For ancestors that are much further upstream, this effect might become vanishingly small. However, this is not such an issue since when fitting a linear model using the detected ancestors, those indirect ancestors are assigned a direct causal effect of anyway.

Based on Theorem 1, we suggest testing for in order to detect some or even all ancestors of . Doing so for all , requires nothing more than fitting a multiple linear model and using its corresponding z-tests for individual covariates.

Let be i.i.d. copies from the model 1. Define the following quantities


where is meant to be applied elementwise in .

Theorem 2.

Assume that the data follows the model (1), , and exists. Use the definitions from (2). Then, it holds

Due to this limiting distribution, we suggest testing the null hypothesis

with the p-value



denotes the cumulative distribution function of the standard normal distribution.

For ancestors, for which , the absolute z-statistics increases as . In typical setups, one can thus detect all ancestors. Having found all ancestors, one can easily infer the parents with an ordinary least squares regression of versus , using the -test for assigning the significance of being a parental variable.

2.2 Adversarial setups

There might be cases where does not hold true for some ancestors leading to reduced power of the method. We discuss such cases in the following.

Gaussian .

It is well-known in causal discovery for linear SEMs that Gaussian error terms lead to non-identifiability.


i.e., the children of through which a directed path from to begins.

Proposition 1.

Assume that the data follows the model (1). Let with . Then, it holds

Under the additional assumptions of Theorem 2,

Thus, if every directed path from to starts with an edge for which the nodes on either end have Gaussian noise terms, we have no power to detect this ancestor relationship. However, we neither detect the opposite direction as guaranteed by Theorem 1 and thus, control against false positives is guaranteed.

Error terms with the same distribution.

A pathological case occurs, if a child’s error term has the same distribution as the inherited contribution from the parent’s error term. Then, all ancestors to which all directed paths from start with that given child are not detectable.

Proposition 2.

Assume that the data follows the model (1). Let and . Then, it holds

For the variables discussed here, the limiting Gaussian distribution as stated in Theorem

2 is not guaranteed even though ; see also the proof in the supplemental material.

2.3 Simulation example

Figure 1: DAG of the linear SEM.

We study ancestor regression in a small simulation example. We generate data from a SEM that can be represented by the directed acyclic graph in Fig. 1 and aim to infer the ancestors of . We chose the precise setup such that neither of the adversarial cases discussed above is present. We let the sample size vary from to executing 200 simulation runs for each. As a nonlinear function, we use . By z-statistics, we mean as defined in Theorem 2. We calculate p-values according to (3) and apply a Bonferroni-Holm correction (without cutting of at for the sake of visualization) to them.

Figure 2: Detecting the ancestors of in a linear SEM corresponding to Fig. 1. The results are based on simulation runs. On the left: Average absolute z-statistics for the potential ancestors (circles), (triangles), (pluses), (crosses) and (diamonds) for different sample sizes. The dotted diagonals correspond to -growth fitted to perfectly match at . The horizontal line corresponds to , i.e., the first absolute moment of a standard normal. On the right: fraction of simulation runs with at least one false causal detection versus fraction of detected ancestors for the different sample sizes (solid), (dashed), (dotted) and (dot-dashed). The curve uses the level of the test as implicit curve parameter. The pluses correspond to .

In Fig. 1, we see the desired -growth of the absolute z-statistics for the ancestors, while for and their sample averages are close to their theoretical mean. The indirect ancestor is slightly harder to detect than the two parents. Although the null distribution is only asymptotically achieved, the type I family-wise error rate is controlled for every sample size, supporting our method’s main benefit, i.e., robustness against false causal discovery.

3 Ancestor detection in graphs: nodewise and recursive

3.1 Algorithm

In the previous section, we assumed that there is a (response) variable that is of special interest. However, this is not always the case, instead, one might be interested in inferring the full graph of causal connections between the variables. Naturally, our ancestor detection technique can be extended to graphs by applying it nodewise. We suggest the procedure described in Algorithm 1.

Input data , significance level and nonlinear function
      Output Estimated set of ancestors , adjusted significance level

1:for  to  do
2:     Calculate using (2) and (3)
3:Apply a multiplicity correction to the list of denote the corrected p-values by
4:Define such that and
5: FindStructure(, )
6:for  to  do
8:procedure FindStructure(, )
9:     Define such that if and else
10:     BuildRecursive()
12:     if  then
13:         return
14:     else
16:          FindStrucure(, )
17:         BuildRecursive()
18:         return      
19:procedure BuildRecursive()
20:     for  to  do
22:     for  to  do
24:         while  do
25:              for  do
26:                   and                        
28:     return
Algorithm 1 Graph detection

In BuildRecursive the graph is recursively constructed by adding the estimated ancestors of every estimated ancestor. This recursive construction facilitates the detection of all ancestors. This procedure cannot increase the type I family-wise error rate compared to just using the significant p-values because a false causal discovery can only be propagated if it existed in the first place.

Since there is no guarantee that the recursive construction does not create directed cycles, we need to address this. In FindStructure, the significance level is gradually reduced until no more directed cycles are outputted. This means, that the output becomes somewhat independent of the significance level, e.g., in a case with two variables and and , we would never claim no matter how large is chosen.

In practice, when there is no a priori knowledge on the structure of the data, one can also consider the largest such that no loops are created as a p-value for the null hypothesis that the modeling assumption (1) holds true if the multiplicity correction is designed to control the family-wise error rate, e.g., Bonferroni-Holm. We summarize the properties of our algorithm.

Corollary 1.

Assume the conditions of Theorem 2 hold . Let and be the output of Algorithm 1 with significance level and Bonferroni-Holm correction. Then, it holds

3.2 Simulation example

We extend the simulation from Section 2.3 to estimating every variable’s ancestors using Algorithm 1. We compare our method to LINGAM using the implementation provided in the R-package pcalg (kalisch2012causal). Based on the graph given in Fig. 1, we consider two slightly different linear SEMs. In the first, which uses the same data as the simulation in Section 2.3, only follows a Gaussian distribution while as in the second, both and do such that Proposition 1 applies. For our method, we apply a Bonferroni-Holm correction (without cutting off at for the sake of visualization). As LINGAM provides an estimated set of parents, we additionally apply BuildRecursive to the output to get an estimated set of ancestors to enable comparison with our method.

Figure 3: Graph detection in a linear SEM corresponding to Fig. 1. The results are based on simulation runs. Depicted is the family-wise error rate of false causal detection versus the fraction of detected ancestors for the different sample sizes (solid), (dashed), (dotted) and (dot-dashed). The curves use the level of the test as implicit curve parameter. The pluses correspond to . The other symbols correspond to the performance of the LINGAM algorithm for sample size (square), (circle), (triangle) and (diamond). On the left: only follows a Gaussian distribution. On the right: both and follow a Gaussian distribution.

For the model with only one Gaussian error variable, we can reliably detect all ancestors without any false causal claims for . Not all curves reach a power of even when letting the significance level become arbitrarily large. This can be explained by the possible insensitivity to the significance level as sketched in Section 3.1.

We are able to control the family-wise error rate even for low sample size, which is not the case for LINGAM. The power of LINGAM approaches faster than ancestor regression and one might consider it to be favorable for or in the model with one Gaussian noise term. The picture changes when looking at the model with multiple Gaussian noise terms. LINGAM is still more powerful for some sample sizes but does not control the error at all. Ancestor regression reaches a power of almost without any false causal discovery. The lacking are due to the fact that out of ancestor relationships depicted in Fig. 1 fulfills the conditions of Proposition 1 for the non-identifiable Gaussian case.

4 Real data example

Causal effect Suggested Passing HOLS Significant in linear model minimum p-value
RAF MEK 2 1 1 0
Akt Erk 7 2 2 5.1e-207
p38 PKC 4 1 1 1.1e-143
PKA Akt 3 1 1 5e-119
MEK RAF 5 2 2 7.1e-115
Akt PKA 4 2 2 2.1e-44
PIP3 PIP2 7 1 1 3.7e-41
PIP3 PLCg 4 2 1 9.6e-35
JNK p38 3 2 2 1.6e-17
PKC p38 4 1 1 3.6e-16
PKA Erk 3 1 1 6.6e-12
PKC JNK 4 1 1 3.1e-09
PIP2 PLCg 1 1 0 1
PLCg PIP2 5 1 0 1
Table 1: Analysis of the dataset by sachs2005causal. The second column reports the number of environments (among 8 possible ones) in which the edge connection is suggested by Algorithm 1. The third columns shows in how many of these environments it then passes the HOLS check. The fourth column additionally reports, in how many of these it is also significant at the

level in the respective linear model fit, i.e., a direct (parental) causal edge, correcting for the total number of suggested connections in all environments. The last column reports the minimum of the p-values from linear regression among environments, where the edge connection is suggested and passes the HOLS check.

We analyze the flow cytometry dataset presented by sachs2005causal. It contains cytometry measurements of 11 phosphorylated proteins and phospholipids. Data is available from various experimental conditions, some of which are interventional environments. The dataset has been further analyzed in various projects, see, e.g., mooij2013cyclic, meinshausen2016methods and taeb2021perturbations. Following these works, we consider data from 8 different environments, 7 of which are interventional. The sample size per environment ranges from to .

For each environment, we fit a graph using Algorithm 1 with nonlinear function and Boferroni-Holm correction. For the claimed set of ancestors, we subsequently fit an ordinary linear regression to see which ancestors might be parents.

The outputted ranges from to . This happens due to the occurrence of estimated directed cycles and thus, there is strong indication that the data does not follow the model (1). The deviation can be in terms of hidden variables, nonlinear effects, or noise that is not additive. Therefore, we want to additionally incorporate the HOLS check introduced in schultheiss2022assessing that aims to find locally unconfounded linear effects in regression models (here, when regressing on the estimated ancestor variables). We find some edge connections that pass this subsequent HOLS check at the level in certain environments. We omit the multiplicity correction for this check to lower the tendency to falsely claim causal detection. Thus, we find that the assumption of locally unconfounded linear effects is not unrealistic. We summarize our findings in Table 1 which reports suggested edge connections that pass the HOLS check in at least one environment.


C. Schultheiss and P. Bühlmann have received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme.


Appendix A Proofs

a.1 Additional notation

We introduce additional notation that is used for these proofs.

Subindex , e.g., denotes a matrix with all columns but the -th. is the

-dimensional identity matrix.

denotes the orthogonal projection onto and denotes the orthogonal projection onto its complement. is the orthogonal projection onto all .

For some random vector

, we have the moment matrix . This equals the covariance matrix for centered . We assume this matrix to be invertible. Then, the principal submatrix is also invertible. We denote statistical independence by .

a.2 Previous work

We adapt some definitions from and results proven in schultheiss2022assessing.

where (4)

Using these definitions, we have from partial regression. We cite a Lemma fundamental to our results.

Lemma 1.

Assume that the data follows the model (1) without hidden variables. Then,

for an appropriate set of parameters. Further, the support of (cf. (4)) is restricted to ’s Markov boundary.

a.3 Proof of Theorem 1

Let . Then, we can write for a suitable with if . We can now find using this representation.

The third equality follows from the independence of the . Naturally, for all we have such that . Further, if . To see this, note that would be lower triangular, if denoted a causal order. Then, its inverse would be lower triangular as well. Naturally, the same principle applies for every other permutation. Thus,

such that if , i.e., if is not an ancestor of .

Alternatively, we could invoke Lemma 1 to see that for a non-ancestor . Then,

a.4 Proof of Theorem 2


Since we assume the covariance matrix to be bounded, we find

where we use invertibility and the continuous mapping theorem. This then implies

and analogously

We get a better rate for than for since we assume existence of the fourth moments. Then,


The second line is restricted to covariates for which this variance exists, which includes all non-ancestors as . Using Slutsky’s theorem, we have


which proves the first part of the theorem.

It remains to consider the variance estimate. Similar to above, we have


Similar to before

Combined, we find

proving the second part of the theorem.

For non-ancestors, such that . Then,

such that the last statement of Theorem 2 follows again from Slutsky’s theorem and (6).

a.5 Proof of Proposition 1

Using the representation from above, we have

We find

The last remaining term corresponds to the only set such that , i.e., the only set that could potentially induce dependence. However, in the setup of Proposition 1, both sums are Gaussian random variables such that uncorrelatedness implies independence. Therefore, such that and .

By the same argument for all mediators between and . Generally, . As is independent of all mediators but also of all non-descendants of itself, it is independent of all ancestors of except itself. Therefore, it is also independent from

such that the variance as in (5) is consistently estimated.

a.6 Proof of Proposition 2

Following the derivation from the previous section, we see that dependence between and can only be induced through with . Since we exclude mediated paths in the assumptions of Proposition 2, dependence between and can only be induced by and . It follows

For the last equivalence, we use the assumption on the distribution.


since contributions of in must be mediated through with the according weights. We get

Since and the two terms appear with the same weight in , the two expectations cancel each other such that . However, in general such that is not generally true. Then, the limiting distribution of the estimator is not the same as for non-ancestors, see also (5).

a.7 Proof of Corollary 1

Let be such that . This means that there is at least one set
such that , where . If at least one of these must correspond to a false causal discovery, i.e., there is an such that but . We conclude

Let denote the number of true null hypotheses. By the construction of Bonferroni-Holm

Let as used in Theorem 2. We find

which proves the first part of the corollary. The second to last equality uses Theorem 2 and the continuous mapping theorem.

As the model (1) excludes the possibility of directed cycles, any output of BuildRecursive that contains cycles must include at least one false causal detection. If , it corresponds to the maximal p-value such that including the corresponding ancestor relationship creates cycles. Therefore, it must hold such that

using similar arguments as above.