# Causal Discovery with Continuous Additive Noise Models

We consider the problem of learning causal directed acyclic graphs from an observational joint distribution. One can use these graphs to predict the outcome of interventional experiments, from which data are often not available. We show that if the observational distribution follows a structural equation model with an additive noise structure, the directed acyclic graph becomes identifiable from the distribution under mild conditions. This constitutes an interesting alternative to traditional methods that assume faithfulness and identify only the Markov equivalence class of the graph, thus leaving some edges undirected. We provide practical algorithms for finitely many samples, RESIT (Regression with Subsequent Independence Test) and two methods based on an independence score. We prove that RESIT is correct in the population setting and provide an empirical evaluation.

## Authors

• 37 publications
• 10 publications
• 41 publications
• 183 publications
• ### Identifiability of Gaussian structural equation models with equal error variances

We consider structural equation models in which variables can be written...
05/11/2012 ∙ by Jonas Peters, et al. ∙ 0

• ### Score-based Causal Learning in Additive Noise Models

Given data sampled from a number of variables, one is often interested i...
11/25/2013 ∙ by Christopher Nowzohour, et al. ∙ 0

• ### Efficient Neural Causal Discovery without Acyclicity Constraints

Learning the structure of a causal graphical model using both observatio...
07/22/2021 ∙ by Phillip Lippe, et al. ∙ 23

• ### SAM: Structural Agnostic Model, Causal Discovery and Penalized Adversarial Learning

We present the Structural Agnostic Model (SAM), a framework to estimate ...
03/13/2018 ∙ by Diviyan Kalainathan, et al. ∙ 0

• ### Characterizing Distribution Equivalence for Cyclic and Acyclic Directed Graphs

The main way for defining equivalence among acyclic directed graphs is b...
10/28/2019 ∙ by AmirEmad Ghassami, et al. ∙ 0

• ### Causal Discovery by Telling Apart Parents and Children

We consider the problem of inferring the directed, causal graph from obs...
08/20/2018 ∙ by Alexander Marx, et al. ∙ 4

• ### Structure Learning for Directed Trees

Knowing the causal structure of a system is of fundamental interest in m...
08/19/2021 ∙ by Martin Emil Jakobsen, et al. ∙ 11

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Many scientific questions deal with the causal structure of a data-generating process. If we know the reasons why an individual is more susceptible to a disease than others, for example, we can hope to develop new drugs in order to cure this disease or prevent its outbreak. Recent results indicate that knowing the causal structure is also useful for classical machine learning tasks. In the two variable case, for example, knowing which is cause and which is effect has implications for semi-supervised learning and covariate shift adaptation

(Schölkopf et al., 2012).

We consider a

-dimensional random vector

with a joint distribution and assume that there is a true acyclic causal graph that describes the data generating process (see Section 1.3). In this work we address the following problem of causal inference: given the distribution we try to infer the graph . A priori, the causal graph contains information about the physical process that cannot be found in properties of the joint distribution. One therefore requires assumptions connecting these two worlds. While traditional methods like PC, FCI (Spirtes et al., 2000) or score-based approaches (e.g. Chickering, 2002), that are explained in more detail in Section 2, make assumptions that enable us to recover the graph up to the Markov equivalence class, we investigate a different set of assumptions. If the data have been generated by an additive noise model (see Section 3), we will generically be able to recover the correct graph from the joint distribution.

In the remainder of this section we set up the required notation and definitions for graphs (Section 1.1), briefly introduce Judea Pearl’s do-calculus (Section 1.2) and use it to define our object of interest, a true causal graph (Section 1.3). We introduce structural equation models (SEMs) in Section 1.4. After discussing existing methods in Section 2, we provide the main results of this work in Section 3. We prove that for additive noise models (ANMs), a special class of SEMs, one can identify the graph from the joint distribution. This is possible not only for additive noise models but for all classes of SEMs that are able to identify graphs from a bivariate distribution, meaning they can distinguish between cause and effect. Section 4 proposes and compares algorithms that can be used in practice, when instead of the joint distribution, we are only given i.i.d. samples. These algorithms are tested in Section 5.

This paper builds on the conference papers of Hoyer et al. (2009), Peters et al. (2011b) and Mooij et al. (2009)111Parts of Sections 1 and 2 have been taken and modified from the PhD thesis of Peters (2012). but extends the material in several aspects. All deliberations in Section 1.3 about the true causal graph and Example 1.4 are novel. The presentation of the theoretical results in Section 3 is improved. In particular, we added the motivating Example 3.2 and Propositions 1.1 and 3.2. Example 3.1 provides a non-identifiable case different from the linear Gaussian example. Proposition 3.1 is based on (Zhang and Hyvärinen, 2009) and contains important necessary conditions for the failure of identifiability. In Corollary 3.2 we present a novel identifiability result for a class of nonlinear functions and Gaussian noise variables. Proposition 6 proves that causal minimality is satisfied if the structural equations do not contain constant functions. Section 3.3 contains results that guarantee to find the set of correct topological orderings when the assumption of causal minimality is dropped. Theorem 1 proves a conjecture from Mooij et al. (2009) by showing that given an independence oracle the algorithm provided in Mooij et al. (2009)

is correct. We propose a new score function for estimating the true directed acyclic graph in Section

4.2 and present two corresponding score-based methods. We provide an extended section on simulation experiments and discuss experiments on real data.

### 1.1 Directed Acyclic Graphs

We start with some basic notation for graphs. Consider a finite family of random variables

with index set (we use capital letters for random variables and bold letters for sets and vectors). We denote their joint distribution by . We write or simply for the Radon-Nikodym derivative of either with respect to the Lebesgue or the counting measure and (sometimes implicitly) assume its existence. A graph consists of nodes and edges with for any . In a slight abuse of notation we identify the nodes (or vertices) with the variables , the context should clarify the meaning. We also consider sets of variables as a single multivariate variable. We now introduce graph terminology that we require later. Most of the definitions can be found in (Spirtes et al., 2000; Koller and Friedman, 2009; Lauritzen, 1996), for example.

Let be a graph with and corresponding random variables . A graph is called a subgraph of if and ; we then write . If additionally, , we call a proper subgraph of .

A node is called a parent of if and a child if . The set of parents of is denoted by , the set of its children by . Two nodes and are adjacent if either or . We call fully connected if all pairs of nodes are adjacent. We say that there is an undirected edge between two adjacent nodes and if and . An edge between two adjacent nodes is directed if it is not undirected. We then write for . Three nodes are called an immorality or a v-structure if one node is a child of the two others that themselves are not adjacent. The skeleton of is the set of all edges without taking the direction into account, that is all , such that or .

A path in is a sequence of (at least two) distinct vertices , such that there is an edge between and for all . If and for all we speak of a directed path between and and call a descendant of . We denote all descendants of by and all non-descendants of , excluding , by . In this work, is neither a descendant nor a non-descendant of itself. If and , and also and , is called a collider on this path. is called a partially directed acyclic graph (PDAG) if there is no directed cycle, i.e., no pair (, ), such that there are directed paths from to and from to . is called a directed acyclic graph (DAG) if it is a PDAG and all edges are directed.

In a DAG, a path between and is blocked by a set (with neither nor in this set) whenever there is a node , such that one of the following two possibilities hold: 1. and or or Or 2., and neither nor any of its descendants is in . We say that two disjoint subsets of vertices and are -separated by a third (also disjoint) subset if every path between nodes in and is blocked by . Throughout this work, denotes (conditional) independence. The joint distribution is said to be Markov with respect to the DAG if

 A,Bd-sep. by C⇒A⊥⊥B|C

for all disjoint sets . is said to be faithful to the DAG if

 A,Bd-sep. by C⇐A⊥⊥B|C

for all disjoint sets . A distribution satisfies causal minimality with respect to if it is Markov with respect to , but not to any proper subgraph of . We denote by the set of distributions that are Markov with respect to : Two DAGs and are Markov equivalent if . This is the case if and only if and satisfy the same set of -separations, that means the Markov condition entails the same set of (conditional) independence conditions. The set of all DAGs that are Markov equivalent to some DAG (a so-called Markov equivalence class) can be represented by a completed PDAG. This graph satisfies if and only if one member of the Markov equivalence class does. Verma and Pearl (1991) showed that: Two DAGs are Markov equivalent if and only if they have the same skeleton and the same immoralities.

Faithfulness is not very intuitive at first glance. We now give an example of a distribution that is Markov but not faithful with respect to some DAG . This is achieved by making two paths cancel each other and creating an independence that is not implied by the graph structure.

###### Example

Consider the two graphs in Figure 1.

Corresponding to the left graph we generate a joint distribution by the following equations. , with , and jointly independent. This is an example of a linear Gaussian structural equation model with graph that we formally define in Section 1.4. Now, if , the distribution is not faithful222More precisely: not triangle-faithful (Zhang and Spirtes, 2008). with respect to since we obtain .

Correspondingly, we generate a distribution related to graph : , with all jointly independent. If we choose , , , and , both models lead to the covariance matrix

 Σ=⎛⎜ ⎜⎝σ2Xaσ2X0aσ2Xa2σ2X+σ2Ybσ2Y0bσ2Yb2σ2Y+σ2Z⎞⎟ ⎟⎠

and thus to the same distribution. It can be checked that the distribution is faithful with respect to if and all .

The distribution from Example 1.1 is faithful with respect to , but not with respect to . Nevertheless, for both models, causal minimality is satisfied if none of the parameters vanishes: the distribution is not Markov to any proper subgraph of or since removing an arrow would correspond to a new (conditional) independence that does not hold in the distribution. Note that is not a proper subgraph of . In general, causal minimality is weaker than faithfulness: If is faithful with respect to , then causal minimality is satisfied. This is due to the fact that any two nodes that are not directly connected by an edge can be -separated. Another, equivalent formulation of causal minimality reads as follows: Consider the random vector and assume that the joint distribution has a density with respect to a product measure. Suppose that is Markov with respect to . Then satisfies causal minimality with respect to if and only if we have that . See Appendix A.1.

### 1.2 Interventional Distributions

Given a directed acyclic graph (DAG) , Pearl (2009) introduces the -notation as a mathematical description of interventional experiments. More precisely, stands for setting the variable randomly according to the distribution , irrespective of its parents, while not interfering with any other variable. Formally: Let be a collection of variables with joint distribution

that we assume to be absolutely continuous with respect to the Lebesgue measure or the counting measure (i.e., there exists a probability density function or a probability mass function). Given a DAG

over , we define the interventional distribution of by

 p(x1,…,xp|do(Xj=~p(xj))):=p∏i≠jp(xi|xPAi)⋅~p(xj),

if and zero otherwise. Here is either a probability density function or a probability mass function. Similarly, we can intervene at different nodes at the same time by defining the interventional distribution for as

if and zero otherwise. Here, denotes the tuple of all for being a parent of in . Pearl (2009) introduces Definition 1.2 with the special case of , where if and otherwise; this corresponds to a point mass at . For more details on soft interventions, see Eberhardt and Scheines (2007). Note that in general:

 p(x1,…,xp|do(Xj=~xj))≠p(x1,…,xp|Xj=~xj).

The expression yields a distribution over . If we are only interested in computing the marginal , where is not a parent of , we can use the parent adjustment formula (Pearl, 2009, Theorem 3.2.2)

 p(xi|do(Xj=~xj))=∑xPAjp(xi|~xj,xPAj)p(xPAj). (1)

### 1.3 True Causal Graphs

In this section we clarify what we mean by a true causal graph . In short, we use this term if one can read off the results of randomized studies from and the observational joint distribution. This means that the graph and the observational joint distribution lead to causal effects that one observes in practice. Two important restrictive assumptions that we make throughout this work are acyclicity (the absence of directed cycles, in other words, no causal feedback loops are allowed) and causal sufficiency (the absence of hidden variables that are a common cause of at least two observed variables). Assume we are given a distribution over and distributions for all (think of the variables having been randomized). We then call the graph a true causal graph for these distributions if

• is a directed acyclic graph;

• the distribution is Markov with respect to ;

• for all and with the distribution coincides with , computed from as in Definition 1.2.

Definition 1.3 is purely mathematical if one considers as an abstract family of given distributions. But it is a small step to make the relation to the “real world”. We call the true causal graph of a data generating process if it is the true causal graph for the distributions and , where the latter are obtained by randomizing according to . In some situations, the precise design of a randomized experiment may not be obvious. While most people would agree on how to randomize over medical treatment procedures, there is probably less agreement how to randomize over the tolerance of a person (does this include other changes of his personality, too?). Only sometimes, this problem can be resolved by including more variables and taking a less coarse-grained point of view. We do not go into further detail since we believe that this would require philosophical deliberations, which lie beyond the scope of this work. Instead, we may explicitly add the requirement that “most people agree on what a randomized experiment should look like in this context”.

In general, there can be more than one true causal DAG. If one requires causal minimality, the true causal DAG is unique. Assume has a density and consider all true causal DAGs of . Then there is a partial order on using the subgraph property as an ordering. This ordering has a least element , i.e., for all . This element is the unique true causal DAG such that satisfies causal minimality with respect to . See Appendix A.2

We now briefly comment on a true causal graph’s behavior when some of the variables from the joint distribution are marginalized out.

###### Example
• If is the only true causal graph for and , there is no true causal graph for the variables and (the -statements do not coincide).

• Assume that the graph with additional is the only true causal graph for and and assume that is faithful with respect to this graph. Then, the only true causal graph for the variables and is .

• If the situation is the same as in (ii) with the difference that (i.e., is not faithful with respect to the true causal graph), the empty graph is also a true causal graph for and .

Latent projections (Verma and Pearl, 1991) provide a formal way to obtain a true causal graph for marginalization. Cases (ii) and (iii) show that there are no purely graphical criteria that provide the minimal true causal graph described in Proposition 1.3.

The results presented in the remainder of this paper can be understood without causal interpretation. Using these techniques to infer a true causal graph, however, requires the assumption that such a true causal DAG for the observed distribution of exists. This includes the assumption that all “relevant” variables have been observed, sometimes called causal sufficiency, and that there are no feedback loops.

Richardson and Spirtes (2002) introduce a representation of graphs (so-called Maximal Ancestral Graphs, or MAGs) with hidden variables that is closed under marginalization and conditioning. The FCI algorithm (Spirtes et al., 2000) exploits the conditional independences in the data to partially reconstruct the graph. Less work concentrates on hidden variables in structural equation models (e.g., Hoyer et al., 2008; Janzing et al., 2009; Silva and Ghahramani, 2009).

### 1.4 Structural Equation Models

A structural equation model (SEM) (also called a functional model) is defined as a tuple , where is a collection of equations

 Sj:Xj=fj(PAj,Nj),j=1,…,p (2)

and is the joint distribution of the noise variables, which we require to be jointly independent (thus, is a product distribution) as we are assuming causal sufficiency. The are considered the direct causes of . An SEM specifies how the affect . Note that in physics (chemistry, biology, …), we would usually expect that such causal relationships occur in time, and are governed by sets of coupled differential equations. Under certain assumptions such as stable equilibria, one can derive an SEM describing how the equilibrium states of such a dynamical system will react to physical interventions on the observables involved (see Mooij et al. (2013)). We do not deal with these issues in the present paper, but we take the SEM as our starting point. Moreover, we consider SEMs only for real-valued random variables . The graph of a structural equation model is obtained simply by drawing direct edges from each parent to its direct effects, i.e., from each variable occurring on the right-hand side of equation (2) to . We henceforth assume this graph to be acyclic. According to the notation defined in Section 1.1, are the parents of . Pearl (2009) shows in Theorem 1.4.1 that the law generated by an SEM is Markov with respect to the graph.

Structural equation models contain strictly more information than their corresponding graph and law and hence also more information than the family of all interventional distributions together with the observational distribution. This information sometimes helps to answer counterfactual questions, as shown in the following example.

###### Example

Let and , such that the three variables are jointly independent. That is,

have a Bernoulli distribution with parameter

and . We define two different SEMs, first consider :

 SA=⎧⎪⎨⎪⎩X1=N1X2=N2X3=(1N3>0⋅X1+1N3=0⋅X2)⋅1X1≠X2+N3⋅1X1=X2

If and have different values, depending on we either choose or . Otherwise . Now, differs from only in the latter case:

 SB=⎧⎪⎨⎪⎩X1=N1X2=N2X3=(1N3>0⋅X1+1N3=0⋅X2)⋅1X1≠X2+(2−N3)⋅1X1=X2

It can be checked that both SEMs generate the same observational distribution, which satisfies causal minimality with respect to the graph . They also generate the same interventional distributions, for any possible intervention. But the two models differ in a counterfactual statement333Here, we make use of Judea Pearl’s definition of counterfactuals (Pearl, 2009).. Suppose, we have seen a sample and we are interested in the counterfactual question, what would have been if had been . From both and it follows that , and thus the two SEMs “predict” different values for under a counterfactual change of .

If we want to use an estimated SEM to predict counterfactual questions, this example shows that we require assumptions that let us distinguish between or . In this work we exploit the additive noise assumption to infer the structure of an SEM. We do not claim that we can predict counterfactual statements.

Another property of structural equation models is that they have the power to describe many distributions444A similar but weaker statement than Proposition 4 can be found in (Druzdzel and van Leijen, 2001; Janzing and Schölkopf, 2010).. Consider and let be Markov with respect to . Then there exists an SEM with graph that generates the distribution . See Appendix A.3.

Structural equation models have been used for a long time in fields like agriculture or social sciences (e.g., Wright, 1921; Bollen, 1989). Model selection, for example, was done by fitting different structures that were considered as reasonable given the prior knowledge about the system. These candidate structures were then compared using goodness of fit tests. In this work we instead consider the question of identifiability, which has not been addressed until recently.

###### Problem (population case)

Suppose we are given a distribution that has been generated by an (unknown) structural equation model with graph ; in particular, is Markov with respect to . Can the (observational) distribution be generated by a structural equation model with a different graph ? If not, we call identifiable from .

In general, is not identifiable from : the joint distribution is certainly Markov with respect to a lot of different graphs, e.g., to all fully connected acyclic graphs. Proposition 4 states the existence of corresponding SEMs. What can be done to overcome this indeterminacy? The hope is that by using additional assumptions one obtains restricted models, in which we can identify the graph from the joint distribution. Considering graphical models, we see in Section 2.1 how the assumption that is Markov and faithful with respect to leads to identifiability of the Markov equivalence class of . Considering SEMs, we see in Section 3 that additive noise models as a special case of restricted SEMs even lead to identifiability of the correct DAG. Also Section 2.3 contains such a restriction based on SEMs.

## 2 Alternative Methods

### 2.1 Estimating the Markov Equivalence Class: Independence-Based Methods

Conditional independence-based methods like the PC algorithm and the FCI algorithm (Spirtes et al., 2000) assume that is Markov and faithful with respect to the correct graph (that means all conditional independences in the joint distribution are entailed by the Markov condition, cf. Section 1.1). Since both assumptions put restrictions only on the conditional independences in the joint distribution, these methods are not able to distinguish between two graphs that entail exactly the same set of (conditional) independences, i.e., between Markov equivalent graphs. Since many Markov equivalence classes contain more than one graph, conditional independence-based methods thus usually leave some arrows undirected and cannot uniquely identify the correct graph.

The first step of the PC algorithm determines the variables that are adjacent. One therefore has to test whether two variables are dependent given any other subset of variables. The PC algorithm exploits a very clever procedure to reduce the size of the condition set. In the worst case, however, one has to perform conditional independence tests with conditioning sets of up to variables (where is the number of variables in the graph). Although there is recent work on kernel-based conditional independence tests (Fukumizu et al., 2008; Zhang et al., 2011)

, such tests are difficult to perform in practice if one does not restrict the variables to follow a Gaussian distribution, for example

(e.g., Bergsma, 2004).

To prove consistency of the PC algorithm one does not only require faithfulness, but strong faithfulness (Zhang and Spirtes, 2003; Kalisch and Bühlmann, 2007). Uhler et al. (2013) argue that this is a restrictive condition. Since parts of faithfulness can be tested given the data (Zhang and Spirtes, 2008), the condition may be weakened.

From our perspective independence-based methods face the following challenges: (1) We can identify the correct DAG only up to Markov equivalence classes. (2) Conditional independence testing, especially with a large conditioning set, is difficult in practice. (3) Simulation experiments suggest, that in many cases, the distribution is close to unfaithfulness. In these cases there is no guarantee that the inferred graph(s) will be close to the original one.

### 2.2 Estimating the Markov Equivalence Class: Score-Based Methods

Although the roots for score-based methods for causal inference may date back even further, we mainly refer to Geiger and Heckerman (1994), Heckerman (1997) and Chickering (2002) and references therein. Given the data from a vector of variables, i.e., i.i.d. samples, the idea is to assign a score to each graph and search over the space of DAGs for the best scoring graph.

 ^G:=argmaxG DAG over XS(D,G) (3)

There are several possibilities to define such a scoring function. Often a parametric model is assumed (e.g., linear Gaussian equations or multinomial distributions), which introduces a set of parameters

.

From a Bayesian point of view, we may define priors and over DAGs and parameters and consider the log posterior as a score function, or equivalently (note that is constant over all DAGs):

 S(D,G):=logppr(G)+logp(D|G),

where is the marginal likelihood

 p(D|G)=∫Θp(D|G,θ)⋅ppr(θ)dθ.

In this case, defined in (3) is the mode of the posterior distribution, which is usually called the maximum a posteriori (or MAP) estimator. Instead of a MAP estimator, one may be interested in the full posterior distribution over DAGs. This distribution can subsequently be averaged over all graphs to get a posterior of the hypothesis about the existence of a specific edge, for example.

In the case of parametric models, we call two graphs and distribution equivalent if for each parameter there is a corresponding parameter , such that the distribution obtained from in combination with is the same as the distribution obtained from graph with , and vice versa. It is known that in the linear Gaussian case (or for unconstrained multinomial distributions) two graphs are distribution-equivalent if and only if they are Markov equivalent. One may therefore argue that and should be the same for Markov equivalent graphs and . Heckerman and Geiger (1995) discuss how to choose the prior over parameters accordingly.

Instead, we may consider the maximum likelihood estimator in each graph and define a score function by using a penalty, e.g., the Bayesian Information Criterion (BIC):

 S(D,G)=logp(D|^θ,G)−d2logn,

where is the sample size and the dimensionality of the parameter .

Since the search space of all DAGs is growing super-exponentially in the number of variables (e.g., Chickering, 2002), greedy search algorithms are applied to solve equation (3): at each step there is a candidate graph and a set of neighboring graphs. For all these neighbors one computes the score and considers the best-scoring graph as the new candidate. If none of the neighbors obtains a better score, the search procedure terminates (not knowing whether one obtained only a local optimum). Clearly, one therefore has to define a neighborhood relation. Starting from a graph , we may define all graphs as neighbors from that can be obtained by removing, adding or reversing one edge. In the linear Gaussian case, for example, one cannot distinguish between Markov equivalent graphs. It turns out that in those cases it is beneficial to change the search space to Markov equivalence classes instead of DAGs. The greedy equivalence search (GES) (Meek, 1997; Chickering, 2002) starts with the empty graph and consists of two-phases. In the first phase, edges are added until a local maximum is reached; in the second phase, edges are removed until a local maximum is reached, which is then given as an output of the algorithm. Chickering (2002) proves consistency of this method by using consistency of the BIC (Haughton, 1988).

### 2.3 Estimating the DAG: LiNGAM

Kano and Shimizu (2003) and Shimizu et al. (2006) propose an inspiring method exploiting non-Gaussianity of the data555A more detailed tutorial can be found on http://www.ar.sanken.osaka-u.ac.jp/~sshimizu/papers/Shimizu13BHMK.pdf.. Although their work covers the general case, the idea is maybe best understood in the case of two variables:

###### Example

Suppose

 Y=ϕX+N,N⊥⊥X,

where and

are normally distributed. It is easy to check that

 X=~ϕY+~N,~N⊥⊥Y.

with and .

If we consider non-Gaussian noise, however, the structural equation model becomes identifiable. Let and be two random variables, for which

 Y =ϕX+N,N⊥⊥X,ϕ≠0

holds. Then we can reverse the process, i.e., there exists and a noise , such that

 X=ψY+~N,~N⊥⊥Y,

if and only if and are Gaussian distributed. Shimizu et al. (2006)

were the first to report this result. They prove it even for more than two variables using Independent Component Analysis (ICA)

(Comon, 1994, Theorem 11), which itself is proved using the Darmois-Skitovič theorem (Skitovič, 1954, 1962; Darmois, 1953). Alternatively, Proposition 2.3 can be proved directly using the Darmois-Skitovič theorem (e.g., Peters, 2008, Theorem 2.10). [Shimizu et al. (2006)] Assume a linear SEM with graph

 Xj=∑k∈PAG0jβjkXk+Nj,j=1,…,p (4)

where all are jointly independent and non-Gaussian distributed. Additionally, for each we require for all . Then, the graph is identifiable from the joint distribution. The authors call this model a linear non-Gaussian acyclic model (LiNGAM) and provide a practical method based on ICA that can be applied to a finite amount of data. Later, improved versions of this method have been proposed in (Shimizu et al., 2011; Hyvärinen and Smith, 2013).

### 2.4 Estimating the DAG: Gaussian SEMs with Equal Error Variances

There is another deviation from linear Gaussian SEMs that makes the graph identifiable. Peters and Bühlmann (2014)

show that restricting the error (or noise) variables to have the same variance is sufficient to recover the graph structure. [

Peters and Bühlmann (2014)] Assume an SEM with graph

 Xj=∑k∈PAG0jβjkXk+Nj,j=1,…,p (5)

where all are i.i.d. and follow a Gaussian distribution. Additionally, for each we require for all . Then, the graph is identifiable from the joint distribution. For estimating the coefficients and the error variance , Peters and Bühlmann (2014) propose to use a penalized maximum likelihood method (BIC). For optimization they propose a greedy search algorithm in the space of DAGs. Rescaling the variables changes the error terms. Therefore, in many applications Theorem 2.4 cannot be sensibly applied. The BIC criterion, however, always allows to compare the method’s score with the score of a linear Gaussian SEM that uses more parameters and does not make the assumption of equal error variances.

## 3 Identifiability of Continuous Additive Noise Models

Recall that equation (2) defines the general form of an SEM: with jointly independent variables . We have seen that these models are too general to identify the graph (Proposition 4). It turns out, however, that constraining the function class leads to identifiability. As a first step we restrict the form of the function to be additive with respect to the noise variable:

 Xj=fj(PAj)+Nj,j=1,…,p (6)

and assume that all noise variables have a strictly positive density. For those models with strictly positive density, causal minimality reduces to the condition that each function is not constant in any of its arguments. Consider a distribution generated by a model (6) and assume that the functions are not constant in any of its arguments, i.e., for all and there are some and some such that

 fj(xPAj∖{i},xi)≠fj(xPAj∖{i},x′i).

Then the joint distribution satisfies causal minimality with respect to the corresponding graph. Conversely, if there is a and such that is constant, causal minimality is violated. See Appendix A.4 Linear functions and Gaussian variables identify only the correct Markov equivalence class and not necessarily the correct graph. In the remainder of this section we establish results showing that this is an exceptional case. We develop conditions that guarantee the identifiability of the DAG. Proposition 3.1 indicates that this condition is rather weak.

Throughout this section we assume that all random variables are absolutely continuous with respect to the Lebesgue measure. Peters et al. (2011a) provides an extension for variables that are absolutely continuous with respect to the counting measure.

### 3.1 Bivariate Additive Noise Models

We now add another assumption about the form of the structural equations. Consider an additive noise model (6) with two variables, i.e., the two equations and with . We call this SEM an identifiable bivariate additive noise model if the triple satisfies Condition 3.1. In particular, we require the noise variables to have strictly positive densities.

###### Condition

The triple does not solve the following differential equation for all with :

 ξ′′′ =ξ′′(−ν′′′f′ν′′+f′′f′)−2ν′′f′′f′+ν′f′′′+ν′ν′′′f′′f′ν′′−ν′(f′′)2f′, (7)

Here, , and and are the logarithms of the strictly positive densities. To improve readability, we have skipped the arguments , , and for , and and their derivatives, respectively.

Zhang and Hyvärinen (2009) even allow for a bijective transformation of the data, i.e., and obtain a similar differential equation as (7).

As the name in Definition 3.1 already suggests, we have identifiability for this class of SEMs. Let be generated by an identifiable bivariate additive noise model with graph and assume causal minimality, i.e., a non-constant function (Proposition 6). Then, is identifiable from the joint distribution. The proof of Hoyer et al. (2009) is reproduced in Appendix A.5. Intuitively speaking, we expect a “generic” triple to satisfy Condition 3.1. The following proposition presents one possible formalization. After fixing we consider the space of all distributions such that Condition 3.1 is violated. This space is contained in a three dimensional space. Since the space of continuous distributions is infinite dimensional, we can therefore say that Condition 3.1 is satisfied for “most distributions” . If for a fixed pair there exists such that for all but a countable set of points , the set of all for which does not satisfy Condition 3.1 is contained in a -dimensional space. The condition holds for all if there is no interval where is constant and the logarithm of the noise density is not linear, for example. See Appendix A.6. In the case of Gaussian variables, the differential equation (7) simplifies. We thus have the following result. If and follow a Gaussian distribution and does not satisfy Condition 3.1, then is linear. See Appendix A.7. Although non-identifiable cases are rare, the question remains when identifiability is violated. Zhang and Hyvärinen (2009) prove that non-identifiable additive noise models necessarily fall into one out of five classes. [Zhang and Hyvärinen (2009)] Consider with fully supported noise variable that is independent of and three times differentiable function . Let further only at finitely many points . If there is a backward model, i.e., we can write with independent of , then one of the following must hold.

• is Gaussian, is Gaussian and is linear.

• is log-mix-lin-exp, is log-mix-lin-exp and is linear.

• is log-mix-lin-exp, is one-sided asymptotically exponential and is strictly monotonic with as or as .

• is log-mix-lin-exp, is generalized mixture of two exponentials and is strictly monotonic with as or as .

• is generalized mixture of two exponentials, is two-sided asymptotically exponential and is strictly monotonic with as or as .

Precise definitions can be found in Appendix A.8. In particular, we obtain identifiability whenever the function is not injective. Proposition 3.1 states that belonging to one of these classes is a necessary condition for non-identifiability. We now show sufficiency for two classes. The linear Gaussian case is well-known and easy to prove.

###### Example

Let with independent and . We can then consider all variables in and project onto . This leads to an orthogonal decomposition . Since for jointly Gaussian variables uncorrelatedness implies independence, we obtain a backward additive noise model. Figure 2 (left) shows the joint density and the functions for the forward and backward model.

We also give an example of a nonidentifiable additive noise model with non-Gaussian distributions, where the forward model is described by case II, and the backwards model by case IV:

###### Example

Let with independent log-mix-lin-exp and , i.e., we have the log-densities

 ξ(x)=logpX2(x)=c1exp(c2x)+c3x+c4

and

 ν(x)=logpN2(n)=γ1exp(γ2n)+γ3n+γ4.

Then

is a generalized mixture of exponential distributions. If and only if

and we obtain a valid backward model with log-mix-lin-exp . Again, Figure 2 (right) shows the joint distribution over and and forward and backward functions.

See Appendix A.9.

Example 3.1 shows how parameters of function, input and noise distribution have to be “fine-tuned” to yield non-identifiability (Janzing and Steudel, 2010).

It can be shown that bivariate identifiability even holds generically when causal feedback is allowed (i.e., if both causes and causes ), at least when assuming noise and input distributions to be Gaussian (Mooij et al., 2011).

### 3.2 From Bivariate to Multivariate Models

It turns out that Condition 3.1 also suffices to prove identifiability in the multivariate case. Assume we are given structural equations as in (6). If we fix all arguments of the functions except for one parent and the noise variable, we obtain a bivariate model. One may expect that it suffices to put restrictions like Condition 3.1 on this triple of function, input and noise distribution. This is not the case.

###### Example

Consider the following SEM

 X1=N1,X2=f2(X1)+N2,X3=f3(X1)+a⋅X2+N3

with , and , i.e., is uniformly distributed on and and are normally distributed. The variables and themselves are non-Gaussian but

 X3|X1=x1=c+a⋅X2|X1=x1+N3

is a linear Gaussian equation for all . We can revert this equation and obtain the same joint distribution by an SEM of the form

 X1=M1,X2=g2(X1)+b⋅X3+M2,X3=g3(X1)+M3

for some , and . Thus, the DAG is not identifiable from the joint distribution.

Instead, we need to put restrictions on conditional distributions. Consider an additive noise model (6) with variables. We call this SEM a restricted additive noise model if for all , and all sets with , there is an with , s.t.

 (fj(xPAj∖{i},⋅Xi),L(Xi|XS=xS),L(Nj))

satisfies Condition 3.1. Here, the underbrace indicates the input component of for variable . In particular, we require the noise variables to have non-vanishing densities and the functions to be continuous and three times continuously differentiable. Assuming causal minimality, we can identify the structure of the SEM from the distribution. Let be generated by a restricted additive noise model with graph and assume that satisfies causal minimality with respect to , i.e., the functions are not constant (Proposition 6). Then, is identifiable from the joint distribution. See Appendix A.11. Our proof of Theorem 3.2 contains a graphical statement that turns out to be a main argument for proving identifiability for Gaussian models with equal error variances (Peters and Bühlmann, 2014). We thus state it explicitly as a proposition. Let and be two different DAGs over variables .

1. Assume that has a strictly positive density and satisfies the Markov condition and causal minimality with respect to and . Then there are variables such that for the sets ,