On the Intersection Property of Conditional Independence and its Application to Causal Discovery

03/03/2014
by   Jonas Peters, et al.
ETH Zurich
0

This work investigates the intersection property of conditional independence. It states that for random variables A,B,C and X we have that X independent of A given B,C and X independent of B given A,C implies X independent of (A,B) given C. Under the assumption that the joint distribution has a continuous density, we provide necessary and sufficient conditions under which the intersection property holds. The result has direct applications to causal inference: it leads to strictly weaker conditions under which the graphical structure becomes identifiable from the joint distribution of an additive noise model.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

07/05/2019

On Finite Exchangeability and Conditional Independence

We study the independence structure of finitely exchangeable distributio...
02/27/2013

Generating Graphoids from Generalised Conditional Probability

We take a general approach to uncertainty on product spaces, and give su...
02/05/2021

On the Sample Complexity of Causal Discovery and the Value of Domain Expertise

Causal discovery methods seek to identify causal relations between rando...
04/18/2019

Intersection property and interaction decomposition

The decomposition of interactions, or interaction decomposition, is a hi...
03/20/2013

"Conditional Inter-Causally Independent" Node Distributions, a Property of "Noisy-Or" Models

This paper examines the interdependence generated between two parent nod...
10/27/2020

A Weaker Faithfulness Assumption based on Triple Interactions

One of the core assumptions in causal discovery is the faithfulness assu...
04/23/2021

Transitional Conditional Independence

We develope the framework of transitional conditional independence. For ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Application to Causal Inference

Inferring causal relationships is a major challenge in science. In the last decades considerable effort has been made in order to learn causal statements from observational data. Causal discovery methods make assumptions that relate the joint distribution with properties of the causal graph. Constraint-based or independence-based methods (Pearl, 2009; Spirtes et al., 2000) and some score-based methods (Chickering, 2002; Heckerman et al., 1999) assume the Markov condition and faithfulness. A distribution is said to be Markov with respect to a directed acyclic graph (DAG) if each d-separation in the graph implies the corresponding (conditional) independence; the distribution is faithful with respect to if the reverse statement holds. These two assumptions render the Markov equivalence class of the correct graph identifiable from the joint distribution, i.e. the skeleton and the v-structures of the graph can be inferred from the joint distribution (Verma and Pearl, 1991). Methods like LiNGAM (Shimizu et al., 2006) or additive noise models (Hoyer et al., 2009; Peters et al., 2013) assume the Markov condition, too, but do not require faithfulness; instead, these methods assume that the structural equations come from a restricted model class (e.g. linear with non-Gaussian noise or non-linear with additive Gaussian noise). In order to prove that the directed acyclic graph (DAG) is identifiable from the joint distribution Peters et al. (2013) require a strictly positive density. Their proof makes use of the intersection property of conditional independence (Definition 2) which is known to hold for positive densities (e.g. Pearl, 2009, 1.1.5).

1.2 Main Contributions

In Section 3 we provide a sufficient and necessary condition on the density for the intersection property to hold (Corollary 1). This result is of interest in itself since the developed condition is weaker than strict positivity.

As mentioned above, some causal discovery methods based on structural equation models require the intersection property for identification; they therefore rely on the strict positivity of the density. This can be achieved by fully supported noise variables, for example. Using the new characterization of the intersection property we can now replace the condition of strict positivity. In fact, we show in Section 4 that noise variables with a path-connected support are sufficient for identifiability of the graph (Proposition 3). This is already known for linear structural equation models (Shimizu et al., 2006) but not for non-linear models. As an alternative, we provide a condition that excludes constant functions and leads to identifiability, too (Proposition 4).

In Section 2, we provide an example of a structural equation model that violates the intersection property (but satisfies causal minimality). Its corresponding graph is not identifiable from the joint distribution. In correspondence to the theoretical results of this work, some noise densities in the example are do not have a path-connected support and the functions are partially constant. We are not aware of any causal discovery method that is able to infer the correct DAG or the correct Markov equivalence class; the example therefore shows current limits of causal inference techniques. It is non-generic in the case that it violates all sufficient assumptions mentioned in Section 4.

1.3 Conditional Independence and the Intersection Property

We now formally introduce the concept of conditional independence in the presence of densities and the intersection property. Let therefore and be (possibly multi-dimensional) random variables that take values in metric spaces and respectively. We first introduce assumptions regarding the existence of a density and some of its properties that appear in different parts of this paper.

  • The distribution is absolutely continuous with respect to a product measure of a metric space. We denote the density by

    . This can be a probability mass function or a probability density function, for example.

  • The density is continuous.

  • For each with the set contains only one path-connected component (see Definition 3).

  • The density is strictly positive.

Condition (A2’) implies (A2). We assume (A0) throughout the whole work.

In this paper we work with the following definition of conditional independence.

Definition 1 (Conditional Independence).

We call independent of conditional on and write if and only if

(1)

for all such that .

The intersection property of conditional independence is defined as follows (e.g. Pearl, 2009, 1.1.5).

Definition 2 (Intersection Property).

We say that the joint distribution of satisfies the intersection property if

(2)

The intersection property (2) has been proven to hold for strictly positive densities (e.g. Pearl, 2009, 1.1.5). It is also known that the intersection property does not necessarily hold if the joint distribution does not have a density (e.g. Dawid, 1979b). Dawid (1980) provides measure-theoretic necessary and sufficient conditions for the intersection property. In this work we assume the existence of a density (A0) and provide more detailed conditions under which the intersection property holds.

2 Counter Example

We now give an example of a distribution that does not satisfy the intersection property (2). Since the joint distribution is absolutely continuous with respect to the Lebesgue measure, the example shows that the intersection property requires further restrictions on the density apart from its existence. We will later use the same idea to prove Proposition 2 that shows the necessity of our new condition.

Example 1.

Consider a structural equation model for random variables :

with , being jointly independent. Let the function be of the form

where the function can be chosen to make arbitrarily smooth. Some parts of this structural equation model are summarized in Figure 1. We clearly have and but and . A formal proof of this statement is provided in the more general setting of Proposition 2. It will turn out to be important that the two connected components of the support of and cannot be connected by an axis-parallel line. In the notation introduced in Definition 3 below, this means and are not equivalent. Within each component, however, that is if we consider the areas and separately, we do have the independence statement that is predicted by the intersection property. This observation will be formalized as the weak intersection property in Proposition 1.

alternative DAG

correct DAG

Figure 1: Example 1. The plot on the left hand side shows the support of variables and in black. In the areas filled with dark gray and light gray the function takes values ten and minus ten, respectively. The structural equation model corresponds to the top graph but the distribution can also be generated by a structural equation model with the bottom graph.

Example 1 has the following important implication for causal inference. The distribution satisfies causal minimality with respect to two different graphs, namely and (see Figure 1). Since it violates faithfulness and the intersection property, we are not aware of any causal inference method that is able to recover the correct graph structure based on observational data only. Recall that Peters et al. (2013) assume strictly positive densities in order to assure the intersection property. More precisely, the example shows that Lemma 37 in (Peters et al., 2013) does not hold anymore when the positivity is violated.

3 Necessary and sufficient condition for the intersection property

This section characterizes the intersection property in terms of the joint density over the corresponding random variables. In particular, we state a weak intersection property (Proposition 1) that leads to a necessary and sufficient condition for the classical intersection property, see Corollary 1. For these results, the notion of path-connectedness becomes important. A continuous mapping into a metric space is called a path between and in . A subset is called path-connected if every pair of points from can be connected by a path in . We require the following definition.

Definition 3.
  • For each with we consider the (not necessarily closed) support of and :

    We further write for all sets

  • We denote the path-connected components of by . Two path-connected components and are said to be coordinate-wise connected if

    We then say that and are equivalent if and only if there is a sequence with two neighbours and being coordinate-wise connected. We represent these equivalence classes by the union of all its members. These unions we denote by .

    We further introduce a deterministic function of the variables and . We set

    We have that if and only if if and only if .

    Note that the projections are disjoint (for different ); similarly for .

  • The case where there is no variable can be treated as if was deterministic: for some .

In Example 1 there is no variable . Figure 1 shows the support in black. It contains two path-connected components. Since they cannot be connected by axis-parallel lines, they are not equivalent; thus, one of them corresponds to and the other to . Figure 2 shows another example that contains three equivalence classes of path-connected components; again, there is no variable ; we formally introduce a deterministic variable that always takes the value .

Figure 2: Each block represents one path-connected component of the support of . All blocks with the same filling are equivalent since they can be connected by axis-parallel lines. There are three different fillings corresponding to the equivalence classes , and .

Using Definition 3 we are now able to state the two main results, Propositions 1 and 2. As a direct consequence we obtain Corollary 1 which generalizes the condition of strictly positive densities.

Proposition 1 (Weak Intersection Property).

Assume (A0), (A1) and that and . Consider now with and the variable as defined in Definition 3(ii). We then have the weak intersection property:

This means that

for all with .

Proposition 2 (Failure of Intersection Property).

Assume (A0), (A1) and that there are two different sets for some with . Then there is a random variable such that the intersection property (2) does not hold for the joint distribution of .

As a direct corollary from these two propositions we obtain a characterization of the intersection property in the case of continuous densities.

Corollary 1 (Intersection Property).

Assume (A0) and (A1). Then

In particular, this is the case if (A2) holds (there is only one path-connected component) or (A2’) holds (the density is strictly positive).

4 Application to Causal Discovery

We now define what we mean by identifiability of the graph in continuous additive noise models. Assume that a joint distribution over is generated by a structural equation model (SEM)

(3)

with continuous, non-constant functions , additive and jointly independent noise variables with mean zero and sets that are the parents of in a directed acyclic graph . To simplify notation, we identify variables with its index (or node) . We consider the following statement

Peters et al. (2013, Theorem 27) prove this identifiability by extending the identifiability from graphs with two nodes to graphs with an arbitrary number of variables. Because they require the intersection property, it is shown only for strictly positive densities. But since Corollary 1 provides weaker assumption for the intersection property, we can use it to obtain new identifiability results.

Proposition 3.

Assume that a joint distribution over is generated by a structural equation model (3). If all densities of are path-connected, then the density of is path-connected, too. Thus, the intersection property (2) holds for any disjoint sets of variables (see Corollary 1). Therefore, statement holds if the noise variables have continuous densities and path-connected support.

Example 1 violates the assumption of Proposition 3 since the support of is not path-connected. It satisfies another important property, too: the function is constant on some intervals. The following proposition shows that this is necessary to violate identifiability.

Proposition 4.

Assume that a joint distribution over is generated by a structural equation model (3) with graph . Let us denote the non-descendants of by . Assume that the structural equations are non-constant in the following way: for all , for all its parents and for all , there are such that and and . Here, represents the value of all parents of except . Then for any , it holds that . Therefore, statement follows.

Proposition 4 provides an alternative way to prove identifiability. The results are summarized in Table 1.

additional assumption on continuous ANMs identifiability of graph, see
noise variables with full support
(Peters et al., 2013)
noise variables with path-connected support
Proposition 3
non-constant functions, see Proposition 4
Proposition 4
none of the above satisfied
Example 1
Table 1: This table shows conditions for continuous additive noise models (ANMs) that lead to identifiability of the directed acyclic graph from the joint distributions. Using the characterization of the intersection property we could weaken the condition of a strictly positive density.

5 Conclusion

It is possible to prove the intersection property of conditional independence for variables whose distributions do not have a strictly positive density. A necessary and sufficient condition for the intersection property is that all path-connected components of the support of the density are equivalent, that is they can be connected by axis-parallel lines. In particular, this condition is satisfied for densities whose support is path-connected. In the general case, the intersection property still holds conditioning on any equivalence class of path-connected components, we call this the weak intersection property.

This insight has a direct application in causal inference. For continuous additive noise models we can prove identifiability of the graph from the joint distribution using strictly weaker assumptions than before.

6 Proofs

6.1 Proof of Proposition 1

We require the following well-known lemma (e.g. Dawid, 1979a).

Lemma 1.

We have if and only if

for all such that and .

Proof.

(of Proposition 1) We have by Lemma 1

(4)

for all with . As the main argument we show that

(5)

for all with for the same .
Step 1, we prove equation (5) for , that is there is a path , such that for all , and and . Since the interval is compact and is continuous, the path is compact, too. Define for each point on the path an open ball with radius small enough such that all in the ball satisfy . Since this is an open cover of the space, choose a finite subset, of size say, of all those balls that still provide an open cover of the path. Without loss of generality generality let be the center of ball and be the center of ball . It suffices to show that equation (5) holds for the centres of two neighbouring balls, say and . Choose one point from the non-empty intersection of those two balls. Since and for the Euclidean metric , we have that , , , and are all greater zero. Therefore, using equation (4) several times,

This shows equation (5) for .
Step 2, we prove equation (5) for and , where and are coordinate-wise connected (and thus equivalent). If , we know that

from the argument given in step 1 above. If , then there is a such that and . By equation (4) and the argument from step 1 we have

We can now combine these two steps in order to prove the original claim from equation (5). If then and , say. Further, there is a sequence coordinate-connecting these components. Combining steps 1 and 2 proves equation (5).

Consider now such that (which implies ) and consider , say. Observe further that for . We thus have

with . It is the case, however, that for all there is a with . But since also we have by equation (5). Ergo,

This implies

Together with equation (4) this leads to

6.2 Proof of Proposition 2

Proof.

Define according to

where

is uniformly distributed with

being jointly independent. Define according to

Fix a value with . We then have for all with that

because can be written as a function of or of . We therefore have that and . Depending on whether is in or not we have or , respectively. Thus,

This shows that . ∎

6.3 Proof of Proposition 3

Proof.

Since the true structure corresponds to a directed acyclic graph, we can find a causal ordering, i.e. a permutation such that

In this ordering, is a source node and is a sink node. We can then rewrite the structural equation model in (3) as

where the functions are the same as except they are constant in the additional input arguments. The statement of the proposition then follows by the following argument: consider a one-dimensional random variable

with mean zero and a (possibly multivariate) random vector

both with path-connected support and a continuous function . Then, the support of the random vector is path-connected, too. Indeed, consider two points and from the support of . The path can then be constructed by concatenating three sub-paths: (1) the path between and (’s support is path-connected), (2) the path between and on the graph of (which is path-connected due to the continuity of ) and (3) the path between and , analogously to (1).

Therefore the statements of Lemma 37 and thus Proposition 28 from Peters et al. (2013) remain correct, which proves for noise variables with continuous densities and path-connected support. ∎

6.4 Proof of Proposition 4

Proof.

The proof is immediate. Since (the means are not the same) the statement follows from Lemma 1.

In this case, Lemma 37 might not hold but more importantly Proposition 28 does (both from Peters et al., 2013). This proves . ∎

References

  • Chickering [2002] D.M. Chickering. Optimal structure identification with greedy search.

    Journal of Machine Learning Research

    , 3:507–554, 2002.
  • Dawid [1979a] A. P. Dawid. Conditional independence in statistical theory. Journal of the Royal Statistical Society. Series B, 41:1–31, 1979a.
  • Dawid [1979b] A. P. Dawid. Some misleading arguments involving conditional independence. Journal of the Royal Statistical Society. Series B (Methodological), 41:249–252, 1979b.
  • Dawid [1980] A. P. Dawid. Conditional independence for statistical operations. Annals of Statistics, 8:598–617, 1980.
  • Heckerman et al. [1999] D. Heckerman, C. Meek, and G. Cooper. A Bayesian approach to causal discovery. In C. Glymour and G. Cooper, editors, Computation, Causation, and Discovery, pages 141–165, Cambridge, MA, 1999. MIT Press.
  • Hoyer et al. [2009] P.O. Hoyer, D. Janzing, J. Mooij, J. Peters, and B. Schölkopf. Nonlinear causal discovery with additive noise models. In Advances in Neural Information Processing Systems 21 (NIPS), pages 689–696. MIT Press, 2009.
  • Pearl [2009] J. Pearl. Causality: Models, reasoning, and inference. Cambridge University Press, 2nd edition, 2009.
  • Peters et al. [2013] J. Peters, J. Mooij, D. Janzing, and B. Schölkopf. Causal discovery with continuous additive noise models, 2013. arXiv:1309.6779.
  • Shimizu et al. [2006] S. Shimizu, P.O. Hoyer, A. Hyvärinen, and A.J. Kerminen. A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7:2003–2030, 2006.
  • Spirtes et al. [2000] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT Press, 2nd edition, 2000.
  • Verma and Pearl [1991] T. Verma and J. Pearl. Equivalence and synthesis of causal models. In

    Proceedings of the 6th Annual Conference on Uncertainty in Artificial Intelligence (UAI)

    , 1991.