Semidefinite tests for latent causal structures

01/03/2017
by   Aditya Kela, et al.
University of Cologne
0

Testing whether a probability distribution is compatible with a given Bayesian network is a fundamental task in the field of causal inference, where Bayesian networks model causal relations. Here we consider the class of causal structures where all correlations between observed quantities are solely due to the influence from latent variables. We show that each model of this type imposes a certain signature on the observable covariance matrix in terms of a particular decomposition into positive semidefinite components. This signature, and thus the underlying hypothetical latent structure, can be tested in a computationally efficient manner via semidefinite programming. This stands in stark contrast with the algebraic geometric tools required if the full observable probability distribution is taken into account. The semidefinite test is compared with tests based on entropic inequalities.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

09/02/2016

The Inflation Technique for Causal Inference with Latent Variables

The problem of causal inference is to determine if a given probability d...
07/08/2014

Inferring latent structures via information inequalities

One of the goals of probabilistic inference is to decide whether an empi...
07/20/2017

The inflation technique solves completely the classical inference problem

The causal inference problem consists in determining whether a probabili...
06/12/2015

Causal inference via algebraic geometry: feasibility tests for functional causal structures with two binary observed variables

We provide a scheme for inferring causal relations from uncontrolled sta...
05/30/2018

Too Fast Causal Inference under Causal Insufficiency

Causally insufficient structures (models with latent or hidden variables...
01/16/2013

Model Criticism of Bayesian Networks with Latent Variables

The application of Bayesian networks (BNs) to cognitive assessment and i...
10/22/2021

Recursive Causal Structure Learning in the Presence of Latent Variables and Selection Bias

We consider the problem of learning the causal MAG of a system from obse...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In spite of the primal importance of discovering causal relations in science, the statistical analysis of empirical data has historically shied away from causality. Only releatively recently has a rigorous theory of causality emerged (see, for instance, Pearl (2009); Spirtes et al. (2001)), showing that empirical data indeed can contain information about causation rather than mere correlation. Since then, causal inference has quickly become influential. Examples range from applications to the inference of genetic Friedman (2004) and social networks Ver Steeg and Galstyan (2011), to a better understanding of the role of causality within quantum physics Leifer and Spekkens (2013); Fritz (2012, 2016); Henson et al. (2014); Chaves et al. (2015a); Pienaar and Brukner (2015); Ried et al. (2015); Costa and Shrapnel (2016); Horsman et al. (2016).

To formalize causal mechanisms it has become popular to use directed acyclic graphs (DAGs) where nodes denote random variables and directed edges (arrows) account for their causal relations. Central problems within this context include

inference or model selection: ‘Given samples from a number of observable variables, which DAG should we associate with them?’, as well as hypothesis testing: ‘Can the observed data be explained in terms of an assumed DAG?’ Here, we concentrate on the latter problem and propose a novel solution based on the covariances that a given causal structure gives rise to. To understand the relevance and applicability of this method it is useful to summarize the difficulties that we typically face when approaching such problems.

The most common method to infer the set of possible DAGs compatible with empirical observations is based on the Markov condition and the faithfulness assumption Pearl (2009); Spirtes et al. (2001). Under these conditions, and in the case where all variables composing a given DAG can be assumed to be empirically accessible, the conditional statistical independencies implied by the graph contain all the information required to test for the compatibility of some data with the causal structure. However, for a variety of practical and fundamental reasons, we do quite generally face causal discovery in the presence of latent (hidden) variables, that is, variables that may play an important role in the causal model, but nonetheless cannot be accessed empirically. In this case we have to characterize the set of marginal probability distributions that a given DAG can give rise to. Unfortunately, as is widely recognized, generic causal models with latent variables impose highly non-trivial constraints on the possible correlations compatible with it Pitowsky (1991); Pearl (1995); Geiger and Meek (1999); Bonet (2001); Garcia et al. (2005); Kang and Tian (2006, 2007); Evans (2012); Lee and Spekkens (2015); Chaves (2016); Rosset et al. (2016); Wolfe et al. (2016). Although the marginal compatibility in principle can be completely characterized in terms of semi-algebraic sets Geiger and Meek (1999), it appears that the resulting tests in practice are computationally intractable beyond a few variables Garcia et al. (2005); Lee and Spekkens (2015).

One possible approach to deal with the apparent intractability is to consider relaxations of the original problem, that is, to design tests that define incomplete lists of constraints (outer approximations) to the set of compatible distributions Bonet (2001); Garcia et al. (2005); Kang and Tian (2006, 2007); Moritz et al. (2014); Chaves et al. (2014a, b). For instance, this approach has previously been considered in Chaves et al. (2014a, b); Steudel and Ay (2015); Weilenmann and Colbeck (2016), with tests based on entropic information theoretic inequalities; an idea originally conceived to tackle foundational questions in quantum mechanics Braunstein and Caves (1988); Cerf and Adami (1997); Chaves and Fritz (2012); Fritz and Chaves (2013); Chaves (2013); Chaves et al. (2015b); Chaves and Budroni (2016). Here we consider a relaxation in a similar spirit, but based on covariances rather than entropies.

Beyond dealing with potential computational intractabilities, an additional benefit with a relaxation based on covariances is that it at most involves bipartite marginals, and it seems reasonable to expect that this would be less data-intensive than methods based on the full multivariate distribution of the observables.

Figure 1: Bipartite DAGs. In this investigation we focus on the class of causal models where all correlations among the observables are due to a collection of independent latent variables. This setting can be described in terms of DAGs that are bipartite, where the latter means that all edges are directed from latent variables () to the observables (), and where there are no edges within each of these subsets.

i.1 Main assumptions and results

We focus on a particular class of latent causal structures, where we assume that there are no direct causal influences between the observables, but only from latent variables to observables (see figure 1). Hence, all correlations among the observables are due to the latent variables. This setting can be described by the class of DAGs where all edges are directed from latent vertices to observable vertices, but no edges within these two groups (see figure 1). In other words, we consider the case of DAGs that are bipartite, with the coloring ‘observable’ and ‘latent’. Alternatively, this can be described in terms of hypergraphs, where each independent latent cause is associated with a hyperedge consisting of the affected observable vertices (see e.g. Evans (2016)).

This class of graphs has previously been considered in the context of marginalization of Bayesian networks Moritz et al. (2014); Steudel and Ay (2015); Evans (2016). They moreover provide examples of the difficulties that arise when characterizing latent structures Branciard et al. (2010); Fritz (2012); Branciard et al. (2012); Tavakoli et al. (2014); Chaves et al. (2014a, b), where standard techniques based on the use of conditional independencies even can yield erroneous results (for a discussion, see e.g. Wood and Spekkens (2015)). This type of latent structures furthermore emerges in the context of Bell’s theorem Bell (1964), as well as in recent generalizations Branciard et al. (2010); Fritz (2012); Branciard et al. (2012); Tavakoli et al. (2014); Chaves (2016); Rosset et al. (2016); Saunders et al. (2016); Carvacho et al. (2016), where they can be used to show that quantum correlations between distant observers –thus without direct causal influences between them– are incompatible with our most basic notions of cause and effect.

Irrespective of the nature of the observables (categorical or continuous) we are free to assign vectors to each possible outcome of the observables. Our main result is to show that each bipartite DAG implies a particular decomposition of the resulting covariance matrix into positive semidefinite components. Hence, we can test whether the observed covariance matrix is compatible with a hypothetical bipartite DAG by checking whether it satisfies the corresponding positive semidefinite decomposition, and we will in the following somewhat colloquially refer to this as the ‘semidefinite test’. The semidefinite test can thus be phrased as a semidefinite membership problem, which in turn can be solved via semidefinite programming. The latter is known to be computationally efficient from a theoretical point of view, and has a good track record concerning algorithms that are efficient also in practice (see discussions in

Vandenberghe and Boyd (1996)).

i.2 Structure of the paper

In section II we derive a general decomposition of covariance matrices, which forms the basis of our semidefinite test. In section III we rephrase this general result to fit with the particular structure of observables and latent variables that we employ, and in section IV we derive the main result, namely that every bipartite DAG implies a particular semidefinite decomposition of the observable covariance matrix. Section V focuses on the converse, namely that every covariance matrix that satisfies the decomposition of a given bipartite DAG can be realized by a corresponding causal model. Section VI relates the semidefinite decomposition to previous types of operator inequalities introduced in von Prillwitz (2015). To obtain a covariance matrix we may be required to assign vectors to the outcomes of the random variables, and section VII discusses the dependence of the semidefinite test on this assignment. In section VIII we briefly discuss the fact that the compatibility with a given bipartite DAG is not affected if the observables are processed locally, and that the semidefinite test respects this basic property under suitable conditions. Section IX considers a specific class of distributions where it is possible to analytically determine the conditions for a semidefinite decomposition. This class of distribution does in section X serve as a testbed for comparisons with the above mentioned entropic tests. We conclude with a summary and outlook in section XI.

Ii Semidefinite decomposition of covariance matrices

In this section we develop the basic structure that forms the core of the semidefinite test. In essence it is obtained via a repeated application of a law of total variance for covariance matrices.

For a vector-valued random variable , in a real or complex inner product space , we define the covariance matrix of as

(1)

where denotes the expectation of and denotes the transposition if the underlying vector space is real, and the Hermitian conjugation if the space is complex. One should note that . We also define the cross-correlation for a pair of vector-valued variables (not necessarily belonging to the same vector space)

(2)

where . For a pair of random variables we denote the expectation of conditioned on as . Via the conditional expectation we can also define the conditional covariance matrix

(3)

In a similar manner we can also obtain a conditional cross-correlation between two random vectors

(4)

The starting point for our derivations is the law of total expectation

(5)

where the ‘outer’ expectation corresponds to the averaging over the random variable . The law of total expectation can be iterated, such that for three random variables , we have a law of total conditional expectation

(6)

and thus .

From the law of total expectation (5) one can obtain a covariance-matrix version of the law of total variance

(7)

which can be confirmed by expanding the two sides of the above equality and applying (5).

For three random variables a conditional version of the law of total covariance reads

(8)

which can be obtained by expanding the right hand side and applying the law of total conditional expectation (6).

The following lemma is obtained via an iterated application of the law of total covariance (7) and the law of total conditional covariance (8

). One may note the similarities with the chain-rule for entropies (see e.g. chapter 2 in

Cover and Thomas (2012)).

Lemma 1.

Let be a vector-valued random variable on a finite-dimensional real or complex inner product space , let be random variables over the same probability space. Assuming that the underlying measure is such that all involved conditional expectations and covariances are well defined, then

(9)

where and are positive semidefinite operators on the space , defined by

(10)

One may note that the above decomposition is not necessarily unique; we could potentially obtain a new decomposition if the variables in the sequence are permuted.

Proof.

The law of total covariance (7) for , combined with the law of total conditional covariance (8) for yields

(11)

Suppose that for some it would be true that

(12)

The law of total conditional covariance (8), with and , gives

By inserting this expression into the last line of (12) one does again obtain (12) but with substituted for . By (11) we can see that (12) is true for . Thus, by induction to , and the identifications in (10), we obtain (9).

Note that is a positive semidefinite operator on for each value of . Hence, by averaging over these variables, and thus implementing the expectation that yields , we do still have a positive semidefinite operator on . The same observation applies to . ∎

Iii Observable vs. latent variables, and feature maps

Here we consider the decomposition developed in the previous section for the more specific setting of observable and latent variables.

We consider a collection of observable variables . To each of these variables we associate a mapping , in some contexts referred to as a ‘feature map’ Schölkopf and Smola (2002), into a finite-dimensional vector space . We denote the resulting vector-valued random variables by , and for the sake of simplicity we will in the following tend to abuse the terminology and refer to the vectors themselves as feature maps. We also define the joint random vector on . (Hence, we can view as the concatenation of the vectors .) One should note that while we regard the observable variables as being part of the setup that is ‘given’, the feature maps are part of the analysis, and we are free to assign these as we see fit. (Concerning the question of how the test depends on this choice, see section VII.)

Figure 2: Observables, latent variables, and feature maps. The model consists of a collection of observable variables and a collection of latent variables . Via feature maps, each is mapped to a vector in a vector space . On the vector space we define the joint random vector .

Let denote the projector onto the subspace in . We divide the total covariance matrix into the cross-correlations between the separate observable quantities . One can note that .

For a collection of latent variables , we make the identifications in Lemma 1. Similarly as for the covariance matrix we decompose the operators and into ‘block-matrices’ and , with and , where we can write

(13)

for . In terms of these blocks we can thus reformulate (9) as

(14)

One should keep in mind that and in the general case are matrices (rather than scalar numbers) for each single pair .

Iv Decomposition of the covariance matrix for bipartite DAGs

We define a bipartite DAG as a finite DAG with vertices and edges , with a bipartition , such that all edges in are directed from the elements in (the latent variables) to the elements in (the observables). Since is finite, we enumerate the elements of as and the elements of as . One may note that we generally will overload the notation and let and denote the vertices in the underlying bipartite DAG, as well as denoting the random variables associated with these vertices.

For a vertex in a directed graph we let denote the children of , i.e., the set of vertices for which there is an edge directed from to . We let denote the parents of , i.e., the set of vertices for which there is an edge directed from to . For bipartite DAGs an element in can only have children in (and have no parents), and an element in can only have parents in (and no children). As an example, for the bipartite DAG in figure 1 we have , , and , and , , , , and .

For a causal model defined by a general DAG the underlying probability distribution can be described via the Markov condition where each edge represents a direct causal influence, and thus each vertex can only be directly influenced by its parents , resulting in distributions of the form . Hence, for a bipartite DAG we get , and thus all the latent variables are independent, and the observables are independent when conditioned on the latent variables.

As in the previous section, we map the observables to vectors in vector spaces . For each we define the projector in by

(15)

Hence, is the projector onto all subspaces of that are associated with the children of the latent variable . (In the above sum we should strictly speaking write . However, in order to avoid a too cumbersome notation we will from time to time take the liberty of writing rather than , and rather than .)

Figure 3: Example: Triangular bipartite DAG. The covariance matrix resulting from the observables in a bipartite DAG is subject to a decomposition where each latent variable gives rise to a positive semidefinite component, and where the support of that component is determined by the children of the corresponding latent variable. In the case of the ‘triangular’ scenario of the the bipartite DAG to the left, each of the three latent variables has two children. The covariance matrix, schematically depicted to the right, can consequently be decomposed into three positive semidefinite components, each with bipartite supports. This observation yields a method (which we refer to as the ‘semidefinite test’) to falsify a given bipartite DAG as an explanation of an observed covariance matrix.
Proposition 1.

For a bipartite DAG with latent variables and observables with assigned feature maps into finite-dimensional real or complex inner-product spaces , the covariance matrix of satisfies

(16)

where

(17)

and where the projectors are as defined in (15) with respect to the given bipartite DAG, and where is the projector onto in .

One may note that if the span of the supports of covers , then we can distribute the blocks of and add them to the different in such a way that the new operators still are positive semidefinite and satisfy the support structure of the original s. The exception is if there is some observable that has no parent (as in figure 1).

Proof.

Select an enumeration of the latent variables. By Lemma 1 we know that the covariance matrix can be decomposed as in (9) with the positive semidefinite operators and as defined in (10). In the following we will make use of the block-decomposition and with respect to the subspaces as in (13).

If then it means that is independent of and thus

The analogous statement is true if . By this it follows that

(18)

Note that . By comparing (18) with (13) we can conclude that if or . The definition of the projector in (15) thus yields . Moreover, we know from Lemma 1 that .

By construction, all the observables and thus also are independent when conditioned on the latent variables. Hence,

and thus .

One may note that although the operators potentially may change if we generated them via a permutation of the sequence of latent variables , the resulting projectors would not change. Hence, the support-structure described by (16) and (17) is stable under rearrangements of the sequence. ∎

Deciding whether a given matrix is of the form (16) can be done via semi-definite programming (SDP). We end this section by describing an explicit SDP formulation.

The optimization will be over matrices which can be interpreted as the direct sum of candidates for and the ’s. More precisely, let

(19)
(20)

Let be a matrix on . According to the direct sum decomposition (19), the matrix is a block matrix with blocks. We think of the fist diagonal blocks as carrying candidates for (which completely defines , according to (17)); while the rear diagonal blocks correspond to candidate ’s. Note that the rear sumands in (19) are dirct sums themselves. It therefore makes sense to use double indices to refer to spaces inside the ’s. Concretely, the SDP includes affine constraints on the blocks . The first part of the indices selects the space in (19). The second part refers to the space within according to (20). We use the convention that denotes if either or does not occur in .

With these definitions, the semi-definite program that verifies whether a covariance matrix is of the form (16) reads

maximize (21)
subject to (23)

where the optimization is over symmetric (hermitian) matrices on . Up to a trivial re-expression of the linear functions of in terms of trace inner products with suitable matrices , the optimization problem above is in the (dual) standard form of an SDP (Vandenberghe and Boyd, 1996, Section 3).

The left-hand side of (23) impliclity defines a linear map from matrices on to matrices on . Explicitly, maps off-diagonal blocks to and acts on block-diagonal matrices as

The constraints of the SDP can thus be written slightly more transparently as

(24)
(25)

In this language, the dual of the above SDP is

minimize (26)
subject to (27)

Let be the optimizer of (26). If , then the original SDP is infeasible and therefore, is not of the form (16). Indeed, by construction, such an has a negative trace inner product with the covariance matrix, but a positive trace inner product

with all matrices that could potentially be feasible for the primal SDP (24). Thus, the dual SDP (26) can be used to find a witness or a dual certificate for the incompatibility of a covariance matrix with a presumed causal structure. The geometry of the involved objects is shown in figure 4. We will refer to this dual construction in section XI

, where we sketch possibilities to base statistical hypothesis tests such witnesses.

Figure 4: Dual Certificates. The set of covariance matrices compatible with a certain causal structure in the sense of proposition 1 forms a convex cone . The cone is the feasible set of the SDP (21). If a given covariance matrix is not

an element of that cone, then there exists a hyperplane (depicted in red) seperating the two convex sets. A normal vector

for the seperating hyperplane can be found using the dual SDP (26).

V Realizing a given decomposition

In the previous section we have shown that the observable covariance matrix associated with a given bipartite DAG always satisfies a particular semidefinite decomposition implied by that DAG. Here we show the converse, in the sense that if we have a positive semidefinite operator that satisfies the decomposition obtained from a particular bipartite DAG, then there exists a causal model associated with that DAG that has the given operator as its observable covariance matrix (see figure 5

). The proof is based on the observation that each positive semidefinite operator on a vector space can be interpreted as the covariance of a vector-valued random variable on that space (e.g. as the covariance of a multivariate normal distribution, or of variable over finite alphabets, as discussed in section

V.2). The essential idea is that we assign an independent random variable to each component in the decomposition, and take these as the latent variables, and that the support structure of the components furthermore determines the children of the latent variables.

v.1 Realization of decompositions

Let be a finite set, and let be a collection of subsets of . The collection defines a bipartite DAG with as observable nodes, and a set of latent nodes , with the edges assigned by the identification for . In the following we denote this bipartite DAG by .

Figure 5: A positive semidefinite operator on a set of selected orthogonal subspaces can be regarded as the covariance matrix of a corresponding collection of vector-valued variables. If this operator separates into positive semidefinite components (as schematically depicted to the left), then the support structures of these components define a bipartite DAG (on the right). The components in the decomposition can be interpreted as the covariance matrices of independent vector-valued latent variables. Moreover, the collection of subspaces on which such an operator has support determines the observable children of the corresponding latent variable. Each observable variable can be constructed by adding the components collected from its parents.
Proposition 2.

Let be finite-dimensional real or complex inner-product spaces. For a number let be a collection of subsets . Suppose that is a positive semidefinite operator on the space , and that it can be written

(28)

for

(29)

with being the projectors onto the subspaces . Then there exists a causal model for the bipartite DAG with vector-valued variables in such that satisfies

(30)
Proof.

Let us define the set and its complement . By construction, is the set of observable nodes in the bipartite DAG that have no parents (like vertex in figure 1) and thus each element in has at least one parent. By the definition of in (29) it follows that . In other words, the operators have no support on the subspaces belonging to parentless observable nodes. Let us now turn to the operator and its block diagonal decomposition with . We can write . Consequently, can be decomposed in one operator on the subspace , and a collection of blocks on the corresponding subspaces for . Since is positive semidefinite, it can be interpreted as the covariance matrix of some random vector in . In the following we assume that we have made such an assignment for all . We also assume that these random vectors are independent.

Each for has its support inside the support of at least one . Hence, we can ‘distribute’ the operators for by forming new positive semidefinite operators such that

(31)

where one may note that .

In the following we shall assign observable and latent random variables to the vertices of the bipartite DAG . For each and each , let be a vector space that is isomorphic to , and let be an arbitrary isomorphism. (We assume that these isomorphisms preserve the inner-product structure, such that maps orthonormal bases of to orthonormal bases of .) We regard the spaces in the collection as being orthogonal to each other. Define , and the corresponding isomorphism . Since each is positive semidefinite, it can be interpreted as the covariance matrix of a vector-valued random variable on . Consequently, we can also find a vector-valued random variable on such that

(32)

We assume that the random variables are independent of each other, and also independent of .

The variables serve as the latent variables corresponding to the latent nodes in the bipartite DAG . In the following we shall construct a collection of vector-valued variables as deterministic functions of the latent variables , in such a way that these functions correspond to the arrows in , thus guaranteeing a valid causal model associated with this bipartite DAG.

Let us decompose the vector into its projections onto the subspaces . For each , the vector is associated to the observable node . (One can imagine it to be transferred to node .) Equivalently we can say that each observable node receives the vector from its ancestor . On the observable node we construct a new vector by adding all the vectors ‘sent to it’ from its parents

(33)

where the last equality follows since if , or equivalently if , and thus if . The collection we take as the observable variables, and we define .

Due to the fact that all for are independent, and also independent of all , we get

v.2 Positive semidefinite operators as covariance matrices of vector-valued random variables over finite alphabets

The material in the previous section presumes the existence of realizations of positive semidefinite operators as the covariance of some vector-valued variable, without making any restriction on ther nature. As mentioned above, each positive semi-definite operator (over a finite-dimensional real or complex vector space) can be regarded as the covariance of a multivariate normal distribution. However, suppose that we would require that the variable only can take a finite number of outcomes. Here we briefly discuss the conditions for such realizations, and provide an explicit construction (in the proof of Lemma 3).

For a (possibly vector-valued) random variable over a finite alphabet, we say that that the supported alphabet size is , if there are precisely outcomes that occur with a non-zero probability.

Lemma 2.

If a random variable on a finite-dimensional real or complex inner-product space has a supported alphabet size , then .

Proof.

We first note that . Since very manifestly is a linear combination of , it follows that the range of is a subset of the range of , and thus . However, in the following we shall show that the stronger inequality holds. To see this, let us first consider the case that are linearly dependent. This means that at least one of these vectors is a linear combination of the others, and thus . Let us now instead assume that is a linearly independent set. Define by , then . Hence, is the matrix representation of with respect to the linearly independent, but not necessarily orthonormal set . One can realize that due to the linear independence, it follows that . Finally, let us define the -dimensional vector . One can confirm that . Hence, , and we can conclude that . ∎

Lemma 3.

Let be a positive semidefinite operator on a finite-dimensional real or complex inner-product space . For every there exists a vector-valued random variable on with supported alphabet size , such that . However, for all with a supported alphabet size .

Proof.

Let be the supported alphabet size of a vector-valued random variable . If , then we know from Lemma 2 that . Hence, it remains to show that it is possible to find a such that for every . We thus wish to find a collection of vectors , and with , and , such that .

Let be an orthonormal basis of the range (support) of the operator , and let be the projector onto the range. Let be a matrix in () if the underlying space is real (complex). Since , we can assign the th column of to be the vector (i.e., for all ) and we arbitrarily complete the rest of the matrix such that it becomes orthogonal (unitary). Since is orthogonal (unitary), it follows that its columns form an orthonormal basis of (). Hence, for each it must be the case that the vector is orthogonal to , and thus

(34)

Next, define the set of vectors by . One can confirm that , as well as , where we use (34). As the final step we define and for . One can confirm that