In spite of the primal importance of discovering causal relations in science, the statistical analysis of empirical data has historically shied away from causality. Only releatively recently has a rigorous theory of causality emerged (see, for instance, Pearl (2009); Spirtes et al. (2001)), showing that empirical data indeed can contain information about causation rather than mere correlation. Since then, causal inference has quickly become influential. Examples range from applications to the inference of genetic Friedman (2004) and social networks Ver Steeg and Galstyan (2011), to a better understanding of the role of causality within quantum physics Leifer and Spekkens (2013); Fritz (2012, 2016); Henson et al. (2014); Chaves et al. (2015a); Pienaar and Brukner (2015); Ried et al. (2015); Costa and Shrapnel (2016); Horsman et al. (2016).
To formalize causal mechanisms it has become popular to use directed acyclic graphs (DAGs) where nodes denote random variables and directed edges (arrows) account for their causal relations. Central problems within this context includeinference or model selection: ‘Given samples from a number of observable variables, which DAG should we associate with them?’, as well as hypothesis testing: ‘Can the observed data be explained in terms of an assumed DAG?’ Here, we concentrate on the latter problem and propose a novel solution based on the covariances that a given causal structure gives rise to. To understand the relevance and applicability of this method it is useful to summarize the difficulties that we typically face when approaching such problems.
The most common method to infer the set of possible DAGs compatible with empirical observations is based on the Markov condition and the faithfulness assumption Pearl (2009); Spirtes et al. (2001). Under these conditions, and in the case where all variables composing a given DAG can be assumed to be empirically accessible, the conditional statistical independencies implied by the graph contain all the information required to test for the compatibility of some data with the causal structure. However, for a variety of practical and fundamental reasons, we do quite generally face causal discovery in the presence of latent (hidden) variables, that is, variables that may play an important role in the causal model, but nonetheless cannot be accessed empirically. In this case we have to characterize the set of marginal probability distributions that a given DAG can give rise to. Unfortunately, as is widely recognized, generic causal models with latent variables impose highly non-trivial constraints on the possible correlations compatible with it Pitowsky (1991); Pearl (1995); Geiger and Meek (1999); Bonet (2001); Garcia et al. (2005); Kang and Tian (2006, 2007); Evans (2012); Lee and Spekkens (2015); Chaves (2016); Rosset et al. (2016); Wolfe et al. (2016). Although the marginal compatibility in principle can be completely characterized in terms of semi-algebraic sets Geiger and Meek (1999), it appears that the resulting tests in practice are computationally intractable beyond a few variables Garcia et al. (2005); Lee and Spekkens (2015).
One possible approach to deal with the apparent intractability is to consider relaxations of the original problem, that is, to design tests that define incomplete lists of constraints (outer approximations) to the set of compatible distributions Bonet (2001); Garcia et al. (2005); Kang and Tian (2006, 2007); Moritz et al. (2014); Chaves et al. (2014a, b). For instance, this approach has previously been considered in Chaves et al. (2014a, b); Steudel and Ay (2015); Weilenmann and Colbeck (2016), with tests based on entropic information theoretic inequalities; an idea originally conceived to tackle foundational questions in quantum mechanics Braunstein and Caves (1988); Cerf and Adami (1997); Chaves and Fritz (2012); Fritz and Chaves (2013); Chaves (2013); Chaves et al. (2015b); Chaves and Budroni (2016). Here we consider a relaxation in a similar spirit, but based on covariances rather than entropies.
Beyond dealing with potential computational intractabilities, an additional benefit with a relaxation based on covariances is that it at most involves bipartite marginals, and it seems reasonable to expect that this would be less data-intensive than methods based on the full multivariate distribution of the observables.
i.1 Main assumptions and results
We focus on a particular class of latent causal structures, where we assume that there are no direct causal influences between the observables, but only from latent variables to observables (see figure 1). Hence, all correlations among the observables are due to the latent variables. This setting can be described by the class of DAGs where all edges are directed from latent vertices to observable vertices, but no edges within these two groups (see figure 1). In other words, we consider the case of DAGs that are bipartite, with the coloring ‘observable’ and ‘latent’. Alternatively, this can be described in terms of hypergraphs, where each independent latent cause is associated with a hyperedge consisting of the affected observable vertices (see e.g. Evans (2016)).
This class of graphs has previously been considered in the context of marginalization of Bayesian networks Moritz et al. (2014); Steudel and Ay (2015); Evans (2016). They moreover provide examples of the difficulties that arise when characterizing latent structures Branciard et al. (2010); Fritz (2012); Branciard et al. (2012); Tavakoli et al. (2014); Chaves et al. (2014a, b), where standard techniques based on the use of conditional independencies even can yield erroneous results (for a discussion, see e.g. Wood and Spekkens (2015)). This type of latent structures furthermore emerges in the context of Bell’s theorem Bell (1964), as well as in recent generalizations Branciard et al. (2010); Fritz (2012); Branciard et al. (2012); Tavakoli et al. (2014); Chaves (2016); Rosset et al. (2016); Saunders et al. (2016); Carvacho et al. (2016), where they can be used to show that quantum correlations between distant observers –thus without direct causal influences between them– are incompatible with our most basic notions of cause and effect.
Irrespective of the nature of the observables (categorical or continuous) we are free to assign vectors to each possible outcome of the observables. Our main result is to show that each bipartite DAG implies a particular decomposition of the resulting covariance matrix into positive semidefinite components. Hence, we can test whether the observed covariance matrix is compatible with a hypothetical bipartite DAG by checking whether it satisfies the corresponding positive semidefinite decomposition, and we will in the following somewhat colloquially refer to this as the ‘semidefinite test’. The semidefinite test can thus be phrased as a semidefinite membership problem, which in turn can be solved via semidefinite programming. The latter is known to be computationally efficient from a theoretical point of view, and has a good track record concerning algorithms that are efficient also in practice (see discussions inVandenberghe and Boyd (1996)).
i.2 Structure of the paper
In section II we derive a general decomposition of covariance matrices, which forms the basis of our semidefinite test. In section III we rephrase this general result to fit with the particular structure of observables and latent variables that we employ, and in section IV we derive the main result, namely that every bipartite DAG implies a particular semidefinite decomposition of the observable covariance matrix. Section V focuses on the converse, namely that every covariance matrix that satisfies the decomposition of a given bipartite DAG can be realized by a corresponding causal model. Section VI relates the semidefinite decomposition to previous types of operator inequalities introduced in von Prillwitz (2015). To obtain a covariance matrix we may be required to assign vectors to the outcomes of the random variables, and section VII discusses the dependence of the semidefinite test on this assignment. In section VIII we briefly discuss the fact that the compatibility with a given bipartite DAG is not affected if the observables are processed locally, and that the semidefinite test respects this basic property under suitable conditions. Section IX considers a specific class of distributions where it is possible to analytically determine the conditions for a semidefinite decomposition. This class of distribution does in section X serve as a testbed for comparisons with the above mentioned entropic tests. We conclude with a summary and outlook in section XI.
Ii Semidefinite decomposition of covariance matrices
In this section we develop the basic structure that forms the core of the semidefinite test. In essence it is obtained via a repeated application of a law of total variance for covariance matrices.
For a vector-valued random variable , in a real or complex inner product space , we define the covariance matrix of as
where denotes the expectation of and denotes the transposition if the underlying vector space is real, and the Hermitian conjugation if the space is complex. One should note that . We also define the cross-correlation for a pair of vector-valued variables (not necessarily belonging to the same vector space)
where . For a pair of random variables we denote the expectation of conditioned on as . Via the conditional expectation we can also define the conditional covariance matrix
In a similar manner we can also obtain a conditional cross-correlation between two random vectors
The starting point for our derivations is the law of total expectation
where the ‘outer’ expectation corresponds to the averaging over the random variable . The law of total expectation can be iterated, such that for three random variables , we have a law of total conditional expectation
and thus .
From the law of total expectation (5) one can obtain a covariance-matrix version of the law of total variance
which can be confirmed by expanding the two sides of the above equality and applying (5).
For three random variables a conditional version of the law of total covariance reads
which can be obtained by expanding the right hand side and applying the law of total conditional expectation (6).
). One may note the similarities with the chain-rule for entropies (see e.g. chapter 2 inCover and Thomas (2012)).
Let be a vector-valued random variable on a finite-dimensional real or complex inner product space , let be random variables over the same probability space. Assuming that the underlying measure is such that all involved conditional expectations and covariances are well defined, then
where and are positive semidefinite operators on the space , defined by
One may note that the above decomposition is not necessarily unique; we could potentially obtain a new decomposition if the variables in the sequence are permuted.
Suppose that for some it would be true that
The law of total conditional covariance (8), with and , gives
By inserting this expression into the last line of (12) one does again obtain (12) but with substituted for . By (11) we can see that (12) is true for . Thus, by induction to , and the identifications in (10), we obtain (9).
Note that is a positive semidefinite operator on for each value of . Hence, by averaging over these variables, and thus implementing the expectation that yields , we do still have a positive semidefinite operator on . The same observation applies to . ∎
Iii Observable vs. latent variables, and feature maps
Here we consider the decomposition developed in the previous section for the more specific setting of observable and latent variables.
We consider a collection of observable variables . To each of these variables we associate a mapping , in some contexts referred to as a ‘feature map’ Schölkopf and Smola (2002), into a finite-dimensional vector space . We denote the resulting vector-valued random variables by , and for the sake of simplicity we will in the following tend to abuse the terminology and refer to the vectors themselves as feature maps. We also define the joint random vector on . (Hence, we can view as the concatenation of the vectors .) One should note that while we regard the observable variables as being part of the setup that is ‘given’, the feature maps are part of the analysis, and we are free to assign these as we see fit. (Concerning the question of how the test depends on this choice, see section VII.)
Let denote the projector onto the subspace in . We divide the total covariance matrix into the cross-correlations between the separate observable quantities . One can note that .
For a collection of latent variables , we make the identifications in Lemma 1. Similarly as for the covariance matrix we decompose the operators and into ‘block-matrices’ and , with and , where we can write
for . In terms of these blocks we can thus reformulate (9) as
One should keep in mind that and in the general case are matrices (rather than scalar numbers) for each single pair .
Iv Decomposition of the covariance matrix for bipartite DAGs
We define a bipartite DAG as a finite DAG with vertices and edges , with a bipartition , such that all edges in are directed from the elements in (the latent variables) to the elements in (the observables). Since is finite, we enumerate the elements of as and the elements of as . One may note that we generally will overload the notation and let and denote the vertices in the underlying bipartite DAG, as well as denoting the random variables associated with these vertices.
For a vertex in a directed graph we let denote the children of , i.e., the set of vertices for which there is an edge directed from to . We let denote the parents of , i.e., the set of vertices for which there is an edge directed from to . For bipartite DAGs an element in can only have children in (and have no parents), and an element in can only have parents in (and no children). As an example, for the bipartite DAG in figure 1 we have , , and , and , , , , and .
For a causal model defined by a general DAG the underlying probability distribution can be described via the Markov condition where each edge represents a direct causal influence, and thus each vertex can only be directly influenced by its parents , resulting in distributions of the form . Hence, for a bipartite DAG we get , and thus all the latent variables are independent, and the observables are independent when conditioned on the latent variables.
As in the previous section, we map the observables to vectors in vector spaces . For each we define the projector in by
Hence, is the projector onto all subspaces of that are associated with the children of the latent variable . (In the above sum we should strictly speaking write . However, in order to avoid a too cumbersome notation we will from time to time take the liberty of writing rather than , and rather than .)
For a bipartite DAG with latent variables and observables with assigned feature maps into finite-dimensional real or complex inner-product spaces , the covariance matrix of satisfies
and where the projectors are as defined in (15) with respect to the given bipartite DAG, and where is the projector onto in .
One may note that if the span of the supports of covers , then we can distribute the blocks of and add them to the different in such a way that the new operators still are positive semidefinite and satisfy the support structure of the original s. The exception is if there is some observable that has no parent (as in figure 1).
Select an enumeration of the latent variables. By Lemma 1 we know that the covariance matrix can be decomposed as in (9) with the positive semidefinite operators and as defined in (10). In the following we will make use of the block-decomposition and with respect to the subspaces as in (13).
If then it means that is independent of and thus
The analogous statement is true if . By this it follows that
By construction, all the observables and thus also are independent when conditioned on the latent variables. Hence,
and thus .
Deciding whether a given matrix is of the form (16) can be done via semi-definite programming (SDP). We end this section by describing an explicit SDP formulation.
The optimization will be over matrices which can be interpreted as the direct sum of candidates for and the ’s. More precisely, let
Let be a matrix on . According to the direct sum decomposition (19), the matrix is a block matrix with blocks. We think of the fist diagonal blocks as carrying candidates for (which completely defines , according to (17)); while the rear diagonal blocks correspond to candidate ’s. Note that the rear sumands in (19) are dirct sums themselves. It therefore makes sense to use double indices to refer to spaces inside the ’s. Concretely, the SDP includes affine constraints on the blocks . The first part of the indices selects the space in (19). The second part refers to the space within according to (20). We use the convention that denotes if either or does not occur in .
With these definitions, the semi-definite program that verifies whether a covariance matrix is of the form (16) reads
where the optimization is over symmetric (hermitian) matrices on . Up to a trivial re-expression of the linear functions of in terms of trace inner products with suitable matrices , the optimization problem above is in the (dual) standard form of an SDP (Vandenberghe and Boyd, 1996, Section 3).
The left-hand side of (23) impliclity defines a linear map from matrices on to matrices on . Explicitly, maps off-diagonal blocks to and acts on block-diagonal matrices as
The constraints of the SDP can thus be written slightly more transparently as
In this language, the dual of the above SDP is
Let be the optimizer of (26). If , then the original SDP is infeasible and therefore, is not of the form (16). Indeed, by construction, such an has a negative trace inner product with the covariance matrix, but a positive trace inner product
with all matrices that could potentially be feasible for the primal SDP (24). Thus, the dual SDP (26) can be used to find a witness or a dual certificate for the incompatibility of a covariance matrix with a presumed causal structure. The geometry of the involved objects is shown in figure 4. We will refer to this dual construction in section XI
, where we sketch possibilities to base statistical hypothesis tests such witnesses.
V Realizing a given decomposition
In the previous section we have shown that the observable covariance matrix associated with a given bipartite DAG always satisfies a particular semidefinite decomposition implied by that DAG. Here we show the converse, in the sense that if we have a positive semidefinite operator that satisfies the decomposition obtained from a particular bipartite DAG, then there exists a causal model associated with that DAG that has the given operator as its observable covariance matrix (see figure 5
). The proof is based on the observation that each positive semidefinite operator on a vector space can be interpreted as the covariance of a vector-valued random variable on that space (e.g. as the covariance of a multivariate normal distribution, or of variable over finite alphabets, as discussed in sectionV.2). The essential idea is that we assign an independent random variable to each component in the decomposition, and take these as the latent variables, and that the support structure of the components furthermore determines the children of the latent variables.
v.1 Realization of decompositions
Let be a finite set, and let be a collection of subsets of . The collection defines a bipartite DAG with as observable nodes, and a set of latent nodes , with the edges assigned by the identification for . In the following we denote this bipartite DAG by .
Let be finite-dimensional real or complex inner-product spaces. For a number let be a collection of subsets . Suppose that is a positive semidefinite operator on the space , and that it can be written
with being the projectors onto the subspaces . Then there exists a causal model for the bipartite DAG with vector-valued variables in such that satisfies
Let us define the set and its complement . By construction, is the set of observable nodes in the bipartite DAG that have no parents (like vertex in figure 1) and thus each element in has at least one parent. By the definition of in (29) it follows that . In other words, the operators have no support on the subspaces belonging to parentless observable nodes. Let us now turn to the operator and its block diagonal decomposition with . We can write . Consequently, can be decomposed in one operator on the subspace , and a collection of blocks on the corresponding subspaces for . Since is positive semidefinite, it can be interpreted as the covariance matrix of some random vector in . In the following we assume that we have made such an assignment for all . We also assume that these random vectors are independent.
Each for has its support inside the support of at least one . Hence, we can ‘distribute’ the operators for by forming new positive semidefinite operators such that
where one may note that .
In the following we shall assign observable and latent random variables to the vertices of the bipartite DAG . For each and each , let be a vector space that is isomorphic to , and let be an arbitrary isomorphism. (We assume that these isomorphisms preserve the inner-product structure, such that maps orthonormal bases of to orthonormal bases of .) We regard the spaces in the collection as being orthogonal to each other. Define , and the corresponding isomorphism . Since each is positive semidefinite, it can be interpreted as the covariance matrix of a vector-valued random variable on . Consequently, we can also find a vector-valued random variable on such that
We assume that the random variables are independent of each other, and also independent of .
The variables serve as the latent variables corresponding to the latent nodes in the bipartite DAG . In the following we shall construct a collection of vector-valued variables as deterministic functions of the latent variables , in such a way that these functions correspond to the arrows in , thus guaranteeing a valid causal model associated with this bipartite DAG.
Let us decompose the vector into its projections onto the subspaces . For each , the vector is associated to the observable node . (One can imagine it to be transferred to node .) Equivalently we can say that each observable node receives the vector from its ancestor . On the observable node we construct a new vector by adding all the vectors ‘sent to it’ from its parents
where the last equality follows since if , or equivalently if , and thus if . The collection we take as the observable variables, and we define .
Due to the fact that all for are independent, and also independent of all , we get
v.2 Positive semidefinite operators as covariance matrices of vector-valued random variables over finite alphabets
The material in the previous section presumes the existence of realizations of positive semidefinite operators as the covariance of some vector-valued variable, without making any restriction on ther nature. As mentioned above, each positive semi-definite operator (over a finite-dimensional real or complex vector space) can be regarded as the covariance of a multivariate normal distribution. However, suppose that we would require that the variable only can take a finite number of outcomes. Here we briefly discuss the conditions for such realizations, and provide an explicit construction (in the proof of Lemma 3).
For a (possibly vector-valued) random variable over a finite alphabet, we say that that the supported alphabet size is , if there are precisely outcomes that occur with a non-zero probability.
If a random variable on a finite-dimensional real or complex inner-product space has a supported alphabet size , then .
We first note that . Since very manifestly is a linear combination of , it follows that the range of is a subset of the range of , and thus . However, in the following we shall show that the stronger inequality holds. To see this, let us first consider the case that are linearly dependent. This means that at least one of these vectors is a linear combination of the others, and thus . Let us now instead assume that is a linearly independent set. Define by , then . Hence, is the matrix representation of with respect to the linearly independent, but not necessarily orthonormal set . One can realize that due to the linear independence, it follows that . Finally, let us define the -dimensional vector . One can confirm that . Hence, , and we can conclude that . ∎
Let be a positive semidefinite operator on a finite-dimensional real or complex inner-product space . For every there exists a vector-valued random variable on with supported alphabet size , such that . However, for all with a supported alphabet size .
Let be the supported alphabet size of a vector-valued random variable . If , then we know from Lemma 2 that . Hence, it remains to show that it is possible to find a such that for every . We thus wish to find a collection of vectors , and with , and , such that .
Let be an orthonormal basis of the range (support) of the operator , and let be the projector onto the range. Let be a matrix in () if the underlying space is real (complex). Since , we can assign the th column of to be the vector (i.e., for all ) and we arbitrarily complete the rest of the matrix such that it becomes orthogonal (unitary). Since is orthogonal (unitary), it follows that its columns form an orthonormal basis of (). Hence, for each it must be the case that the vector is orthogonal to , and thus
Next, define the set of vectors by . One can confirm that , as well as , where we use (34). As the final step we define and for . One can confirm that