Partial correlation hypersurfaces in Gaussian graphical models

06/01/2018 ∙ by Jan Draisma, et al. ∙ Universität Bern 0

We derive a combinatorial sufficient condition for a partial correlation hypersurface in the parameter space of a directed Gaussian graphical model to be nonsingular, and speculate on whether this condition can be used in algorithms for learning the graph. Since the condition is fulfilled in the case of a complete DAG on any number of vertices, the result implies an affirmative answer to a question raised by Lin-Uhler-Sturmfels-Bühlmann.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

DAGs

Let be a directed, acyclic graph (DAG) with vertex set and edge set . We write if and otherwise. A path in from to of length is a sequence with for all ; we allow . If there exists a path from to of length at least we say that is below .

Directed Gaussian graphical models

We follow [DSS09, Page 87]. Associated to

is the directed graphical model for jointly Gaussian random variables

related by

where the vector

and where the are the parameters of the model. The vector satisfies

where is the matrix with -entry if and otherwise. Therefore where

Note that, since is nilpotent, this is a matrix whose entries are polynomials in the parameters . For subsets we write for -submatrix of , and we use notation such as .

Partial correlation hypersurfaces

Let be distinct and . In [LUSB14] the partial correlation hypersurface is defined as the zero locus of the polynomial

the expression is the partial correlation of and given .

So the vanishing of is equivalent to the statement that are conditionally independent given . We assume that is not identically zero on . This is equivalent to the statement that does not d-separate and in [SGS01, §2.3.4]; the trek system expansion of Section 2 yields an equivalent combinatorial characterisation.

The key motivation in [LUSB14] for studying is that the behaviour for

of the volume (relative to some probability measure) of

is related to the singularities of . This volume scales linearly with if is nonsingular but can be superlinear otherwise—whence the study of the real log-canonical threshold of in [LUSB14]. The parameter values in

correspond to probability distributions that are not

-strongly-faithful to —distributions where the PC algorithm for learning might fail. So it is useful to know criteria for nonsingularity of .

Main result

We will establish the following criterion for nonsingularity of ; the same applies when the

have unequal variances (Proposition 

2).

Assume that and that for all below we have . Then is nonsingular.

If is the DAG on with if and only if , then is nonsingular, independently of the choice of .

For this is [LUSB14, Theorem 4.1], which was established there by extensive computer calculations showing that some power of lies in the ideal generated by and its partial derivatives. Since is positive definite and hence has a nonzero determinant for all (real) values of the parameters, this shows that the (real) common vanishing locus of and its derivatives is empty.

We will follow a similar approach, except that we consider the principal submatrix , no power is needed, and indeed not but only some of its partial derivatives are needed.

Organisation

In Section 2 we review the expansion of subdeterminants of in terms of trek systems without sided intersection [STD10]. In Section 3 we use this to prove the theorem, and we conclude with a brief discussion in Section 4.

2. Background

The trek rule

We recall results from [STD10]. Suppose we allow the variances of the to be distinct, rather than all equal to as above. In that case, the covariance matrix becomes

where is the diagonal matrix with the covariances of the on the diagonal. Using the geometric series for we find that

where the sum is over all treks from to as in the following definition.

A trek in is a pair of paths in that start at the same vertex , the top of the trek. The paths are called the up part and the down part of , respectively. If is the last vertex of and is the last vertex of , then we call is a trek from to , the starting vertex of , and the end vertex of . The weight of equals

We allow one or both of to have length , in which case the corresponding factor(s) above is (are) .

The terminology derives from an informal interpretation of a trek as traversing upwards from (i.e., against the direction of its edges in ) and then traversing downwards to . In slightly different terms, the trek rule above goes back at least to [Wri34].

Trek system expansion

Equip with an arbitrary linear order. Then for of equal cardinality and we define as to the power the number of crossings: pairs with but .

Let with . A trek system from to is a set of treks such that is precisely the set of starting vertices of the and is precisely the set of end vertices of the . We write . The map that sends the starting vertex of each trek to its end vertex is a bijection, and we define the sign of as . The weight of is .

A sided intersection between treks and is a vertex where either the up parts of and meet or the down parts of and meet. We say that a trek system has no sided intersections if there is no sided intersection between any two of its treks.

We have the following formula for subdeterminants of .

[[STD10]] For of the same cardinality we have

(*)

The proof is an application of tail swapping as in the classical Lindström-Gessel-Viennot Lemma [GV85]. We will see another instance of tail swapping in Section 3. In [STD10] the proposition is used to give a combinatorial criterion, generalising d-separation, for the determinant to be identically zero on . Furthermore, in [DST13] it is shown that the sum above is cancellation-free: if two trek systems have the same weight, then they have the same sign. Moreover, it is shown there that the coefficient of each monomial is plus or minus a power of .

All of these results—the formula (*2) of course, but also the cancellation-freeness and the power-of-two phenomenon—persist when we specialise

to the identity matrix, as we did in Section 

1 and as we do again in Section 3. Indeed, if is a trek system without sided interaction, then the tops of the treks in can be recovered from the specialisation of as follows: is a top if and only if either

  1. at least one appears in and no appears in ; or else

  2. and contains no and no (then some trek is ).

Action by diagonal matrices

Let where the are in . Then

where and where has the same zero pattern as . Hence, the group acts on the parameter space and on the space of covariance matrices in such a manner that the map is equivariant. This implies that for any of equal cardinality the hypersurface in defined by is stable under this action.

Alternatively, this can be read off from (*2): scaling each with and with , the weight of each trek from a vertex to a vertex gets scaled by , and therefore scales with .

Define and let be obtained from by specialising to the identity matrix. Let be the hypersurface in defined by and let be the hypersurface defined by in .

As real algebraic varieties, is isomorphic to . In particular, is nonsingular if and only if is.

Proof.

By the discussion above, the map

maps into . The inverse is given by

Both maps are morphisms of real algebraic varieties. ∎

3. Proof of the theorem

We retain the notation of Section 1; in particular, , and is the hypersurface defined by . In this section, we treat the as variables and our computations take place in the polynomial ring . Let be the ideal in this ring generated by all partial derivatives of .

For and with the variable does not appear in .

Proof.

Let be a trek system without sided intersection. If the arrow were used in the up (respectively, down) part of some trek in , then would have a sided intersection with the trek starting (respectively, ending) at . So that arrow is not used and the conclusion follows from (*2). ∎

As a consequence, in the remaining discussion we may and will replace by , so that has no arrows going out of .

Suppose that has no outgoing arrows from elements of . For with the variable appears at most linearly in and its coefficient equals . In particular, .

Proof.

If a trek in a trek system without sided intersection uses the edge , then it does so in its down part—indeed, in its up part it would yield a sided intersection with the trek starting at .

In particular, the variable appears only linearly in . Furthermore, ends in , or else would have a sided intersection with the trek ending at . So if we remove from the arrow , then we obtain a trek system without sided intersection (Figure 1).

Figure 1. Proof of Lemma 3. We suggestively draw the arrows in up parts of treks as pointing in the south-west direction and arrows in down parts as pointing in the south-east direction—of course, this is not always possible!

Conversely, if we have any trek system without sided intersection, then no trek in it passes through on its way down, because has no outgoing arrows. Hence, adding the arrow to the trek in ending in yields a trek system without sided intersection.

Hence the map gives a bijection between the terms in (the trek system expansion of) divisible by and the terms in . Furthermore, equals , where the sign is the sign of the bijection that is the identity on and sends to ; in particular, this sign does not depend on . ∎

Assume that . The variable appears at most linearly in and its coefficient equals where

(**)

is the sum over all trek systems without sided intersection of which one trek contains in its down part. In particular, .

Proof.

If a trek in a trek system without sided intersection uses the edge , then it does so on its way down: on its way up it would yield a sided intersection with the trek starting at . In particular, the variable appears only linearly in .

Furthermore, ends in , or else it would have a sided intersection with the trek ending at . So if we remove from the arrow , then we obtain a trek system without sided intersection (Figure 2). Also, equals times the sign of the bijection that is the identity on and maps to ; this will determines the sign in the lemma.

Figure 2. Proof of Lemma 3.

Conversely, given a trek system without sided intersection, we may try and add the arrow to the trek ending in . The resulting trek system has no sided intersection if and only if no trek of passes on its way down. The remaining must be therefore be subtracted as in the lemma. ∎

For under define

the sum of the weights of all directed paths in from to .

The element from (**3) satisfies

where is the identity on and sends to .

Proof.

Let be a trek system without sided intersection and let be the trek of ending in . Appending to any path from down to yields a trek system with sign . In this manner, precisely those trek systems arise for which

  1. a unique trek of passes on its way down, and

  2. every sided intersection of is between and some other trek of on their way down, and happens at a vertex below .

So the left-hand side of the equation in the lemma equals where runs over the trek systems with properties (1) and (2). The right-hand side is the sub-sum over all without any sided intersection. We construct a sign-changing involution on the remaining , as follows.

Let be the lowest vertex on the down part of that lies on the down part of some other trek of . Swapping the parts of and below yields treks and that still meet at . Let be the trek system obtained from by replacing with and with (Figure 3).

Figure 3. The tail swapping argument of Lemma 3. The sided intersections of with other treks are depicted as square vertices.

The trek system satisfies (1): is its unique trek that passes on its way down. As for (2): the sided intersections between and other treks are precisely the sided intersections between and other treks, so they happen below . Furthermore, cannot have sided intersections with treks other than , because those would have come from a sided intersection between and another trek happening below —this is where the choice of matters. Furthermore, , so there are no sided intersections between these treks. This shows that satisfies (2). Also, the map is an involution, since is the last intersection of the down part of with any down part of a trek in . Since , this shows that the terms on the left-hand side that do not appear in the right-hand side cancel out. ∎

Proof of the theorem.

We claim that the zero set of in is empty. By Lemma 3 we may delete from all outgoing arrows from elements of without changing . Since , by Lemma 3 we have . The identity in Lemma 3 expresses as a linear combination of the determinants in Lemma 3 where runs over the elements of below . By assumption, for each of these we have , so Lemma 3 implies that . Hence . But for any set of real parameters the matrix is positive definite, hence has a nonzero determinant. This proves the claim. ∎

4. A modest implication for the PC algorithm

In the edge-removal part of the PC algorithm [SGS01] for learning , in each step we have an undirected graph whose edge set, if no error has occurred so far, contains that of . Using the sample covariance matrix, a partial correlation is then computed for some triple such that there is an edge in and such that is contained in the -neighbours of or in the -neighbours of . Before this step all partial correlations with sets of cardinality smaller than that of have already been checked. If the absolute value of the partial correlation is less than some prescribed , then the edge is removed from .

Our theorem suggests that it might be advantageous to perform this check first for sets contained in the intersection of the neighbourhoods of and in . Then, if all the edges between present in are also present in the DAG (with some orientation), one readily checks that the conditions of the theorem are satisfied. Hence the volume of is proportional to , and the region in the parameter space of where we would erroneously delete in this step is small.

There are two obvious issues with this. First, in general it will not suffice to check in the intersection of the neighbourhoods of and . And second, the condition that all of those edges are indeed present in is rather strong. To make better use of our theorem, one might want to develop a version of the PC algorithm where orientation steps are intertwined with the edge-deletion steps.

We conclude this paper with two examples.

Figure 4. The graphs in Example 4.

To see that singular partial correlation hypersurfaces cannot be avoided in the edge removal step of the PC algorithm, consider the graph in Figure 4, taken from [LUSB14, Example 4.8]. In the beginning, the PC algorithm finds all nonconditional independencies (so with ), and hence removes the edge to arrive at the graph on the right. If the algorithm next chooses to consider the edge , then it will delete this edge after finding that are independent given . However, by symmetry of it is equally likely that it will first consider the edge .

In [LUSB14] it is shown that the partial correlation with and has a singular hypersurface and that the corresponding of bad parameter values is fatter.

In addition to the study of correlation hypersurfaces, [LUSB14]

discusses mathematical interpretations of existing heuristics in statistics. In particular,

[LUSB14, Problem 6.2] discusses a volume inequality that would confirm the belief that “collider-stratification bias tends to attenuate when it arises from more extended paths”.

Figure 5. The graph from Example 4.

In particular, in the situation of Figure 5, their conjecture says that

The paper does not explicitly say with respect to which measure is defined here. If it is supposed to be true for all measures, then the above is equivalent to This is certainly not true in general: taking

yields So formulating this statistical belief as a precise mathematical conjecture remains a challenge.

References

  • [DSS09] Mathias Drton, Bernd Sturmfels, and Seth Sullivant. Lectures on Algebraic Statistics. Oberwolfach Seminars 39. Birkhäuser, Basel, 2009.
  • [DST13] Jan Draisma, Seth Sullivant, and Kelli Talaska. Positivity for Gaussian graphical models. Adv. Appl. Math., 50(5):661–674, 2013.
  • [GV85] Ira Gessel and Gérard Viennot. Binomial determinants, paths, and hook length formulae. Adv. Math., 58:300–321, 1985.
  • [LUSB14] Shaowei Lin, Caroline Uhler, Bernd Sturmfels, and Peter Bühlmann. Hypersurfaces and their singularities in partial correlation testing. Found. Comp. Math., 14:1079–1116, 2014.
  • [SGS01] Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, prediction, and search. With additional material by David Heckerman, Christopher Meek, Gregory F. Cooper and Thomas Richardson. 2nd ed. Cambridge, MA: MIT Press, 2nd ed. edition, 2001.
  • [STD10] Seth Sullivant, Kelli Talaska, and Jan Draisma. Trek separation for Gaussian graphical models. Ann. Stat., 38(3):1665–1685, 2010.
  • [Wri34] S. Wright. The method of path coefficients. Ann. Math. Stat., 5:161–215, 1934.