Latent tree models are associated to a tree-structured graph in which some nodes represent observed variables and others represent unobserved (latent) variables. Due to their tractability, these models have found many applications in fields ranging from the traditional life sciences, biology and psychology to contemporary areas such as artificial intelligence and computer vision; refer toMourad et al. (2013) for a comprehensive review. In this paper, we study the problem of testing the goodness-of-fit of a postulated Gaussian latent tree model to an observed dataset. In a low dimensional setting where the number of observed variables is small relative to the sample size at hand, testing is usually based on the likelihood ratio which measures the divergence in maximum likelihood between the postulated latent tree model and an unconstrained Gaussian model. This, however, requires maximization of the possibly multimodal likelihood function of latent tree models. In contrast, recent work of Shiers et al. (2016) takes a different approach and leverages known polynomial constraints on the covariance matrix of the observed variables in a given Gaussian latent tree. Specifically, the postulated latent tree is tested with an aggregate statistic formed from estimates of the polynomial quantities involved. This approach can be traced back to Spearman (1904) and Wishart (1928); also see Drton et al. (2007, 2008).
We make the following new contributions. In Section 2, we extend the polynomial characterization of Shiers et al. (2016) to cases where observed nodes may also be inner nodes of the tree as considered, for example, in the tree learning algorithms of Choi et al. (2011). Section 3 describes how we may use polynomial equality constraints to test a star tree model. We base ourselves on the recent groundbreaking work of Chernozhukov et al. (2013a)
, form our test statistic as the maximum of unbiased estimates of the relevant polynomials, and calibrate the critical value for testing based on multiplier bootstrapping techniques. This new way of using the polynomials to furnish a test allows us to handle latent trees with a larger number of observed variables and avoids potential singularity issues caused by individual polynomials. Numerical experiments in Section4 makes comparisons to the likelihood ratio test and assesses the size of our tests in finite samples. Section 5 discusses future research directions.
Let be two positive integers. We let and write for the collection of subsets of with cardinality . The supremum norm of a vector is written
. For two random variablesand , the symbols indicates that and have the same distributions, and indicates that the distributions are approximately equal.
means a normal distribution with mean.
2 Characterization of general Gaussian latent trees
We first provide the definition of the models considered in this paper. A tree is an undirected graph in which any two nodes are connected by precisely one path. Let be a tree, where is the set of nodes, and is the set of edges which we take to be unordered duples of nodes in . We say that is a latent tree if it is paired with a set , corresponding to observed variables, such that whenever is of degree less than or equal to two. In particular, contains all leaf nodes of the tree (i.e., nodes of degree 1), but it may contain additional nodes. The nodes in correspond to latent variables that are not observed but each have at least three other neighbors in the tree. This minimal degree requirements of on the latent nodes ensures identifiability (Choi et al., 2011, p.1778). In the terminology of mathematical phylogenetics, is a semi-labeled tree on with an injective labeling map; see Semple and Steel (2003, p.16). However, phylogenetic trees are latent trees restricted to have equal to the set of leaves. While we have defined as a set of nodes, it will be convenient to abuse notation slightly and let also denote a random vector whose coordinates correspond to the nodes in question. The context will clarify whether we refer to nodes or random variables.
Now we present the polynomial characterization of a Gaussian latent tree graphical model that extends the results in Shiers et al. (2016). The Gaussian graphical model on , denoted , is the set of all
-variate Gaussian distributions respecting the pairwise Markov property of, i.e., for any pair with , the random variables associated to and are conditionally independent given the variables corresponding to . The -Gaussian latent tree model on , denoted , is the set of all -variate Gaussian distributions that are the marginal distribution for under some distribution in . For a given distribution in , let be the Pearson correlation of the pair for any . The pairwise Markov property implies that
where denotes the set of edges on the unique path that connects and in , and is the Pearson correlation between a pair of nodes and in . Of course, if and . In the sequel, we often abbreviate as for simplicity.
Suppose is the covariance matrix of . Our task is to test whether comes from against a saturated Gaussian graphical model. We assume that all edges in the tree correspond to a nonzero correlation, so that contains no zero entries. The covariance matrices for are parametrized via (2.1). As shown in Shiers et al. (2016), this set of covariance matrices may be characterized by leveraging results on pseudo-metrics defined on . Suppose is a function that assigns non-negative weights to the edges in . One can then define a pseudo-metric by
This is known as a -induced pseudo-metric on . The following lemma characterizes all the pseudo-metrics on that are -induced. The proof is a bit delicate and is given in our supplementary material.
Suppose is a pseudo-metric defined on . Let for any for simplicity. Then is a -induced pseudo-metric if and only if for any four distinct such that ,
and for any three distinct ,
Lemma 2.1 modifies Corollary in Shiers et al. (2016) by requiring the extra equality constraints in (2.3) concerning three distinct variable indices. For any subset , let be the restriction of to , that is, the minimal subtree of induced by the elements in with all the nodes of degree two not in suppressed (Semple and Steel, 2003, p.110); refer to Section 6.2 in our supplementary material for the related graphical notions. Shiers et al. (2016) only consider phylogenetic trees in which the observed variables always correspond to the set of nodes in with degree one. In this case the constraint in (2.3) is vacuous. Indeed, if are any three observed nodes in , then must have the configuration on the left panel of Figure 2.1, and it can be seen that for any permutation of . However, for a general latent tree whose observed nodes are not confined to be the leaves, condition (2.3) is necessary for a pseudo-metric to be -induced: may take the configuration on the right panel of Figure 2.1, where for some permutation of , , and it must hold that
if is -induced.
While condition (2.2) appears in the result of Shiers et al. (2016), it may lead to different patterns of constraints for a general latent tree. For four distinct indices , there are three possible partitions into two subsets of equal sizes, namely, , and . These three partitions correspond to the path pairs
respectively. Now refer to Figure 2.2 which shows all possible configurations of the restriction of to the four observed variables . In Figure 2.2(a)-(c), up to permutations of the indices , only one of three pairs in (2.4) can give an empty set when the intersection of its two component paths is taken. In light of (2.2), this implies that, for some permutation of the indices ,
On the contrary, in Figure 2.2(d) and (e), it must be the case that each of the three path pairs in (2.4) gives an empty set when an intersection is taken between its two component paths, giving the equalities in consideration of (2.2).
Lemma 2.1 readily implies a characterization of the latent tree model via polynomial constraints in the entries of the covariance matrix as spelt out in the ensuing corollary. Its proof employs similar arguments in Shiers et al. (2016) and is deferred to our supplementary material. In what follows, we let be the set of all quadruples such that only one of the three path pairs in (2.4) gives an empty set when the union of its two component paths is taken. In other words, contains all such that is one of the configurations in Figure 2.2(a)-(c). Given , we write to indicate that belongs to in a way that it is the path pairs and that have empty intersection. Similarly, we will let be the set of all triples such that has the configuration in Figure 2.1. We will use the notation to indicate that is the “middle point" such that .
Suppose is the covariance matrix of and has no zero entries. The following are together necessary and sufficient for the distribution of to belong to :
For any ,
For any ,
For any ,
For any , .
For any ,
For any ,
3 Testing a star tree model
In this section we illustrate how one can test a postulated Gaussian latent tree model using Corollary 2.2. In order to focus the discussion we treat the simple but important special case of a star tree, which corresponds to a single factor model. A single factor model with observed variables can be described by the linear system of equations
where is the mean of , is a latent variable, is the loading coefficient for variable , and is the idiosyncratic error for variable . All of , are independent. The model postulates that are conditionally independent given . It thus corresponds to the graphical model associated with a star tree with ,
Let be i.i.d. draws from the distribution of , which is assumed to be Gaussian. Our goal is to test whether the distribution of belongs to the single factor model . Without loss of generality, we may assume that for all (Anderson, 2003, Theorem 3.3.2). We proceed by testing whether all the constraints in Corollary 2.2 are simultaneously satisfied with respect to the latent tree . For simplicity, we will focus on testing the equality constraints in Corollary 2.2, and briefly discuss how one can incorporate the inequality constraints in Corollary 2.2 in Section 5. For , both sets and are empty, so that Corollary 2.2 and are automatically satisfied. Hence, we are only left with Corollary 2.2: For any ,
The two polynomials above, equal to and respectively, are known as tetrads in the literature of factor analysis. It is well-known that they define equality constraints for a single factor model (Bekker and de Leeuw, 1987; Bollen and Ting, 1993; Drton et al., 2007).
3.1 Estimating tetrads
The idea now is to estimate each one of the tetrads in (3.2), and aggregate the estimates in a test statistic. From the sample covariance matrix , a straightforward sample tetrad estimate, say , can be computed. If one define the vectors and , as well as the function , by the delta method it is expected that , where is the limiting covariance matrix of and is the gradient of evaluated at . However, the distribution of this sample tetrad becomes asymptotically degenerate at singularities, that is, when the gradient vanishes, which happens if the underlying true covariances are zero (Drton and Xiao, 2016)
. Consequently, a standardized sample tetrad cannot be well approximated by a normal distribution if the underlying correlations are weak. More generally, even for stronger correlations, we found it difficult to reliably estimate the variance of all sample tetrads in larger-scale models.
We propose alternative estimators for which sampling variability can be estimated more easily. Due to the independence of samples, the tetrad can be estimated unbiasedly with the differences
where the subscripts in is indicative of the row and column indices for the submatrix . These differences can then be averaged for an estimate of the tetrad. Similarly, one can form to estimate in (3.2). If we arrange all the tetrads from into a -vector , and correspondingly arrange the estimates into a -vector for each
, then the central limit theorem for1-dependent sums ensures that for sufficiently large sample size we have the distributional approximation
where and . The latter limiting covariance matrix will not degenerate to a singular matrix even if the underlying covariance matrix for has zeros at which some of the tetrads are singular (i.e. have zero gradient).
3.2 Bootstrap test
The fact from (3.4) could serve as the starting point for a test of model . However, the normal approximation quickly becomes of concern when moving beyond a small number of variables . Indeed, the dimension of , , may well be close to the sample size , or even larger. For instance, if , for a model with merely observed variables the dimension of is already , more than half the sample size. A recent work of Zhang and Wu (2017), which follows up on the groundbreaking paper of Chernozhukov et al. (2013a) on Gaussian approximation for maxima of high dimensional independent sums, suggests that while the approximation in (3.4) may be dubious, by taking a supremum norm on both sides, the Gaussian approximation
where , can be valid even the dimension of is large compared to . In fact, the original work of Chernozhukov et al. (2013a) suggested that asymptotically, the dimension can be sub-exponential in the sample size for the Gaussian approximation to hold. In what follows, we will discuss implementation of and experiments with a vanishing tetrad test based on (3.5). While it is possible to adapt the supporting theory for the present application, the technical details are involved and beyond the scope of this conference paper.
Since from (3.4) and (3.5) is an estimator of the vector of tetrads , it is natural to use as the test statistic and reject the model for large values of . The Gaussian approximation (3.5) suggests that when is true, i.e. , is distributed as . Nevertheless, to calibrate critical values based on the distribution of , one must estimate the unknown covariance matrix . Zhang and Wu (2017) suggested the batched mean estimator
where for a batch size and one considers the non-overlapping sets of samples , . The “batching" aims to capture the dependence among the ’s, and has been widely studied in the time series literature (Bühlmann, 2002; Lahiri, 2003). If model is true, then (3.5) yields that
where the right-hand side is to interpreted conditionally on , with and comprising only the diagonal of . More precisely, for a fixed test level , if we define to be the conditional
-quantile of the distribution ofgiven , then
according to Zhang and Wu (2017, Corollary 5.4). We will use as our test statistic for the model , and calibrate the critical value based on (3.7) by simulating the conditional quantile from for fixed .
While our above presentation invoked the estimate , which is a matrix with entries, we may in fact bypass the problem of computing such a large covariance matrix for the tetrad estimates. To simulate the conditional quantile in (3.7), let be i.i.d. standard normal random variables, and consider the expression
which has exactly the same distribution as conditioning on the data . We emphasize the diagonal entries of are easily computed as variances in (3.6). In conclusion, we perform the following multiplier bootstrap procedure: (i) Generate many, say , sets of , (ii) evaluate (3.8) for each of these sets, and (iii) take to be the quantile from the resulting numbers. Despite the bootstrap being a computationally intensive process, it is not hard to see that the evaluation of (3.8) for all sets of multipliers will involve operations, which even for moderate is far less than the operations needed to obtain an entire covariance matrix for all tetrads.
It is instructive to make a comparison with the testing methodology in Shiers et al. (2016), where the focus was on lower-dimensional applications. Suppose is the function that maps the covariance matrix into the vector of tetrads in (3.2). To test the vanishing of the tetrads, Shiers et al. (2016) form plug-in estimates for with the sample covariance matrix . Letting be the covariance matrix for the -vector , they form a Hotelling’s type statistic as
where is a consistent estimate for ; see also Drton et al. (2008). For a test of model
, this statistic is now compared to a chi-square distribution withdegrees of freedom. While this calibration is justified for sufficiently large sample size by a joint normal approximation analogous to (3.5), it can be problematic for large . Even more pressing can be the computational disadvantage that one explicitly uses the entire matrix with its entries.
4 Numerical experiments
We now report on some experiments with the bootstrap test based on the sup-norm of the estimated tetrads proposed in Section 3. In the implementation we always use sets of normal multipliers to simulate the quantile and work with batch size in (3.8). We also benchmark our methodology against the likelihood ratio test for factor models implemented by the function factanal in the base library of R, which implements a likelihood ratio (LR) test with Bartlett correction for more accurate asymptotic approximation. The critical value of the LR test is calibrated with the chi-square distribution with degrees of freedom (Drton et al., 2009, p.99).
4.1 Low dimensional setup
We first consider two experimental setups, each with data generated from the one-factor model in (3.1) for both and . The model parameters are as follows: (i) Setup 1: all loadings and error variances are taken to be . (ii) Setup 2: and are taken to be , while the other loadings are independently generated based on a normal distribution with mean and variance . The error variances all equal .
For different nominal test levels in the range that are apart, we compare the empirical sizes of our test based on the statistic and the likelihood ratio (LR) test implemented by the function factanal, using repetitions of experiments. The results are shown in Figure 4.1. The left two panels correspond to Setup 1 and the right two panels correspond to Setup 2, while the upper panels correspond to and lower correspond to . While we show the entire range for the x-axis, practical interest is typically in the initial part where the nominal error rate is in say .
In Setup , for both sample sizes, the empirical test sizes of the LR test align almost perfectly with the line as one would expect from classical theory. The sizes of our test based on also align better with line as sample sizes grow. Note that for nominal test levels that are of practical interest, also gives conservative test sizes for both sample sizes.
In Setup , where parameters are close to being “singular", one can see the true advantage of using over the LR test. The empirical test sizes of the LR test with factanal do not align well with the line as one normally expect from classical theory, whereas the test sizes of our statistic lean closer to the line as increases. Particularly the performance of the LR test is problematic since, by rejecting the true model (3.1) all too often, it fails to give even an approximate control on type error. Note that the values of and are such that, for the most part, the observed variables are rather weakly dependent on each other. If the observations were in fact independent then the likelihood ratio test statistic does not exhibit a chi-square limiting distribution (Drton, 2009, Theorem 6.1). This highlights the fact that, in addition to avoiding any non-convex optimization of the likelihood function of the factor model, our approach based on the simple estimates from (3.3) is not subject to non-standard limiting behaviors that plague the LR test when the parameter values lean close to singularities of the parameter space (Drton, 2009).
4.2 Higher dimensional setup
Our last experiment aims to compare the test sizes of the two tests when the number of observed variables is relatively large compared to . Data are exactly as in Setup , except that . For such a model with large , the number of tetrads involved in our testing methodology is so large that even after taking the supremum norm one shouldn’t expect (3.5) to hold; for example, when , the dimension of is , and one should be skeptical about the validity of (3.5) when we only have the sample size . To implement our test, we first randomly select of the tetrads, and proceed with the bootstrapping procedure in (3.8) with being estimates for this selected subset of tetrads alone. The choice of tetrads to be tested is based on the fact that, in the previous experiments with , our test gives reasonable empirical test sizes for a practical range of nominal levels when the total number of tetrads being tested, , is approximately . Since the subset of tetrads is randomly selected, our test is still expected to approximately control the test sizes at nominal level. The results are reported in Figure 4.2 .
As seen, the test based on
has the main features seen in the first experiment. In particular, it successfully controls type I error rates for the practical range of. In contrast, with increased to , the LR test drastically fails to control type I error rate. This is despite the fact that the setup is regular with parameter values that are far from any model singularity. The reason for the failure of the LR test is the fact that the dimension is on the same order as the sample size of . The sample size is not large enough for chi-square asymptotics based on fixed dimension to “kick in”.
In this paper we have established a full set of polynomial constraints on the covariance matrix of the observed variables, in the form of both equalities and inequalities, that characterizes a general Gaussian latent tree model whose observed nodes are not confined to be the leaves. Focusing on the special case of a star tree model, we also experimented with a new methodology for testing the equality constraints by forming unbiased estimates of the polynomials involved. In simulation studies, when the number of variables involved is large or the underlying parameters are close to being “singular", our test compares favorably with the likelihood ratio test in terms of test size.
Our results have paved the way for developing a full-fledged algebraic test for a Gaussian latent tree model. Although we have not pursued this generality in the present conference paper, we give a brief discussion here. Of course, to do so one would first need to write an efficient graph algorithm to tease out all the polynomials entailed by Corollary 2.2 for a given latent tree input. Then the current testing methodology can be adopted by forming unbiased estimates of all these polynomials at hand, which also brings to our attention that in Section 3 only the equality constraints in Corollary 2.2 were used to test the single factor model. For illustration, take the 3-degree monomial in Corollary 2.2 (i)(a) as an example. Like (3.3), one may form a summand which is unbiased for , and then use as an averaged estimator. To incorporate the constraints in Corollary 2.2 (i) into our test one can first arrange all those inequalities into “less than" conditions, i.e., Corollary 2.2 (i)(a) becomes and the corresponding estimate becomes . Following that, in the definition of the test statistic , one can take a maximum over all the unbiased estimates for the “less than" versions of the polynomials in Corollary 2.2, in addition to the absolute values of the estimates for the polynomials in Corollary 2.2. The resulting test statistic shall also reject the model when its value is too large. While critical values can still be calibrated with multiplier bootstrap, additional techniques such as inequality selection can be incorporated to contain the power loss as a result of testing the inequalities; see Chernozhukov et al. (2013b) for more details.
Another challenge is the determination of the batch size in (3.6). In our simulation studies of Section 4 we took since we believe that a batch size of should be enough to capture dependence among the -dependent summands. Batch size determination has been widely studied in the time series literature for low dimensional problems (Bühlmann, 2002; Hall et al., 1995; Lahiri, 2003). To the best of our knowledge, in high dimensions this is still a widely open problem. Theoretical research on this is far beyond the scope of our current work.
6 Supplementary material
In this supplement we furnish proofs for the main text of “Algebraic tests of general Gaussian latent tree models".
6.1 Proof of Corollary 2.2
We only sketch the proof here since it is exactly analogous to that of Theorem in Shiers et al. (2016). First, consider the special case where all the entries of , and hence the the Pearson correlations , , are strictly positive. In this case condition is redundant. Via the isomorphsim
between the parametrizations in (2.1) and all -induced pseudometrics, the discussion preceding our corollary readily translates (2.2) into and (2.3) into , whereas the triangular inequality property of pseudometrics is translated into for triples that are not in . The general case of with nonzero but not necessarily positive entries is then addressed by incorporating condition .
6.2 Proof of Lemma 2.1
To prove the lemma, we first collect all the required graphical notions borrowed from Semple and Steel (2003). We attempted to make this proof as self-contained as possible, but the readers are encouraged to read Semple and Steel (2003) for more background on mathematical phylogenetics.
Suppose we are given a tree . If is a subset of , denotes the minimal subtree of that contains all the nodes in . If , is the graph obtained by removing , and is the tree obtained from by identifying the ends of and then deleting . In particular, if is a node of degree two and is an edge incident with , is said to be obtained from by suppressing . If , is the set of edges on the unique path connecting and .
We will also need the notion of an -tree. An -tree, or semi-labeled tree on a set
, is an ordered pair, where is a tree with node set and is a (labeling) map with the property that, for each of degree at most two, . Note that is not necessarily injective. Moreover, if is a subset of , is the tree obtained from by suppressing all the nodes of degree two that are not in . We then define the restriction of to , denoted , to be the -tree .
Finally, we introduce the notion of -split. For a set , an -split is a partition of into two non-empty sets. We denote the -split whose blocks are and by where the order of and in the notation doesn’t matter. Now suppose is an -tree with an edge set . For each , must consist of two components and which induce an -split . We then define as the collection of all -splits induced by .
Important remark: In all the definitions above, is not specified as a subset of the node set for a given tree. Nonetheless, when we have a tree with a subset of observed nodes as in the main text, we will slightly abuse the notations by identifying with the -tree whose labeling map is simply the identity function. Moreover, if , we will also identify with the -tree that is the restriction of (as an -tree) to .
Now we begin to prove Lemma 2.1. The “only if" part of the theorem is trivial and we will only prove the “if" part of the statement.
We recall that . Let be a pseudo-metric on satisfying the two conditions (2.2) and (2.3) in display. For any four distinct points , given the tree structure of it must be true that for some permutation of . By (2.2), together with the fact that is a pseudo-metric, is in fact a tree metric (Semple and Steel (2003, Theorem 7.2.6)), i.e., there exists an -tree for a tree and a labeling map , as well as a strictly positive weighting function such that
for all . By Theorem 6.3.5 and Lemma 7.1.4 in Semple and Steel (2003), to show that can be induced from it suffices to show that for any of size at most , the two restricted -trees and are such . Note that this is trivial for and . For , we first note that
These characterization for the elements in can be easily checked; also see Semple and Steel (2003, p.148) where these characterizations are stated. To finish the proof it remains to show that, when , any -split as in (6.1) or any -split as in (6.2) must also be an element of .
First, towards a contradiction, suppose there exists an -split that is an element of but not an element of . Since is not an element of , by considering as a tree it must be the case that the node has degree at least two. Then, by condition (2.3), there must exist two distinct and in the set such that . But this reaches a contradiction since by (6.1) as .
Similarly, suppose is an element of but not an element of . Since , the two strict inequalities in (6.2) must be true. On the other hand, if has any of the configurations in Figure 2.2, since it must be true that , in which case it must lead to either or , contradicting one of the inequalities in by condition (2.2). If has the configuration in Figure 2.2 or , then it must be the case that both and