Marginal likelihood and model selection for Gaussian latent tree and forest models

12/29/2014 ∙ by Mathias Drton, et al. ∙ Universitat Pompeu Fabra Agency for Science, Technology and Research University of Washington 0

Gaussian latent tree models, or more generally, Gaussian latent forest models have Fisher-information matrices that become singular along interesting submodels, namely, models that correspond to subforests. For these singularities, we compute the real log-canonical thresholds (also known as stochastic complexities or learning coefficients) that quantify the large-sample behavior of the marginal likelihood in Bayesian inference. This provides the information needed for a recently introduced generalization of the Bayesian information criterion. Our mathematical developments treat the general setting of Laplace integrals whose phase functions are sums of squared differences between monomials and constants. We clarify how in this case real log-canonical thresholds can be computed using polyhedral geometry, and we show how to apply the general theory to the Laplace integrals associated with Gaussian latent tree and forest models. In simulations and a data example, we demonstrate how the mathematical knowledge can be applied in model selection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graphical models based on trees are particularly tractable, which makes them useful tools for exploring and exploiting multivariate stochastic dependencies, as first demonstrated by [CL68]. More recent work develops statistical methodology for extensions that allow for inclusion of latent variables and in which the graph may be a forest, that is, a union of trees over disjoint vertex sets [CTAW11, TAW11, MRS13]. These extensions lead to a new difficulty in that the Fisher-information matrix of a latent tree model is typically singular along submodels given by subforests. As explained in [Wat09], such singularity invalidates the mathematical arguments that lead to the Bayesian information criterion (BIC) of [Sch78], which is widely used to guide model selection algorithms that infer trees or forests [EdAL10]. Indeed, the BIC will generally no longer share the asymptotic behavior of Bayesian methods; see also [DSS09, Sect. 5.1]

. Similarly, Akaike’s information criterion may no longer be an asymptotically unbiased estimator of the expected Kullback-Leibler divergence that it is designed to approximate

[Wat09, Wat10a, Wat10b].

In this paper, we study the large-sample behavior of the marginal likelihood in Bayesian inference for Gaussian tree/forest models with latent variables, with the goal of obtaining the mathematical information needed to evaluate a generalization of BIC proposed in [DP13]. As we review below, this information comes in the form of so-called real log-canonical thresholds (also known as stochastic complexities or learning coefficients) that appear in the leading term of an asymptotic expansion of the marginal likelihood. We begin by more formally introducing the models that are the object of study.

Let

be a random vector whose components are indexed by the vertices of an undirected tree

with edge set . Via the paradigm of graphical modeling [Lau96], the tree induces a Gaussian tree model

for the joint distribution of

. The model is the collection of all multivariate normal distributions on under which and are conditionally independent given for any choice of two nodes and a set such that contains a node on the (unique) path between and . For two nodes , let be the set of edges on the path between and . It can be shown that a normal distribution with correlation matrix belongs to if and only if

(1.1)

where when is the edge incident to and . Indeed, for three nodes the conditional independence of and given is equivalent to ; compare also [MRS13, p. 4359].

In this paper, we are concerned with latent tree models in which only the tree’s leaves correspond to observed random variables. So let

be the set of leaves of tree . Then the Gaussian latent tree model for the distribution of the subvector is the set of all -marginals of the distributions in . The object of study in our work is the parametrization of the model . Without loss of generality, we may assume that the latent variables at the inner nodes

have mean zero and variance one. Moreover, we assume that the observed vector

has mean zero. Then, based on (1.1), the distributions in can be parametrized by the variances for each variable , , and the edge correlations , .

Our interest is in the marginal likelihood of model when the variance and correlation parameters are given a prior distribution with smooth and everywhere positive density. Following the theory developed by [Wat09], we will derive large-sample properties of the marginal likelihood by studying the geometry of the fibers (or preimages) of the parametrization map.

Example 1.1.

Suppose is a star tree with one inner node that is connected to each one of three leaves, labelled 1, 2, and 3. A positive definite correlation matrix is the correlation matrix of a distribution in model if

(1.2)

for a choice of the three correlation parameters that are associated with the three edges of the tree.

Now suppose that is indeed the correlation matrix of a distribution in and that for all . Then, modulo a sign change that corresponds to negating the latent variable at the inner node , the parameters can be identified uniquely using the identities

Hence, the fiber of the parametrization is finite, containing two points.

If instead the correlations between the leaves are zero then this identifiability breaks down. If

is the identity matrix with

, then every vector that lies in the set

satisfies (1.2). The fiber of the identity matrix is thus the union of three line segments that form a one-dimensional semi-algebraic set with a singularity at the origin where the lines intersect.

Remark 1.2.

Some readers may be more familiar with rooted trees with directed edges and model specifications based on the Markov properties for directed graphs or structural equations. However, these are equivalent to the setup considered here, as can be seen by applying the so-called trek rule [SGS00]. Our later results also apply to Bayesian inference in graphical models associated with directed trees.

Suppose is a smooth and positive density that defines a prior distribution on the parameter space of the Gaussian latent tree model . Let be a sample consisting of independent and identically distributed random vectors in , and write for the marginal likelihood of . If is generated from a distribution and , then it holds that

(1.3)

where is a rational number smaller than or equal to the dimension of the model . The number is an integer greater than or equal to 1. More detail on how (1.3) follows from results in [Wat09] is given in Section 2. In this paper, we derive formulas for the pair from (1.3), which will be seen to depend on the pattern of zeros in the correlation matrix of the distribution .

Let and be the variances and the correlations of the data-generating distribution . The point of departure for our work is Proposition 2.3, which clarifies that the pair is also determined by the behavior of the deterministic Laplace integral

(1.4)

where the phase function in the exponent is

In the formulation of our results, we adopt the notation

as is sometimes referred to as real log-canonical threshold and is the threshold’s multiplicity. Our formulas for are stated in Theorem 4.3. The proof of the theorem relies on facts presented in Section 3, which concern models with monomial parametrizations in general. As our formulas show, the marginal likelihood admits non-standard large-sample asymptotics, with differing from the model dimension if exhibits zero correlations (recall Example 1.1). We describe the zero patterns of in terms of a subforest with edge set .

Our result for trees generalizes directly to models based on forests. If is a forest with the set comprising the leaves of the subtrees, then we may define a Gaussian latent forest model in the same way as for trees. Again we assign a variance parameter to each node and a correlation parameter to each edge . Forming products of correlations along paths, exactly as in (1.1), we obtain again a parametrization of the correlation matrix of a multivariate normal distribution on . In contrast to the case of a tree, there may be pairs of nodes with necessarily zero correlation, namely, when two leaves and are in distinct connected components of . Theorem 4.7 extends Theorem 4.3 to the case of forests. The non-standard cases arise when the data-generating distribution lies in the submodel defined by a proper subforest of the given forest .

The remainder of the paper begins with a review of the connection between the asymptotics of the marginal likelihood and that of the Laplace integral in (1.4); see Section 2 which introduces the notion of a real log-canonical threshold (RLCT). Gaussian latent tree/forest models have a monomial parametrization and we clarify in Section 3 how the monomial structure allows for calculation of RLCTs via techniques from polyhedral geometry. In Section 4, these techniques are applied to derive the above mentioned Theorems 4.3 and 4.7. In Section 5, we demonstrate how our results can be used in model selection with Bayesian information criteria (BIC). In a simulation study and an example of temperature data, we compare a criterion based on RLCTs to the standard BIC, which is based on model dimension alone.

2 Background

Consider an arbitrary parametric statistical model , with parameter space . Let each distribution have density and, for Bayesian inference, consider a prior distribution with density on . Writing for a sample of size from , the log-likelihood function of is

The key quantity for Bayesian model determination is the integrated or marginal likelihood

(2.1)

As in the derivation of the Bayesian information criterion in [Sch78], our interest is in the large-sample behavior of the marginal likelihood.

Let the sample be drawn from a true distribution with density that can be realized by the model, that is, for some . Then, as we will make more precise below, the asymptotic properties of the marginal likelihood are tied to those of the Laplace integral

(2.2)

where

(2.3)

is the Kullback-Leibler divergence between the data-generating distribution and distributions in the model . Note that for all , and precisely when satisfies . For large the integrand in (2.2) is equal to if and is negligibly small otherwise. Therefore, the main contribution to the integral comes from a neighborhood of the zero set

which we also call the -fiber.

Suppose now that is a semianalytic set and that is an analytic function with compact -fiber . Suppose further that the prior density is a smooth and positive function. Then, under additional integrability conditions, the Main Theorem 6.2 in [Wat09] shows that the marginal likelihood has the following asymptotic behavior as the sample size tends to infinity:

(2.4)

In (2.4), is a rational number in , and is an integer in . The number is known as learning coefficient, stochastic complexity or also real log-canonical threshold, and is the associated multiplicity. As explained in [Wat09, Chap. 4], the pair also satisfies

(2.5)

Moreover, the pair can equivalently be defined using the concept of a zeta function as illustrated below; compare also [Lin11].

Definition 2.1 (The real log-canonical threshold).

Let be a nonnegative analytic function whose zero set is compact and nonempty. The zeta function

(2.6)

can be analytically continued to a meromorphic function on the complex plane. The poles of this continuation are real and positive. Let be the smallest pole, known as the real log-canonical threshold (rlct) of , and let be its multiplicity. Since we are interested in both the rlct and its multiplicity, we use the notation . When , we simply write . Finally, if is another analytic function with , then we write if or if and .

Example 2.2.

Suppose and . Then the -fiber is the union of two segments of the coordinate axes. Taking , we have

This example is simple enough that can be computed by elementary means. Let be the distribution function of the standard normal distribution. Then

Integration by parts yields

Taking logarithms, we see that (2.5) holds with and . It follows that . Concerning Definition 2.1, we have that

for all with . In fact, this holds as long as . The meromorphic continuation of given by has one pole at . The pole has multiplicity confirming that .

In this paper we are concerned with Gaussian models for which we may assume, without loss of generality, that all distributions are centered. So let the data-generating distribution be the multivariate normal distribution , with positive definite covariance matrix . Further, let be the density of the distribution with positive definite covariance matrix . Then

For fixed positive definite , the function

has a full rank Hessian at . Hence, in a neighborhood of , we can both lower-bound and upper-bound by positive multiples of the function

It follows that ; compare [Wat09, Remark 7.2]. For our study of Gaussian latent tree (and forest) models, it is convenient to change coordinates to correlations and consider the function

(2.7)

where and are the correlations obtained from or ; so, e.g., . Since

(2.8)

our discussion of latent tree models may thus start from the following fact.

Proposition 2.3.

Let be a tree with set of leaves . Let be the parameter space for the Gaussian latent tree model , the parameters being the variances , , and the correlation parameters , . Suppose the (data-generating) distribution is in and has variances and a positive definite correlation matrix with entries . Then , where

(2.9)

3 Monomial parametrizations

According to Proposition 2.3, the asymptotic behavior of the marginal likelihood of a Gaussian latent tree model is determined by the real log-canonical threshold of the function in (2.9). This function is a sum of squared differences between monomials formed from the parameter vector and constants determined by the data-generating distribution . In this section, we formulate general results on the real log-canonical thresholds for such monomial parametrizations, which also arise in other contexts [RG05, Zwi11].

Specifically, we treat functions of the form

(3.1)

with domain . Here, are constants and each monomial is given by a vector of nonnegative integers . Special cases of this setup are the regular case with , and the quasi-regular case of [YW12], in which the vectors have pairwise disjoint supports and all .

Let be the number of summands on the right-hand side of (3.1) that have . Without loss of generality, assume that and . Furthermore, suppose that are the parameters appearing in the monomials , that is, . If then for all . Moreover, if the zero set is compact, then each one of the parameters is bounded away from zero on . (Clearly, the zero set of the function from Proposition 2.3 is compact.)

Now define the nonzero part of as

(3.2)

and the zero part of as

(3.3)
Definition 3.1.

The Newton polytope of the zero part is the convex hull of the points for . The Newton polyhedron of is the polyhedron

Let be the vector of all ones. Then the -distance of is the smallest such that . The associated multiplicity is the codimension of the (inclusion-minimal) face of containing .

We say that is a product of intervals if with . The following is the main result of this section. It is proved in Appendix A.

Theorem 3.2.

Suppose that is a product of intervals, and let and be the projections of onto the first and the last coordinates, respectively. Let be the sum of squares from (3.1) and assume that the zero set is non-empty and compact. Let be a smooth positive function that is bounded above on . Then

where is the codimension of in , and is the -distance of the Newton polyhedron with associated multiplicity . Here, and if has no zero part, i.e., .

Remark 3.3.

In order to compute the codimension of , one may consider one orthant at a time and take logarithms (accounting for signs). This turns the equations into linear equations in .

Example 3.4.

If and , then (2.2) is a Gaussian integral and it is clear (c.f. (2.5)) that . The Newton polytope for is the convex hull of the canonical basis vectors of . The Newton polyhedron of has -distance with multiplicity 1. The same is true whenever

(3.4)
Example 3.5.

Earlier, we have shown that on the function has ; recall Example 2.2. The function has no nonzero part. Its Newton polytope consists of a single point, namely, . The Newton polyhedron is . Clearly, the -distance of the Newton polyhedron is 1. Since the ray spanned by meets the Newton polyhedron in the vertex , the multiplicity is 2, as it had to be according to our earlier calculation.

Example 3.6.

Consider the function

on . The nonzero part is and the zero part is . With , the codimension of is . The Newton polytope of is the convex hull of and . The Newton polyhedron of is . Hence, and . Note that while the point is a vertex of the Newton polytope, it lies on a one-dimensional face of the Newton polyhedron. In conclusion, .

4 Gaussian latent tree and forest models

Let be a tree with set of leaves . By Proposition 2.3, our study of the marginal likelihood of the Gaussian latent tree model turns into the study of the function

(4.1)

Since for all , the split of into its zero and nonzero part depends solely on the zero pattern among the correlations of the data-generating distribution . Furthermore, from the form of the parametrization in (1.1), it is clear that zero correlations can arise only if one sets for one or more edges in the edge set . For a fixed set , the set of parameter vectors with for all parametrizes the forest model , where is the forest obtained from by removing the edges in . In this submodel, if and only if and lie in two different connected components of .

It is possible that two different subforests induce the same pattern of zeros among the correlations of the data-generating distribution . However, there is always a unique minimal forest inducing this zero pattern, and we term the -forest. Put differently, the -forest is obtained from by first removing all edges for all pairs of nodes that can have zero correlation under and then removing all inner nodes of that have become isolated. Isolated leaf nodes are retained so that . In the remainder of this section, we take to be the set of edges whose removal defines . We write if and are two leaves in that are joined by a path in the -forest .

Example 4.1.

Let be the quartet tree in Figure 1(a). Let have but for all other . The -forest is obtained by removing the edges in . Inner node becomes isolated and is removed as well. The forest thus has the five nodes in the set , and the two edges in the set ; see Figure 1(b).

(a)

      (b)

Figure 1: (a) A quartet tree ; (b) the -forest from Example 4.1.

Moving on to the decomposition of the function from (4.1), recall that we divide the parameter vector into coordinates that never vanish on the -fiber and the remaining part . In our case, consists of all for and for and consists of for . Moreover,

(4.2)

and

(4.3)

The Gaussian latent tree model given by a tree with set of leaves and edge set has dimension

where denotes the number of degree two nodes in . Similarly, the model given by a forest with set of leaves and edge set has dimension

where are the trees defined by the connected components of and is again the number of degrees two nodes.

Example 4.2.

The -forest from Example 4.1 has . The dimensions for the trees in the forest are , , and ; the trees and each contain only a single node.

The following theorem provides the real log-canonical thresholds of Gaussian latent tree models. The proof of theorem is given in Appendix B.

Theorem 4.3.

Let be a tree with set of leaves , and let be a distribution in the Gaussian latent tree model . Write for the parameter space of , and let be the -forest. If is a smooth positive function that is bounded above on , then the function from (4.1) has

where is the number of nodes that shares with , and is the number of nodes in that have degree two and are not in .

Theorem 4.3 implies in particular that the pair depends on only through the forest and we write

Example 4.4.

In Example 4.1, (c.f. Example 4.2) and . Hence, the real log-canonical threshold is 13/2, which translates into a coefficient of 13/4 for the term in the asymptotic expansion of the log-marginal likelihood. Note that the threshold 13/2 is smaller than , making the latent tree model behave like a lower-dimensional model.

Example 4.5.

Suppose has two leaves, labelled 1 and 2, and one inner node , which then necessarily has degree two. If is a distribution under which the random variables at the two leaves are uncorrelated, then we have

Using the calculation from Example 2.2 or Example 3.5, we see that . When applying Theorem 4.3, the -forest has the leaves 1 and 2 isolated and . Since and each one of the two removed edges satisfies , the formula from Theorem 4.3 yields , as it should.

Remark 4.6.

Note that if has an (inner) node of degree two, then we can contract one of the edges the node is adjacent to obtain a tree with . Repeating such edge contraction it is always possible to find a tree with all inner nodes of degree at least three that defines the same model as the original tree . Moreover, in applications such as phylogenetics, the trees of interesting do not have nodes of degree two, in which case the multiplicity in RLCT is always equal to one.

In the model selection problems that motivate this work, we wish to choose between different forests. We thus state an explicit result for forests in the below Theorem 4.7. For a forest , we define -forests in analogy to the definition we made for trees. In other words, we apply the previous definitions to each tree appearing in the connected components of and then form the union of the results. Similarly, the proof of Theorem 4.7 is obtained by simply applying Theorem 4.3 to each connected component of the given forest .

Theorem 4.7.

Let be a forest with the set of leaves , and let be a distribution in the Gaussian latent forest model . Write for the parameter space of , and let be the -forest. If is a smooth positive function that is bounded above on , then the function from (4.1) has

where is the number of nodes that shares with , and is the number of nodes in that have degree two and are not in .

As in Theorem 4.3, the pair depends on only through the forest and we write

Remark 4.8.

Fix a forest with leaves , and let be any subforest of with the same leaves (any is of this form). Let and be such that is the degree of in for all and similarly for . Note that

From this and our prior formula for we have that

where is the number of degree 2 nodes in . Computing can now easily be done in linear time in the size of , i.e. in time, under the assumption that we have stored and as adjacency lists and there is a map, with access time, associating vertices in with those in . In computational practice we found that the prior two conditions are trivial to guarantee. In particular, note that if