The Inverse G-Wishart Distribution and Variational Message Passing

05/20/2020
by   L. Maestrini, et al.
0

Message passing on a factor graph is a powerful paradigm for the coding of approximate inference algorithms for arbitrarily graphical large models. The notion of a factor graph fragment allows for compartmentalization of algebra and computer code. We show that the Inverse G-Wishart family of distributions enables fundamental variational message passing factor graph fragments to be expressed elegantly and succinctly. Such fragments arise in models for which approximate inference concerning covariance matrix or variance parameters is made, and are ubiquitous in contemporary statistics and machine learning.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

04/01/2021

Bayesian Functional Principal Components Analysis via Variational Message Passing

Functional principal components analysis is a popular tool for inference...
01/16/2018

Factor graph fragmentization of expectation propagation

Expectation propagation is a general approach to fast approximate infere...
04/03/2020

TRAMP: Compositional Inference with TRee Approximate Message Passing

We introduce tramp, standing for TRee Approximate Message Passing, a pyt...
10/15/2020

Fundamental Linear Algebra Problem of Gaussian Inference

Underlying many Bayesian inference techniques that seek to approximate t...
11/02/2016

Improving variational methods via pairwise linear response identities

Inference methods are often formulated as variational approximations: th...
08/10/2018

Model Approximation Using Cascade of Tree Decompositions

In this paper, we present a general, multistage framework for graphical ...
09/02/2020

Online system identification in a Duffing oscillator by free energy minimisation

Online system identification is the estimation of parameters of a dynami...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We argue that a very general family of covariance matrix distributions, known as the Inverse G-Wishart family, plays a fundamental role in modularization of variational inference algorithms via variational message passing when a factor graph fragment (Wand, 2017) approach is used. A factor graph fragment, or fragment for short, is a sub-graph of the relevant factor graph consisting of a factor and all of its neighboring nodes. Even though use of the Inverse G-Wishart distribution is not necessary, its adoption allows for fundamental factor graph fragment natural parameter updates to be expressed elegantly and succinctly. An essential aspect of this strategy is that the Inverse G-Wishart distribution is the only distribution used for covariance matrix and variance parameters. The family includes as special cases the Inverse Chi-Squared, Inverse Gamma and Inverse Wishart distributions. Therefore, just a single distribution is required which leads to savings in notation and code. Whilst similar comments concerning modularity apply to Monte Carlo-based approaches to approximate Bayesian inference, here we focus on variational inference.

Two of the most common contemporary approaches to fast approximate Bayesian inference are mean field variational Bayes (e.g. Attias, 1999) and expectation propagation (e.g. Minka, 2001). Minka (2005) explains how each approach can be expressed as message passing on relevant factor graphs with variational message passing (Winn & Bishop, 2005) being the name used for the message passing version of mean field variational Bayes. Wand (2017) introduced the concept of factor graph fragments, or fragments for short, for compartmentalization of variational message passing into atom-like components. Chen & Wand (2020) demonstrate the use of fragments for expectation propagation. Explanations of factor graph-based variational message passing that match the current exposition are given in Sections 2.4–2.5 of Wand (2017).

Sections 4.1.2–4.1.3 of Wand (2017) introduce two variational message passing fragments known as the Inverse Wishart prior fragment and the iterated Inverse G-Wishart

fragment. The first of these simply corresponds to imposing an Inverse Wishart prior on a covariance matrix. In the scalar case this reduces to imposing an Inverse Chi-Squared or, equivalently, an Inverse Gamma prior on a variance parameter. The iterated Inverse G-Wishart fragment facilitates the imposition of arbitrarily non-informative priors on standard deviation parameters such as members of the Half-

family (Gelman, 2006 ; Polson & Scott, 2012). The extension to the covariance matrix case, for which there is the option to impose Uniform distribution priors over the interval

on correlation parameters, is elucidated in Huang & Wand (2013). These two fragments arise in many classes of Bayesian models, such as both Gaussian and generalized response linear mixed models (e.g. McCulloch

et al., 2008), Bayesian factor models (e.g. Conti et al.

, 2014), vector autoregressive models (e.g. Assaf

et al., 2019), and generalized additive mixed models and group-specific curve models (e.g. Harezlak et al., 2018).

Despite the fundamentalness of Inverse G-Wishart-based fragments for variational message passing, the main reference to date, Wand (2017), is brief in its exposition and contains some errors that affect certain cases. In this article we provide a detailed exposition of the Inverse G-Wishart distribution in the context of variational message passing and list the Inverse Wishart prior and iterated Inverse G-Wishart fragment updates in full ready-to-code forms. R functions (R Core Team, 2020) that implement these algorithms are provided as part of the supplementary material of this article. We also explain the errors in Wand (2017).

Section 2 contains relevant definitions and results concerning the G-Wishart and Inverse G-Wishart distributions. Connections with the Huang-Wand family of marginally noninformative prior distributions for covariance matrices are summarized in Section 3 and in Section 4 we point to background material on variational message passing. In Sections 5 and 6 we provide detailed accounts of the two variational message passing fragments pertaining to variance and covariance matrix parameters, expanding on what is presented in Sections 4.1.2 and 4.1.3 of Wand (2017), and making some corrections to what is presented there. In Section 7 we provide explicit instructions on how the two fragments are used to specify different types of prior distributions on standard deviation and covariance matrix parameters in variational message passing-based approximate Bayesian inference. Section 8 contains a data analytic example that illustrates the use of the covariance matrix fragment update algorithms. Some closing discussion is given in Section 9. A web-supplement contains relevant details.

2 The G-Wishart and Inverse G-Wishart Distributions

A random matrix

has an Inverse G-Wishart distribution if and only if has a G-Wishart distribution. In this section we first review the G-Wishart distribution, which has an established literature. Then we discuss the Inverse G-Wishart distribution and list properties that are relevant to its employment in variational message passing.

Let be an undirected graph with nodes labeled and set consisting of pairs of nodes that are connected by an edge. We say that the symmetric matrix respects if

Figure 1 shows the zero/non-zero entries of four symmetric matrices. For each matrix, the -node graph that the matrix respects is shown underneath.

Figure 1: The zero/non-zero entries of four symmetric matrices with non-zero entries denoted by . Underneath each matrix is the -node undirected graph that the matrix respects. The nodes are numbered according to the rows and columns of the matrices. A graph edge is present between nodes and whenever the entry of the matrix is non-zero. The graph respected by the full matrix is denoted by . The graph respected by the diagonal matrix is denoted by .

The first graph in Figure 1 is totally connected and corresponds to the matrix being full. Hence we denote this graph by . At the other end of the spectrum is the last graph of Figure 1, which is totally disconnected. Since this corresponds to the matrix being diagonal we denote this graph by .

An important concept in G-Wishart and Inverse G-Wishart distribution theory is graph decomposability. An undirected graph is decomposable if and only if all cycles of four or more nodes have an edge that is not part of the cycle but connects two nodes of the cycle. In Figure 1 the first, third and fourth graphs are decomposable. However, the second graph is not decomposable since it contains a four-node cycle that is devoid of edges that connect pairs of nodes within this cycle. Alternative labels for decomposable graphs are chordal graphs and triangulated graphs.

In Sections 2.1 and 2.2 we define the G-Wishart and Inverse G-Wishart distributions and treat important special cases. This exposition depends on particular notation, which we define here. For a generic proposition we define to equal if

is true and zero otherwise. If the random variables

, , are independent such that has distribution we write , . For a vector let be the diagonal matrix with diagonal comprising the entries of in order. For a matrix let denote the vector comprising the diagonal entries of in order. The vec and vech matrix operators are well-established (e.g. Gentle, 2007). If is a vector then is the matrix such that . The matrix , known as the duplication matrix of order , is the matrix containing only zeros and ones such that for any symmetric matrix (Magnus & Neudecker, 1999). For example,

The Moore-Penrose inverse of is and is such that for a symmetric matrix .

2.1 The G-Wishart Distribution

The G-Wishart distribution (Atay-Kayis & Massam, 2005) is defined as follows:

Definition 1.

Let be a symmetric and positive definite random matrix and be a -node undirected graph such that respects . For and a symmetric positive definite matrix we say that has a G-Wishart distribution with graph , shape parameter and rate matrix , and write

if and only if the non-zero values of the density function of satisfy

(1)

Obtaining an expression for the normalizing factor of a general G-Wishart density function is a challenging problem and recently was resolved by Uhler et al. (2018). In the special case where is a decomposable graph a relatively simple expression for the normalizing factor exists and is given, for example, by equation (1.4) of Uhler et al. (2018). The non-decomposable case is much more difficult and treated in Section 3 of Uhler et al. (2018), but the normalizing factor does not have a succinct expression for general

. Similar comments apply to expressions for the mean of a G-Wishart random matrix. As discussed in Section 3 of Atay-Kayis & Massam (2005), the G-Wishart distribution has connections with other distributional constructs such as the hyper Wishart law defined by Dawid & Lauritzen (1993).

Let be the totally connected -node undirected graph and be the totally disconnected -node undirected graph. The special cases of and are such that the normalizing factor and mean do have simple closed form expressions. Since these cases arise in fundamental variational message passing algorithms we now turn our attention to them.

2.1.1 The Special Case

In the case where is a fully connected graph we have:

Result 1.

If the random matrix is such that then

(2)

The mean of is

Result 1 is not novel at all since the case corresponds to having a Wishart distribution. In other words, (2) is simply the density function of a Wishart random matrix. However, it is worth pointing out the the shape parameter used here is different from that commonly used for the Wishart distribution. For example, in Table A.1 of Gelman et al. (2014) the shape parameter is denoted by and is related to the shape parameter of (2) according to

and therefore are the same only in the special case of being scalar.

2.1.2 The Special Case

Before treating the

situation, we define the Chi-Squared distribution for a scalar random variable

.

Definition 2.

Let be a random variable. For and we say that has a Chi-Squared distribution with shape parameter and rate parameter , and write , if and only if the density function of satisfies

The distribution is tied intimately to the Chi-Squared distribution, as Result 2 shows.

Result 2.

Suppose that the random matrix is such that . Then the non-zero entries of satisfy

where is the th diagonal entry of . The density function of is

The mean of is

We now make some remarks concerning Result 2.

  1. When the off-diagonal entries of have no effect on the distribution of . In other words, the declaration is equivalent to the declaration .

  2. The declaration is equivalent to the diagonal entries of being independent Chi-Squared random variables with shape parameter and rate parameters equalling the diagonal entries of .

  3. Even though statements concerning the distributions of independent random variables may seem simpler than a statement of the form , the major thrust of this article is the elegance provided by key variational message passing fragment updates being expressed in terms of a single family of distributions.

2.1.3 Exponential Family Form and Natural Parameterisation

Suppose that . Then for such that we have

(3)

where

are, respectively, sufficient statistic and natural parameter vectors. The inverse of the natural parameter mapping is

Note that, throughout this article, we use rather than since the former is more compact and avoids duplications. Section S.1 in the web-supplement has further discussion on this matter.

2.2 The Inverse G-Wishart Distribution

Suppose that , where is , and . Let the density functions of and be denoted by and respectively. Then the density function of is

(4)

where

is the Jacobian of the transformation.

An important observation is that the form of is dependent on the graph . In the case of being a decomposable graph an expression for is given by (2.4) of Letac & Massam (2007), with credit given to Roverato (2000). Therefore, if is decomposable, the density function of an Inverse G-Wishart random matrix can be obtained by substitution of (2.4) of Letac & Massam (2007) into (4). However, depending on the complexity of , simplification of the density function expression may be challenging.

With variational message passing in mind, we now turn to the and special cases. The case is simple since it involves products of univariate density functions and we have

(5)

The case is more challenging and is the focus of Theorem 2.1.8 of Muirhead (1982):

(6)

This result is also stated as Lemma 2.1 in Letac & Massam (2007).

Combining (4), (5) and (6) we have:

Result 3.

Suppose that where and is .

  • If then .

  • If then .

Whilst Result 3 only covers or it shows that, in these special cases, the density function of an Inverse G-Wishart random matrix is proportional to a power of multiplied by an an exponentiated trace of a matrix multiplied by . This form does not necessarily arise for . Since the motivating variational message passing fragment update algorithms only involve the cases we focus on them for the remainder of this section.

2.2.1 The Inverse G-Wishart Distribution When

For succinct statement of variational message passing fragment update algorithms involving variance and covariance matrix parameters it is advantageous to have a single Inverse G-Wishart distribution notation for the cases.

Definition 3.

Let be a symmetric and positive definite random matrix and be a -node undirected graph such that respects . Let and be a symmetric positive definite matrix .

  • If and is restricted such that then we say that has an Inverse G-Wishart distribution with graph , shape parameter and scale matrix , and write

    if and only if the non-zero values of the density function of satisfy

  • If then say that has an Inverse G-Wishart distribution with graph , shape parameter and scale matrix , and write

    if and only if the non-zero values of the density function of satisfy

  • If then is not defined.

The shape parameter used in Definition 3 is a reasonable compromise between various competing parameterization choices for the Inverse G-Wishart distribution for and for use in variational message passing algorithms. It has the following attractions:

  • The exponent of the determinant in the density function expression is regardless of whether or , which is consistent with the G-Wishart distributional notation used in Definition 1.

  • In the case matches the shape parameter in the most common parameterization of the Inverse Chi-Squared distribution such at that used in Table A.1 of Gelman et al. (2014).

In case where we have the following:

Result 4.

If the random matrix is such that then

The mean of is

Result 4 follows directly from the fact that if and only if has an Inverse Wishart distribution and established results for the density function and mean of this distribution given in, for example, Table A.1 of Gelman et al. (2014).

We now deal with the case.

Definition 4.

Let be a random variable. For and we say that the random variable has an Inverse Chi-Squared distribution with shape parameter and rate parameter , and write

if and only if . If then the density function of is

We are now ready to state:

Result 5.

Suppose that the random matrix is such that . Then the non-zero entries of satisfy

where is the th diagonal entry of . The density function of is

The mean of is

2.2.2 Natural Parameter Forms and Sufficient Statistic Expectations

Suppose that where . Then for such that ,

where

(7)

are, respectively, sufficient statistic and natural parameter vectors. The inverse of the natural parameter mapping is

(8)

As explained in Section S.1 of the web-supplement, alternatives to (7) are those that use instead of . Throughout this article we use the more compact “vech” form.

The following result is fundamental to succinct formulation of updates of covariance and variance parameter fragment updates for variational message passing:

Result 6.

If is a random matrix that has an Inverse G-Wishart distribution with graph and natural parameter vector . Then

2.2.3 Relationships with the Hyper Inverse Wishart Distributions

Throughout this article we follow the G-Wishart nomenclature as used by, for example, Atay-Kayis & Massam (2005), Letac & Massam (2007) and Uhler et al. (2018) in our naming of the Inverse G-Wishart family. Some earlier articles, such as Roverato (2000), use the term Hyper Inverse Wishart for the same family of distributions. The naming used here is in keeping with the more recent literature concerning Wishart distributions with graphical restrictions.

3 Connection with the Huang-Wand Family of Distributions

A major motivation for working with the Inverse G-Wishart distribution is the fact that the family of marginally non-informative priors proposed in Huang & Wand (2013) can be expressed succinctly in terms of the family where . This means that variational message fragments that cater for Huang-Wand prior specification, as well as Inverse-Wishart prior specification, only require natural parameter vector manipulations within a single distributional family.

If is a symmetric positive definite matrix then, for and , the specification

(9)

places a Huang-Wand distribution on with shape parameter and scale parameters .

The specification (9) matches (2) of Huang & Wand (2013) but with some differences in notation. Firstly, is used for matrix dimension here rather than in Huang & Wand (2013). Also, the , , scale parameters are denoted by in Huang & Wand (2013). The auxiliary variables in (2) of Huang & Wand (2013) are related to the matrix via the expression .

As discussed in Huang & Wand (2013), special cases of (9) correspond to marginally noninformative prior specification of the covariance matrix in the sense that the standard deviation parameters , , can have Half- priors with arbitrarily large scale parameters, controlled by the values. This is in keeping with the advice given in Gelman (2006). Moreover, correlation parameters , have a Uniform distribution over the interval when . We refer to this special case as the Huang-Wand marginally non-informative prior distribution with scale parameters and write

(10)

as a shorthand for (9) with .

4 Variational Message Passing Background

The overarching goal of this article is to identify and specify algebraic primitives for flexible imposition of covariance matrix priors within a variational message passing framework. In Wand (2017) these algebraic primitives are organised into fragments. This formalism is also used in Nolan & Wand (2017), Maestrini & Wand (2018) and McLean & Wand (2019).

Despite it being a central theme of this article, we will not provide a detailed description of variational message passing here. Instead we refer the reader to Sections 2–4 of Wand (2017) for the relevant variational message passing background material.

Since the notational conventions for messages used in this section’s references are used in the remainder of this article we summarize them here. If denotes a generic factor and denotes a generic stochastic variable that is a neighbor of in the factor graph then the message passed from to and the message passed from to are both functions of and are denoted by, respectively,

Typically, the messages are proportional to an exponential family density function with sufficient statistic , and we have

where and are the message natural parameter vectors. Such vectors play a central role in variational message passing iterative algorithms. We also adopt the notation

5 The Inverse G-Wishart Prior Fragment

The Inverse G-Wishart prior fragment corresponds to the following prior imposition on a covariance matrix :

for a -node undirected graph , scalar shape parameter and scale matrix . The fragment’s factor is

Figure 2: Diagram of the Inverse G-Wishart prior fragment.

Figure 2 is a diagram of the fragment, which shows that its only factor to stochastic node message is

which leads to

Therefore, the natural parameter update is

Apart from passing the natural parameter vector out of the fragment, we should also pass the graph out of the fragment. This entails the update:

Algorithm 1 provides the inputs, updates and outputs for the Inverse G-Wishart prior fragment.

Hyperparameter Inputs: . Updates:    ;    Outputs: , .
Algorithm 1 The inputs, updates and outputs for the Inverse G-Wishart prior fragment.

6 The Iterated Inverse G-Wishart Fragment

The iterated Inverse G-Wishart fragment corresponds to the following specification involving a covariance matrix :

where is a -node undirected graph such that and is a particular deterministic value of the Inverse G-Wishart shape parameter according to Definition 3. Figure 3 is a diagram of this fragment, showing that it has a factor connected to two stochastic nodes and .

Figure 3: Diagram of the iterated Inverse G-Wishart fragment.

The factor of the iterated Inverse G-Wishart fragment is, as a function of both and ,

As shown in Section S.2.1 of the web-supplement both of the factor to stochastic node messages of this fragment,

are proportional to Inverse G-Wishart density functions with graph . We assume the following conjugacy constraints:

All messages passed to and from outside the fragment are proportional to Inverse G-Wishart density functions with graph . The Inverse G-Wishart messages passed between and have the same graph. The Inverse G-Wishart messages passed between and have the same graph.

Under these constraints, and in view of e.g. (7) of Wand (2017), the message passed from to has the form

and the message passed from to has the form

Algorithm 2 gives the full set of updates of the message natural parameter vectors and graphs for the iterated Inverse-G-Wishart fragment. The derivation of Algorithm 2 is given in Section S.2 of the web-supplement.

Graph Input: . Shape Parameter Input: . Message Graph Input: . Natural Parameter Inputs: . Updates:    ;    If then If then If then If then If then If then Outputs: .
Algorithm 2 The inputs, updates and outputs for the iterated Inverse G-Wishart fragment.

6.1 Corrections to Section 4.1.3 of Wand (2017)

The iterated Inverse G-Wishart fragment was introduced in Section 4.1.3 of Wand (2017) and it is one of the five fundamental fragments of semiparametric regression given in Table 1. However, there are some errors due to the author of Wand (2017) failing to recognise particular subtleties regarding the Inverse G-Wishart distribution, as discussed in Section 2.2. We now point out misleading or erroneous aspects in Section 4.1.3 of Wand (2017).

Firstly, in Wand (2017) plays the role of and plays the role of . The dimension of and is denoted by . The first displayed equation of Section 4.1.3 is

(11)

for but it is only in the case that such a statement is reasonable for general . When then according the notation used in the current article. Therefore, (11) involves a different parameterization to that used throughout this article. Therefore, our first correction is to replace the first displayed equation of Section 4.1.3 of Wand (2017) by:

where if and if .

The following sentence in Section 4.1.3 of Wand (2017): “The fragment factor is of the form

should instead be “The fragment factor is of the form

In equation (31) of Wand (2017), the first entry of the vector on the right-hand side of the should be

To match the correct parameterization of the Inverse G-Wishart distribution, as used in the current article, equation (32) of Wand (2017) should be

The equation in Section 4.1.3 of Wand (2017):

should be replaced by

where depends on the graph of the Inverse G-Wishart distribution corresponding to . If the graph is then and if the graph is then .”

Lastly the iterated Inverse G-Wishart fragment natural parameter updates given by equations (36) and (37) of Wand (2017) are affected by the oversights described in the preceding paragraphs. They should be replaced by the updates given in Algorithm 2 with and .

7 Use of the Fragments for Covariance Matrix Prior Specification

The underlying rationale for the Inverse G-Wishart prior and iterated Inverse G-Wishart fragments is their ability to facilitate the specification of a wide range of covariance matrix priors within the variational message passing framework. In the special case, covariance matrix parameters reduce to variance parameters and their square roots are standard deviation parameters. In this section we spell out how the fragments, and their natural parameter updates in Algorithms 1 and 2, can be used for prior specification in important special cases.

7.1 Imposing an Inverse Chi-Squared Prior on a Variance Parameter

Let be a variance parameter and consider the prior imposition

for hyperparameters , within a variational message passing scheme. Then Algorithm 1 should be called with inputs set to:

7.2 Imposing an Inverse Gamma Prior on a Variance Parameter

Let be a variance parameter and consider the prior imposition

(12)

for hyperparameters . The density function corresponding to