Models involving higher level random effects commonly arise in a variety of contexts. The areas of study known as longitudinal data analysis (e.g. Fitzmaurice et al.
, 2008), mixed models (e.g. Pinheiro & Bates, 2000), multilevel models (e.g. Goldstein, 2010), panel data analysis (e.g. Baltagi, 2013) and small area estimation (e.g. Rao & Molina, 2015) potentially each require the handling of higher levels of nesting. Our main focus in this article is providing explicit algorithms that facilitate variational inference for up to three-level random effects and a pathway for handling even higher levels. Both direct and message passing approaches to mean field variational Bayes are treated. We also provide algorithms for frequentist best linear unbiased prediction that, to our knowledge, are not in the current literature and may be viewed of extensions of the exquisite streamlined mixed model computational strategies presented in Section 2.2 of Pinheiro & Bates (2000).
A useful prototype setting for understanding the nature and computational challenges is a fictitious sociology example in which residents (level 1 units) are divided into different towns (level 2 units) and those towns are divided into different districts (level 3 units). Following Goldstein (2010) we call these three-level data, although note that Pinheiro & Bates (2000) use the term “two-level”, corresponding to two levels of nesting, for the same setting. Figure 1
displays simulated regression data generated according to this setting with a single predictor variable corresponding to years of education and the response corresponding to annual income. In Figure1, the number of districts is 6, the number of towns per district is 8 and the resident sample size within each town is 25.
In each panel of Figure 1
, the line corresponds to the best linear unbiased prediction fit of a three-level random intercepts and slopes linear mixed model, as explained in Section5.1. The variational Bayesian analogue, covered in Section 5.2
, is such that best linear unbiased prediction is replaced by variational approximate posterior means and confidence intervals are replaced by variational approximate credible intervals. Now suppose that the group and sample sizes are much larger with, say, 500 districts, 60 towns per district and 1,000 residents per town. Then naïve fitting is storage-greedy and computationally challenging since the combined fixed and random effects design matrices haveentries of which at least 99.99% equal zero. A major contribution of this article is explaining how variational inference can be achieved using only the 0.01% non-zero design matrix components with updates that are linear in the numbers of groups.
Our streamlined variational inference algorithms for higher level random effects models rely on four theorems provided by Nolan & Wand (2019) concerning linear system solutions and sub-blocks of matrix inverses for two-level and three-level sparse matrix problems which are the basis for the fundamental Algorithms 1–4 in Section 3. In that article, as well as here, we treat one higher level situation at a time. Even though four-level and even higher level situations may be of interest in future analysis, the required theory is not yet in place. As we will see, covering both direct and message passing approaches for just the two-level and three-level cases is quite a big task. Nevertheless, our results and algorithms shed important light on streamlined variational inference for general higher level random effects models.
After laying out the four fundamental algorithms in Section 3 we then derive an additional ten algorithms, labeled Algorithms 5–14, that facilitate streamlined frequentist and variational inference for two-level and three-level linear mixed models. Algorithms 5 and 10 treat best linear unbiased prediction as a prelude to the closely related mean field variational Bayes analogues, which are dealt with in Algorithms 6 and 11. The remaining four algorithms are concerned with streamlined factor graph fragment updates according to the variational message passing infrastructure described in Wand (2017). As explained there, the message passing approach has the advantage compartmentalization of variational inference algebra and code. The inherent complexity of streamlined variational inference for higher level random effects models is such that the current article is restricted to ordinary linear mixed models. Extensions such as generalized additive mixed models with higher level random effects and higher level group-specific curve models follow from Algorithms 1–4, but must be treated elsewhere. Section 8 provides further details on this matter.
Our algorithms also build on previous work on streamlined variational inference for similar classes of models described in Lee & Wand (2016). However, Lee & Wand (2016) only treated the two-level case, did not employ QR decomposition enhancement and did not include any variational message passing algorithms. The current article is a systematic treatment of higher level random effects models beyond the common two-level case.
Section 2 provides background material concerning variational inference. In Section 3 we present four algorithms for solving higher level sparse matrix problems which are fundamental for variational inference involving general models with hierarchical random effects structure. The two-level situation is treated in Section 4, followed by treatment of the three-level situation in Section 5. Section 6 demonstrates the speed advantages of streamlining for variational inference in random effects models via some timing studies. Illustration for data from a large perinatal health study is given in Section 7. In Section 8 we close with some discussion about extensions to other settings.
2 Variational Inference Background
In keeping with the theme of this article, we will explain the essence of variational inference for a general class of Bayesian linear mixed models. Summaries of variational inference in wider statistical contexts are given in Ormerod & Wand (2010) and Blei, Kucukelbir & McAuliffe (2017).
Suppose that the response data vectoris modeled according to a Bayesian version of the Gaussian linear mixed model (e.g. Robinson, 1991)
for hyperparametersand and such that and are independent. The and vectors are labeled fixed effects and random effects, respectively. Their corresponding design matrices are and . We will allow for the possibility that prior specification for the covariance matrices and involves auxiliary covariance matrices and with conjugate Inverse G-Wishart distributions (Wand, 2017). The prior specification of and involves the specifications
Figure 2 is a directed acyclic graph representation of (1) and (2). The circles, usually called nodes, correspond to the model’s random vectors and random matrices. The arrows depict conditional independence relationships (e.g. Bishop, 2006; Chapter 8).
Full Bayesian inference for the, and and the random effects involves the posterior density function
, but typically is analytically intractable and Markov chain Monte Carlo approaches are required for practical ‘exact’ inference. Variational approximate inference involves mean field restrictions such as
for density functions and , which we call -densities. The approximation at (3) represents the minimal product restriction for which practical variational inference algorithms arise. However, as explained in Section 10.2.5 of Bishop (2006), the graphical structure of Figure 2 induces further product density forms and the right-hand side of (3) admits the further factorization
With this product density form in place, the forms and optimal parameters for the
-densities are obtained by minimising the Kullback-Leibler divergence of the right-hand side of (3) from its left-hand side. The optimal -density parameters are interdependent and a coordinate ascent algorithm (e.g. Algorithm 1 of Ormerod & Wand, 2010) is used to obtain their solution. For example, the optimal -density for , denoted by , is a Multivariate Normal density function with mean vector and covariance matrix . The coordinate ascent algorithm is such that they are updated according to
where and are the -density expectations of and and . If, for example, (1) corresponds to a mixed model with three-level random effects such that then, as pointed out in Section 1, with 60 groups at level 2 and 500 groups at level 3 the matrix has almost 2 trillion entries of which 99.99% are zero. Moreover, is a matrix of which only about 0.016% of its approximately 3.7 billion entries are required for variational inference under mean field restriction (3). Avoiding the wastage of the naïve updates given by (5) is the crux of this article and dealt with in the upcoming sections. The updates for and depend on parameterizations of and . For example, for some throughout Sections 4 and 5. However, these covariance parameter updates are relatively simple and free of storage and computational efficiency issues. Similar comments apply to the updates for the -density parameters of and .
An alternative approach to obtaining the relevant sub-blocks of and the covariance and auxiliary variable -parameter updates is to use the notion of message passing on a factor graph. The relevant factor graph for model (1), according to the product density form (4), is shown in Figure 3.
with factorization according to the conditional independence structure apparent from Figure 2. Then, as explained in e.g. Minka (2005), the -density of can be expressed as
are known as messages, with the subscripts indicating that they are passed from to and to respectively. Messages are simply functions of the stochastic node to which the message is passed and, for mean field variational inference, are formed according to rules listed in Minka (2005) and Section 2.5 of Wand (2017). To compartmentalize algebra and coding for variational message passing, Wand (2017) advocates the use of fragments, which are sub-graphs of a factor graph containing a single factor and each of its neighboring stochastic nodes. In Sections 4 and 5 of Wand (2017), eight important fragments are identified and treated including those needed for a wide range of linear mixed models. However, in the interests of brevity, Wand (2017) ignored issues surrounding potentially very large and sparse matrices in the message parameter vectors. In Sections 4 and 5 of this article, we explain how the messages passed to the node can be streamlined to avoid massive sparse matrices.
A core component of the message passing approach to variational inference is exponential family forms, sufficient statistics and natural parameters. For a Multivariate Normal random vector
this involves re-expression of its density function according to
are, respectively, the sufficient statistic and natural parameter vectors. The matrix , known as the duplication matrix of order , is the matrix containing only zeroes and ones such that for any symmetric matrix . The function
is the log-partition function, where is the Moore-Penrose inverse of and is such that whenever is symmetric. The inverse of the natural parameter transformation is given by
The vec and vech matrix operators are reasonably well-established (e.g. Gentle, 2007). If is a vector then is the matrix such that . We also require vec inversion to non-square matrices. If is a vector then is the matrix such that .
The other major distributional family used throughout this article is a generalization of the Inverse Wishart distribution known as the Inverse G-Wishart distribution. It corresponds to the matrix inverses of random matrices that have a G-Wishart distribution (e.g. Atay-Kayis & Massam, 2005). For any positive integer , let be an undirected graph with nodes labeled and set consisting of sets of pairs of nodes that are connected by an edge. We say that the symmetric matrix respects if
A random matrix has an Inverse G-Wishart distribution with graph and parameters and symmetric matrix , written
if and only if the density function of satisfies
over arguments such that is symmetric and positive definite and respects . Two important special cases are
for which the Inverse G-Wishart distribution coincides with the ordinary Inverse Wishart distribution, and
for which the Inverse G-Wishart distribution coincides with a product of independent Inverse Chi-Squared random variables. The subscripts ofand reflect the fact that is a full matrix and is a diagonal matrix in each special case.
The case corresponds to the ordinary Inverse Wishart distribution. However, with message passing in mind, we will work with the more general Inverse G-Wishart family throughout this article.
In the special case the graph
and the Inverse G-Wishart distribution reduces to the Inverse Chi-Squared distributions. Throughout this article we write
for this special case with and scalar.
Finally, we remark on the and notation used for density functions in this article. In the variational inference literature these letters have become very commonplace to denote the density functions corresponding to the model and the density functions of parameters according to the mean field approximation, with for the former and for the latter. However, the same letters are commonly used as dimension variables in the mixed models literature (e.g. Pinheiro & Bates, 2000). Therefore we use ordinary and as dimension variables and scripted versions of these letters ( and ) for density functions.
3 Multilevel Sparse Matrix Problem Algorithms
A key observation in this work is the fact that streamlining of variational inference algorithms for higher level random effects models can be achieved by recognition and isolation of a few fundamental algorithms, which we call multilevel sparse matrix problem algorithms. These algorithms, based on the results of Nolan & Wand (2019), are identical to those used traditionally for fitting frequentist random effects (Pinheiro & Bates, 2000). For each level there are two types of sparse matrix solution algorithms: one that applies to general forms and one that uses a QR-decomposition enhancement for a particular form that arises commonly for models containing random effects. Both types are needed for variational inference.
In theory, based on the infrastructure laid out in Nolan & Wand (2019), any number of levels can be handled. However, each higher level brings increasing complexity. Here we restrict attention to two-level and three-level sparse matrix algorithms.
3.1 Two-Level Sparse Matrix Algorithms
Two-level sparse matrix problems are described in Section 2 of Nolan & Wand (2019). The notation used there is also used in this section. Here we present two algorithms, named
SolveTwoLevelSparseMatrix and SolveTwoLevelSparseLeastSquares
which are at the heart of streamlining variational inference for two-level models.
The SolveTwoLevelSparseMatrix algorithm is concerned with solving general two-level sparse linear system problem , where
and obtaining the sub-matrices corresponding to the non-zero blocks of :
As will be elaborated upon later, the blocks represented by the symbol are not of interest. SolveTwoLevelSparseMatrix is listed as Algorithm 1 and is justified by Theorem 1 of Nolan & Wand (2019).
The SolveTwoLevelSparseLeastSquares algorithm arises in the special case where is the minimizer of the least squares problem where
In this case , so that the sub-blocks of and take the forms
As demonstrated in Section 4, these forms arise in two-level random effects models. Theorem 2 of Nolan & Wand (2019) shows that this special form lends itself to a QR decomposition (e.g. Harville, 2008; Section 6.4.d) approach which has speed and stability advantages in regression settings (e.g. Gentle, 2007; Section 6.7.2).
SolveTwoLevelSparseLeastSquares is listed as Algorithm 2. Note that we use , rather than , to denote the number of rows in each of , and to avoid a notational clash with common grouped data dimension notation as used in Section 4. In the first loop over the groups of data the upper triangular matrices , , are obtained via QR-decomposition; a standard procedure within most computing environments. Following that, all matrix equations involve , which can be achieved rapidly via back-solving.
Note that in Algorithm 2 calculations such as do not require storage of and use of ordinary multiplication. Standard matrix algebraic programming languages store information concerning in a compact form from which matrices such as can be efficiently obtained.
3.2 Three-Level Sparse Matrix Algorithms
Extension to the three-level situation is described in Section 3 of Nolan & Wand (2019). Theorems 3 and 4 given there lead to the algorithms
SolveThreeLevelSparseMatrix and SolveThreeLevelSparseLeastSquares
which facilitate streamlining variational inference for three-level models.
An illustrative three-level sparse matrix is:
and corresponds to level 2 group sizes of and , and a level 3 group size of . A general three-level sparse matrix consists of the following components:
A matrix , which is designated the -block position.
A set of partitioned matrices , which is designated the -block position. For each , is , and for each , is .
A -block, which is simply the transpose of the -block.
A block diagonal structure along the -block position, where each sub-block is a two-level sparse matrix, as defined in (7). For each , is , and for each , is and is .
The three-level sparse linear system problem takes the form where we partition the vectors and as follows:
Here and are vectors. Then, for each , and are vectors. Lastly, for each and the vectors and have dimension .
The three-level sparse matrix inverse problem involves determination of the sub-blocks of corresponding to the non-zero sub-blocks of . Our notation for these sub-blocks is illustrated by
for the , and case.
SolveThreeLevelSparseMatrix, which provides streamlined solutions for the general three-level sparse matrix problem, is listed as Algorithm 3.