1 Introduction
Models involving higher level random effects commonly arise in a variety of contexts. The areas of study known as longitudinal data analysis (e.g. Fitzmaurice et al.
, 2008), mixed models (e.g. Pinheiro & Bates, 2000), multilevel models (e.g. Goldstein, 2010), panel data analysis (e.g. Baltagi, 2013) and small area estimation (e.g. Rao & Molina, 2015) potentially each require the handling of higher levels of nesting. Our main focus in this article is providing explicit algorithms that facilitate variational inference for up to threelevel random effects and a pathway for handling even higher levels. Both direct and message passing approaches to mean field variational Bayes are treated. We also provide algorithms for frequentist best linear unbiased prediction that, to our knowledge, are not in the current literature and may be viewed of extensions of the exquisite streamlined mixed model computational strategies presented in Section 2.2 of Pinheiro & Bates (2000).
A useful prototype setting for understanding the nature and computational challenges is a fictitious sociology example in which residents (level 1 units) are divided into different towns (level 2 units) and those towns are divided into different districts (level 3 units). Following Goldstein (2010) we call these threelevel data, although note that Pinheiro & Bates (2000) use the term “twolevel”, corresponding to two levels of nesting, for the same setting. Figure 1
displays simulated regression data generated according to this setting with a single predictor variable corresponding to years of education and the response corresponding to annual income. In Figure
1, the number of districts is 6, the number of towns per district is 8 and the resident sample size within each town is 25.In each panel of Figure 1
, the line corresponds to the best linear unbiased prediction fit of a threelevel random intercepts and slopes linear mixed model, as explained in Section
5.1. The variational Bayesian analogue, covered in Section 5.2, is such that best linear unbiased prediction is replaced by variational approximate posterior means and confidence intervals are replaced by variational approximate credible intervals. Now suppose that the group and sample sizes are much larger with, say, 500 districts, 60 towns per district and 1,000 residents per town. Then naïve fitting is storagegreedy and computationally challenging since the combined fixed and random effects design matrices have
entries of which at least 99.99% equal zero. A major contribution of this article is explaining how variational inference can be achieved using only the 0.01% nonzero design matrix components with updates that are linear in the numbers of groups.Our streamlined variational inference algorithms for higher level random effects models rely on four theorems provided by Nolan & Wand (2019) concerning linear system solutions and subblocks of matrix inverses for twolevel and threelevel sparse matrix problems which are the basis for the fundamental Algorithms 1–4 in Section 3. In that article, as well as here, we treat one higher level situation at a time. Even though fourlevel and even higher level situations may be of interest in future analysis, the required theory is not yet in place. As we will see, covering both direct and message passing approaches for just the twolevel and threelevel cases is quite a big task. Nevertheless, our results and algorithms shed important light on streamlined variational inference for general higher level random effects models.
After laying out the four fundamental algorithms in Section 3 we then derive an additional ten algorithms, labeled Algorithms 5–14, that facilitate streamlined frequentist and variational inference for twolevel and threelevel linear mixed models. Algorithms 5 and 10 treat best linear unbiased prediction as a prelude to the closely related mean field variational Bayes analogues, which are dealt with in Algorithms 6 and 11. The remaining four algorithms are concerned with streamlined factor graph fragment updates according to the variational message passing infrastructure described in Wand (2017). As explained there, the message passing approach has the advantage compartmentalization of variational inference algebra and code. The inherent complexity of streamlined variational inference for higher level random effects models is such that the current article is restricted to ordinary linear mixed models. Extensions such as generalized additive mixed models with higher level random effects and higher level groupspecific curve models follow from Algorithms 1–4, but must be treated elsewhere. Section 8 provides further details on this matter.
Our algorithms also build on previous work on streamlined variational inference for similar classes of models described in Lee & Wand (2016). However, Lee & Wand (2016) only treated the twolevel case, did not employ QR decomposition enhancement and did not include any variational message passing algorithms. The current article is a systematic treatment of higher level random effects models beyond the common twolevel case.
Section 2 provides background material concerning variational inference. In Section 3 we present four algorithms for solving higher level sparse matrix problems which are fundamental for variational inference involving general models with hierarchical random effects structure. The twolevel situation is treated in Section 4, followed by treatment of the threelevel situation in Section 5. Section 6 demonstrates the speed advantages of streamlining for variational inference in random effects models via some timing studies. Illustration for data from a large perinatal health study is given in Section 7. In Section 8 we close with some discussion about extensions to other settings.
2 Variational Inference Background
In keeping with the theme of this article, we will explain the essence of variational inference for a general class of Bayesian linear mixed models. Summaries of variational inference in wider statistical contexts are given in Ormerod & Wand (2010) and Blei, Kucukelbir & McAuliffe (2017).
Suppose that the response data vector
is modeled according to a Bayesian version of the Gaussian linear mixed model (e.g. Robinson, 1991)(1) 
for hyperparameters
and and such that and are independent. The and vectors are labeled fixed effects and random effects, respectively. Their corresponding design matrices are and . We will allow for the possibility that prior specification for the covariance matrices and involves auxiliary covariance matrices and with conjugate Inverse GWishart distributions (Wand, 2017). The prior specification of and involves the specifications(2) 
Figure 2 is a directed acyclic graph representation of (1) and (2). The circles, usually called nodes, correspond to the model’s random vectors and random matrices. The arrows depict conditional independence relationships (e.g. Bishop, 2006; Chapter 8).
Full Bayesian inference for the
, and and the random effects involves the posterior density function, but typically is analytically intractable and Markov chain Monte Carlo approaches are required for practical ‘exact’ inference. Variational approximate inference involves mean field restrictions such as
(3) 
for density functions and , which we call densities. The approximation at (3) represents the minimal product restriction for which practical variational inference algorithms arise. However, as explained in Section 10.2.5 of Bishop (2006), the graphical structure of Figure 2 induces further product density forms and the righthand side of (3) admits the further factorization
(4) 
With this product density form in place, the forms and optimal parameters for the
densities are obtained by minimising the KullbackLeibler divergence of the righthand side of (
3) from its lefthand side. The optimal density parameters are interdependent and a coordinate ascent algorithm (e.g. Algorithm 1 of Ormerod & Wand, 2010) is used to obtain their solution. For example, the optimal density for , denoted by , is a Multivariate Normal density function with mean vector and covariance matrix . The coordinate ascent algorithm is such that they are updated according to(5) 
where and are the density expectations of and and . If, for example, (1) corresponds to a mixed model with threelevel random effects such that then, as pointed out in Section 1, with 60 groups at level 2 and 500 groups at level 3 the matrix has almost 2 trillion entries of which 99.99% are zero. Moreover, is a matrix of which only about 0.016% of its approximately 3.7 billion entries are required for variational inference under mean field restriction (3). Avoiding the wastage of the naïve updates given by (5) is the crux of this article and dealt with in the upcoming sections. The updates for and depend on parameterizations of and . For example, for some throughout Sections 4 and 5. However, these covariance parameter updates are relatively simple and free of storage and computational efficiency issues. Similar comments apply to the updates for the density parameters of and .
An alternative approach to obtaining the relevant subblocks of and the covariance and auxiliary variable parameter updates is to use the notion of message passing on a factor graph. The relevant factor graph for model (1), according to the product density form (4), is shown in Figure 3.
The circles in Figure 3 correspond to the parameters in each factor of (4) and are referred to as stochastic nodes. The squares correspond to the factors of
(6) 
with factorization according to the conditional independence structure apparent from Figure 2. Then, as explained in e.g. Minka (2005), the density of can be expressed as
where
are known as messages, with the subscripts indicating that they are passed from to and to respectively. Messages are simply functions of the stochastic node to which the message is passed and, for mean field variational inference, are formed according to rules listed in Minka (2005) and Section 2.5 of Wand (2017). To compartmentalize algebra and coding for variational message passing, Wand (2017) advocates the use of fragments, which are subgraphs of a factor graph containing a single factor and each of its neighboring stochastic nodes. In Sections 4 and 5 of Wand (2017), eight important fragments are identified and treated including those needed for a wide range of linear mixed models. However, in the interests of brevity, Wand (2017) ignored issues surrounding potentially very large and sparse matrices in the message parameter vectors. In Sections 4 and 5 of this article, we explain how the messages passed to the node can be streamlined to avoid massive sparse matrices.
A core component of the message passing approach to variational inference is exponential family forms, sufficient statistics and natural parameters. For a Multivariate Normal random vector
this involves reexpression of its density function according to
where
are, respectively, the sufficient statistic and natural parameter vectors. The matrix , known as the duplication matrix of order , is the matrix containing only zeroes and ones such that for any symmetric matrix . The function
is the logpartition function, where is the MoorePenrose inverse of and is such that whenever is symmetric. The inverse of the natural parameter transformation is given by
The vec and vech matrix operators are reasonably wellestablished (e.g. Gentle, 2007). If is a vector then is the matrix such that . We also require vec inversion to nonsquare matrices. If is a vector then is the matrix such that .
The other major distributional family used throughout this article is a generalization of the Inverse Wishart distribution known as the Inverse GWishart distribution. It corresponds to the matrix inverses of random matrices that have a GWishart distribution (e.g. AtayKayis & Massam, 2005). For any positive integer , let be an undirected graph with nodes labeled and set consisting of sets of pairs of nodes that are connected by an edge. We say that the symmetric matrix respects if
A random matrix has an Inverse GWishart distribution with graph and parameters and symmetric matrix , written
if and only if the density function of satisfies
over arguments such that is symmetric and positive definite and respects . Two important special cases are
for which the Inverse GWishart distribution coincides with the ordinary Inverse Wishart distribution, and
for which the Inverse GWishart distribution coincides with a product of independent Inverse ChiSquared random variables. The subscripts of
and reflect the fact that is a full matrix and is a diagonal matrix in each special case.The case corresponds to the ordinary Inverse Wishart distribution. However, with message passing in mind, we will work with the more general Inverse GWishart family throughout this article.
In the special case the graph
and the Inverse GWishart distribution reduces to the Inverse ChiSquared distributions. Throughout this article we write
for this special case with and scalar.
Finally, we remark on the and notation used for density functions in this article. In the variational inference literature these letters have become very commonplace to denote the density functions corresponding to the model and the density functions of parameters according to the mean field approximation, with for the former and for the latter. However, the same letters are commonly used as dimension variables in the mixed models literature (e.g. Pinheiro & Bates, 2000). Therefore we use ordinary and as dimension variables and scripted versions of these letters ( and ) for density functions.
3 Multilevel Sparse Matrix Problem Algorithms
A key observation in this work is the fact that streamlining of variational inference algorithms for higher level random effects models can be achieved by recognition and isolation of a few fundamental algorithms, which we call multilevel sparse matrix problem algorithms. These algorithms, based on the results of Nolan & Wand (2019), are identical to those used traditionally for fitting frequentist random effects (Pinheiro & Bates, 2000). For each level there are two types of sparse matrix solution algorithms: one that applies to general forms and one that uses a QRdecomposition enhancement for a particular form that arises commonly for models containing random effects. Both types are needed for variational inference.
In theory, based on the infrastructure laid out in Nolan & Wand (2019), any number of levels can be handled. However, each higher level brings increasing complexity. Here we restrict attention to twolevel and threelevel sparse matrix algorithms.
3.1 TwoLevel Sparse Matrix Algorithms
Twolevel sparse matrix problems are described in Section 2 of Nolan & Wand (2019). The notation used there is also used in this section. Here we present two algorithms, named
SolveTwoLevelSparseMatrix and SolveTwoLevelSparseLeastSquares
which are at the heart of streamlining variational inference for twolevel models.
The SolveTwoLevelSparseMatrix algorithm is concerned with solving general twolevel sparse linear system problem , where
(7) 
and obtaining the submatrices corresponding to the nonzero blocks of :
(8) 
As will be elaborated upon later, the blocks represented by the symbol are not of interest. SolveTwoLevelSparseMatrix is listed as Algorithm 1 and is justified by Theorem 1 of Nolan & Wand (2019).
The SolveTwoLevelSparseLeastSquares algorithm arises in the special case where is the minimizer of the least squares problem where
(9) 
In this case , so that the subblocks of and take the forms
As demonstrated in Section 4, these forms arise in twolevel random effects models. Theorem 2 of Nolan & Wand (2019) shows that this special form lends itself to a QR decomposition (e.g. Harville, 2008; Section 6.4.d) approach which has speed and stability advantages in regression settings (e.g. Gentle, 2007; Section 6.7.2).
SolveTwoLevelSparseLeastSquares is listed as Algorithm 2. Note that we use , rather than , to denote the number of rows in each of , and to avoid a notational clash with common grouped data dimension notation as used in Section 4. In the first loop over the groups of data the upper triangular matrices , , are obtained via QRdecomposition; a standard procedure within most computing environments. Following that, all matrix equations involve , which can be achieved rapidly via backsolving.
Note that in Algorithm 2 calculations such as do not require storage of and use of ordinary multiplication. Standard matrix algebraic programming languages store information concerning in a compact form from which matrices such as can be efficiently obtained.
3.2 ThreeLevel Sparse Matrix Algorithms
Extension to the threelevel situation is described in Section 3 of Nolan & Wand (2019). Theorems 3 and 4 given there lead to the algorithms
SolveThreeLevelSparseMatrix and SolveThreeLevelSparseLeastSquares
which facilitate streamlining variational inference for threelevel models.
An illustrative threelevel sparse matrix is:
(10) 
and corresponds to level 2 group sizes of and , and a level 3 group size of . A general threelevel sparse matrix consists of the following components:

A matrix , which is designated the block position.

A set of partitioned matrices , which is designated the block position. For each , is , and for each , is .

A block, which is simply the transpose of the block.

A block diagonal structure along the block position, where each subblock is a twolevel sparse matrix, as defined in (7). For each , is , and for each , is and is .
The threelevel sparse linear system problem takes the form where we partition the vectors and as follows:
(11) 
Here and are vectors. Then, for each , and are vectors. Lastly, for each and the vectors and have dimension .
The threelevel sparse matrix inverse problem involves determination of the subblocks of corresponding to the nonzero subblocks of . Our notation for these subblocks is illustrated by
(12) 
for the , and case.
SolveThreeLevelSparseMatrix, which provides streamlined solutions for the general threelevel sparse matrix problem, is listed as Algorithm 3.