Binary response mixed model-based data analysis is ubiquitous in many areas of application, with examples such as analysis of biomedical longitudinal data (e.g. Diggle et al., 2002), social science multilevel data (e.g. Goldstein, 2010), small area survey data (e.g. Rao & Molina, 2015) and economic panel data (e.g. Baltagi, 2013). The standard approach for likelihood-based inference in the presence of multivariate random effects is Laplace approximation, which is well-known to be inconsistent and prone to inferential inaccuracy. Our main contribution is to overcome this problem using expectation propagation. The new approach possesses speed and scalability on par with that of Laplace approximation, but is provably consistent and demonstrably very accurate. Bayesian approaches and Monte Carlo methods offer another route to accurate inference for binary response mixed models (e.g. Gelman & Hill, 2007). However, speed and scalability issues aside, frequentist inference is the dominant approach in many areas in which mixed models are used. Henceforth, we focus on frequentist binary mixed model analysis.
The main obstacle for likelihood-based inference for binary mixed models is the presence of irreducible integrals. For grouped data with one level of nesting, the dimension of the integrals matches the number of random effects. The two most common approaches to dealing with these integrals are (1) quadrature and (2) Laplace approximation. For example, in the R computing environment (R Core Team, 2018) the function glmer() in the package lme4 (Bates et al., 2015) supports both adaptive Gauss-Hermite quadrature and Laplace approximation for univariate random effects. For multivariate random effects only Laplace approximation is supported by glmer(), presumably because of the inherent difficulties of higher dimensional quadrature. Laplace approximation eschews multivariate integration via quadratic approximation of the log-integrand. However, the resultant approximate inference is well-known to be inaccurate, often to an unacceptable degree, in binary mixed models (e.g. McCulloch et al., Section 14.4). An embellishment of Laplace approximation, known as integrated nested Laplace approximation (Rue, Martino & Chopin, 2009), has been successful in various Bayesian inference contexts.
Expectation propagation (e.g. Minka, 2001) is general prescription for approximation of integrals that arise in statistical inference problems. Most of its literature is within the realm of Computer Science and, in particular, geared towards approximate inference for Bayesian graphical models (e.g. Chapter 10, Bishop, 2006). A major contribution of this article is transferral of expectation propagation methodology to frequentist statistical inference. In principle, our approach applies to any generalized linear mixed model situation. However, expectation propagation for binary response mixed model analysis has some especially attractive features and therefore we focus on this class of models. In the special case of probit mixed models, the expectation propagation approximation to the log-likelihood is exact regardless of the dimension of the random effects. This leads to a new practical alternative to multivariate quadrature. Moreover, asymptotic theory reveals that expectation propagation provides consistent approximation of the exact likelihood surface. This implies very good inferential accuracy of expectation propagation, and is supported by our simulation results. We are not aware of any other quadrature-free approaches to generalized mixed model analysis that has such a strong theoretical underpinning.
To facilitate widespread use of the new approach, a new package in the R language (R Core Team, 2018) has been launched. The package, glmmEP (Wand & Yu, 2018), uses a low-level language implementation of expectation propagation for speedy approximate likelihood-based inference and scales well to large sample sizes.
Binary response mixed models, and their inherent computational challenges, are summarized in Section 2 The expectation propagation approach to fitting and approximate inference, with special attention given to the quadrature-free probit link situation, is given in Section 3. Section 4 presents the results of numerical studies for both simulated and real data, and shows expectation propagation to be of great practical value as a fast, high quality approximation that scales well to big data and big model situations. Theoretical considerations are summarised in Section 5. Higher level and random effects extensions are touched upon in Section 6. Lastly, we briefly discuss transferral of new approach to other generalized linear mixed model settings in Section 7.
2 Binary Response Mixed Models
Binary mixed models for grouped data with one level of nesting and Gaussian random effects has the general form
where , the inverse link
, is a pre-specified cumulative distribution function andis the th response for the th group, where number of groups is and the number of responses measurements within the th group is . Also, is a vector of predictors corresponding to , modeled as having fixed effects with coefficient vector . Similarly, is a vector of predictors modeled as having random effects with coefficient vectors , . Typically, is a sub-vector of . It is also very common for each of and to have first entry equal to , corresponding to fixed and random intercepts. The random effects covariance matrix has dimension .
By far, the most common choices for are
where and is the cumulative distribution function of the distribution.
Despite the simple form of (1), likelihood-based inference for the parameters and and best prediction of the random effects is very numerically challenging. Assuming that , as is the case for the logistic and probit cases, the log-likelihood is
and the best predictor of is
The -dimensional integrals in the and expressions cannot be reduced further and multivariate numerical integration must be called upon for their evaluation. In addition, has to be maximized over
-dimensional space to obtain maximum likelihood estimates. Lastly, there is the problem of obtaining approximate confidence intervals for the entries ofand and approximate prediction intervals for the entries of .
3 Expectation Propagation Likelihood Approximation
We will first explain expectation propagation for approximation of the log-likelihood . Approximation of follows relatively quickly. First note that where
Each of the are approximated individually and then summed to approximate The essence is of the approximation of is replacement of each
by an unnormalized Multivariate Normal density function, chosen according to an appropriate minimum Kullback-Leibler divergence criterion. The resultant integrand is then proportional to a product of Multivariate Normal density functions and admits an explicit form. The number approximating density functions of the same order of magnitude and, together with the properties of minimum Kullback-Leibler divergence, leads to accurate and statistically consistent approximation of. In probit case, where , the minimum Kullback-Leibler divergence steps are explicit. This leads to accurate approximation of without the need for any numerical integration – just some fixed-point iteration. The expectation propagation-approximate log-likelihood, which we denote by , can be evaluated quite rapidly and maximized using established derivative-free methods such as the Nelder-Mead algorithm (Nelder & Mead, 1965) or quasi-Newton optimization methods such as the Broyden-Fletcher-Goldfarb-Shanno approach with numerical derivatives. The latter also facilitates Hessian matrix approximation at the maximum, which can be used to construct approximate confidence intervals.
We now provide the details, with subsections on each of Kullback-Leibler projection onto unnormalized Multivariate Normal density functions, message passing formulation for organizing the required versions of these projections and quasi-Newton-based approximate inference. The upcoming subsections require some specialized matrix notation. If is matrix then is the vector obtained by stacking the columns of underneath each other in order from left to right. Also, is vector defined similarly to but only involving entries on and below the diagonal. The duplication matrix of order , denoted by , is the unique matrix of zeros and ones such that
The Moore-Penrose inverse of is
3.1 Projection onto Unnormalized Multivariate Normal Density Functions
Let denote the set of absolutely integrable functions on . For such that , the Kullback-Leibler divergence of from is
(e.g. Minka, 2005). In the special case where and are density functions the right-hand side of (3) reduces to the more common Kullback-Leibler divergence expression. However, we require this more general form that caters for unnormalized density functions.
Now consider the family of functions on of the form
where , is a vector and is a vector restricted in such a way that . Then (4) is the family of unnormalized Multivariate Normal density functions written in exponential family form with natural parameters , and .
Expectation propagation for generalized linear mixed models with Gaussian random effects has the following notion at its core:
|given , determine the , and that minimizes .||(5)|
The solution is termed the (Kullback-Leibler) projection onto the family of Multivariate Normal density functions and we write
denoting the set of all allowable natural parameters. Note that the special case of Kullback-Leibler projection onto the unnormalized Multivariate Normal family has a simple moment-matching representation, withbeing the unique vector such that zeroth-, first- and second-order moments of match those of .
For the binary mixed model (1), expectation propagation requires repeated projection of the form
onto the unnormalized Multivariate Normal family. An important observation is that case of probit mixed models, has an exact solution.
Let . It follows that
where is the density function. We are now in a position to define two algebraic functions which are fundamental for approximate likelihood-based inference in probit mixed models based on expectation propagation:
Definition 1. For primary arguments () and () such that is symmetric and positive definite, and auxiliary arguments and () the function is given by
and the function is given by
In addition, for primary arguments , () and () such that both and are symmetric and positive definite, and auxiliary arguments and (), the function is given by
Inspection of Definition 1 reveals that the and functions are simple functions up to evaluations of and . Even though software for is widely available, direct computation of and can be unstable and software such as the function zeta() in the R package sn (Azzalini, 2017) is recommended. Another option is use of continued fraction representation and Lentz’s Algorithm (e.g. Wand & Ormerod, 2012).
Expectation propagation for probit mixed models relies heavily upon: Theorem 1. If
A proof of Theorem 1 is given in Section S.1 of the online supplement.
3.2 Message Passing Formulation
The th summand of can be written as
where, for ,
are, respectively, the conditional density functions of each response given its random effect and the density function of that random effect. Note that product structure of the integrand in (6) can be represented using factor graph shown in Figure 1. The circle in Figure 1 corresponds to the random vector and factor graph parlance is a stochastic variable node. The solid rectangles correspond to each of the factors in the (6) integrand. Each of these factors depend on , which is signified by an edges connecting each factor node to the lone stochastic variable node.
Expectation propagation approximation of involves projection onto the unnormalized Multivariate Normal family. Suppose that
are initialized to be unnormalized Multivariate Normal density functions in . Then, for each , the update involves minimization of
as functions of . Noting that this problem has the form (5), Theorem 1 can be used to perform the update explicitly in the case of a probit link. This procedure is then iterated until the s converge.
A convenient way to keep track of the updates and compartmentalize the algebra and coding is to call upon the notion of message passing. Minka (2005) shows how to express expectation propagation as a message passing algorithm in the Bayesian graphical models context, culminating in his equation (54) and (83) update formulae. Exactly the same formulae arise here, as is made clear in Section S.2 of the online supplement. In particular, in keeping with (83) of Minka (2005), (8) can be expressed as
where is the message passed from the factor to the stochastic node and is the message passed from back to . The message passed from to is
In keeping with equation (54) of Minka (2005), the stochastic node to factor messages are updated according to
As laid out at the end of Section 6 of Minka (2005), the expectation message passing protocol is:
Initialize all factor to stochastic node messages.
Upon convergence, the expectation propagation propagation approximation to is
where the integrand is in keeping with the general form given by (44) of Minka & Winn (2008). The success of expectation propagation hinges on the fact that each of the messages in (13) is an unnormalized Multivariate Normal density function and the integral over can be obtained exactly as follows:
is as defined in Definition 1 and, for an unnormalized Multivariate Normal natural parameter vector , denotes the first entry (the zero subscript is indicative of the first entry being the coefficient of ) and denotes the remaining entries.
We have carried out extensive simulated data tests on Algorithm 1 using the starting values described in Section 3.3 and found convergence to be rapid. Moreover, each of updates in Algorithm 1 involve explicit calculations and low-level language implementation, used in our R package glmmEP, affords very fast evaluation of the approximate log-likelihood surface. As explained in (3.4), quasi-Newton methods can be used for maximization of and approximate likelihood-based inference.
3.3 Recommended Starting Values for Algorithm 1
and is a prediction of . A convenient choice for is that based on Laplace approximation. In the R computing environment the function glmer() in the package lme4 (Bates et al., 2015) provides fast Laplace approximation-based predictions for the . In our numerical experiments, we found convergence of the cycle loop of Algorithm 1 to be quite rapid, with convergents of
Therefore, we strongly recommend the starting values (14).
3.4 Quasi-Newton Optimization and Approximate Inference
Even though Algorithm 1 provides fast approximate evaluation of the probit mixed model likelihood surface, we still need to maximize over to obtain the expectation propagation-approximate maximum likelihood estimators . This is also the issue of approximate inference based on Fisher information theory.
Since is defined implicitly via an iterative scheme, differentiation for use in derivative-based optimization techniques is not straightforward. A practical workaround involves the employment of optimization methods such as those of the quasi-Newton variety for which derivatives are approximated numerically. In the R computing environment the function optim() supports several derivative-free optimization implementations. The Matlab computing environment (The Mathworks Incorporated, 2018) has similar capabilities via functions such as fminunc(). In the glmmEP package and the examples in Section 4 we use the Broyden-Fletcher-Goldfarb-Shanno quasi-Newton method (Broyden, 1970; Fletcher 1970; Goldfarb, 1970; Shanno, 1970) with Nelder-Mead starting values. Section 220.127.116.11 of Givens & Hoetig (2005) provides a concise summary of the Broyden-Fletcher-Goldfarb-Shanno method.
Since is constrained to be symmetric and positive definite, we instead perform quasi-Newton optimization over the unconstrained parameter vector where
and is the matrix logarithm of (e.g. Section 2.2 of Pinheiro & Bates, 2000). Note that can be obtained using
is the spectral decomposition of and denotes element-wise evaluation of the logarithm to the entries of . If is the maximizer of then the expectation propagation-approximate maximum likelihood estimate of is
is the spectral decomposition of the