1 Introduction
Binary response mixed modelbased data analysis is ubiquitous in many areas of application, with examples such as analysis of biomedical longitudinal data (e.g. Diggle et al., 2002), social science multilevel data (e.g. Goldstein, 2010), small area survey data (e.g. Rao & Molina, 2015) and economic panel data (e.g. Baltagi, 2013). The standard approach for likelihoodbased inference in the presence of multivariate random effects is Laplace approximation, which is wellknown to be inconsistent and prone to inferential inaccuracy. Our main contribution is to overcome this problem using expectation propagation. The new approach possesses speed and scalability on par with that of Laplace approximation, but is provably consistent and demonstrably very accurate. Bayesian approaches and Monte Carlo methods offer another route to accurate inference for binary response mixed models (e.g. Gelman & Hill, 2007). However, speed and scalability issues aside, frequentist inference is the dominant approach in many areas in which mixed models are used. Henceforth, we focus on frequentist binary mixed model analysis.
The main obstacle for likelihoodbased inference for binary mixed models is the presence of irreducible integrals. For grouped data with one level of nesting, the dimension of the integrals matches the number of random effects. The two most common approaches to dealing with these integrals are (1) quadrature and (2) Laplace approximation. For example, in the R computing environment (R Core Team, 2018) the function glmer() in the package lme4 (Bates et al., 2015) supports both adaptive GaussHermite quadrature and Laplace approximation for univariate random effects. For multivariate random effects only Laplace approximation is supported by glmer(), presumably because of the inherent difficulties of higher dimensional quadrature. Laplace approximation eschews multivariate integration via quadratic approximation of the logintegrand. However, the resultant approximate inference is wellknown to be inaccurate, often to an unacceptable degree, in binary mixed models (e.g. McCulloch et al., Section 14.4). An embellishment of Laplace approximation, known as integrated nested Laplace approximation (Rue, Martino & Chopin, 2009), has been successful in various Bayesian inference contexts.
Expectation propagation (e.g. Minka, 2001) is general prescription for approximation of integrals that arise in statistical inference problems. Most of its literature is within the realm of Computer Science and, in particular, geared towards approximate inference for Bayesian graphical models (e.g. Chapter 10, Bishop, 2006). A major contribution of this article is transferral of expectation propagation methodology to frequentist statistical inference. In principle, our approach applies to any generalized linear mixed model situation. However, expectation propagation for binary response mixed model analysis has some especially attractive features and therefore we focus on this class of models. In the special case of probit mixed models, the expectation propagation approximation to the loglikelihood is exact regardless of the dimension of the random effects. This leads to a new practical alternative to multivariate quadrature. Moreover, asymptotic theory reveals that expectation propagation provides consistent approximation of the exact likelihood surface. This implies very good inferential accuracy of expectation propagation, and is supported by our simulation results. We are not aware of any other quadraturefree approaches to generalized mixed model analysis that has such a strong theoretical underpinning.
To facilitate widespread use of the new approach, a new package in the R language (R Core Team, 2018) has been launched. The package, glmmEP (Wand & Yu, 2018), uses a lowlevel language implementation of expectation propagation for speedy approximate likelihoodbased inference and scales well to large sample sizes.
Binary response mixed models, and their inherent computational challenges, are summarized in Section 2 The expectation propagation approach to fitting and approximate inference, with special attention given to the quadraturefree probit link situation, is given in Section 3. Section 4 presents the results of numerical studies for both simulated and real data, and shows expectation propagation to be of great practical value as a fast, high quality approximation that scales well to big data and big model situations. Theoretical considerations are summarised in Section 5. Higher level and random effects extensions are touched upon in Section 6. Lastly, we briefly discuss transferral of new approach to other generalized linear mixed model settings in Section 7.
2 Binary Response Mixed Models
Binary mixed models for grouped data with one level of nesting and Gaussian random effects has the general form
(1) 
where , the inverse link
, is a prespecified cumulative distribution function and
is the th response for the th group, where number of groups is and the number of responses measurements within the th group is . Also, is a vector of predictors corresponding to , modeled as having fixed effects with coefficient vector . Similarly, is a vector of predictors modeled as having random effects with coefficient vectors , . Typically, is a subvector of . It is also very common for each of and to have first entry equal to , corresponding to fixed and random intercepts. The random effects covariance matrix has dimension .By far, the most common choices for are
where and is the cumulative distribution function of the distribution.
Despite the simple form of (1), likelihoodbased inference for the parameters and and best prediction of the random effects is very numerically challenging. Assuming that , as is the case for the logistic and probit cases, the loglikelihood is
(2) 
and the best predictor of is
The dimensional integrals in the and expressions cannot be reduced further and multivariate numerical integration must be called upon for their evaluation. In addition, has to be maximized over
dimensional space to obtain maximum likelihood estimates. Lastly, there is the problem of obtaining approximate confidence intervals for the entries of
and and approximate prediction intervals for the entries of .3 Expectation Propagation Likelihood Approximation
We will first explain expectation propagation for approximation of the loglikelihood . Approximation of follows relatively quickly. First note that where
Each of the are approximated individually and then summed to approximate The essence is of the approximation of is replacement of each
by an unnormalized Multivariate Normal density function, chosen according to an appropriate minimum KullbackLeibler divergence criterion. The resultant integrand is then proportional to a product of Multivariate Normal density functions and admits an explicit form. The number approximating density functions of the same order of magnitude and, together with the properties of minimum KullbackLeibler divergence, leads to accurate and statistically consistent approximation of
. In probit case, where , the minimum KullbackLeibler divergence steps are explicit. This leads to accurate approximation of without the need for any numerical integration – just some fixedpoint iteration. The expectation propagationapproximate loglikelihood, which we denote by , can be evaluated quite rapidly and maximized using established derivativefree methods such as the NelderMead algorithm (Nelder & Mead, 1965) or quasiNewton optimization methods such as the BroydenFletcherGoldfarbShanno approach with numerical derivatives. The latter also facilitates Hessian matrix approximation at the maximum, which can be used to construct approximate confidence intervals.We now provide the details, with subsections on each of KullbackLeibler projection onto unnormalized Multivariate Normal density functions, message passing formulation for organizing the required versions of these projections and quasiNewtonbased approximate inference. The upcoming subsections require some specialized matrix notation. If is matrix then is the vector obtained by stacking the columns of underneath each other in order from left to right. Also, is vector defined similarly to but only involving entries on and below the diagonal. The duplication matrix of order , denoted by , is the unique matrix of zeros and ones such that
The MoorePenrose inverse of is
3.1 Projection onto Unnormalized Multivariate Normal Density Functions
Let denote the set of absolutely integrable functions on . For such that , the KullbackLeibler divergence of from is
(3) 
(e.g. Minka, 2005). In the special case where and are density functions the righthand side of (3) reduces to the more common KullbackLeibler divergence expression. However, we require this more general form that caters for unnormalized density functions.
Now consider the family of functions on of the form
(4) 
where , is a vector and is a vector restricted in such a way that . Then (4) is the family of unnormalized Multivariate Normal density functions written in exponential family form with natural parameters , and .
Expectation propagation for generalized linear mixed models with Gaussian random effects has the following notion at its core:
given , determine the , and that minimizes .  (5) 
The solution is termed the (KullbackLeibler) projection onto the family of Multivariate Normal density functions and we write
where
with
denoting the set of all allowable natural parameters. Note that the special case of KullbackLeibler projection onto the unnormalized Multivariate Normal family has a simple momentmatching representation, with
being the unique vector such that zeroth, first and secondorder moments of match those of .For the binary mixed model (1), expectation propagation requires repeated projection of the form
onto the unnormalized Multivariate Normal family. An important observation is that case of probit mixed models, has an exact solution.
Let . It follows that
where is the density function. We are now in a position to define two algebraic functions which are fundamental for approximate likelihoodbased inference in probit mixed models based on expectation propagation:
Definition 1. For primary arguments () and () such that is symmetric and positive definite, and auxiliary arguments and () the function is given by
with
and the function is given by
In addition, for primary arguments , () and () such that both and are symmetric and positive definite, and auxiliary arguments and (), the function is given by
with .
Inspection of Definition 1 reveals that the and functions are simple functions up to evaluations of and . Even though software for is widely available, direct computation of and can be unstable and software such as the function zeta() in the R package sn (Azzalini, 2017) is recommended. Another option is use of continued fraction representation and Lentz’s Algorithm (e.g. Wand & Ormerod, 2012).
Expectation propagation for probit mixed models relies heavily upon: Theorem 1. If
then
where
A proof of Theorem 1 is given in Section S.1 of the online supplement.
3.2 Message Passing Formulation
The th summand of can be written as
(6) 
where, for ,
are, respectively, the conditional density functions of each response given its random effect and the density function of that random effect. Note that product structure of the integrand in (6) can be represented using factor graph shown in Figure 1. The circle in Figure 1 corresponds to the random vector and factor graph parlance is a stochastic variable node. The solid rectangles correspond to each of the factors in the (6) integrand. Each of these factors depend on , which is signified by an edges connecting each factor node to the lone stochastic variable node.
Expectation propagation approximation of involves projection onto the unnormalized Multivariate Normal family. Suppose that
(7) 
are initialized to be unnormalized Multivariate Normal density functions in . Then, for each , the update involves minimization of
(8) 
as functions of . Noting that this problem has the form (5), Theorem 1 can be used to perform the update explicitly in the case of a probit link. This procedure is then iterated until the s converge.
A convenient way to keep track of the updates and compartmentalize the algebra and coding is to call upon the notion of message passing. Minka (2005) shows how to express expectation propagation as a message passing algorithm in the Bayesian graphical models context, culminating in his equation (54) and (83) update formulae. Exactly the same formulae arise here, as is made clear in Section S.2 of the online supplement. In particular, in keeping with (83) of Minka (2005), (8) can be expressed as
(9) 
where is the message passed from the factor to the stochastic node and is the message passed from back to . The message passed from to is
(10) 
In keeping with equation (54) of Minka (2005), the stochastic node to factor messages are updated according to
(11) 
and
(12) 
As laid out at the end of Section 6 of Minka (2005), the expectation message passing protocol is:

Initialize all factor to stochastic node messages.
Upon convergence, the expectation propagation propagation approximation to is
(13) 
where the integrand is in keeping with the general form given by (44) of Minka & Winn (2008). The success of expectation propagation hinges on the fact that each of the messages in (13) is an unnormalized Multivariate Normal density function and the integral over can be obtained exactly as follows:
where
is as defined in Definition 1 and, for an unnormalized Multivariate Normal natural parameter vector , denotes the first entry (the zero subscript is indicative of the first entry being the coefficient of ) and denotes the remaining entries.
The full algorithm for expectation propagation approximation of is summarized as Algorithm 1. The derivational details are given in Section S.2.
We have carried out extensive simulated data tests on Algorithm 1 using the starting values described in Section 3.3 and found convergence to be rapid. Moreover, each of updates in Algorithm 1 involve explicit calculations and lowlevel language implementation, used in our R package glmmEP, affords very fast evaluation of the approximate loglikelihood surface. As explained in (3.4), quasiNewton methods can be used for maximization of and approximate likelihoodbased inference.
3.3 Recommended Starting Values for Algorithm 1
In Section S.3 we use a Taylor series argument to justify the following starting values for in Algorithm 1:
(14) 
where
and is a prediction of . A convenient choice for is that based on Laplace approximation. In the R computing environment the function glmer() in the package lme4 (Bates et al., 2015) provides fast Laplace approximationbased predictions for the . In our numerical experiments, we found convergence of the cycle loop of Algorithm 1 to be quite rapid, with convergents of
Therefore, we strongly recommend the starting values (14).
3.4 QuasiNewton Optimization and Approximate Inference
Even though Algorithm 1 provides fast approximate evaluation of the probit mixed model likelihood surface, we still need to maximize over to obtain the expectation propagationapproximate maximum likelihood estimators . This is also the issue of approximate inference based on Fisher information theory.
Since is defined implicitly via an iterative scheme, differentiation for use in derivativebased optimization techniques is not straightforward. A practical workaround involves the employment of optimization methods such as those of the quasiNewton variety for which derivatives are approximated numerically. In the R computing environment the function optim() supports several derivativefree optimization implementations. The Matlab computing environment (The Mathworks Incorporated, 2018) has similar capabilities via functions such as fminunc(). In the glmmEP package and the examples in Section 4 we use the BroydenFletcherGoldfarbShanno quasiNewton method (Broyden, 1970; Fletcher 1970; Goldfarb, 1970; Shanno, 1970) with NelderMead starting values. Section 2.2.2.3 of Givens & Hoetig (2005) provides a concise summary of the BroydenFletcherGoldfarbShanno method.
Since is constrained to be symmetric and positive definite, we instead perform quasiNewton optimization over the unconstrained parameter vector where
and is the matrix logarithm of (e.g. Section 2.2 of Pinheiro & Bates, 2000). Note that can be obtained using
is the spectral decomposition of and denotes elementwise evaluation of the logarithm to the entries of . If is the maximizer of then the expectation propagationapproximate maximum likelihood estimate of is
is the spectral decomposition of the
Comments
There are no comments yet.