# Fast and Accurate Binary Response Mixed Model Analysis via Expectation Propagation

Expectation propagation is a general prescription for approximation of integrals in statistical inference problems. Its literature is mainly concerned with Bayesian inference scenarios. However, expectation propagation can also be used to approximate integrals arising in frequentist statistical inference. We focus on likelihood-based inference for binary response mixed models and show that fast and accurate quadrature-free inference can be realized for the probit link case with multivariate random effects and higher levels of nesting. The approach is supported by asymptotic theory in which expectation propagation is seen to provide consistent estimation of the exact likelihood surface. Numerical studies reveal the availability of fast, highly accurate and scalable methodology for binary mixed model analysis.

## Authors

• 1 publication
• 1 publication
• 1 publication
• 4 publications
• 2 publications
12/16/2010

### Fast Convergent Algorithms for Expectation Propagation Approximate Bayesian Inference

We propose a novel algorithm to solve the expectation propagation relaxa...
01/16/2018

### Expectation Propagation for Approximate Inference: Free Probability Framework

We study asymptotic properties of expectation propagation (EP) - a metho...
07/29/2011

### Expectation-Propagation for Likelihood-Free Inference

Many models of interest in the natural and social sciences have no close...
11/29/2011

### Gaussian Probabilities and Expectation Propagation

While Gaussian probability densities are omnipresent in applied mathemat...
02/25/2016

### Expectation Consistent Approximate Inference: Generalizations and Convergence

Approximations of loopy belief propagation, including expectation propag...
01/10/2022

### Loss-calibrated expectation propagation for approximate Bayesian decision-making

Approximate Bayesian inference methods provide a powerful suite of tools...
03/11/2021

### Likelihood-based missing data analysis in multivariate crossover trials

For gene expression data measured in a crossover trial, a multivariate m...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Binary response mixed model-based data analysis is ubiquitous in many areas of application, with examples such as analysis of biomedical longitudinal data (e.g. Diggle et al., 2002), social science multilevel data (e.g. Goldstein, 2010), small area survey data (e.g. Rao & Molina, 2015) and economic panel data (e.g. Baltagi, 2013). The standard approach for likelihood-based inference in the presence of multivariate random effects is Laplace approximation, which is well-known to be inconsistent and prone to inferential inaccuracy. Our main contribution is to overcome this problem using expectation propagation. The new approach possesses speed and scalability on par with that of Laplace approximation, but is provably consistent and demonstrably very accurate. Bayesian approaches and Monte Carlo methods offer another route to accurate inference for binary response mixed models (e.g. Gelman & Hill, 2007). However, speed and scalability issues aside, frequentist inference is the dominant approach in many areas in which mixed models are used. Henceforth, we focus on frequentist binary mixed model analysis.

The main obstacle for likelihood-based inference for binary mixed models is the presence of irreducible integrals. For grouped data with one level of nesting, the dimension of the integrals matches the number of random effects. The two most common approaches to dealing with these integrals are (1) quadrature and (2) Laplace approximation. For example, in the R computing environment (R Core Team, 2018) the function glmer() in the package lme4 (Bates et al., 2015) supports both adaptive Gauss-Hermite quadrature and Laplace approximation for univariate random effects. For multivariate random effects only Laplace approximation is supported by glmer(), presumably because of the inherent difficulties of higher dimensional quadrature. Laplace approximation eschews multivariate integration via quadratic approximation of the log-integrand. However, the resultant approximate inference is well-known to be inaccurate, often to an unacceptable degree, in binary mixed models (e.g. McCulloch et al., Section 14.4). An embellishment of Laplace approximation, known as integrated nested Laplace approximation (Rue, Martino & Chopin, 2009), has been successful in various Bayesian inference contexts.

Expectation propagation (e.g. Minka, 2001) is general prescription for approximation of integrals that arise in statistical inference problems. Most of its literature is within the realm of Computer Science and, in particular, geared towards approximate inference for Bayesian graphical models (e.g. Chapter 10, Bishop, 2006). A major contribution of this article is transferral of expectation propagation methodology to frequentist statistical inference. In principle, our approach applies to any generalized linear mixed model situation. However, expectation propagation for binary response mixed model analysis has some especially attractive features and therefore we focus on this class of models. In the special case of probit mixed models, the expectation propagation approximation to the log-likelihood is exact regardless of the dimension of the random effects. This leads to a new practical alternative to multivariate quadrature. Moreover, asymptotic theory reveals that expectation propagation provides consistent approximation of the exact likelihood surface. This implies very good inferential accuracy of expectation propagation, and is supported by our simulation results. We are not aware of any other quadrature-free approaches to generalized mixed model analysis that has such a strong theoretical underpinning.

To facilitate widespread use of the new approach, a new package in the R language (R Core Team, 2018) has been launched. The package, glmmEP (Wand & Yu, 2018), uses a low-level language implementation of expectation propagation for speedy approximate likelihood-based inference and scales well to large sample sizes.

Binary response mixed models, and their inherent computational challenges, are summarized in Section 2 The expectation propagation approach to fitting and approximate inference, with special attention given to the quadrature-free probit link situation, is given in Section 3. Section 4 presents the results of numerical studies for both simulated and real data, and shows expectation propagation to be of great practical value as a fast, high quality approximation that scales well to big data and big model situations. Theoretical considerations are summarised in Section 5. Higher level and random effects extensions are touched upon in Section 6. Lastly, we briefly discuss transferral of new approach to other generalized linear mixed model settings in Section 7.

## 2 Binary Response Mixed Models

Binary mixed models for grouped data with one level of nesting and Gaussian random effects has the general form

 (1)

, is a pre-specified cumulative distribution function and

is the th response for the th group, where number of groups is and the number of responses measurements within the th group is . Also, is a vector of predictors corresponding to , modeled as having fixed effects with coefficient vector . Similarly, is a vector of predictors modeled as having random effects with coefficient vectors . Typically, is a sub-vector of . It is also very common for each of and to have first entry equal to , corresponding to fixed and random intercepts. The random effects covariance matrix has dimension .

By far, the most common choices for are

 F={expitfor logistic mixed modelsΦfor probit mixed models

where and is the cumulative distribution function of the distribution.

Despite the simple form of (1), likelihood-based inference for the parameters and and best prediction of the random effects is very numerically challenging. Assuming that , as is the case for the logistic and probit cases, the log-likelihood is

 ℓ(β,Σ)=m∑i=1log∫Rd\tiny R{ni∏j=1F((2yij−1)(βTx\tiny Fij+uTx\tiny Rij))}|2πΣ|−1/2exp(−12uTΣ−1u)du (2)

and the best predictor of is

 BP(ui)=∫Rd\tiny Ru{∏nij=1F((2yij−1)(βTx\tiny Fij+uTx\tiny Rij))}exp(−12uTΣ−1u)du∫Rd\tiny R{∏nij=1F((2yij−1)(βTx\tiny Fij+uTx\tiny Rij))}exp(−12uTΣ−1u)du,1≤i≤m.

The -dimensional integrals in the and expressions cannot be reduced further and multivariate numerical integration must be called upon for their evaluation. In addition, has to be maximized over

-dimensional space to obtain maximum likelihood estimates. Lastly, there is the problem of obtaining approximate confidence intervals for the entries of

and and approximate prediction intervals for the entries of .

## 3 Expectation Propagation Likelihood Approximation

We will first explain expectation propagation for approximation of the log-likelihood . Approximation of follows relatively quickly. First note that where

 ℓi(β,Σ)≡log∫Rd\tiny R{ni∏j=1F((2yij−1)(βTx\tiny Fij+uTx\tiny Rij))}|2πΣ|−1/2exp(−12uTΣ−1u)du.

Each of the are approximated individually and then summed to approximate The essence is of the approximation of is replacement of each

 F((2yij−1)(βTx\tiny Fij+uTx\tiny Rij)),1≤j≤ni,

by an unnormalized Multivariate Normal density function, chosen according to an appropriate minimum Kullback-Leibler divergence criterion. The resultant integrand is then proportional to a product of Multivariate Normal density functions and admits an explicit form. The number approximating density functions of the same order of magnitude and, together with the properties of minimum Kullback-Leibler divergence, leads to accurate and statistically consistent approximation of

. In probit case, where , the minimum Kullback-Leibler divergence steps are explicit. This leads to accurate approximation of without the need for any numerical integration – just some fixed-point iteration. The expectation propagation-approximate log-likelihood, which we denote by , can be evaluated quite rapidly and maximized using established derivative-free methods such as the Nelder-Mead algorithm (Nelder & Mead, 1965) or quasi-Newton optimization methods such as the Broyden-Fletcher-Goldfarb-Shanno approach with numerical derivatives. The latter also facilitates Hessian matrix approximation at the maximum, which can be used to construct approximate confidence intervals.

We now provide the details, with subsections on each of Kullback-Leibler projection onto unnormalized Multivariate Normal density functions, message passing formulation for organizing the required versions of these projections and quasi-Newton-based approximate inference. The upcoming subsections require some specialized matrix notation. If is matrix then is the vector obtained by stacking the columns of underneath each other in order from left to right. Also, is vector defined similarly to but only involving entries on and below the diagonal. The duplication matrix of order , denoted by , is the unique matrix of zeros and ones such that

 Ddvech(A)=vec(A)forA=AT.

The Moore-Penrose inverse of is

 D+d≡(DTdDd)−1DTd.

### 3.1 Projection onto Unnormalized Multivariate Normal Density Functions

Let denote the set of absolutely integrable functions on . For such that , the Kullback-Leibler divergence of from is

 KL(f1∥f2)=∫Rd[f1(x)log{f1(x)/f2(x)}+f2(x)−f1(x)]dx (3)

(e.g. Minka, 2005). In the special case where and are density functions the right-hand side of (3) reduces to the more common Kullback-Leibler divergence expression. However, we require this more general form that caters for unnormalized density functions.

Now consider the family of functions on of the form

 f\tiny UN(x)≡exp⎧⎪ ⎪⎨⎪ ⎪⎩⎡⎢⎣1xvech(xxT)⎤⎥⎦T⎡⎢⎣η0η1η2⎤⎥⎦⎫⎪ ⎪⎬⎪ ⎪⎭ (4)

where , is a vector and is a vector restricted in such a way that . Then (4) is the family of unnormalized Multivariate Normal density functions written in exponential family form with natural parameters , and .

Expectation propagation for generalized linear mixed models with Gaussian random effects has the following notion at its core:

 given f\tiny input∈L1(Rd), determine the η0, η1 and η2 that minimizes KL(f\tiny input∥f\tiny UN). (5)

The solution is termed the (Kullback-Leibler) projection onto the family of Multivariate Normal density functions and we write

 proj[f\tiny input](x)≡exp⎧⎪ ⎪⎨⎪ ⎪⎩⎡⎢⎣1xvech(xxT)⎤⎥⎦T⎡⎢ ⎢⎣η∗0η∗1η∗2⎤⎥ ⎥⎦⎫⎪ ⎪⎬⎪ ⎪⎭

where

 (η∗0,η∗1,η∗2)=argmin(η0,η1,η2)∈HKL(f\tiny input∥f\tiny UN),

with

denoting the set of all allowable natural parameters. Note that the special case of Kullback-Leibler projection onto the unnormalized Multivariate Normal family has a simple moment-matching representation, with

being the unique vector such that zeroth-, first- and second-order moments of match those of .

For the binary mixed model (1), expectation propagation requires repeated projection of the form

 f\tiny input(x)=F(c0+cT1x)exp⎧⎪⎨⎪⎩[xvech(xxT)]T⎡⎢⎣η\tiny{input}1η\tiny{input}2⎤⎥⎦⎫⎪⎬⎪⎭

onto the unnormalized Multivariate Normal family. An important observation is that case of probit mixed models, has an exact solution.

Let . It follows that

 ζ′(x)=ϕ(x)/Φ(x)andζ′′(x)=−ζ′(x){x+ζ′(x)}

where is the density function. We are now in a position to define two algebraic functions which are fundamental for approximate likelihood-based inference in probit mixed models based on expectation propagation:

Definition 1. For primary arguments () and () such that is symmetric and positive definite, and auxiliary arguments and () the function is given by

with

 A2≡vec−1(D+Tda2),r1≡√2(2−cT1A−12c1),r2≡(2c0−cT1A−12a1)/r1,
 r3≡2ζ′(r2)/r1,r4≡−2ζ′′(r2)/r21{and}R5≡(A2+r4c1cT1)−1A2

and the function is given by

 AN([a1a2])≡−14aT1A−12a1−12log∣∣−2A2∣∣.

In addition, for primary arguments , () and () such that both and are symmetric and positive definite, and auxiliary arguments and (), the function is given by

 C\tiny probit([a1a2],[b1b2];c0,c1)≡logΦ(r2)+14bT1B−12b1−14aT1A−12a1+12log{|B2|/|A2|}

with .

Inspection of Definition 1 reveals that the and functions are simple functions up to evaluations of and . Even though software for is widely available, direct computation of and can be unstable and software such as the function zeta() in the R package sn (Azzalini, 2017) is recommended. Another option is use of continued fraction representation and Lentz’s Algorithm (e.g. Wand & Ormerod, 2012).

Expectation propagation for probit mixed models relies heavily upon: Theorem 1. If

 f\tiny input(x)=Φ(c0+cT1x)exp⎧⎪⎨⎪⎩[xvech(xxT)]T⎡⎢⎣η\tiny{input}1η\tiny{input}2⎤⎥⎦⎫⎪⎬⎪⎭

then

where

A proof of Theorem 1 is given in Section S.1 of the online supplement.

### 3.2 Message Passing Formulation

The th summand of can be written as

 ℓi(β,Σ)=log∫Rd\tiny R{ni∏j=1p(yij|ui;β)}p(ui;Σ)dui (6)

where, for ,

 p(yij|ui;β)≡F((2yij−1)(βTx\tiny Fij+uTix\tiny Rij))andp(ui;Σ)≡|2πΣ|−1/2exp(−12uTiΣ−1ui)

are, respectively, the conditional density functions of each response given its random effect and the density function of that random effect. Note that product structure of the integrand in (6) can be represented using factor graph shown in Figure 1. The circle in Figure 1 corresponds to the random vector and factor graph parlance is a stochastic variable node. The solid rectangles correspond to each of the factors in the (6) integrand. Each of these factors depend on , which is signified by an edges connecting each factor node to the lone stochastic variable node.

Expectation propagation approximation of involves projection onto the unnormalized Multivariate Normal family. Suppose that

 \lx@underaccentset∼p(yij|ui;β)=exp⎧⎪ ⎪⎨⎪ ⎪⎩⎡⎢⎣1uivech(uiuTi)⎤⎥⎦Tηij⎫⎪ ⎪⎬⎪ ⎪⎭,1≤j≤ni (7)

are initialized to be unnormalized Multivariate Normal density functions in . Then, for each , the update involves minimization of

 KL(p(yij|ui;β)⎧⎨⎩ni∏j′≠j\lx@underaccentset∼p(yij′|ui;β)⎫⎬⎭p(ui;Σ)∥∥∥⎧⎨⎩ni∏j′=1\lx@underaccentset∼p(yij′|ui;β)⎫⎬⎭p(ui;Σ)) (8)

as functions of . Noting that this problem has the form (5), Theorem 1 can be used to perform the update explicitly in the case of a probit link. This procedure is then iterated until the s converge.

A convenient way to keep track of the updates and compartmentalize the algebra and coding is to call upon the notion of message passing. Minka (2005) shows how to express expectation propagation as a message passing algorithm in the Bayesian graphical models context, culminating in his equation (54) and (83) update formulae. Exactly the same formulae arise here, as is made clear in Section S.2 of the online supplement. In particular, in keeping with (83) of Minka (2005), (8) can be expressed as

 \largem\footnotesizep(yij|ui;β)→ui(ui)⟵% proj[\largem\footnotesizeui→p(yij|ui;β)(ui)p(yij|ui;β)](ui)\largem% \footnotesizeui→p(yij|ui;β)(ui),1≤j≤ni, (9)

where is the message passed from the factor to the stochastic node and is the message passed from back to . The message passed from to is

 \largem\footnotesizep(ui;Σ)→ui(ui)⟵proj[\largem\footnotesizeui→p(ui;Σ)(ui)p(ui;Σ)](ui)\largem% \footnotesizeui→p(ui;Σ)% (ui). (10)

In keeping with equation (54) of Minka (2005), the stochastic node to factor messages are updated according to

 \largem\footnotesizeui→p(yij|ui;β)(ui)=\largem\footnotesizep(ui;Σ)→ui(ui)⎧⎨⎩ni∏j′≠j% \largem\footnotesizep(yij′|ui;β)→ui(ui)⎫⎬⎭, 1≤j≤ni, (11)

and

 \largem\footnotesizeui→p(ui;Σ)(ui)=ni∏j=1\large% m\footnotesizep(yij|ui;β)→ui(ui). (12)

As laid out at the end of Section 6 of Minka (2005), the expectation message passing protocol is:

• Initialize all factor to stochastic node messages.

• Cycle until all factor to stochastic node messages converge:

• For each factor:

• Compute the messages passed to the factor using (11) or (12).

• Compute the messages passed from the factor using (9) or (10).

Upon convergence, the expectation propagation propagation approximation to is

 \lx@underaccentset∼ℓi(β,Σ)=log∫Rd\tiny R{ni∏j=1\largem\footnotesizep(yij|ui;β)→ui(ui)}% \largem\footnotesizep(ui;Σ)→ui(ui)dui. (13)

where the integrand is in keeping with the general form given by (44) of Minka & Winn (2008). The success of expectation propagation hinges on the fact that each of the messages in (13) is an unnormalized Multivariate Normal density function and the integral over can be obtained exactly as follows:

 ∫Rd\tiny R{ni∏j=1\largem\footnotesizep(yij|ui;β)→ui(ui)}\largem\footnotesizep(ui;Σ)→ui(ui)dui =∫Rd\tiny R⎡⎢ ⎢⎣ni∏j=1exp⎧⎪ ⎪⎨⎪ ⎪⎩⎡⎢⎣1uivech(uiuTi)⎤⎥⎦T\largeη\footnotesizep(yij|ui;β)→ui⎫⎪ ⎪⎬⎪ ⎪⎭⎤⎥ ⎥⎦ ×exp⎧⎪ ⎪⎨⎪ ⎪⎩⎡⎢⎣1uivech(uiuTi)⎤⎥⎦T\largeη\footnotesizep(ui;Σ)→ui⎫⎪ ⎪⎬⎪ ⎪⎭dui =(2π)−1/2exp{(\largeηΣ+SUM{\largeη% \footnotesizep(yi|ui;β)→ui})0 +AN((\largeηΣ+SUM{\largeη\footnotesizep(yi|ui;β)→ui})−0)}

where

 \largeηΣ≡⎡⎢ ⎢ ⎢ ⎢⎣−12log|2πΣ|0d\tiny R−12DTd\tiny Rvec(Σ−1)⎤⎥ ⎥ ⎥ ⎥⎦,SUM{\largeη\footnotesizep(yi|ui;β)→ui}≡ni∑j=1\largeη\footnotesizep(yij|ui;β)→ui,

is as defined in Definition 1 and, for an unnormalized Multivariate Normal natural parameter vector , denotes the first entry (the zero subscript is indicative of the first entry being the coefficient of ) and denotes the remaining entries.

The full algorithm for expectation propagation approximation of is summarized as Algorithm 1. The derivational details are given in Section S.2.

We have carried out extensive simulated data tests on Algorithm 1 using the starting values described in Section 3.3 and found convergence to be rapid. Moreover, each of updates in Algorithm 1 involve explicit calculations and low-level language implementation, used in our R package glmmEP, affords very fast evaluation of the approximate log-likelihood surface. As explained in (3.4), quasi-Newton methods can be used for maximization of and approximate likelihood-based inference.

### 3.3 Recommended Starting Values for Algorithm 1

In Section S.3 we use a Taylor series argument to justify the following starting values for in Algorithm 1:

 \largeη\tiny start\footnotesizep(yij|ui;β)→ui≡⎡⎢ ⎢ ⎢ ⎢ ⎢⎣0(2yij−1)ζ′(ˆaij)x\tiny Rij−ζ′′(ˆaij)x\tiny Rij(x\tiny Rij)Tˆui12ζ′′(ˆaij)DTd\tiny Rvec(x\tiny Rij(x\tiny Rij)T)⎤⎥ ⎥ ⎥ ⎥ ⎥⎦,1≤j≤ni, (14)

where

 ˆaij≡(2yij−1)(βTx\tiny Fij+ˆuTix% \tiny Rij)

and is a prediction of . A convenient choice for is that based on Laplace approximation. In the R computing environment the function glmer() in the package lme4 (Bates et al., 2015) provides fast Laplace approximation-based predictions for the . In our numerical experiments, we found convergence of the cycle loop of Algorithm 1 to be quite rapid, with convergents of

 (\largeη\footnotesizep(yij|ui;β)→ui)−0% relatively close to(\largeη\tiny start% \footnotesizep(yij|ui;β)→ui)−0.

Therefore, we strongly recommend the starting values (14).

### 3.4 Quasi-Newton Optimization and Approximate Inference

Even though Algorithm 1 provides fast approximate evaluation of the probit mixed model likelihood surface, we still need to maximize over to obtain the expectation propagation-approximate maximum likelihood estimators . This is also the issue of approximate inference based on Fisher information theory.

Since is defined implicitly via an iterative scheme, differentiation for use in derivative-based optimization techniques is not straightforward. A practical workaround involves the employment of optimization methods such as those of the quasi-Newton variety for which derivatives are approximated numerically. In the R computing environment the function optim() supports several derivative-free optimization implementations. The Matlab computing environment (The Mathworks Incorporated, 2018) has similar capabilities via functions such as fminunc(). In the glmmEP package and the examples in Section 4 we use the Broyden-Fletcher-Goldfarb-Shanno quasi-Newton method (Broyden, 1970; Fletcher 1970; Goldfarb, 1970; Shanno, 1970) with Nelder-Mead starting values. Section 2.2.2.3 of Givens & Hoetig (2005) provides a concise summary of the Broyden-Fletcher-Goldfarb-Shanno method.

Since is constrained to be symmetric and positive definite, we instead perform quasi-Newton optimization over the unconstrained parameter vector where

 θ≡vech(12log(Σ))

and is the matrix logarithm of (e.g. Section 2.2 of Pinheiro & Bates, 2000). Note that can be obtained using

 log(Σ)=U\tinyΣdiag{log(λ\tinyΣ)}UT\tinyΣwhereΣ=U\tinyΣ% diag(λ\tinyΣ)UT\tinyΣ

is the spectral decomposition of and denotes element-wise evaluation of the logarithm to the entries of . If is the maximizer of then the expectation propagation-approximate maximum likelihood estimate of is

 ˆ\lx@underaccentset∼Σ=U\tinyˆ\lx@underaccentset∼θdiag{exp(2λ\tinyˆ\lx@underaccentset∼θ)}UT\tinyˆ\lx@underaccentset∼θwherevech−1(ˆ\lx@underaccentset∼θ)=U\tinyˆ\lx@underaccentset∼θdiag(λ\tinyˆ\lx@underaccentset∼θ)UT\tinyˆ\lx@underaccentset∼θ

is the spectral decomposition of the