 # Fast Monte Carlo Markov chains for Bayesian shrinkage models with random effects

When performing Bayesian data analysis using a general linear mixed model, the resulting posterior density is almost always analytically intractable. However, if proper conditionally conjugate priors are used, there is a simple two-block Gibbs sampler that is geometrically ergodic in nearly all practical settings, including situations where p > n (Abrahamsen and Hobert, 2017). Unfortunately, the (conditionally conjugate) multivariate normal prior on β does not perform well in the high-dimensional setting where p ≫ n. In this paper, we consider an alternative model in which the multivariate normal prior is replaced by the normal-gamma shrinkage prior developed by Griffin and Brown (2010). This change leads to a much more complex posterior density, and we develop a simple MCMC algorithm for exploring it. This algorithm, which has both deterministic and random scan components, is easier to analyze than the more obvious three-step Gibbs sampler. Indeed, we prove that the new algorithm is geometrically ergodic in most practical settings.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The general linear mixed model (or variance components model) is one of the most frequently applied statistical models. It takes the form

 Y=Xβ+m∑i=1Ziui+e,

where is an observable

data vector,

and are known matrices, is an unknown vector of regression coefficients, are independent random vectors whose elements represent the various levels of the random factors in the model, and . The random vectors and are independent, and , where is , , and . (We assume throughout that , and that for each .) For a book-length treatment of this model and its many applications, see McCulloch et al. (2008).

In the Bayesian setting, prior distributions are assigned to and . Unfortunately, any non-trivial prior leads to an intractable posterior density. However, if and are assigned conditionally conjugate priors, then a simple two-block Gibbs sampler can be used to explore the resulting posterior density. In particular, if we assign a multivariate normal prior to , and independent gamma priors to the precision parameters, then, letting , it is easily shown that given observed data , is multivariate normal, and is a product of independent gammas. (Since is unobservable, it is treated like a parameter.) Convergence rate results for this block Gibbs sampler can be found in Abrahamsen and Hobert (2017).

Now consider this Bayesian mixed model in the high-dimensional setting where . This situation can arise, e.g., in genetics and neuroscience where variability between subjects is most appropriately handled with random effects (see, e.g., Fazli et al., 2011; Rohart et al., 2014). While the model described above could certainly be used in this setting, the multivariate normal prior on is really not suitable. Indeed, when , it is often assumed that is sparse, i.e., that many components of are zero. Unfortunately, the multivariate normal prior for will shrink

the estimated coefficients towards zero, but not enough to produce an (approximately) sparse estimate of

. Additionally, when the components of have varying magnitudes, the estimates of the “large” components will be shrunk disproportionately compared to the estimates of the “small” components. Below we propose an alternative prior for that is tailored to the high-dimensional setting.

The well-known Bayesian interpretation of the lasso (involving iid Laplace priors for the regression parameters) has led to a flurry of recent research concerning the development of prior distributions for regression parameters (in linear models without

random effects) that yield posterior distributions with high posterior probability around sparse values of

. These prior distributions are called continuous shrinkage priors and the corresponding statistical models are referred to as Bayesian shrinkage models (see, e.g., Bhattacharya et al. (2013, 2015), Griffin and Brown (2010), Polson and Scott (2010), and Park and Casella (2008)). One such Bayesian shrinkage model is the so-called normal-gamma model of Griffin and Brown (2010), which is given by

 Y|β,τ,λ0 ∼Nn(Xβ,λ−10In) (1) β|τ,λ0 ∼Np(0,λ−10Dτ),

where and is a diagonal matrix with the s on the diagonal. The precision parameter, , and the components of are assumed to be a priori independent with and for . When , this model becomes the Bayesian lasso model introduced by Park and Casella (2008). We note that Bhattacharya et al. (2013, 2015) show that, in terms of frequentist optimality, the Bayesian lasso has sub-optimal prior concentration rates in that it does not place sufficient mass around sparse values of . Alternatively, shrinkage priors that have singularities at zero and robust tails (such as in the normal-gamma model with ), have been shown to perform well in empirical studies.

In this paper, we propose and analyze an MCMC algorithm for a new Bayesian general linear mixed model in which the standard multivariate normal prior on is replaced with the continuous shrinkage prior from the normal-gamma model. Our high-dimensional Bayesian general linear mixed model is defined as follows

 Y|β,u,τ,λ ∼Nn(Xβ+m∑i=1Ziui,λ−10In) (2) β|u,τ,λ ∼Np(0,λ−10Dτ) u|τ,λ ∼Nq(0,Λ−1),

where and are a priori independent with , for , and for . This model can be considered a Bayesian analog of the frequentist, high dimensional mixed model developed by Schelldorfer et al. (2011). (Of course, it can also be viewed as a mixed version of the normal-gamma shrinkage model.) A similar sparse Bayesian linear mixed model has been proposed by Zhou et al. (2013)

for polygenic modeling. They assume a “spike and slab” prior consisting of a mixture of a point mass at 0 and a normal distribution for the components of

. However, it is well-known that spike and slab priors lead to MCMC algorithms that have convergence problems, especially when is large (Polson and Scott (2010); Bhattacharya et al. (2015)).

Recall that , and let denote the posterior density associated with model (2

). This density is highly intractable and Bayesian inference requires MCMC, which should, of course, be based on a geometrically ergodic Monte Carlo Markov chain

(see, e.g. Roberts and Rosenthal, 1998; Jones and Hobert, 2001; Flegal et al., 2008). As we show in Section 2, the full conditional densities , , and all have standard forms, which means that there is a simple three-block Gibbs sampler available. Unfortunately, we have been unable to establish a convergence rate for this Gibbs sampler (in either deterministic or random scan form). However, we have been able to prove that a related hybrid algorithm does converge at a geometric rate. The invariant density of our Markov chain is . Let be fixed, and denote the Markov chain by . If the current state is , then we simulate the new state, , using the following three-step procedure.

Iteration of the hybrid algorithm:

1. Draw , and, independently, .

2. If , set where

3. Otherwise, set where

At each iteration, this sampler first performs a deterministic update of from its full conditional distribution. Next, a random scan update is performed, which updates either or with probability and , respectively. This sampler is no more difficult to implement than the three-block Gibbs sampler. Moreover, it is straightforward to show that the Markov chain driving this algorithm is reversible with respect to , and that it is Harris ergodic. This algorithm is actually a special case of a more general MCMC algorithm for Bayesian latent data models that was recently developed by Jung (2015).

Our main result provides a set of conditions under which the hybrid Markov chain defined above is geometrically ergodic (as defined by Jones and Hobert (2001, p.319)). Here is the result.

###### Proposition 1.

The Markov chain, , is geometrically ergodic for all if

1. has full column rank,

2. , and

3. for each .

Note that the conditions of Proposition 1 are quite simple to check. We do require to be full column rank, which holds for most basic designs, but there is no such restriction on , so the result is applicable when . While the second condition may become restrictive when

, this can be mitigated to some extent by the fact that the user if free to choose any positive value for the hyperparameter

. Indeed, when a large value of is required to satisfy condition (2), can be chosen such that the prior mean and variance of (which are given by and , respectively) have reasonable values.

Abrahamsen and Hobert (2017) established simple, easily-checked sufficient conditions for geometric ergodicity of the two-block Gibbs sampler for the standard Bayesian general linear mixed model that assigns a multivariate normal prior to . Note, however, that unlike the multivariate normal prior, the continuous shrinkage prior that we use here is hierarchical, which means that an additional set of latent variables (the s) must be managed, and this leads to more complicated MCMC algorithms that are more difficult to analyze. On a related note, Pal and Khare (2014) obtained geometric ergodicity results for the Gibbs sampler developed for the original normal-gamma model (without random effects). But again, adding random effects to the normal-gamma model leads to new MCMC algorithms that are more complex and harder to handle.

The remainder of this paper is organized as follows. Section 2 contains a formal definition of the Markov chain that drives our hybrid sampler. A proof of Proposition 1 is given in Section 3. Finally, a good deal of technical material, such as proofs of the lemmas that are used in the main proof, has been relegated to the Appendices.

## 2 The Hybrid Sampler

In this section, we formally define the Markov transition function (Mtf) of the hybrid algorithm. We begin with a brief derivation of the conditional densities, , . Let , , and recall that . The posterior density can be expressed up to a constant of proportionality as

 π(θ,τ,λ|y) ∝π(y|β,u,τ,λ)π(β|τ,λ)π(u|τ,λ)π(τ)π(λ) (3) ∝λn/20exp{−λ02(y−Wθ)T(y−Wθ)} ×λp/20[p∏j=1τ−1/2j]exp{−λ02βTD−1τβ} ×[m∏i=1λqi/2i]exp{−12uTΛu} ×[p∏j=1τc−1je−dτjIR+(τj)][m∏i=0λai−1ie−biλiIR+(λi)].

We will use (3) to derive the full conditional distributions of , and . First, it is shown in the Appendix that the full conditional distribution of is multivariate normal with

 E[θ|τ,λ,y]=[λ0T−1λ,τXTy−λ20T−1λ,τXTZQ−1λ,τZTMλ,τyλ0Q−1λ,τZTMλ,τy], (4)

and

 Var[θ|τ,λ,y]=[T−1λ,τ+λ20T−1λ,τXTZQ−1λ,τZTXT−1λ,τ−λ0T−1λ,τXTZQ−1λ,τ−λ0Q−1λ,τZTXT−1λ,τQ−1λ,τ], (5)

where , , and .

Next, it’s clear from (3) that the components of

are conditionally independent, and that each has a gamma distribution. Indeed,

 π2(λ0|θ,τ,y)∝λn/2+p/2+a0−10e−λ0(||y−Wθ||22+βTD−1τβ2+b0)IR+(λ0), (6)

and, for ,

 π2(λi|θ,τ,y)∝λqi/2+ai−1ie−λi(||ui||22+bi)IR+(λi). (7)

Lastly,

 π3(τ|θ,λ,y) =p∏j=1τ(c−1/2)−1jexp{−12(λ0β2jτj+2dτj)}IR+(τj).

Thus, the s are conditionally independent, and

 π3(τj|θ,λ,y)∝τ(c−1/2)−1je−12⎛⎝λ0β2jτj+2dτj⎞⎠IR+(τj). (8)

This brings us to a subtle technical problem that turns out to be very important in our convergence rate analysis. Note that, if and , then the right-hand side of (8) is not integrable. Moreover, it is precisely these values of that yield effective shrinkage priors. Of course, from a simulation standpoint, this technical problem is a non-issue because we will never observe an exact zero from the normal distribution. However, in order to perform a theoretical analysis of the Markov chain, we are obliged to define the Mtf for all points in the state space. Our solution is to simply delete the offending points from the state space. (Alternatively, we could make a special definition of the Mtf at the offending points, but this leads to a Markov chain that lacks the Feller property (Meyn and Tweedie, 2009, p.124), and this prevents us from employing Meyn and Tweedie’s (2009) Lemma 15.2.8.) Thus, we define the state space of our Markov chain to be

 X={(θ,λ)∈Rp+q×Rm+1+:|βj|>0for j=1,…,p}.

Taking the state space to be instead of has no effect on posterior inference because the difference between these two sets is a set of measure zero. However, as will become clear in Section 3, the deleted points do create some complications in the drift analaysis. We note that this particular technical issue has surfaced before (see, e.g., Román and Hobert, 2012), although, in contrast with the current situation, the culprit is typically improper priors.

It’s clear from (8) that conditional on , and , the distribution of is , where

denotes the Generalized Inverse Gaussian distribution with parameters

, , and . The density is given by

 fGIG(x;ζ,ξ,ψ)=ξζ/22ψζ/2Kζ(√ξψ)xζ−1e−12(ξx+ψx)IR+(x), (9)

where denotes the modified Bessel function of the second kind.

Fix , and let denote a measurable set in . The Mtf of the Markov chain that drives our hybrid algorithm is given by

 P((θ,λ),A) =r∫Rp+qIA(~θ,λ)[∫Rp+π(~θ|τ,λ)π(τ|θ,λ)dτ]d~θ +(1−r)∫Rm+IA(θ,~λ)[∫Rp+π(~λ|θ,τ)π(τ|θ,λ)dτ]d~λ,

where we (henceforth) omit the dependence on from the conditional densities for notational convenience. The Appendix contains a proof that the Markov chain defined by is a Feller chain. In the next section, we prove Proposition 1.

## 3 Proof of Proposition 1

This section contains a proof of Proposition 1. In particular, we establish a geometric drift condition for our Markov chain, , using a drift function, , that is unbounded off compact sets. Since our chain is Feller, geometric ergodicity then follows immediately from Meyn and Tweedie’s (2009) Theorem 6.0.1 and Lemma 15.2.8.

### 3.1 The drift function

Recall that in our model, . The hyperparameter will play a crucial role in our drift function. Define as

 ν(c)=cI(0,12](c)+min{1/2,2c−1}I(12,∞)(c).

Now let , , and define the drift function as follows

 v(θ,λ) =α1\normy−Wθ2+α2\normβ2 (10) +p∑j=11\absβjν(c)+m∑i=1δi\normui2 +α3λ0+α4λν(c)/20+λ−10+m∑i=1λi+m∑i=1ηiλ−1i,

where and . The values of these constants are to be determined. The Appendix contains a proof that is unbounded off compact sets.

###### Remark 1.

Using instead of as the state space has implications for the construction of the drift function. In particular, since the hyper-planes where are not part of , and we need the drift function to be unbounded off compact sets, we must have a term in the drift function that diverges as . In fact, this is the only reason why the term is part of . Interestingly, Pal and Khare (2014) established their convergence rate results using a drift/minorization argument, which does not require the drift function to be unbounded off compact sets, yet they still require this same term in their drift function. Fortunately, we are able to reuse some of the bounds that they developed.

Our goal is to demonstrate that

 E[v(~θ,~λ)|θ,λ]≤ρv(θ,λ)+L, (11)

for some and some finite constant . First, note that

 E[v(~θ,~λ)|θ,λ] =r∫Rp+qv(~θ,λ)[∫Rp+π(~θ|τ,λ)π(τ|θ,λ)dτ]d~θ +(1−r)∫Rm+v(θ,~λ)[∫Rp+π(~λ|θ,τ)π(τ|θ,λ)dτ]d~λ =rE[E[v(~θ,λ)|τ,λ]∣∣∣θ,λ]+(1−r)E[E[v(θ,~λ)|τ,θ]∣∣∣θ,λ].

It follows that

 E[v(~θ,~λ)|θ,λ] =rE[E[α1\normy−W~θ2+α2\norm~β2 (12) +p∑j=11\abs~βjν(c)+m∑i=1δi\norm~ui2∣∣τ,λ⎤⎥⎦∣∣∣θ,λ⎤⎥⎦ +r(α3λ0+α4λν(c)/20+λ−10+m∑i=1λi+m∑i=1ηiλ−1i) +(1−r)(α1\normy−Wθ2+α2\normβ2+p∑j=11\absβjν(c)+m∑i=1δi\normui2) +(1−r)E[E[α3~λ0+α4~λν(c)/20+~λ−10+m∑i=1~λi+m∑i=1ηi~λ−1i∣∣τ,θ]∣∣∣θ,λ].
###### Remark 2.

The hybrid sampler employs three full conditional densities, yet the expression above contains only two nested expectations. This is due to the random scan step in the hybrid sampler. In contrast, a drift analysis of the deterministic scan Gibbs sampler for this problem involves similar equations having three nested expectations, which greatly complicates the calculations. On the other hand, a drift analysis of the random scan Gibbs sampler involves no nested expectations, which suggests that such an analysis would be relatively easy. However, it turns out to be more difficult than the analysis of the hybrid algorithm.

Now define

 h(~θ)=α1\normy−W~θ2+α2\norm~β2+p∑j=11\abs~βjν(c)+m∑i=1δi\norm~ui2,

and

 g(~λ)=α3~λ0+α4~λν(c)/20+~λ−10+m∑i=1~λi+m∑i=1ηi~λ−1i.

In order to bound (12), we need to develop bounds for and . These will be handled separately in the next two subsections.

### 3.2 A bound on E[E[h(~θ)∣∣τ,λ]∣∣θ,λ]

Let

 Ri=[0qi×q1…0qi×qi−1 Iqi×qi 0qi×qi+1…0qi×qr],

so that for . It’s easy to see that

 E[\normy−W~θ2∣∣τ,λ] =tr(WVar[~θ|τ,λ]WT)+\normy−WE[~θ|τ,λ]2, (13) E[\norm~β2∣∣τ,λ] =tr(Var[~β|τ,λ])+\normE[~β|τ,λ]2,and E[\norm~ui2∣∣τ,λ] =tr(RiQ−1λ,τRTi)+\normE[~ui|τ,λ]2.

Hence,

 E[E[h(~θ)∣∣τ,λ]∣∣∣θ,λ] =α1E[tr(WVar[~θ|τ,λ]WT)+\normy−WE[~θ|τ,λ]2∣∣θ,λ] (14) +α2E[tr(Var[~β|τ,λ])+\normE[~β|τ,λ]2∣∣θ,λ] +E⎡⎢⎣E⎡⎢⎣p∑j=11\abs~βjν(c)∣∣∣τ,λ⎤⎥⎦∣∣∣θ,λ⎤⎥⎦ +m∑i=1δiE[tr(RiQ−1λ,τRTi)+\normE[~ui|τ,λ]2∣∣θ,λ].

We will bound (14) using the following lemmas, which are proven in the Appendix.

For all and ,

• ,   and

• ,

• .

###### Lemma 2.

For all and ,

 \normy−WE[θ|τ,λ]2≤2n\normy2+2n3\normy2.
###### Lemma 3.

For all and ,

• ,   and

• .

where

is the largest singular value of

and is a finite positive constant.

###### Lemma 4.

For all and ,

where

 κ(c):=Γ(1−ν(c)2)21−ν(c)2√2π,

and is the largest singular value of .

###### Lemma 5.

Assume that has full column rank. For all , and ,

• ,   and

• .

Substituting into (14) gives

 E[E[h(~θ)∣∣τ,λ] ∣∣∣θ,λ]≤α1rank(X)λ−10+α1tr(ZZT)m∑i=1λ−1i (15) +α2λ−10p∑j=1E[τj∣∣θ,λ]+α2c∗tr(ZZT)m∑i=1λ−1i +α2c∗n2\normy2s2maxp∑j=1E[τj∣∣θ,λ] +pκ(c)sν(c)maxλν(c)/20 +κ(c)λν(c)/20p∑j=1E⎡⎢⎣1τν(c)/2j∣∣∣θ,λ⎤⎥⎦ +m∑i=1δiqiλ−1i +maxi{δi}qtr[(ZTZ)−1]\normy2n3s2maxp∑j=1E[τj|θ,λ] +K0(α1,α2,δ),

where

 K0(α1,α2,δ):=2α1(n\normy2+n3\normy2)+α2c∗n2\normy2+tr[(ZTZ)−1]\normy2n3m∑i=1δiqi.

### 3.3 A bound on E[E[g(~λ)∣∣τ,θ]∣∣θ,λ]

From (6), we have

 E[λ0|θ,τ] =Γ(n+p+2a0+22)Γ(n+p+2a02)(||y−Wθ||2+βTD−1τβ+2b02)−1 (16) ≤(n+p+2a0)b−10,

and

 E[λ−10|θ,τ] =Γ(n+p+2a0−22)Γ(n+p+2a02)(\normy−Wθ2+βTD−1τβ+2b02) (17) =1n+p+2a0−2(\normy−Wθ2+βTD−1τβ+2b0).

Also, from Jensen’s inequality and (16),

 E[λν(c)/20∣∣τ,θ]≤E[λ0|τ,θ]ν(c)/2≤[(n+p+2a0)b−10]ν(c). (18)

Similarly, from (7), for we have

 E[λi|θ,τ]=Γ(qi+2ai+22)Γ(qi+2ai2)(\normui2+2bi2)−1≤(qi+2ai)b−1i, (19)

and

 E[λ−1i|θ,τ] =Γ(qi+2ai−22)Γ(qi+2ai2)(\normui2+2bi2) (20) =1qi+2ai−2(\normui2+2bi).

Conditions (2) and (3) of Proposition 1 imply that all of the above Gamma functions have positive arguments.

From (16)-(20), we have

 E[E[g(~λ)∣∣τ,θ]∣∣∣θ,λ] ≤1n+p+2a0−2\normy−Wθ2 (21) +1n+p+2a0−2p∑j=1β2jE[τ−1j|θ,λ