A Hybrid Alternative to Gibbs Sampling for Bayesian Latent Variable Models

08/27/2018 ∙ by Grant Backlund, et al. ∙ 0

Gibbs sampling is a widely popular Markov chain Monte Carlo algorithm which is often used to analyze intractable posterior distributions associated with Bayesian hierarchical models. The goal of this article is to introduce an alternative to Gibbs sampling that is particularly well suited for Bayesian models which contain latent or missing data. The basic idea of this hybrid algorithm is to update the latent data from its full conditional distribution at every iteration, and then use a random scan to update the parameters of interest. The hybrid algorithm is often easier to analyze from a theoretical standpoint than the deterministic or random scan Gibbs sampler. We highlight a positive result in this direction from Abrahamsen and Hobert (2018), who proved geometric ergodicity of the hybrid algorithm for a Bayesian version of the general linear mixed model with a continuous shrinkage prior. The convergence rate of the Gibbs sampler for this model remains unknown. In addition, we provide new geometric ergodicity results for the hybrid algorithm and the Gibbs sampler for two classes of Bayesian linear regression models with non-Gaussian errors. In both cases, the conditions under which the hybrid algorithm is geometric are much weaker than the corresponding conditions for the Gibbs sampler. Finally, we show that the hybrid algorithm is amenable to a modified version of the sandwich methodology of Hobert and Marchev (2008), which can be used to speed up the convergence rate of the underlying Markov chain while requiring roughly the same computational effort per iteration.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Let be an intractable target density, and suppose that is a joint density whose -marginal is the target; i.e., Think of as the parameters in a Bayesian model, and as latent data. If straightforward sampling from the associated conditional densities is possible, then we can use the data augmentation (DA) algorithm to explore Of course, running the algorithm entails alternating between draws from and which simulates the Markov chain whose Markov transition density (Mtd) is

It’s easy to see that is symmetric in so the DA Markov chain is reversible.

Unfortunately, there are many situations in which useful latent data exist, but the DA algorithm is not directly applicable. Specifically, it is often the case that it is possible to draw from but it is not possible to draw from On the other hand, in such cases, one can sometimes break into two pieces, where in such a way that one is able to draw from and from If so, we can simulate the Markov chain, according to the Mtd given by

(1)

This chain still has the correct invariant density, but it is not reversible.

In this paper, we introduce an alternative to that we call the hybrid algorithm. Because this new algorithm employs a random scan, the resulting Markov chain is reversible, which facilitates theoretical comparisons. Fix

to play the role of the selection probability behind the random scan. Consider a Markov chain

with state space that evolves as follows. If the current state is then we simulate the new state, using the following two-step procedure.

  Iteration of the hybrid algorithm:

  1. Draw call the result and, independently, draw

    1. If draw and set

    2. Otherwise if draw and set

 


It follows from Proposition 1 in Appendix B that the hybrid algorithm is reversible and has as its invariant density. (The result proven in Appendix B is actually for a general algorithm of which the hybrid algorithm is a special case.) Of course, as in the Mtd given by (1), the usual deterministic scan three-variable Gibbs sampler for this situation proceeds by repeatedly drawing from and in that order. Hence, thinking of as latent data, the deterministic scan Gibbs sampler updates the latent data and both parameters of interest at every iteration. Moreover, the usual random scan Gibbs sampler proceeds by fixing two selection probabilities so that and using and in the obvious way to randomly choose one of or to draw from at each iteration. This means that the random scan Gibbs sampler updates either the latent data or one of the two parameters of interest at each iteration, leaving the others fixed. On the other hand, the hybrid algorithm updates the latent data at every iteration, and then proceeds to update one set of parameters randomly while leaving the other set of parameters fixed.

There are several important advantages of the hybrid algorithm over deterministic scan Gibbs sampling. First, the Markov chain corresponding to the hybrid algorithm is reversible, which allows for the use of certain spectral theoretic techniques that are not applicable to non-reversible chains (see, e.g., Khare and Hobert 2011). Second, if specific information about the target distribution is known, the practitioner may vary the selection probability to cause one set of parameters to be updated more frequently than the other. Third, as we explain in Section 3, we may use the sandwich methodology of Hobert and Marchev (2008) to add as many as two extra steps to the hybrid algorithm which can potentially speed up the convergence rate without adding to the computational complexity. Finally, the hybrid algorithm is often easier to analyze from a theoretical standpoint. This means that it is either easier to establish convergence rate results for the hybrid algorithm, or the corresponding results for the Gibbs sampler require stronger assumptions. We present several examples of these phenomena in Sections 24, and 5. Like the hybrid algorithm, the Markov chain driven by the random scan Gibbs sampler is reversible and the selection probabilities facilitate the option for the practitioner to adjust the frequency with which the variables are updated. However, the latter two advantages mentioned above continue to hold for the hybrid algorithm over random scan Gibbs.

The important practical benefits of basing one’s MCMC algorithm on a geometrically ergodic Markov chain have been well-documented by, e.g., Roberts and Rosenthal (1998), Jones and Hobert (2001), Flegal et al. (2008) and Łatuszyński et al. (2013)

. Indeed, geometric ergodicity, along with a moment condition, guarantees a central limit theorem for sample means and quantiles based on MCMC output. In addition, it ensures the consistency of various methods for estimating the variance in the asymptotic normal distribution. Hence, it allows one to construct valid asymptotic standard errors for MCMC-based estimators, which is extremely important from a practical perspective. It is for this reason that one might always favor an algorithm that is known to be geometrically ergodic over another algorithm for which there are no known theoretical convergence rate results, even if empirical evidence suggests it could be somewhat slower. See the discussion at the end of Appendix 

B for a summary of several ways to choose between two MCMC algorithms for the same problem. In the following, we provide several instances where the conditions for geometric ergodicity of the hybrid algorithm are strictly weaker than the corresponding conditions for the Gibbs sampler. In addition, we have found that the performance of the hybrid algorithm tends to fall in between the random and deterministic scan Gibbs samplers in terms of autocorrelation in most empirical studies, with the hybrid algorithm being better than random scan Gibbs, but worse than deterministic scan Gibbs. We illustrate this with several simulated data examples in Section 2.

The remainder of this paper is organized as follows. In Section 2, we present the hybrid algorithm for the general linear mixed model example of Abrahamsen and Hobert (2018). This section also contains simulation results used to compare the empirical performance of the hybrid and Gibbs algorithms. In Section 3, we explain how to add sandwich steps to the hybrid algorithm, resulting in the double sandwich (DS) algorithm. We illustrate by adding a nontrivial sandwich step to the hybrid algorithm for the linear mixed model of Section 2. In Section 4, we describe the hybrid algorithm for the Bayesian linear regression model with scale mixtures of normal errors. The details of the corresponding DS algorithm for a particular mixing distribution can be found in Appendix C. In Section 5, we discuss the hybrid algorithm for a generalization of the model presented in Section 4

, namely the Bayesian linear regression model with skew scale mixtures of normal errors. The appendix contains a detailed example of the development of a DS algorithm for a toy model, some important theoretical results for the general DS algorithm, and the proofs of the main convergence rate results stated in later sections.

2 The General Linear Mixed Model with a Continuous Shrinkage Prior

The general linear mixed model takes the form

(2)

where is an

data vector,

and are known matrices with dimensions and , respectively, is an unknown vector of regression coefficients, is a random vector whose elements represent the various levels of the random factors in the model, and the random vectors and are independent. Suppose that the model contains random factors, so that and may be partitioned as and where is , is , and Then It is assumed that where Finally, let denote the vector of precision components, i.e.,

One can describe a Bayesian version of the general linear mixed model by specifying a prior distribution for the unknown parameters and

A popular choice is the proper (conditionally) conjugate prior that takes

to be multivariate normal, and takes each of the precision components to be gamma. However, in the increasingly important situation where is larger than we may wish to use a so-called Bayesian shrinkage prior on (see, e.g., Griffin and Brown 2010). Indeed, Abrahamsen and Hobert (2018) considered the following Bayesian shrinkage version of the general linear mixed model which incorporates the normal-gamma prior due to Griffin and Brown (2010):

where is a diagonal matrix with on the diagonal. Finally, all components of and are assumed a priori independent with for and for There is evidence (both empirical and theoretical) suggesting that values of in lead to a posterior that concentrates on sparse vectors (Bhattacharya et al. (2012, 2015)).

Define and so that The vector is treated as latent data, and the distribution of interest is the posterior distribution of given the data, Here is the full posterior density:

(3)

In order to use the standard DA algorithm, we would need to be able to sample from and from The former is not a problem, as we now explain. We write to mean that

has a generalized inverse Gaussian distribution with density

(4)

where and denotes the modified Bessel function of the second kind. Conditional on the components of are independent with

Unfortunately, it is not straightforward to make draws from Thus, the standard DA algorithm is not applicable. On the other hand, the conditional density of given is multivariate normal, and, given the components of are independent gammas. Hence, the hybrid algorithm is applicable.

We now state the conditional densities, beginning with First,

Now, for we have

Now, define and Conditional on is -variate normal with mean

and covariance matrix

The hybrid algorithm is based on the Markov chain with state space and fixed selection probability If the current state is then we simulate the new state, using the following two-step procedure.

  Iteration of the hybrid algorithm:

  1. Draw independently with , let and, independently, draw

    1. If draw independently with

      and let Set

    2. Otherwise if draw

      and set

 


It follows from the general results in Appendix B that the Markov chain driving this algorithm is reversible with respect to and it is straightforward to show that it is Harris ergodic. Abrahamsen and Hobert (2018) analyzed the hybrid algorithm, and proved that it is geometrically ergodic under mild regularity conditions. Here is the statement of their main theoretical result.

Theorem 1.

The hybrid Markov chain, is geometrically ergodic for all if

  1. has full column rank.

  2. and

  3. for each

Note that the conditions of Theorem 1 are quite easy to check, and also that the result is applicable when However, note that there is no known convergence rate result for the three-variable Gibbs sampler (either deterministic or random scan). In fact, as Abrahamsen and Hobert (2018) discuss, the hybrid algorithm is much easier to analyze than three-variable Gibbs, despite being no more difficult to implement. This means that a practitioner who wishes to construct valid asymptotic standard errors for their MCMC based estimators should prefer the hybrid algorithm over either version of the three-variable Gibbs sampler.

We now present some simulation results for the linear mixed model example discussed above. Note that no numerical or simulation results were provided in Abrahamsen and Hobert (2018). The goal is to compare the proposed hybrid algorithm with the three-variable Gibbs sampler (both determinstic and random scan). We also include in the comparison the double sandwich (DS) algorithm for this model, the theory of which is developed in Section 3. We consider three simulation settings corresponding to the situations where , , and , respectively, in order to account for the effects of the shrinkage prior. The elements of the design matrix were chosen by generating iid random variables. For convenience, we consider the case of one random effect with 5 levels, i.e., and , and consider the standard cell means model structure for the matrix . Recall from Theorem 1

that there are several restrictions on the hyperparameters that must be adhered to in order for the hybrid Markov chain to be geometrically ergodic. This sometimes requires

to be large. We mitigate this by setting in each simulation setting to give the corresponding prior distribution a mean of 1. We set and for all three simulations. Also, recall that there is empirical and theoretical evidence suggesting that values of in lead to a posterior that concentrates on sparse vectors. Accordingly, we set and throughout. Finally, in each of the simulations we fix the selection probability at for the hybrid and DS algorithms. For the random scan Gibbs sampler, we fix the selection probabilities at Note that this gives each variable an equal chance of being updated at each iteration. Under this setup, the hybrid and DS algorithms take two iterations on average to update both parameters of interest, the random scan Gibbs sampler takes three iterations on average, and the deterministic Gibbs sampler updates both at every iteration. Hence, in order to perform an “apples to apples” comparison, if the random scan Gibbs sampler is run for iterations after burn-in, the hybrid and DS algorithms should be run for iterations each, and deterministic Gibbs should be run for iterations. Here is a summary of the simulation settings considered.

Table 1: Hyperparameter settings

Setting
1 100 10 1 5 1 1 1.5 1 0.25 1
2 100 100 1 5 77 77 1.5 1 0.25 1
3 100 200 1 5 152 152 1.5 1 0.25 1

In each simulation, we ran all four Markov chains for a burn-in period of 5,000 iterations. The next iterations were used to compute the autocorrelations (up to lag 10) for the function , where for the hybrid and DS algorithms, for deterministic Gibbs, and for random scan Gibbs. This function is a natural choice as it involves both parameters of interest ( and ). The results are summarized in Figure 1. We can clearly see that for all three simulations, the magnitude of the autocorrelations for the deterministic scan Gibbs sampler is the lowest and the magnitude of the autocorrelations for the random scan Gibbs sampler is the highest. The performance of the hybrid Markov chain and the DS Markov chain is comparable, with the DS algorithm being slightly better in the first two simulations. We note that the hybrid Markov chain and the DS Markov chain are also comparable in terms of clock time. For example, in the second simulation setting, the DS algorithm increased the run time by less than 1% relative to the hybrid algorithm across 15,000 iterations. Of course, the numerical results we provide in no way imply that this same ordering of the four algorithms will always occur in practice.

Figure 1: Autocorrelations for Hybrid, DS, Deterministic Gibbs, and Random Scan Gibbs

3 Adding Sandwich Steps

Building on ideas in Liu and Wu (1999), Meng and van Dyk (1999) and van Dyk and Meng (2001), Hobert and Marchev (2008) introduced an alternative to DA that employs an extra move on the space that is “sandwiched” between the two conditional draws. In keeping with the notation of the introduction, let denote the -marginal and suppose that is any Markov transition function (Mtf) that is reversible with respect to i.e., The sandwich algorithm simulates the Markov chain whose Mtd is

Again, it’s easy to see that is symmetric in Note that the sandwich algorithm reduces to DA if we take to be the trivial Mtf whose chain is absorbed at the starting point. To run the sandwich algorithm, we simply run the DA algorithm as usual, except that after each is drawn, we perform the extra step before drawing the new Of course, a sandwich algorithm is a useful alternative to the underlying DA algorithm only if the computational burden of drawing from is small relative to the speed-up it provides. Consider, for example, the Mtf where This leads to a sandwich algorithm that is nothing but two consecutive iterations of the DA algorithm. Here, whatever is gained in mixing, is offset exactly in increased computational effort. Fortunately, it is often possible to find an that leads to a significant speed-up, while adding very little to the overall computational cost. This is typically accomplished by choosing such that, for fixed , the (reducible) chain driven by lives in a low dimensional subspace of (Note that such an would not have an Mtd with respect to Lebesgue measure on , and this is the reason why it is defined via its Mtf, instead of a Mtd.) There are many examples of sandwich algorithms that drastically outperform their DA counterparts in empirical studies, see, e.g., Liu and Wu (1999) or Meng and van Dyk (1999). Moreover, the superiority of the sandwich algorithm has also been established theoretically. Indeed, results in Hobert and Marchev (2008) and Khare and Hobert (2011) show that the sandwich algorithm converges at least as fast as the DA algorithm, and is at least as good in the sense of asymptotic relative efficiency.

We now explain how to add two different sandwich steps to the hybrid algorithm. We call this new algorithm the double sandwich algorithm, or DS algorithm for short. For fixed let denote a Mtf on that is reversible with respect to so that

(5)

Define

It’s easy to show that leaves invariant. (In fact, is symmetric in - see Appendix B.) Analogously, for fixed define

where is reversible with respect to There are a couple of default ways to construct sandwich moves, see, e.g., Hobert and Marchev (2008) and Liu and Wu (1999). In some cases, the resulting Mtf can be simulated with relatively little additional computational effort, and in such cases, there is nothing to lose by adding the step. In other cases, the Mtfs that result require rejection sampling to implement. When the effort required to implement the rejection sampler outweighs the gain in speed, the step should obviously not be used. Fortunately in this case, we could always choose one or both of and to be trivial. The potential benefits of the sandwich methodology remain as long as at least one of the steps is computationally efficient.

The DS algorithm is simply a random scan algorithm which, at each iteration, uses either or In particular, fix , and consider a Markov chain with state space that evolves as follows. If the current state is then we simulate the new state, using the following two-step procedure.

  Iteration of the DS algorithm:

  1. Draw call the result and, independently, draw

    1. If draw call the result draw and set

    2. Otherwise if draw call the result draw and set

 


Note that the version of the DS algorithm in which and are both trivial is simply the hybrid algorithm from the introduction. Analogously to the hybrid algorithm, the DS algorithm is reversible. Using the reversibility property, we are able to establish for the DS algorithm analogues of the strong theoretical results that have been proven for the basic sandwich algorithm. In particular, we show in Appendix B that the DS algorithm is at least as good as the hybrid algorithm in terms of convergence rate, and in the sense of asymptotic relative efficiency. One important consequence of the convergence rate result is that geometric ergodicity of the hybrid algorithm implies geometric ergodicity of all corresponding DS algorithms. This is highly useful in practice because the hybrid algorithm is much simpler, and hence much easier to analyze. Pal et al. (2015) developed an alternative method for Bayesian latent data models based on the sandwich methodology of Hobert and Marchev (2008). Unfortunately, it is difficult to obtain theoretical results using their method because the corresponding Markov chains are not reversible.

As an example, we now show how to add one nontrivial sandwich step to the hybrid algorithm for the linear mixed model of Abrahamsen and Hobert (2018). (See Appendix A for a simple example where two sandwich steps are possible.) We note that those authors only considered the vanilla hybrid algorithm with no sandwich steps. To get started, we require either an Mtf which is reversible with respect to or an Mtf which is reversible with respect to We have found that sandwich moves corresponding to the latter are difficult to implement in practice. For this reason, we focus on constructing an Mtf which is reversible with respect to A routine calculation shows that

Let It follows from the group theoretic arguments in Hobert and Marchev (2008) that the move is reversible with respect to if is drawn from the density proportional to (This is a low-dimensional move since, for fixed the points lie on a ray emanating from the origin and passing through the point ) Now, as a function of

so the density from which must be drawn is given by

where is a free parameter. So,

(6)

where

If we choose then two things happen: (1) the first term on the right-hand side of  (6) is proportional to a scaled density, and (2) the second term is bounded. In fact, the second term achieves its maximum at Thus, we can use a simple accept/reject algorithm with an candidate to draw from In particular, let and Here’s the algorithm.

  Accept/Reject algorithm for :

  1. Draw set and independently draw

  2. If

    then accept as a draw from  (6), otherwise return to 1.

 


Note that due to the dependence on the constant , in general, the efficiency of this rejection sampler depends on the data at hand. In the numerical examples considered in the previous section, the sampler is quite efficient, with an acceptance probability of more than 70% in each of the simulations settings considered. So our DS algorithm proceeds as follows. Fix and let the current state of the chain be First, draw and then flip an -coin. If the coin comes up heads, we move to by first drawing and then drawing If the coin comes up tails, we move to by drawing It follows from a result in Appendix B and the geometric ergodicity of the hybrid algorithm with no sandwich steps that this algorithm is also geometrically ergodic. Recall that some empirical results for this DS algorithm are depicted alongside the results of the hybrid and Gibbs algorithms in Figure 1 of the previous section. It is clear that while the DS algorithm outperforms the hybrid algorithm in most of the simulations considered and outperforms random scan Gibbs in every simulation, it does not outperform deterministic scan Gibbs in terms of autocorrelation.

4 Bayesian Linear Regression with Scale Mixtures of Normal Errors

Let be independent random variables satisfying the linear regression model

(7)

where is a vector of known covariates associated with , is a vector of unknown regression coefficients, is an unknown scale parameter, and

are iid errors. The standard assumption that the errors are Gaussian is often inappropriate, e.g., when the data contain outliers. Various heavy-tailed alternatives can be constructed as scale mixtures of the Gaussian density. Consider an error density of the form

(8)

where is the distribution function of some non-negative random variable. By varying the mixing distribution quite a lot of symmetric and unimodal distributions can be constructed. Thus, datasets with various types of tail behavior (particularly with heavier tails than the normal) are often modeled by choosing a distribution from this class. In this section, we consider a Bayesian analysis of the linear regression model  (7) when the errors are iid random variables with general scale mixture of normals (SMN) density given in  (8). There are several different prior distributions available that lead to interesting results. Hobert et al. (2018) consider a standard improper prior and show that a regular DA algorithm is available. A regular DA algorithm is also available in the case where we specify a proper conditionally conjugate prior on sequentially, by setting and where and are the prior mean vector and covariance matrix of , respectively, and Throughout this section, we will instead consider the proper prior which takes and to be a priori independent with and Under this setup, it is straightforward to show that the distribution of is not available in closed form, so that the regular DA algorithm is not applicable. In this case, the hybrid algorithm is a viable alternative. Here are the details.

Let . Let denote the matrix whose th row is . We assume throughout that has full column rank. We also assume that has a pdf with respect to Lebesgue measure on The likelihood of the data from the linear regression model above is given by

The posterior density is given by

Note that there is no direct method to sample from this intractable posterior density.

Define the complete data posterior density as

and note that so that constitutes latent data. We now derive the conditional densities needed for the hybrid algorithm. It is clear that, conditional on
are independent, and the conditional density of given is given by

Next, let be an diagonal matrix whose th diagonal element is We have

Hence

Finally, we have

After completing the square, we obtain

where

The hybrid algorithm is based on the Markov chain with state space and selection probability The dynamics of are defined by the following two step procedure for moving from to

  Iteration of the hybrid algorithm:

  1. Draw independently with

    call the observed values and, independently, draw