 # Pseudo-extended Markov chain Monte Carlo

Sampling from the posterior distribution using Markov chain Monte Carlo (MCMC) methods can require an exhaustive number of iterations to fully explore the correct posterior. This is often the case when the posterior of interest is multi-modal, as the MCMC sampler can become trapped in a local mode for a large number of iterations. In this paper, we introduce the pseudo-extended MCMC method as an approach for improving the mixing of the MCMC sampler in complex posterior distributions. The pseudo-extended method augments the state-space of the posterior using pseudo-samples as auxiliary variables, where on the extended space, the MCMC sampler is able to easily move between the well-separated modes of the posterior. We apply the pseudo-extended method within an Hamiltonian Monte Carlo sampler and show that by using the No U-turn algorithm (Hoffman and Gelman, 2014), our proposed sampler is completely tuning free. We compare the pseudo-extended method against well-known tempered MCMC algorithms and show the advantages of the new sampler on a number of challenging examples from the statistics literature.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Markov chain Monte Carlo (MCMC) methods (see, e.g., Brooks et al. (2011)

) are generally regarded as the gold standard approach for sampling from high-dimensional distributions. In particular, MCMC algorithms have been extensively applied within the field of Bayesian statistics

(Green et al., 2015) to sample from posterior distributions when the posterior density can only be evaluated up to a constant of proportionality. Under mild conditions, it can be shown that asymptotically, the limiting distribution of the samples generated from the MCMC algorithm will converge to the posterior distribution of interest. While theoretically elegant, one of the main drawbacks of MCMC methods is that running the algorithm to stationarity can be prohibitively expensive if the posterior distribution is of a complex form, for example, contains multiple unknown modes. Notable examples of multi-modality include the posterior over model parameters in mixture models (McLachlan and Peel, 2000)

, deep neural networks

(Neal, 2012), and differential equation models (Calderhead and Girolami, 2009). In these settings, the Markov chain can become stuck for long periods of time without fully exploring the posterior distribution. Figure 1: Left: Density plot of the target π(x). Right: Contour plot of the extended target with N=2.

In this paper, we present the pseudo-extended Markov chain Monte Carlo method as an approach for augmenting the state-space of the original posterior distribution to allow the MCMC sampler to easily move between areas of high posterior density. The pseudo-extended method introduces pseudo-samples on the extended space to improve the mixing of the Markov chain. To illustrate how this method works, in Figure 1

we plot a mixture of two univariate Gaussian distributions (

left

). The area of low probability density between the two Gaussians will make it difficult for an MCMC sampler to traverse between them. Using the pseudo-extended approach (as detailed in Section

2), we can extend the state-space to two dimensions (right), where on the extended space, the modes are now connected allowing the MCMC sampler to easily mix between them.

The pseudo-extended framework can be applied for general MCMC sampling, however, in this paper we focus on using ideas from tempered MCMC (Jasra et al., 2007) to improve multi-modal posterior sampling. Unlike previous approaches which use MCMC to sample from multi-modal posteriors, i) we do not require a priori information regarding the number, or location, of modes, ii) nor do we need to specify a sequence of intermediary tempered distributions (Geyer, 1991).

We show that samples generated using the pseudo-extended method admit the correct posterior of interest as the limiting distribution. Furthermore, once weighted using a post-hoc correction step, it is possible to use all pseudo-samples for approximating the posterior distribution. The pseudo-extended method can be applied as an extension to many popular MCMC algorithms, including the random-walk Metropolis (Roberts et al., 1997) and Metropolis-adjusted Langevin algorithm (Roberts and Tweedie, 1996). However, in this paper, we focus on applying the popular Hamiltonian Monte Carlo (HMC) algorithm (Neal, 2010) within the pseudo-extended framework, and show that this leads to improved posterior exploration compared to standard HMC.

## 2 The Pseudo-Extended Method

Let be a target probability density on with respect to Lebesgue measure, defined for all by

 π(x):=γ(x)Z=exp{−ϕ(x)}Z, (1)

where is a continuously differentiable function and is the normalizing constant which is typically intractable. Throughout we will refer to as the target density. In the Bayesian setting, this would be the posterior, where for data , the likelihood is denoted as with parameters assigned a prior density

. The posterior density of the parameters given the data is derived from Bayes theorem to give

, where the marginal likelihood is the normalizing constant , which is typically not available analytically.

We extend the state-space of the original target distribution (1) by introducing pseudo-samples, , where the extended-target distribution is defined on . The pseudo-samples act as auxiliary variables, where for each , we introduce an instrumental distribution with support covering that of . In a similar vein to the pseudo-marginal MCMC algorithm (Beaumont, 2003; Andrieu and Roberts, 2009) our extended-target, including the auxiliary variables, is now of the form,

 πN(x1:N) :=1NN∑i=1π(xi)∏j≠iq(xj)=1Z{1NN∑i=1γ(xi)q(xi)}×∏iq(xi), (2)

where and are defined in (1). In pseudo-marginal MCMC, it is common to use

as a proposal distribution for an importance sampler, and to use it to compute an unbiased estimate of the intractable normalizing constant (see Section

2.3 for details). However, we will take a different approach, as explained below, and we do not even require that exact sampling from is possible; a fact that we will exploit in Section 3 for designing .

In the case where , our extended-target (2) simplifies back to the original target (1). For , the marginal distribution of the th pseudo-sample is a mixture between the target and the proposal

 πN(xi)=1Nπ(xi)+N−1Nq(xi).

Samples from the original target of interest, are given by using a post-hoc correction step, where the samples from the extended-target are weighted with weights proportional to .

###### Theorem 2.1.

Let be distributed according to the extended-target . Weighting each sample with self-normalized weights proportional to gives samples from the target distribution, , in the sense that

 EπN[∑Ni=1f(xi)γ(xi)/q(xi)∑Ni=1γ(xi)/q(xi)]=Eπ[f(x)] (3)

for arbitrary integrable .

The proof follows from the invariance of particle Gibbs (Andrieu et al., 2010) and is given in Section A of the Supplementary Material.

Introducing pseudo-samples has the effect of linearly increasing the state-space of the target. In general, we would prefer to perform MCMC on a lower dimensional space rather than increasing the size of the state-space as this leads to slower mixing of the Markov chain (Roberts and Rosenthal, 2004). This is less problematic for the HMC algorithm compared to the random-walk Metropolis as the step-size should be scaled as for HMC compared to

for the variance of the random-walk Metropolis proposal

(Beskos et al., 2013), giving a computational complexity of and , respectively, to explore the target in stationarity. This indicates that in high-dimensional spaces, the HMC algorithm is able to explore the target space faster. One disadvantage of HMC, and other gradient-based MCMC algorithms, is that they tend to be mode-seeking and are more prone to getting trapped in local modes of the target. Modifying the target distribution using the pseudo-extended framework, whereby the modes of the target can be connected on the extended space, reduces the mode-seeking behavior of HMC and allows the sampler to more easily move between the modes of the target.

### 2.1 Pseudo-extended Hamiltonian Monte Carlo

Recalling that our parameters of interest are , we introduce artificial momentum variables that are independent of . The Hamiltonian , represents the total energy of the system as the combination of the potential function , as defined in (1), and kinetic energy ,

 H(x,ρ):=ϕ(x)+12ρ⊤M−1ρ,

where

is a mass matrix and is often set to the identity matrix. The Hamiltonian now augments our target distribution so that we are sampling

from the joint distribution

, which admits the target as the marginal. In the case of the pseudo-extended target (2), the Hamiltonian is,

 HN(x1:N,ρ) =−log[N∑i=1exp{−ϕ(xi)+δ(xi)}]+N∑i=1δ(xi)+12ρ⊤M−1ρ, (4)

where now , and is the potential function of the instrumental distribution from eq. (2).

Aside from a few special cases, we generally cannot simulate from the dynamical system corresponding to the Hamiltonian (4) exactly (Neal, 2010). Instead, we discretize time using small step-sizes and calculate the state at , , , etc. using a numerical integrator. Several numerical integrators are available which preserve the volume and reversibility of the Hamiltonian system (Girolami and Calderhead, 2011), the most popular being the leapfrog integrator which takes steps, each of size , though the Hamiltonian dynamics (see Algorithm 1). After a fixed number of iterations , the algorithm generates samples approximately distributed according to the joint distribution , where after discarding the momentum variables , our MCMC samples will be approximately distributed according to the target .

Two important tuning parameters of Algorithm 1 are and . We estimate these using the No-U-turn sampler (NUTS) introduced by Hoffman and Gelman (2014), where the trajectory length is tuned to avoid the sampler doubling back on itself and is estimated using dual averaging (Nesterov, 2009). In this paper, we use the HMC algorithm because of its impressive mixing times, however, it is important to note that alternative MCMC schemes, such as the random-walk Metropolis (Roberts et al., 1997), Metropolis-adjusted Langevin algorithm (Roberts and Rosenthal, 1998), slice sampler (Neal, 2003) and other MCMC algorithms can also be used to sample from the pseudo-extended target (2).

### 2.2 One-dimensional illustration

Consider a bi-modal target of the form (see Figure 1 (left)),

 π(x)∝N(−1,0.1)+N(1,0.02).

If there are pseudo-samples, the pseudo-extended target (2) simplifies to

 π(x1:2)∝γ(x1)q(x2)+γ(x2)q(x1),

and for the sake of illustration, we choose .

Density plots for the original and pseudo-extended target are given in Figure 1. On the original target, the modes are separated by a region of low density and an MCMC sampler will therefore only pass between the modes with low probability, thus potentially requiring an exhaustive number of iterations. On the pseudo-extended target, the modes of the original target are now connected on the extended space . The proposal has the effect of increasing the density in the low probability regions of the target which separate the modes. A higher density between the modes means that the MCMC sampler can now traverse between the modes with higher probability than under the original target. Figure 2: 10,000 samples from the original target (left) and extended target (right) using HMC sampler

In Figure 2, density plots of the original target are overlayed with samples drawn from the original and pseudo-extended targets using the HMC algorithm, respectively.

After 10,000 iterations of the HMC sampler on the original target only one mode is discovered. Applying the same HMC algorithm on the pseudo-extended target, and then weighting the samples (as discussed in Section 2 and shown in Algorithm 1), both modes of the original target are discovered and the samples produce a good empirical approximation to the target.

### 2.3 Connections to pseudo-marginal MCMC

The pseudo-extended target (2) can be viewed as a special case of the pseudo-marginal target of Andrieu and Roberts (2009). In the pseudo-marginal setting, it is (typically) assumed that the target density is of the form , where is some “top-level” parameter, and where are now viewed as a latent variable that cannot be integrated out analytically. Using importance sampling, an unbiased Monte Carlo estimate of the target is computed using latent variable samples from a proposal distribution with density and then approximating the integral as

 ~π(θ):=1NN∑i=1π(θ,xi)q(xi),wherexi∼q(⋅).

The pseudo-marginal target is then defined, analogously to the pseudo-extended target (2), as

 ~πN(θ,x):=1NN∑i=1π(θ,xi)∏j≠iq(xj), (5)

which admits as a marginal. In the original pseudo-marginal method, the extended-target is sampled from using MCMC, with an independent proposal for (corresponding to importance sampling for these variables) and a standard MCMC proposal (e.g., random-walk) used for .

There are two key differences between pseudo-marginal MCMC and pseudo-extended MCMC. Firstly, we do not distinguish between latent variables and parameters, and simply view all unknown variables, or parameters, of interest as being part of . Secondly, we do not use an importance-sampling-based proposal to sample (which, effectively, would just result in a standard importance sampler for with proposal ), but instead we propose to simulate directly from the pseudo-extended target (2) using HMC as explained in Section 2.1. An important consequence of this is that we can use instrumental distributions without needing to sample from them. In Section 3 we exploit this fact to construct the instrumental distribution by tempering.

It has previously been proposed to simulate from the pseudo-marginal target (5) using HMC Lindsten and Doucet (2016). However, their motivation differs from ours since they still view as “nuisance variables” that are approximately marginalized out in order to mimic an ideal HMC algorithm targeting the marginal distribution . Furthermore, they study the convergence to this ideal HMC for large . An implication of this is that they rely on a splitting technique which requires a re-parameterization of

as a standard normal distribution. This limits the range of instrumental distributions that can be used with their method—in particular, they cannot use the tempered distributions which we propose in Section

3. On the contrary, based on the empirical results presented in Section 4, we are advocating using a small number of pseudo-samples and therefore we do not need to resort to the re-parameterization used by Lindsten and Doucet (2016). Another related method, which also uses this re-parameterization, is the pseudo-marginal slice sampler by Murray and Graham (2016).

In summary, the pseudo-marginal framework is a powerful technique for sampling from models with intractable likelihoods. The pseudo-extended method, on the other hand, is designed for sampling from complex target distributions, where the landscape of the target is difficult for standard MCMC samplers to traverse without an exhaustive number of MCMC iterations. In particular, where the target distribution is multi-modal, we show that extending the state-space allows our MCMC sampler to more easily explore the modes of the target.

## 3 Tempering

### 3.1 Tempered targets as instrumental distributions

In the case of importance sampling, we would choose a proposal which closely approximates the target . However, this would assume that we could find a tractable instrumental distribution for which i) sufficiently covers the support of the target and ii) captures its multi-modality. Approximations, such as the Laplace approximation (Rue et al., 2009) and variational Bayes (e.g., Bishop (2006), Chapter 10) could be used to choose , however, such approximations tend to locate only one mode of the target.

A significant advantage of the pseudo-extended framework (2) is that it permits a wide range of potential instrumental distributions. Unlike standard importance sampling, we also do not require to be a distribution that we can sample from, only that it can be evaluated point-wise up to proportionality. This is a simpler condition to satisfy and allows us to find better instrumental distributions for connecting the modes of the target. In this paper, we utilize a simple approach for choosing the instrumental distribution which does not require a closed-form approximation of the target. Specifically, we use tempered targets as instrumental distributions.

Tempering has previously been utilized in the MCMC literature to improve the sampling of multi-modal targets. Here we use a technique inspired by Graham and Storkey (2017) (see Section 3.2), where we consider the family of approximating distributions,

 Π:={πβ(x)=γβ(x)Z(β):β∈[0,1]}, (6)

where can be evaluated point-wise and is typically intractable.

We will construct an extended target distribution on consisting of pairs , for . This target distribution will be constructed in such a way that the marginal distribution of each -sample is a mixture with components selected from . This will typically make the marginal distribution more diffuse than the target itself, encouraging better mixing.

Let where, for the sake of tractability, we let , where can be evaluated point-wise and is a normalizing constant. Thus,

 q(x,β)=γβ(x)g(β)C. (7)

The joint proposal does not admit a closed-form expression and in general it is not possible to sample from it. We can, however, use an MCMC algorithm, in this case an HMC sampler, on the extended-target and therefore only require that can be evaluated point-wise up to proportionality. Under the proposal (6), the pseudo-extended target (2) is now

 πN(x1:N,β1:N):=1NN∑i=1π(xi)π(βi)∏j≠iq(xj,βj) (8) =1ZCN−1{1NN∑i=1γ(xi)π(βi)γβi(xi)g(βi)}N∏j=1γβj(xj)g(βj),

where is some arbitrary user-chosen target distribution for . Through our choice of , the normalizing constants for the target and proposal, and respectively are not dependent on or and so cancel in the Metropolis-Hastings ratio.

### 3.2 Related Work on tempered MCMC

Tempered MCMC is the most popular approach to sampling from multi-modal targets distributions (see Jasra et al. (2007) for a full review). The main idea behind tempered MCMC is to sample from a sequence of tempered targets,

 πt(x)∝exp{−βtϕ(x)},t=1,…,T,

where is a tuning parameter referred to as the temperature that is associated with . A sequence of temperatures, commonly known as the ladder, is chosen a priori, where . The intuition behind tempered MCMC is that when is small, the modes of the target are flattened out making it easier for the MCMC sampler to traverse through the regions of low density separating the modes. One of the most popular tempering algorithms is parallel tempering (PT) (Geyer, 1991), where in parallel, separate MCMC algorithms are run with each sampling from one of the tempered targets . Samples from neighboring Markov chains are exchanged (i.e. sample from chain exchanged with chain or ) using a Metropolis-Hasting step. These exchanges improve the convergence of the Markov chain to the target of interest , however, information from low targets can be slow to traverse up the temperature ladder. There is also serial version of this algorithm, known as simulated tempering (ST) (Marinari and Parisi, 1992). An alternative approach introduced by Neal (2001) is annealed importance sampling (AIS), which draws samples from a simple base distribution and then via a sequence of intermediate transition densities, moves the samples along the temperature ladder giving a weighted sample from the target distribution. Generally speaking, these tempered approaches can be very difficult to apply in practice often requiring extensive tuning. In the case of PT, the user needs to choose the number of parallel chains , temperature schedule, step-size for each chain and the number of exchanges at each iteration.

Our proposed tempering scheme is closely related to the continuously tempered HMC algorithm of Graham and Storkey (2017). They propose to run HMC on a distribution similar to (7) and then apply an importance weighting as a post-correction to account for the different temperatures. It thus has some resemblance with ST, in the sense that a single chain is used to explore the state space for different temperature levels. On the contrary, for our proposed pseudo-extended method, the distribution (7) is not used as a target, but merely as an instrumental distribution to construct the pseudo-extended target (8). The resulting method therefore has some resemblance with PT, since we propagate pseudo-samples in parallel, all possibly exploring different temperature levels. Furthermore, by mixing in part of the actual target we ensure that the samples do not simultaneously “drift away” from regions with high probability under .

Graham and Storkey (2017) propose to use a variational approximation to the target, both when defining the family of distributions (6) and for choosing the function . This is also possible with the pseudo-extended method, but we do not consider this possibility here for brevity. Finally, we note that in the pseudo-extended method the temperature parameter is treated as unknown and is sampled as part of the MCMC scheme, rather than being pre-specified as a sequence of fixed temperatures. This can be particularly beneficial as it avoids the need to tune the temperature ladder prior to implementing the MCMC sampler.

## 4 Experiments

We test the performance of the pseudo-extended method on four different examples. The first two (Sections 4.1 and 4.2) are chosen to show how the pseudo-extended method performs on simulated data when the target is multi-modal. The third example (Section 4.3

) considers a sparsity-inducing logistic regression model, where multi-modality occurs in the posterior from three real-world datasets. We compare against popular competing algorithms from the literature, including methods discussed in Section

3.2. In Section E of the Supplementary Material, we show how the pseudo-extended approach can be applied beyond multi-modal posteriors and consider two challenging target distributions.

All the simulations for the pseudo-extended method use the tempered proposal (Section 3) for and thus the pseudo-extended target is given by (8). For each simulation study we choose and

. We also use a logit transformation for

to map the parameters onto the unconstrained space. The pseudo-extended HMC algorithm (Alg. 1) is implemented within the STAN framework111All simulation code is available to the reviewers in the Supplementary Material and will be published on Github after the review process. (Carpenter et al., 2017).

### 4.1 Mixture of Gaussians

We consider a popular example from the literature (Kou et al., 2006; Tak et al., 2016), where the target is a mixture of 20 bivariate Gaussians,

 π(x)=20∑j=1wj2πσ2jexp{−12σ2j(x−μj)⊤(x−μj)},

where are specified in Kou et al. (2006). We consider two scenarios, where (a) each mixture component has weight and variance and where (b) the weights and variances

are unequal. Under scenario (a) the modes are well separated where most of the modes are more than 15 standard deviations apart. The distance between the modes makes it challenging for standard MCMC algorithms to explore all of the modes (see Figure

3). In scenario (b) the mixture components close to (5,5) have a higher weight with smaller variance. Figure 3: 10,000 samples drawn from the the target under scenario (a) (top) and scenario (b) (bottom) using the HMC sampler (left) and pseudo-extended method (right).

We ran the HMC and pseudo-extended HMC () samplers under the same conditions as in Kou et al. (2006) and Tak et al. (2016), for 10,000 iterations. Figure 3 shows the samples drawn using the standard HMC and pseudo-extended HMC samplers overlayed with the contour plots. In the case where the modes are well-separated (scenario (a)), the HMC sampler is only able to explore the modes locally clustered together, whereas the pseudo-extended HMC sampler is able to explore all of the modes, with the same number of iterations. Under scenario (b), the weights and variances of the mixture components are larger than under scenario (a), as a result, there is a higher density region separating the modes making it easier for the HMC sampler to move between the mixture components. Compared to the pseudo-extended HMC sampler, the standard HMC sampler is still not able to explore all of the modes of the target.

In Table 1, we compare the pseudo-extended sampler against parallel tempering (PT) (Geyer, 1991), repelling-attracting Metropolis (RAM) (Tak et al., 2016) and the equi-energy (EE) MCMC sampler (Kou et al., 2006)

, all of which are designed for sampling from multi-modal distributions. The first and second moments are estimated for each sampler taken over 20 independent simulations. Each sampler was ran for 50,000 iterations (after burn-in) and the specific tuning details for the temperature ladder of PT and the energy rings for EE are given in

Kou et al. (2006). Unlike the competing methods, the HMC and pseudo-extended (PE) HMC samplers are automatically tuned using the NUTS algorithm (Hoffman and Gelman, 2014). Furthermore, while not reported here, the HMC samplers produce Markov chains with significantly reduced auto-correlation compared to the EE and RAM samplers, which both rely on random-walk updates. We note from Table 1 that increasing the number of pseudo-samples leads to improved estimates at an increased computational cost. In the Supplementary Material (Section B), we show that, when taking account for computational cost, only 2-5 pseudo-samples are required.

### 4.2 Boltzmann machine relaxations

A challenging inference problem from the statistical physics literature is sampling from a Boltzmann machine distribution (Jordan et al., 1999). The Boltzmann machine distribution is defined on the binary space with mass function

 P(s)=1Zbexp{12s⊤Ws+s⊤b},withZb=∑s∈Sexp{12s⊤Ws+s⊤b}, (9)

where is a real symmetric matrix and are the model parameters. Sampling from this distribution typically requires Gibbs steps (Geman and Geman, 1984) which tend to mix very poorly as the states can be strongly correlated when the Boltzmann machine has high levels of connectivity (Salakhutdinov, 2010).

HMC methods have been shown to perform significantly better than Gibbs sampling when the states of the target distribution are highly correlated (Girolami and Calderhead, 2011). Unfortunately, HMC is generally restricted to sampling from continuous state spaces. Using the Gaussian integral trick (Hertz et al., 1991), we can introduce auxiliary variables and transform (9) to now sample from (see Section C in the Supplementary Material).

Following the set-up of Graham and Storkey (2017), the bias parameters are sampled as and the weight parameter was generated by first sampling a

random orthogonal matrix

(Stewart, 1980)

along with a vector of eigenvalues

, where and , for . The weight parameter is then set to , with the diagonal elements set to , for , where we set (). We set and which has been found empirically to produce highly multi-modal distributions.

Under the settings given above, we compare the HMC and pseudo-extended (PE) HMC algorithms against annealed importance sampling (AIS), simulated tempering (ST), and continuously tempered HMC algorithm of Graham and Storkey (2017) (GS). Set-up details for each algorithm are given in Section C of the Supplementary Material. Figure 4: Two-dimensional projection of 10,000 samples drawn from the target using each of the proposed methods, where the first plot gives the ground-truth sampled directly from the Boltzmann machine relaxation distribution. A temperature ladder of length 1,000 was used for both simulated tempering and annealed importance sampling. Figure 5: Root mean squared error (given on the log scale) of the first and second moment of the target taken over 10 independent simulations and calculated for each of the proposed methods.

In Figure 4, samples drawn using each algorithm are plotted alongside independent draws from the Boltzmann relaxation distribution (only computationally feasible for low to moderate dimensional settings). Most of the methods, with the exception of HMC, are able to capture the dominant modes of the target. The pseudo-extended method provides an accurate approximation of the target landscape, in particular the areas of low density. In Section C of the Supplementary Material we analytically derive the first two moments of the Boltzmann distribution, and in Figure 8 we give the root mean squared error of the moment approximations. These results support the conclusion seen visually in Figure 4, that better exploration of the target space leads to improved estimation of integrals of interest. Further simulations in Supplementary Material show the effect of varying the number of pseudo-samples, which again supports the conclusion from the mixture of Gaussians example (Section 4.1) that it is sensible to let .

### 4.3 Sparse logistic regression with horseshoe priors

In this section, we apply the pseudo-extended framework to the problem of sparse Bayesian inference with horseshoe priors. This is a common problem in statistics and machine learning, where the number of parameters

to estimate is much larger than the size of the available data used to fit the model. Taking a Bayesian approach, we can use shrinkage priors to shrink model parameters, in this case , to zero and prevent the model from over-fitting to the data. There are a range of shrinkage priors presented in the literature (Griffin and Brown, 2013) and here we use the horseshoe prior (Carvalho et al., 2010), in particular, the regularized horseshoe as proposed by Piironen and Vehtari (2017). From a sampling perspective, sparse Bayesian inference can be challenging as the posterior distributions are naturally multi-modal, where there is a spike at zero (indicating that variable is inactive) and some posterior mass centered away from zero.

Following Piironen and Vehtari (2017), we apply the regularized horseshoe prior on a logistic regression model (see Section D of the Supplementary Material for full details). We apply this model to three real-world data sets using micro-array data for cancer classification (see Piironen and Vehtari (2017) for further details regarding the data). We compare the pseudo-extended HMC algorithm against standard HMC and give the log-predictive density on a held-out test dataset in Figure 6. In order to ensure a fair comparison between HMC and pseudo-extended HMC, we run HMC for 10,000 iterations and reduce the number of iterations of the pseudo-extended algorithms (with N=2 and N=5) to give equal total computational cost. The results show that, for two out of three datasets, there is an improvement in using the pseudo-extended method. The strong performance of standard HMC is not surprising in this setting as the posterior density plots (given in the Supplementary Material) show that the posterior modes are close together, and as seen in scenario (b) of Section 4.1, the HMC sampler can usually locate and traverse between local modes. Figure 6: Log-predictive densities on a held-out test data for three cancer datasets comparing the HMC and pseudo-extended HMC samplers, with N=2 and N=5.

## 5 Discussion

We have introduced the pseudo-extended method as an approach for augmenting the target distribution for MCMC sampling. We have shown that the pseudo-extended method can be used to sample from multi-modal distributions, a challenging scenario for standard MCMC algorithms, and does not require prior knowledge of where, or how many, modes there are in the target. We have combined the pseudo-extended method with the Hamiltonian Monte Carlo algorithm to create an efficient MCMC sampling tool and have shown that a sensible choice for the instrumental distribution is to use a tempered version of the target. However, there are many alternative proposals that could also be used to potentially extend the pseudo-extended technique beyond sampling from multi-modal posteriors. Further work investigating these alternatives is ongoing.

### Acknowledgements

Christopher Nemeth is supported by ESPRC grants EP/S00159X/1 and EP/R01860X/1.

## References

• Andrieu et al. (2010) Andrieu, C., Doucet, A., and Holenstein, R. (2010). Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342.
• Andrieu and Roberts (2009) Andrieu, C. and Roberts, G. O. (2009). The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics, 37:697–725.
• Beaumont (2003) Beaumont, M. (2003). Estimation of population growth or decline in genetically monitored populations. Genetics, 164(3):1139–60.
• Beskos et al. (2013) Beskos, A., Pillai, N., Roberts, G., Sanz-Serna, J.-M., and Stuart, A. (2013). Optimal tuning of the hybrid Monte Carlo algorithm. Bernoulli, 19(5A):1501–1534.
• Bishop (2006) Bishop, C. M. (2006). Pattern recognition and machine learning. springer.
• Brooks et al. (2011) Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. (2011). Handbook of markov chain monte carlo. CRC press.

Estimating Bayes factors via thermodynamic integration and population MCMC.

Computational Statistics and Data Analysis, 53(12):4028–4045.
• Carpenter et al. (2017) Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1):1–32.
• Carvalho et al. (2010) Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2):465–480.
• Geman and Geman (1984) Geman, S. and Geman, D. (1984). Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE Trans Pattern Analysis and Machine Intelligence, 6(6):721–741.
• Geyer (1991) Geyer, C. J. (1991). Markov chain Monte Carlo maximum likelihood. In Computing Science and Statistics: Proc. 23rd Symp. on the Interface, pages 156–163. Fairfax.
• Girolami and Calderhead (2011) Girolami, M. and Calderhead, B. (2011). Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214.
• Graham and Storkey (2017) Graham, M. M. and Storkey, A. J. (2017). Continuously tempered Hamiltonian Monte Carlo. In

Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence

, pages 1–12.
• Green et al. (2015) Green, P. J., Latuszy, K., Pereyra, M., and Robert, C. P. (2015). Bayesian computation: a perspective on the current state, and sampling backwards and forwards. arXiv preprint arXiv:1502.01148.
• Griffin and Brown (2013) Griffin, J. E. and Brown, P. J. (2013). Some priors for sparse regression modelling. Bayesian Analysis, 8(3):691–702.
• Haario et al. (1999) Haario, H., Saksman, E., and Tamminen, J. (1999). Adaptive proposal distribution for random walk Metropolis algorithm. Computational Statistics, 14(3):375.
• Hertz et al. (1991) Hertz, J. A., Krogh, A. S., and Palmer, R. G. (1991). Introduction to the theory of neural computation, volume 1. Basic Books.
• Hoffman and Gelman (2014) Hoffman, M. and Gelman, A. (2014). The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15:1593–1623.
• Jasra et al. (2007) Jasra, A., Stephens, D. A., and Holmes, C. C. (2007). On population-based simulation for static inference. Statistics and Computing, 17(3):263–279.
• Jordan et al. (1999) Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37(2):183–233.
• Kou et al. (2006) Kou, S., Zhou, Q., and Wong, W. H. (2006). Equi-energy sampler with applications in statistical inference and statistical mechanics. Annals of Statistics, 34(4):1581–1619.
• Lindsten and Doucet (2016) Lindsten, F. and Doucet, A. (2016). Pseudo-marginal Hamiltonian Monte Carlo. arXiv.org, arXiv:1607.02516.
• Marinari and Parisi (1992) Marinari, E. and Parisi, G. (1992). Simulated Tempering : a New Monte Carlo Scheme. Europhysics Letters, 19(6):451–458.
• McLachlan and Peel (2000) McLachlan, G. J. and Peel, D. (2000). Finite mixture models. Wiley Series in Probability and Statistics, New York.
• Murray and Graham (2016) Murray, I. and Graham, M. M. (2016). Pseudo-marginal slice sampling. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS), Cadiz, Spain.
• Neal (2001) Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing, 11:125–139.
• Neal (2003) Neal, R. M. (2003). Slice sampling. Annals of Statistics, 31(3):743–748.
• Neal (2010) Neal, R. M. (2010). MCMC Using Hamiltonian Dynamics. In Handbook of Markov Chain Monte Carlo (Chapman & Hall/CRC Handbooks of Modern Statistical Methods), pages 113–162.
• Neal (2012) Neal, R. M. (2012). Bayesian learning for neural networks, volume 118. Springer Science & Business Media.
• Nesterov (2009) Nesterov, Y. (2009). Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1):221–259.
• Piironen and Vehtari (2017) Piironen, J. and Vehtari, A. (2017). Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics, 11(2):5018–5051.
• Roberts and Rosenthal (2004) Roberts, G. and Rosenthal, J. S. (2004). General state space Markov chains and MCMC algorithms. Probability Surveys, 1:20–71.
• Roberts and Tweedie (1996) Roberts, G. and Tweedie, R. (1996). Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363.
• Roberts et al. (1997) Roberts, G. O., Gelman, A., and Gilks, W. (1997). Weak Convergence and Optimal Scaling of the Random Walk Metropolis Algorithms. The Annals of Applied Probability, 7(1):110–120.
• Roberts and Rosenthal (1998) Roberts, G. O. and Rosenthal, J. S. (1998). Optimal scaling of discrete approximations to Langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255–268.
• Rue et al. (2009) Rue, H., Martino, S., and Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(2):319–392.
• Salakhutdinov (2010) Salakhutdinov, R. (2010). Learning Deep Boltzmann Machines using Adaptive MCMC. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), volume 10, pages 943—-950.
• Sejdinovic et al. (2014) Sejdinovic, D., Strathmann, H., Garcia, M. L., Andrieu, C., and Gretton, A. (2014). Kernel Adaptive Metropolis-Hastings. In International Conference on Machine Learning, volume 31, pages 1665–1673.
• Stewart (1980) Stewart, G. W. (1980). The Efficient Generation of Random Orthogonal Matrices with an Application to Condition Estimators. SIAM Journal of Numerical Analysis, 17(3):403–409.
• Tak et al. (2016) Tak, H., Meng, X.-L., and van Dyk, D. A. (2016). A repulsive-attractive metropolis algorithm for multimodality. arXiv preprint arXiv:1601.05633.
• Zhang et al. (2012) Zhang, Y., Sutton, C., Storkey, A., and Ghahramani, Z. (2012). Continuous Relaxations for Discrete Hamiltonian Monte Carlo. In Advances in Neural Information Processing Systems 25, pages 3194–3202.

## Appendix A Proof of Theorem 2.1

We start by assuming that are distributed according to the extended-target . Assuming there exists a measurable function, , we define the expectation of the function over the extended-target as where is the unnormalized target density (1) and is the instrumental distribution (discussed in Section 2). Using the density for the pseudo-extended target (2), it follows that

 EπN[∑Ni=1f(xi)γ(xi)/q(xi)∑Ni=1γ(xi)/q(xi)]=∫∑Ni=1f(xi)γ(xi)/q(xi)∑Ni=1γ(xi)/q(xi)πN(x1:N)dx1:N =∫∑Ni=1f(xi)γ(xi)/q(xi)∑Ni=1γ(xi)/q(xi)×1Z{1NN∑i=1γ(xi)q(xi)}×∏iq(xi)dx1:N =1ZN∫{N∑i=1f(xi)γ(xi)q(xi)}×∏iq(xi)dx1:N =1N∫N∑i=1f(xi)π(xi)∏j≠iq(xj)dx1:N=1NN∑i=1∫f(xi)π(xi)dxi∏j≠iq(xj)=Eπ[f(x)]\qed

## Appendix B Mixture of Gaussians

The pseudo-extended sampler with tempered proposal (Section 3) performs well in both scenarios, where the modes are close or far apart. For the smallest number of pseudo-samples (), the pseudo-extended HMC sampler performs equally as well as the competing methods. Increasing the number of pseudo-samples leads to a decrease in the standard deviation of the moment estimates. However, increasing the number of pseudo-samples also increases the overall computational cost of the pseudo-extended sampler. Figure 7 measures the cost of the pseudo-extended sampler as the average mean squared error (over 20 runs) multiplied by the computational time. From the figure we see that by minimizing the error relative to computational cost, the optimal number of pseudo-samples, under both scenarios, is between 2 and 5. We also note that Figure 7 suggests that the number of pseudo-samples may be problem specific. In scenario (a), where the modes are well-separated, increasing the number of pseudo-samples beyond 5 does not significantly increase the cost of the sampler, whereas under scenario (b), using more than 5 pseudo-samples (where the mixture components are easier to explore) introduces a significant increase in the computational cost without a proportional reduction in the error. Figure 7: Average mean squared error (MSE) (given on the log scale) of the first and second moments taken over 20 independent simulations for varying number of pseudo-samples N, where MSE is scaled by computational time (CT) and plotted as MSE×CT.

## Appendix C Boltzmann machine relaxation derivations

The Boltzmann machine distribution is defined on the binary space with mass function

 P(s) =1Zbexp{12s⊤Ws+s⊤b}, (10) Zb =∑s∈Sexp{12s⊤Ws+s⊤b},

where and is a real symmetric matrix are the model parameters.

Following the approach of Graham and Storkey (2017) and Zhang et al. (2012), we convert the problem of sampling on the discrete space to a continuous problem using the Gaussian integral trick (Hertz et al., 1991). We introduce the auxiliary variable which follows a conditional Gaussian distribution,

 π(x|s)=1(2π)d/2exp{−12(x−Q⊤s)⊤(x−Q⊤s)}, (11)

where is a matrix such that and is a diagonal matrix chosen to ensure that is a positive semi-definite matrix.

Combining (10) and (11) the joint distribution is,

 π(x,s)= = 1(2π)d/2Zbexp{−12x⊤xs⊤(Qx+b)−12s⊤Ds} = 1(2π)d/2Zbexp{12Tr(D)}exp{−12x⊤x}×db∏k=1exp{sk(q⊤kx+bk)},

where are the rows of . The key feature of this trick is that the

term cancel. On the joint space the binary variables

variables are now decoupled and can be summed over independently to give the marginal density,

 π(x)= 2db(2π)d/2Zbexp{12Tr(D)}exp{−12x⊤x}×db∏i=kcosh(q⊤kx+bk),

which is referred to as the Boltzmann machine relaxation density, which is a Gaussian mixture with components.

We can rearrange the terms in the Boltzmann machine relaxation density to match our generic target , (1), where

 ϕ(x)=12x⊤x−db∑k=1logcosh(q⊤kx+bk),

and the normalizing constant is directly related to the Boltzmann machine distribution

 logZ=logZb+12Tr(D)+d2log(2π)−dblog2.

Converting a discrete problem onto the continuous space does not automatically guarantee that sampling from the continuous space will be any easier than on the discrete space. In fact, if the elements of are large, then on the relaxed space, the modes of the mixture components will be far apart making it difficult for an MCMC sampler to explore the target. Following Zhang et al. (2012), for the experiments in this paper we select by minimizing the maximum eigenvalue of which has the effect of decreasing the separation of the mixture components on the relaxed space.

For the MCMC simulation comparison given in Section 4.2, we compare our pseudo-extended (PE) method against HMC, annealed importance sampling (AIS), simulated tempering (ST) and the Graham and Storkey (2017) (GS) algorithm. For the setting where we can draw independent samples from the Boltzmann distribution (9), if where any large, then this would not be possible. We run each of the competing algorithms for 10,000 iterations and for PE, we test (see Figure 8) but in Figure 4 we only plot the results for . For ST and AIS, both of which require a temperature ladder , we used a ladder of length 1,000 with equally-spaced uniform intervals.

Finally, the first two moments of the relaxed distribution can be directly related to their equivalent moments for the Boltzmann machine distribution by

 E[X]= E[XX⊤]= ∑s∈S[∫Xxx⊤N(x|Q⊤s,I)dxP(s)]=E[Q⊤SS⊤Q+I]=Q⊤E[SS⊤]+I.

As noted in the mixture of Gaussians example (Section 4.1), increasing the number of pseudo-samples improves the accuracy of the pseudo-extended method, but at a computational cost which grows linearly with . When choosing the number of pseudo-samples it is sensible that increases linearly with the dimension of the target. However, taking into account computational cost (Figure 8), a significantly smaller number of pseudo-samples can be used while still achieving a high level of sampling accuracy. Figure 8: Average mean squared error (MSE) (given on the log scale) taken over 10 independent simulations with varying number of pseudo-samples N, where the MSE is scaled by computational time as MSE×CT

## Appendix D Sparse logistic regression plots

We consider the following logistic regression model for data ,

 Pr(Y=y)=py(1−p)1−y,

where

 p=11+exp(z⊤x)

and are covariates. In this setting our parameter of interest is the model coefficient, and recalling that , we can define a regularized horseshoe prior (Piironen and Vehtari, 2017) on each of the coefficients as,

 xj|λj,τ,c ∼N(0,τ2~λ2j),   ~λ2j=c2λ2jc2+τ2λ2j, λj ∼C+(0,1),j=1,…,d,

where is a constant (for which we follow Piironen and Vehtari (2017) in choosing) and

is a half-Cauchy distribution. To give an indication of how this prior behaves, when

, the coefficient is close to zero and the regularized horseshoe prior (above) approaches the original horseshoe (Carvalho et al., 2010). Alternatively, when , the coefficient moves away from zero and the regularizing feature of this prior means that it approaches .

Figure 9 gives the posterior density plots for a random subset of the model parameters for each dataset. We can see from these plots that the posteriors are mostly uni-modal with some posterior mass centered at zero. This is a common trait of horseshoe and similar priors for inducing sparsity, where the point-mass at zero indicates that the variable is turned-off (mass at zero), or contains some positive posterior mass elsewhere. We also note that, unlike the examples given in Sections B and 4.2, the posterior modes are close together. For this reason, it is unsurprising that the HMC algorithm is able to accurately explore the posterior space, and as a result, produce accurate log-predictive estimates (as seen in Figure 6).

## Appendix E Challenging targets

The examples considered so far have focused on using the pseudo-extended framework to sample from multi-modal targets. In this example, as an early piece of work, we consider applying the pseudo-extended method beyond sampling from multi-modal targets. In a more general sense, the pseudo-extended method can be applied to sample from target distributions which are challenging for standard MCMC samplers. We consider two illustrative examples to highlight the improvement of pseudo-extended HMC over standard HMC. The first is the warped Gaussian target, popularly known as banana-shaped target given in Haario et al. (1999)

 π(x)=N(x1|0,ν)N(x2|b(x21−ν),1),

where we set and for our experiments. Our second example is the flower target from Sejdinovic et al. (2014),

 π(x)=exp⎧⎪ ⎪⎨⎪ ⎪⎩−√x21+x22−r−Acos(ωarctan2(x2,x1))2σ2⎫⎪ ⎪⎬⎪ ⎪⎭,

where the distribution represents a periodic perturbation with amplitude and frequency . For our experiments we set , , and .

We use the NUTS (Hoffman and Gelman, 2014) tuning algorithm for HMC as implemented within STAN (Carpenter et al., 2017) for both standard HMC and pseudo-extended HMC, with pseudo-samples and both algorithms ran for 10,000 iterations, with the first half of the chain removed as burn-in. Figures 10(a) and 10(b) show the trace plots for both targets overlayed with a contour plot. For both targets the pseudo-extended model completely explores the target space, whereas the standard HMC algorithm struggles to fully explore the flower target within the fixed number of iterations and appears to under explore the tails of the banana target. The lines given in the plot show the trajectory of the Markov chain, what is particularly notable is that the pseudo-extended method allows the HMC sampler to make large jumps across the target space which reduces the autocorrelation of the Markov chain.