1 Introduction
Markov chain Monte Carlo (MCMC) methods (see, e.g., Brooks et al. (2011)
) are generally regarded as the gold standard approach for sampling from highdimensional distributions. In particular, MCMC algorithms have been extensively applied within the field of Bayesian statistics
(Green et al., 2015) to sample from posterior distributions when the posterior density can only be evaluated up to a constant of proportionality. Under mild conditions, it can be shown that asymptotically, the limiting distribution of the samples generated from the MCMC algorithm will converge to the posterior distribution of interest. While theoretically elegant, one of the main drawbacks of MCMC methods is that running the algorithm to stationarity can be prohibitively expensive if the posterior distribution is of a complex form, for example, contains multiple unknown modes. Notable examples of multimodality include the posterior over model parameters in mixture models (McLachlan and Peel, 2000), deep neural networks
(Neal, 2012), and differential equation models (Calderhead and Girolami, 2009). In these settings, the Markov chain can become stuck for long periods of time without fully exploring the posterior distribution.In this paper, we present the pseudoextended Markov chain Monte Carlo method as an approach for augmenting the statespace of the original posterior distribution to allow the MCMC sampler to easily move between areas of high posterior density. The pseudoextended method introduces pseudosamples on the extended space to improve the mixing of the Markov chain. To illustrate how this method works, in Figure 1
we plot a mixture of two univariate Gaussian distributions (
left). The area of low probability density between the two Gaussians will make it difficult for an MCMC sampler to traverse between them. Using the pseudoextended approach (as detailed in Section
2), we can extend the statespace to two dimensions (right), where on the extended space, the modes are now connected allowing the MCMC sampler to easily mix between them.The pseudoextended framework can be applied for general MCMC sampling, however, in this paper we focus on using ideas from tempered MCMC (Jasra et al., 2007) to improve multimodal posterior sampling. Unlike previous approaches which use MCMC to sample from multimodal posteriors, i) we do not require a priori information regarding the number, or location, of modes, ii) nor do we need to specify a sequence of intermediary tempered distributions (Geyer, 1991).
We show that samples generated using the pseudoextended method admit the correct posterior of interest as the limiting distribution. Furthermore, once weighted using a posthoc correction step, it is possible to use all pseudosamples for approximating the posterior distribution. The pseudoextended method can be applied as an extension to many popular MCMC algorithms, including the randomwalk Metropolis (Roberts et al., 1997) and Metropolisadjusted Langevin algorithm (Roberts and Tweedie, 1996). However, in this paper, we focus on applying the popular Hamiltonian Monte Carlo (HMC) algorithm (Neal, 2010) within the pseudoextended framework, and show that this leads to improved posterior exploration compared to standard HMC.
2 The PseudoExtended Method
Let be a target probability density on with respect to Lebesgue measure, defined for all by
(1) 
where is a continuously differentiable function and is the normalizing constant which is typically intractable. Throughout we will refer to as the target density. In the Bayesian setting, this would be the posterior, where for data , the likelihood is denoted as with parameters assigned a prior density
. The posterior density of the parameters given the data is derived from Bayes theorem to give
, where the marginal likelihood is the normalizing constant , which is typically not available analytically.We extend the statespace of the original target distribution (1) by introducing pseudosamples, , where the extendedtarget distribution is defined on . The pseudosamples act as auxiliary variables, where for each , we introduce an instrumental distribution with support covering that of . In a similar vein to the pseudomarginal MCMC algorithm (Beaumont, 2003; Andrieu and Roberts, 2009) our extendedtarget, including the auxiliary variables, is now of the form,
(2) 
where and are defined in (1). In pseudomarginal MCMC, it is common to use
as a proposal distribution for an importance sampler, and to use it to compute an unbiased estimate of the intractable normalizing constant (see Section
2.3 for details). However, we will take a different approach, as explained below, and we do not even require that exact sampling from is possible; a fact that we will exploit in Section 3 for designing .In the case where , our extendedtarget (2) simplifies back to the original target (1). For , the marginal distribution of the th pseudosample is a mixture between the target and the proposal
Samples from the original target of interest, are given by using a posthoc correction step, where the samples from the extendedtarget are weighted with weights proportional to .
Theorem 2.1.
Let be distributed according to the extendedtarget . Weighting each sample with selfnormalized weights proportional to gives samples from the target distribution, , in the sense that
(3) 
for arbitrary integrable .
The proof follows from the invariance of particle Gibbs (Andrieu et al., 2010) and is given in Section A of the Supplementary Material.
Introducing pseudosamples has the effect of linearly increasing the statespace of the target. In general, we would prefer to perform MCMC on a lower dimensional space rather than increasing the size of the statespace as this leads to slower mixing of the Markov chain (Roberts and Rosenthal, 2004). This is less problematic for the HMC algorithm compared to the randomwalk Metropolis as the stepsize should be scaled as for HMC compared to
for the variance of the randomwalk Metropolis proposal
(Beskos et al., 2013), giving a computational complexity of and , respectively, to explore the target in stationarity. This indicates that in highdimensional spaces, the HMC algorithm is able to explore the target space faster. One disadvantage of HMC, and other gradientbased MCMC algorithms, is that they tend to be modeseeking and are more prone to getting trapped in local modes of the target. Modifying the target distribution using the pseudoextended framework, whereby the modes of the target can be connected on the extended space, reduces the modeseeking behavior of HMC and allows the sampler to more easily move between the modes of the target.2.1 Pseudoextended Hamiltonian Monte Carlo
Recalling that our parameters of interest are , we introduce artificial momentum variables that are independent of . The Hamiltonian , represents the total energy of the system as the combination of the potential function , as defined in (1), and kinetic energy ,
where
is a mass matrix and is often set to the identity matrix. The Hamiltonian now augments our target distribution so that we are sampling
from the joint distribution
, which admits the target as the marginal. In the case of the pseudoextended target (2), the Hamiltonian is,(4) 
where now , and is the potential function of the instrumental distribution from eq. (2).
Aside from a few special cases, we generally cannot simulate from the dynamical system corresponding to the Hamiltonian (4) exactly (Neal, 2010). Instead, we discretize time using small stepsizes and calculate the state at , , , etc. using a numerical integrator. Several numerical integrators are available which preserve the volume and reversibility of the Hamiltonian system (Girolami and Calderhead, 2011), the most popular being the leapfrog integrator which takes steps, each of size , though the Hamiltonian dynamics (see Algorithm 1). After a fixed number of iterations , the algorithm generates samples approximately distributed according to the joint distribution , where after discarding the momentum variables , our MCMC samples will be approximately distributed according to the target .
Two important tuning parameters of Algorithm 1 are and . We estimate these using the NoUturn sampler (NUTS) introduced by Hoffman and Gelman (2014), where the trajectory length is tuned to avoid the sampler doubling back on itself and is estimated using dual averaging (Nesterov, 2009). In this paper, we use the HMC algorithm because of its impressive mixing times, however, it is important to note that alternative MCMC schemes, such as the randomwalk Metropolis (Roberts et al., 1997), Metropolisadjusted Langevin algorithm (Roberts and Rosenthal, 1998), slice sampler (Neal, 2003) and other MCMC algorithms can also be used to sample from the pseudoextended target (2).
2.2 Onedimensional illustration
Consider a bimodal target of the form (see Figure 1 (left)),
If there are pseudosamples, the pseudoextended target (2) simplifies to
and for the sake of illustration, we choose .
Density plots for the original and pseudoextended target are given in Figure 1. On the original target, the modes are separated by a region of low density and an MCMC sampler will therefore only pass between the modes with low probability, thus potentially requiring an exhaustive number of iterations. On the pseudoextended target, the modes of the original target are now connected on the extended space . The proposal has the effect of increasing the density in the low probability regions of the target which separate the modes. A higher density between the modes means that the MCMC sampler can now traverse between the modes with higher probability than under the original target.
In Figure 2, density plots of the original target are overlayed with samples drawn from the original and pseudoextended targets using the HMC algorithm, respectively.
After 10,000 iterations of the HMC sampler on the original target only one mode is discovered. Applying the same HMC algorithm on the pseudoextended target, and then weighting the samples (as discussed in Section 2 and shown in Algorithm 1), both modes of the original target are discovered and the samples produce a good empirical approximation to the target.
2.3 Connections to pseudomarginal MCMC
The pseudoextended target (2) can be viewed as a special case of the pseudomarginal target of Andrieu and Roberts (2009). In the pseudomarginal setting, it is (typically) assumed that the target density is of the form , where is some “toplevel” parameter, and where are now viewed as a latent variable that cannot be integrated out analytically. Using importance sampling, an unbiased Monte Carlo estimate of the target is computed using latent variable samples from a proposal distribution with density and then approximating the integral as
The pseudomarginal target is then defined, analogously to the pseudoextended target (2), as
(5) 
which admits as a marginal. In the original pseudomarginal method, the extendedtarget is sampled from using MCMC, with an independent proposal for (corresponding to importance sampling for these variables) and a standard MCMC proposal (e.g., randomwalk) used for .
There are two key differences between pseudomarginal MCMC and pseudoextended MCMC. Firstly, we do not distinguish between latent variables and parameters, and simply view all unknown variables, or parameters, of interest as being part of . Secondly, we do not use an importancesamplingbased proposal to sample (which, effectively, would just result in a standard importance sampler for with proposal ), but instead we propose to simulate directly from the pseudoextended target (2) using HMC as explained in Section 2.1. An important consequence of this is that we can use instrumental distributions without needing to sample from them. In Section 3 we exploit this fact to construct the instrumental distribution by tempering.
It has previously been proposed to simulate from the pseudomarginal target (5) using HMC Lindsten and Doucet (2016). However, their motivation differs from ours since they still view as “nuisance variables” that are approximately marginalized out in order to mimic an ideal HMC algorithm targeting the marginal distribution . Furthermore, they study the convergence to this ideal HMC for large . An implication of this is that they rely on a splitting technique which requires a reparameterization of
as a standard normal distribution. This limits the range of instrumental distributions that can be used with their method—in particular, they cannot use the tempered distributions which we propose in Section
3. On the contrary, based on the empirical results presented in Section 4, we are advocating using a small number of pseudosamples and therefore we do not need to resort to the reparameterization used by Lindsten and Doucet (2016). Another related method, which also uses this reparameterization, is the pseudomarginal slice sampler by Murray and Graham (2016).In summary, the pseudomarginal framework is a powerful technique for sampling from models with intractable likelihoods. The pseudoextended method, on the other hand, is designed for sampling from complex target distributions, where the landscape of the target is difficult for standard MCMC samplers to traverse without an exhaustive number of MCMC iterations. In particular, where the target distribution is multimodal, we show that extending the statespace allows our MCMC sampler to more easily explore the modes of the target.
3 Tempering
3.1 Tempered targets as instrumental distributions
In the case of importance sampling, we would choose a proposal which closely approximates the target . However, this would assume that we could find a tractable instrumental distribution for which i) sufficiently covers the support of the target and ii) captures its multimodality. Approximations, such as the Laplace approximation (Rue et al., 2009) and variational Bayes (e.g., Bishop (2006), Chapter 10) could be used to choose , however, such approximations tend to locate only one mode of the target.
A significant advantage of the pseudoextended framework (2) is that it permits a wide range of potential instrumental distributions. Unlike standard importance sampling, we also do not require to be a distribution that we can sample from, only that it can be evaluated pointwise up to proportionality. This is a simpler condition to satisfy and allows us to find better instrumental distributions for connecting the modes of the target. In this paper, we utilize a simple approach for choosing the instrumental distribution which does not require a closedform approximation of the target. Specifically, we use tempered targets as instrumental distributions.
Tempering has previously been utilized in the MCMC literature to improve the sampling of multimodal targets. Here we use a technique inspired by Graham and Storkey (2017) (see Section 3.2), where we consider the family of approximating distributions,
(6) 
where can be evaluated pointwise and is typically intractable.
We will construct an extended target distribution on consisting of pairs , for . This target distribution will be constructed in such a way that the marginal distribution of each sample is a mixture with components selected from . This will typically make the marginal distribution more diffuse than the target itself, encouraging better mixing.
Let where, for the sake of tractability, we let , where can be evaluated pointwise and is a normalizing constant. Thus,
(7) 
The joint proposal does not admit a closedform expression and in general it is not possible to sample from it. We can, however, use an MCMC algorithm, in this case an HMC sampler, on the extendedtarget and therefore only require that can be evaluated pointwise up to proportionality. Under the proposal (6), the pseudoextended target (2) is now
(8)  
where is some arbitrary userchosen target distribution for . Through our choice of , the normalizing constants for the target and proposal, and respectively are not dependent on or and so cancel in the MetropolisHastings ratio.
3.2 Related Work on tempered MCMC
Tempered MCMC is the most popular approach to sampling from multimodal targets distributions (see Jasra et al. (2007) for a full review). The main idea behind tempered MCMC is to sample from a sequence of tempered targets,
where is a tuning parameter referred to as the temperature that is associated with . A sequence of temperatures, commonly known as the ladder, is chosen a priori, where . The intuition behind tempered MCMC is that when is small, the modes of the target are flattened out making it easier for the MCMC sampler to traverse through the regions of low density separating the modes. One of the most popular tempering algorithms is parallel tempering (PT) (Geyer, 1991), where in parallel, separate MCMC algorithms are run with each sampling from one of the tempered targets . Samples from neighboring Markov chains are exchanged (i.e. sample from chain exchanged with chain or ) using a MetropolisHasting step. These exchanges improve the convergence of the Markov chain to the target of interest , however, information from low targets can be slow to traverse up the temperature ladder. There is also serial version of this algorithm, known as simulated tempering (ST) (Marinari and Parisi, 1992). An alternative approach introduced by Neal (2001) is annealed importance sampling (AIS), which draws samples from a simple base distribution and then via a sequence of intermediate transition densities, moves the samples along the temperature ladder giving a weighted sample from the target distribution. Generally speaking, these tempered approaches can be very difficult to apply in practice often requiring extensive tuning. In the case of PT, the user needs to choose the number of parallel chains , temperature schedule, stepsize for each chain and the number of exchanges at each iteration.
Our proposed tempering scheme is closely related to the continuously tempered HMC algorithm of Graham and Storkey (2017). They propose to run HMC on a distribution similar to (7) and then apply an importance weighting as a postcorrection to account for the different temperatures. It thus has some resemblance with ST, in the sense that a single chain is used to explore the state space for different temperature levels. On the contrary, for our proposed pseudoextended method, the distribution (7) is not used as a target, but merely as an instrumental distribution to construct the pseudoextended target (8). The resulting method therefore has some resemblance with PT, since we propagate pseudosamples in parallel, all possibly exploring different temperature levels. Furthermore, by mixing in part of the actual target we ensure that the samples do not simultaneously “drift away” from regions with high probability under .
Graham and Storkey (2017) propose to use a variational approximation to the target, both when defining the family of distributions (6) and for choosing the function . This is also possible with the pseudoextended method, but we do not consider this possibility here for brevity. Finally, we note that in the pseudoextended method the temperature parameter is treated as unknown and is sampled as part of the MCMC scheme, rather than being prespecified as a sequence of fixed temperatures. This can be particularly beneficial as it avoids the need to tune the temperature ladder prior to implementing the MCMC sampler.
4 Experiments
We test the performance of the pseudoextended method on four different examples. The first two (Sections 4.1 and 4.2) are chosen to show how the pseudoextended method performs on simulated data when the target is multimodal. The third example (Section 4.3
) considers a sparsityinducing logistic regression model, where multimodality occurs in the posterior from three realworld datasets. We compare against popular competing algorithms from the literature, including methods discussed in Section
3.2. In Section E of the Supplementary Material, we show how the pseudoextended approach can be applied beyond multimodal posteriors and consider two challenging target distributions.All the simulations for the pseudoextended method use the tempered proposal (Section 3) for and thus the pseudoextended target is given by (8). For each simulation study we choose and
. We also use a logit transformation for
to map the parameters onto the unconstrained space. The pseudoextended HMC algorithm (Alg. 1) is implemented within the STAN framework^{1}^{1}1All simulation code is available to the reviewers in the Supplementary Material and will be published on Github after the review process. (Carpenter et al., 2017).4.1 Mixture of Gaussians
We consider a popular example from the literature (Kou et al., 2006; Tak et al., 2016), where the target is a mixture of 20 bivariate Gaussians,
where are specified in Kou et al. (2006). We consider two scenarios, where (a) each mixture component has weight and variance and where (b) the weights and variances
are unequal. Under scenario (a) the modes are well separated where most of the modes are more than 15 standard deviations apart. The distance between the modes makes it challenging for standard MCMC algorithms to explore all of the modes (see Figure
3). In scenario (b) the mixture components close to (5,5) have a higher weight with smaller variance.We ran the HMC and pseudoextended HMC () samplers under the same conditions as in Kou et al. (2006) and Tak et al. (2016), for 10,000 iterations. Figure 3 shows the samples drawn using the standard HMC and pseudoextended HMC samplers overlayed with the contour plots. In the case where the modes are wellseparated (scenario (a)), the HMC sampler is only able to explore the modes locally clustered together, whereas the pseudoextended HMC sampler is able to explore all of the modes, with the same number of iterations. Under scenario (b), the weights and variances of the mixture components are larger than under scenario (a), as a result, there is a higher density region separating the modes making it easier for the HMC sampler to move between the mixture components. Compared to the pseudoextended HMC sampler, the standard HMC sampler is still not able to explore all of the modes of the target.
In Table 1, we compare the pseudoextended sampler against parallel tempering (PT) (Geyer, 1991), repellingattracting Metropolis (RAM) (Tak et al., 2016) and the equienergy (EE) MCMC sampler (Kou et al., 2006)
, all of which are designed for sampling from multimodal distributions. The first and second moments are estimated for each sampler taken over 20 independent simulations. Each sampler was ran for 50,000 iterations (after burnin) and the specific tuning details for the temperature ladder of PT and the energy rings for EE are given in
Kou et al. (2006). Unlike the competing methods, the HMC and pseudoextended (PE) HMC samplers are automatically tuned using the NUTS algorithm (Hoffman and Gelman, 2014). Furthermore, while not reported here, the HMC samplers produce Markov chains with significantly reduced autocorrelation compared to the EE and RAM samplers, which both rely on randomwalk updates. We note from Table 1 that increasing the number of pseudosamples leads to improved estimates at an increased computational cost. In the Supplementary Material (Section B), we show that, when taking account for computational cost, only 25 pseudosamples are required.Scenario (a)  

Truth  4.478  4.905  25.605  33.920 
RAM  4.471 (0.091)  4.932 (0.101)  25.572 (0.900)  33.223 (1.100) 
EE  4.502 (0.107)  4.944 (0.139)  25.924 (1.098)  34.476 (1.373) 
PT  4.419 (0.170)  4.879 (0.283)  24.986 (1.713)  33.597 (2.867) 
HMC  3.133 (2.857)  1.741 (0.796)  18.002 (27.201)  3.679 (2.226) 
PE (N=2)  4.383 (0.042)  4.913 (0.101)  24.585 (0.434)  34.038 (1.000) 
PE (N=5)  4.469 (0.039)  4.893 (0.049)  25.553 (0.374)  33.834 (0.437) 
PE (N=10)  4.467 (0.027)  4.903 (0.027)  25.479 (0.251)  33.907 (0.234) 
PE (N=20)  4.477 (0.017)  4.910 (0.020)  25.576 (0.145)  33.945 (0.213) 
Scenario (b)  
Truth  4.688  5.030  25.558  31.378 
RAM  4.673 (0.026)  5.029 (0.035)  25.508 (0.263)  31.456 (0.334) 
EE  4.699 (0.072)  5.037 (0.086)  25.693 (0.739)  31.433 (0.839) 
PT  4.709 (0.116)  5.001 (0.134)  25.813 (1.122)  31.105 (1.186) 
HMC  4.460 (0.151)  4.843 (0.474)  22.811 (1.481)  28.691 (3.978) 
PE (N=2)  4.666 (0.050)  5.054 (0.077)  25.572 (0.461)  31.922 (0.669) 
PE (N=5)  4.675 (0.015)  5.033 (0.021)  25.629 (0.167)  31.663 (0.220) 
PE (N=10)  4.670 (0.010)  5.026 (0.016)  25.598 (0.087)  31.646 (0.169) 
PE (N=20)  4.674 (0.013)  5.023 (0.013)  25.628 (0.134)  31.577 (0.121) 
4.2 Boltzmann machine relaxations
A challenging inference problem from the statistical physics literature is sampling from a Boltzmann machine distribution (Jordan et al., 1999). The Boltzmann machine distribution is defined on the binary space with mass function
(9) 
where is a real symmetric matrix and are the model parameters. Sampling from this distribution typically requires Gibbs steps (Geman and Geman, 1984) which tend to mix very poorly as the states can be strongly correlated when the Boltzmann machine has high levels of connectivity (Salakhutdinov, 2010).
HMC methods have been shown to perform significantly better than Gibbs sampling when the states of the target distribution are highly correlated (Girolami and Calderhead, 2011). Unfortunately, HMC is generally restricted to sampling from continuous state spaces. Using the Gaussian integral trick (Hertz et al., 1991), we can introduce auxiliary variables and transform (9) to now sample from (see Section C in the Supplementary Material).
Following the setup of Graham and Storkey (2017), the bias parameters are sampled as and the weight parameter was generated by first sampling a
random orthogonal matrix
(Stewart, 1980)along with a vector of eigenvalues
, where and , for . The weight parameter is then set to , with the diagonal elements set to , for , where we set (). We set and which has been found empirically to produce highly multimodal distributions.Under the settings given above, we compare the HMC and pseudoextended (PE) HMC algorithms against annealed importance sampling (AIS), simulated tempering (ST), and continuously tempered HMC algorithm of Graham and Storkey (2017) (GS). Setup details for each algorithm are given in Section C of the Supplementary Material.
In Figure 4, samples drawn using each algorithm are plotted alongside independent draws from the Boltzmann relaxation distribution (only computationally feasible for low to moderate dimensional settings). Most of the methods, with the exception of HMC, are able to capture the dominant modes of the target. The pseudoextended method provides an accurate approximation of the target landscape, in particular the areas of low density. In Section C of the Supplementary Material we analytically derive the first two moments of the Boltzmann distribution, and in Figure 8 we give the root mean squared error of the moment approximations. These results support the conclusion seen visually in Figure 4, that better exploration of the target space leads to improved estimation of integrals of interest. Further simulations in Supplementary Material show the effect of varying the number of pseudosamples, which again supports the conclusion from the mixture of Gaussians example (Section 4.1) that it is sensible to let .
4.3 Sparse logistic regression with horseshoe priors
In this section, we apply the pseudoextended framework to the problem of sparse Bayesian inference with horseshoe priors. This is a common problem in statistics and machine learning, where the number of parameters
to estimate is much larger than the size of the available data used to fit the model. Taking a Bayesian approach, we can use shrinkage priors to shrink model parameters, in this case , to zero and prevent the model from overfitting to the data. There are a range of shrinkage priors presented in the literature (Griffin and Brown, 2013) and here we use the horseshoe prior (Carvalho et al., 2010), in particular, the regularized horseshoe as proposed by Piironen and Vehtari (2017). From a sampling perspective, sparse Bayesian inference can be challenging as the posterior distributions are naturally multimodal, where there is a spike at zero (indicating that variable is inactive) and some posterior mass centered away from zero.Following Piironen and Vehtari (2017), we apply the regularized horseshoe prior on a logistic regression model (see Section D of the Supplementary Material for full details). We apply this model to three realworld data sets using microarray data for cancer classification (see Piironen and Vehtari (2017) for further details regarding the data). We compare the pseudoextended HMC algorithm against standard HMC and give the logpredictive density on a heldout test dataset in Figure 6. In order to ensure a fair comparison between HMC and pseudoextended HMC, we run HMC for 10,000 iterations and reduce the number of iterations of the pseudoextended algorithms (with N=2 and N=5) to give equal total computational cost. The results show that, for two out of three datasets, there is an improvement in using the pseudoextended method. The strong performance of standard HMC is not surprising in this setting as the posterior density plots (given in the Supplementary Material) show that the posterior modes are close together, and as seen in scenario (b) of Section 4.1, the HMC sampler can usually locate and traverse between local modes.
5 Discussion
We have introduced the pseudoextended method as an approach for augmenting the target distribution for MCMC sampling. We have shown that the pseudoextended method can be used to sample from multimodal distributions, a challenging scenario for standard MCMC algorithms, and does not require prior knowledge of where, or how many, modes there are in the target. We have combined the pseudoextended method with the Hamiltonian Monte Carlo algorithm to create an efficient MCMC sampling tool and have shown that a sensible choice for the instrumental distribution is to use a tempered version of the target. However, there are many alternative proposals that could also be used to potentially extend the pseudoextended technique beyond sampling from multimodal posteriors. Further work investigating these alternatives is ongoing.
Acknowledgements
Christopher Nemeth is supported by ESPRC grants EP/S00159X/1 and EP/R01860X/1.
References
 Andrieu et al. (2010) Andrieu, C., Doucet, A., and Holenstein, R. (2010). Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342.
 Andrieu and Roberts (2009) Andrieu, C. and Roberts, G. O. (2009). The pseudomarginal approach for efficient Monte Carlo computations. The Annals of Statistics, 37:697–725.
 Beaumont (2003) Beaumont, M. (2003). Estimation of population growth or decline in genetically monitored populations. Genetics, 164(3):1139–60.
 Beskos et al. (2013) Beskos, A., Pillai, N., Roberts, G., SanzSerna, J.M., and Stuart, A. (2013). Optimal tuning of the hybrid Monte Carlo algorithm. Bernoulli, 19(5A):1501–1534.
 Bishop (2006) Bishop, C. M. (2006). Pattern recognition and machine learning. springer.
 Brooks et al. (2011) Brooks, S., Gelman, A., Jones, G., and Meng, X.L. (2011). Handbook of markov chain monte carlo. CRC press.

Calderhead and Girolami (2009)
Calderhead, B. and Girolami, M. (2009).
Estimating Bayes factors via thermodynamic integration and population MCMC.
Computational Statistics and Data Analysis, 53(12):4028–4045.  Carpenter et al. (2017) Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1):1–32.
 Carvalho et al. (2010) Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2):465–480.
 Geman and Geman (1984) Geman, S. and Geman, D. (1984). Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE Trans Pattern Analysis and Machine Intelligence, 6(6):721–741.
 Geyer (1991) Geyer, C. J. (1991). Markov chain Monte Carlo maximum likelihood. In Computing Science and Statistics: Proc. 23rd Symp. on the Interface, pages 156–163. Fairfax.
 Girolami and Calderhead (2011) Girolami, M. and Calderhead, B. (2011). Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214.

Graham and Storkey (2017)
Graham, M. M. and Storkey, A. J. (2017).
Continuously tempered Hamiltonian Monte Carlo.
In
Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence
, pages 1–12.  Green et al. (2015) Green, P. J., Latuszy, K., Pereyra, M., and Robert, C. P. (2015). Bayesian computation: a perspective on the current state, and sampling backwards and forwards. arXiv preprint arXiv:1502.01148.
 Griffin and Brown (2013) Griffin, J. E. and Brown, P. J. (2013). Some priors for sparse regression modelling. Bayesian Analysis, 8(3):691–702.
 Haario et al. (1999) Haario, H., Saksman, E., and Tamminen, J. (1999). Adaptive proposal distribution for random walk Metropolis algorithm. Computational Statistics, 14(3):375.
 Hertz et al. (1991) Hertz, J. A., Krogh, A. S., and Palmer, R. G. (1991). Introduction to the theory of neural computation, volume 1. Basic Books.
 Hoffman and Gelman (2014) Hoffman, M. and Gelman, A. (2014). The NoUTurn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15:1593–1623.
 Jasra et al. (2007) Jasra, A., Stephens, D. A., and Holmes, C. C. (2007). On populationbased simulation for static inference. Statistics and Computing, 17(3):263–279.
 Jordan et al. (1999) Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37(2):183–233.
 Kou et al. (2006) Kou, S., Zhou, Q., and Wong, W. H. (2006). Equienergy sampler with applications in statistical inference and statistical mechanics. Annals of Statistics, 34(4):1581–1619.
 Lindsten and Doucet (2016) Lindsten, F. and Doucet, A. (2016). Pseudomarginal Hamiltonian Monte Carlo. arXiv.org, arXiv:1607.02516.
 Marinari and Parisi (1992) Marinari, E. and Parisi, G. (1992). Simulated Tempering : a New Monte Carlo Scheme. Europhysics Letters, 19(6):451–458.
 McLachlan and Peel (2000) McLachlan, G. J. and Peel, D. (2000). Finite mixture models. Wiley Series in Probability and Statistics, New York.
 Murray and Graham (2016) Murray, I. and Graham, M. M. (2016). Pseudomarginal slice sampling. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS), Cadiz, Spain.
 Neal (2001) Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing, 11:125–139.
 Neal (2003) Neal, R. M. (2003). Slice sampling. Annals of Statistics, 31(3):743–748.
 Neal (2010) Neal, R. M. (2010). MCMC Using Hamiltonian Dynamics. In Handbook of Markov Chain Monte Carlo (Chapman & Hall/CRC Handbooks of Modern Statistical Methods), pages 113–162.
 Neal (2012) Neal, R. M. (2012). Bayesian learning for neural networks, volume 118. Springer Science & Business Media.
 Nesterov (2009) Nesterov, Y. (2009). Primaldual subgradient methods for convex problems. Mathematical Programming, 120(1):221–259.
 Piironen and Vehtari (2017) Piironen, J. and Vehtari, A. (2017). Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics, 11(2):5018–5051.
 Roberts and Rosenthal (2004) Roberts, G. and Rosenthal, J. S. (2004). General state space Markov chains and MCMC algorithms. Probability Surveys, 1:20–71.
 Roberts and Tweedie (1996) Roberts, G. and Tweedie, R. (1996). Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363.
 Roberts et al. (1997) Roberts, G. O., Gelman, A., and Gilks, W. (1997). Weak Convergence and Optimal Scaling of the Random Walk Metropolis Algorithms. The Annals of Applied Probability, 7(1):110–120.
 Roberts and Rosenthal (1998) Roberts, G. O. and Rosenthal, J. S. (1998). Optimal scaling of discrete approximations to Langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255–268.
 Rue et al. (2009) Rue, H., Martino, S., and Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(2):319–392.
 Salakhutdinov (2010) Salakhutdinov, R. (2010). Learning Deep Boltzmann Machines using Adaptive MCMC. In Proceedings of the 27th International Conference on Machine Learning (ICML10), volume 10, pages 943—950.
 Sejdinovic et al. (2014) Sejdinovic, D., Strathmann, H., Garcia, M. L., Andrieu, C., and Gretton, A. (2014). Kernel Adaptive MetropolisHastings. In International Conference on Machine Learning, volume 31, pages 1665–1673.
 Stewart (1980) Stewart, G. W. (1980). The Efficient Generation of Random Orthogonal Matrices with an Application to Condition Estimators. SIAM Journal of Numerical Analysis, 17(3):403–409.
 Tak et al. (2016) Tak, H., Meng, X.L., and van Dyk, D. A. (2016). A repulsiveattractive metropolis algorithm for multimodality. arXiv preprint arXiv:1601.05633.
 Zhang et al. (2012) Zhang, Y., Sutton, C., Storkey, A., and Ghahramani, Z. (2012). Continuous Relaxations for Discrete Hamiltonian Monte Carlo. In Advances in Neural Information Processing Systems 25, pages 3194–3202.
Appendix A Proof of Theorem 2.1
We start by assuming that are distributed according to the extendedtarget . Assuming there exists a measurable function, , we define the expectation of the function over the extendedtarget as where is the unnormalized target density (1) and is the instrumental distribution (discussed in Section 2). Using the density for the pseudoextended target (2), it follows that
Appendix B Mixture of Gaussians
The pseudoextended sampler with tempered proposal (Section 3) performs well in both scenarios, where the modes are close or far apart. For the smallest number of pseudosamples (), the pseudoextended HMC sampler performs equally as well as the competing methods. Increasing the number of pseudosamples leads to a decrease in the standard deviation of the moment estimates. However, increasing the number of pseudosamples also increases the overall computational cost of the pseudoextended sampler. Figure 7 measures the cost of the pseudoextended sampler as the average mean squared error (over 20 runs) multiplied by the computational time. From the figure we see that by minimizing the error relative to computational cost, the optimal number of pseudosamples, under both scenarios, is between 2 and 5. We also note that Figure 7 suggests that the number of pseudosamples may be problem specific. In scenario (a), where the modes are wellseparated, increasing the number of pseudosamples beyond 5 does not significantly increase the cost of the sampler, whereas under scenario (b), using more than 5 pseudosamples (where the mixture components are easier to explore) introduces a significant increase in the computational cost without a proportional reduction in the error.
Appendix C Boltzmann machine relaxation derivations
The Boltzmann machine distribution is defined on the binary space with mass function
(10)  
where and is a real symmetric matrix are the model parameters.
Following the approach of Graham and Storkey (2017) and Zhang et al. (2012), we convert the problem of sampling on the discrete space to a continuous problem using the Gaussian integral trick (Hertz et al., 1991). We introduce the auxiliary variable which follows a conditional Gaussian distribution,
(11) 
where is a matrix such that and is a diagonal matrix chosen to ensure that is a positive semidefinite matrix.
Combining (10) and (11) the joint distribution is,
where are the rows of . The key feature of this trick is that the
term cancel. On the joint space the binary variables
variables are now decoupled and can be summed over independently to give the marginal density,which is referred to as the Boltzmann machine relaxation density, which is a Gaussian mixture with components.
We can rearrange the terms in the Boltzmann machine relaxation density to match our generic target , (1), where
and the normalizing constant is directly related to the Boltzmann machine distribution
Converting a discrete problem onto the continuous space does not automatically guarantee that sampling from the continuous space will be any easier than on the discrete space. In fact, if the elements of are large, then on the relaxed space, the modes of the mixture components will be far apart making it difficult for an MCMC sampler to explore the target. Following Zhang et al. (2012), for the experiments in this paper we select by minimizing the maximum eigenvalue of which has the effect of decreasing the separation of the mixture components on the relaxed space.
For the MCMC simulation comparison given in Section 4.2, we compare our pseudoextended (PE) method against HMC, annealed importance sampling (AIS), simulated tempering (ST) and the Graham and Storkey (2017) (GS) algorithm. For the setting where we can draw independent samples from the Boltzmann distribution (9), if where any large, then this would not be possible. We run each of the competing algorithms for 10,000 iterations and for PE, we test (see Figure 8) but in Figure 4 we only plot the results for . For ST and AIS, both of which require a temperature ladder , we used a ladder of length 1,000 with equallyspaced uniform intervals.
Finally, the first two moments of the relaxed distribution can be directly related to their equivalent moments for the Boltzmann machine distribution by
As noted in the mixture of Gaussians example (Section 4.1), increasing the number of pseudosamples improves the accuracy of the pseudoextended method, but at a computational cost which grows linearly with . When choosing the number of pseudosamples it is sensible that increases linearly with the dimension of the target. However, taking into account computational cost (Figure 8), a significantly smaller number of pseudosamples can be used while still achieving a high level of sampling accuracy.
Appendix D Sparse logistic regression plots
We consider the following logistic regression model for data ,
where
and are covariates. In this setting our parameter of interest is the model coefficient, and recalling that , we can define a regularized horseshoe prior (Piironen and Vehtari, 2017) on each of the coefficients as,
where is a constant (for which we follow Piironen and Vehtari (2017) in choosing) and
is a halfCauchy distribution. To give an indication of how this prior behaves, when
, the coefficient is close to zero and the regularized horseshoe prior (above) approaches the original horseshoe (Carvalho et al., 2010). Alternatively, when , the coefficient moves away from zero and the regularizing feature of this prior means that it approaches .Figure 9 gives the posterior density plots for a random subset of the model parameters for each dataset. We can see from these plots that the posteriors are mostly unimodal with some posterior mass centered at zero. This is a common trait of horseshoe and similar priors for inducing sparsity, where the pointmass at zero indicates that the variable is turnedoff (mass at zero), or contains some positive posterior mass elsewhere. We also note that, unlike the examples given in Sections B and 4.2, the posterior modes are close together. For this reason, it is unsurprising that the HMC algorithm is able to accurately explore the posterior space, and as a result, produce accurate logpredictive estimates (as seen in Figure 6).
Appendix E Challenging targets
The examples considered so far have focused on using the pseudoextended framework to sample from multimodal targets. In this example, as an early piece of work, we consider applying the pseudoextended method beyond sampling from multimodal targets. In a more general sense, the pseudoextended method can be applied to sample from target distributions which are challenging for standard MCMC samplers. We consider two illustrative examples to highlight the improvement of pseudoextended HMC over standard HMC. The first is the warped Gaussian target, popularly known as bananashaped target given in Haario et al. (1999)
where we set and for our experiments. Our second example is the flower target from Sejdinovic et al. (2014),
where the distribution represents a periodic perturbation with amplitude and frequency . For our experiments we set , , and .
We use the NUTS (Hoffman and Gelman, 2014) tuning algorithm for HMC as implemented within STAN (Carpenter et al., 2017) for both standard HMC and pseudoextended HMC, with pseudosamples and both algorithms ran for 10,000 iterations, with the first half of the chain removed as burnin. Figures 10(a) and 10(b) show the trace plots for both targets overlayed with a contour plot. For both targets the pseudoextended model completely explores the target space, whereas the standard HMC algorithm struggles to fully explore the flower target within the fixed number of iterations and appears to under explore the tails of the banana target. The lines given in the plot show the trajectory of the Markov chain, what is particularly notable is that the pseudoextended method allows the HMC sampler to make large jumps across the target space which reduces the autocorrelation of the Markov chain.
Comments
There are no comments yet.