Plateau Proposal Distributions for Adaptive Component-wise Multiple-Try Metropolis

09/06/2019
by   F. Din-Houn Lau, et al.
0

Markov chain Monte Carlo (MCMC) methods are sampling methods that have become a commonly used tool in statistics, for example to perform Monte Carlo integration. As a consequence of the increase in computational power, many variations of MCMC methods exist for generating samples from arbitrary, possibly complex, target distributions. The performance of an MCMC method is predominately governed by the choice of the so-called proposal distribution used. In this paper, we introduce a new type of proposal distribution for the use in MCMC methods that operates component-wise and with multiple-tries per iteration. Specifically, the novel class of proposal distributions, called Plateau distributions, do not overlap, thus ensuring that the multiple-tries are drawn from different parts of the state space. We demonstrate in numerical simulations that this new type of proposals outperforms competitors and efficiently explores the state-space.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/29/2020

Accelerating MCMC algorithms through Bayesian Deep Networks

Markov Chain Monte Carlo (MCMC) algorithms are commonly used for their v...
05/02/2021

Sampling by Divergence Minimization

We introduce a family of Markov Chain Monte Carlo (MCMC) methods designe...
07/05/2018

A multiple-try Metropolis-Hastings algorithm with tailored proposals

We present a new multiple-try Metropolis-Hastings algorithm designed to ...
12/06/2018

A Framework for Adaptive MCMC Targeting Multimodal Distributions

We propose a new Monte Carlo method for sampling from multimodal distrib...
10/27/2021

Entropy-based adaptive Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) is a popular Markov Chain Monte Carlo (MCM...
01/16/2021

An MCMC Method to Sample from Lattice Distributions

We introduce a Markov Chain Monte Carlo (MCMC) algorithm to generate sam...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Markov chain Monte Carlo (MCMC) methods are essentially used to perform Monte Carlo integration, which has become a standard statistical tool. Specifically, MCMC methods produce samples from a target distribution by using an ergodic Markov chain with stationary distribution . Typically, MCMC methods are used when it is difficult to sample from the target distribution directly, e.g. when the normalisation constant is unknown. There are many ways to construct this Markov chain which have lead to many variations of MCMC methods; see, e.g., Brooks et al. (2011).

The classic MCMC method is the Metropolis-Hastings algorithm (Metropolis et al., 1953)

. At each iteration, the Metropolis-Hastings algorithm is designed to update the entire current state (i.e. all components of the random vector generated at the previous iteration) at once. However, updating individual components, or subsets of components, is possible. Indeed, this type of component-wise updating was initially proposed in

Metropolis et al. (1953), but did not receive much attention at first. In this paper, we focus on updating individual components and refer to it as the single-component Metropolis-Hastings algorithm.

Another variant of MCMC sampling is the multiple-try method (Jun S. Liu, 2000) where several proposals or trials are suggested at each iteration, as opposed to a single proposal. The motivation behind the multiple-try method is that more of the space is explored at the expense of an increased computational cost (i.e. proposal generation and evaluation of acceptance criterion).

Recently, in Yang et al. (2019)

the authors introduce a component-wise, multiple-try MCMC method with Gaussian proposals for each trial, where each uni-variate proposal has a different variance. In this work, we introduce a new class of proposal distributions for use in a component-wise, multiple-try MCMC method. These proposals, called

Plateau distributions, do not overlap to exploit the multiple-try nature of the method. Indeed, by using proposals that do not overlap for each trial, the Markov chain is forced to explore different parts of the state-space. Conversely, using proposals that overlap, e.g. Gaussians with different variances, can lead to an inefficient algorithm, as the trials tend to be from a similar region of the state-space.

Using Plateau distributions for the trial distributions leads to better exploration of the space over standard Gaussian proposals. The idea is intuitive, easy to implement and leads to good results, in the sense of exploring the state space. Moreover, using the Plateau proposals leads to a reversible Markov chain with the target distribution as its invariant distribution e.g. see Yang et al. (2019).

As is common for proposals used in MCMC methods, the Plateau distributions have parameters that need to be selected and tuned with respect to the target distribution to obtain an effective MCMC algorithm. Here, we propose an adaptive procedure that can be used to automatically tune these parameters as the MCMC progresses. Adaptation of MCMC methods typically entails tuning

the parameter(s) of a class of proposal distributions, e.g. the variance parameter in a Normal distribution, in order to improve the convergence properties of the Markov chain. For instance, in

Haario et al. (2001), a Gaussian proposal distribution is used whose variance (or covariance) is updated using the previously generated states of the Markov chain.

The goal of this work is thus to introduce a general class of proposal distributions that leads to good results for a variety of target distributions by combining improved state space exploration with adaptation to the target distribution. The proposed new class of proposal distributions is for use in a component-wise multiple-try MCMC method.

The remainder of this is organised as follows. In Section 2.1 a generic component-wise multiple-try algorithm is presented. The novel class of Plateau proposals is introduced and discussed in Section 3. In Section 4 we discuss how to adaptively select the parameter of the Plateau proposals and offer a detailed algorithmic description of the complete method. The performance of our new method is then compared with other MCMC methods in Section 5. Finally, a commentary on improvements and a summary are provided in Section 6.

2 Component-wise Update with Multiple-Tries

Let

be a probability density function,

, where . Our main interest is to sample from ; this is the target distribution. We assume that sampling directly from is difficult or impossible, for example because may only be known up to a multiplicative constant. In order to sample from we use MCMC methods. Specifically, given the current state of the Markov chain at iteration , the Metropolis-Hastings algorithm, one of the simplest MCMC methods, proposes a candidate for the chain’s next state

by drawing a random variable from distribution with probability density function (PDF)

from which it is easier to sample from than from the target distribution. That is , where is the conditional density given the current value of, what is commonly known as, the proposal distribution. For example, the random-walk proposal Metropolis et al. (1953); Hastings (1970) uses a multivariate normal distribution, which we will write as with given and being the identity matrix. Another example is with fixed, which is the proposal used in the Metropolis-adjusted Langevin algorithm Roberts and Tweedie (1996). The realisation of the Markov chain’s next state is then selected by means of an accept-reject procedure in a way such that the resulting Markov chain’s stationary distribution is .

One of the main difficulties when using an MCMC method is the choice of the proposal distribution. In particular, the choice of the proposal may significantly affect the properties of the MCMC method, including the speed of convergence to equilibrium and its mixing properties Rosenthal (2011). Typically, the proposal distribution is selected from some family of well-known distributions, e.g., from the family of normal distributions. It is noteworthy that an optimal MCMC performance in high dimensions requires to select the “scale” of the proposal appropriately (e.g. see Roberts and Rosenthal, 2001). This problem of scaling may become increasingly more challenging as the number of dimensions increases.

Instead of proposing a single multivariate candidate from by updating all components of the current state simultaneously via the (global) proposal distribution , it is also possible to split the state space into its individual components (or small groups of components) and propose candidates for each component (or group of components) independently. This local (or projected) approach is intuitive, computationally efficient, and reduces the problem of selecting a multidimensional proposal into lower dimensional proposals that are easier to handle. MCMC methods using these types of updates are called component-wise MCMC methods which may use a potentially different proposal distribution per component of the state (Gilks et al., 1995, Ch. 1). In this work we focus on one-dimensional, component-wise proposals. We emphasise, however, that the ideas that follow are not restricted to one-dimensional components and are general in fact.

Considering independent one-dimensional, component-wise proposals is equivalent to the proposal distribution given being separable, in the sense that

(1)

for any . Here, , with , denotes the one-dimensional proposal density given used to draw the candidate for component . That is, the proposed (global) candidate is obtained by sampling each mutually independent from any other . Notice that this one-dimensional, component-wise proposal step is identical to the standard MCMC method with , and hence can be considered a natural extension to the multivariate case.

While the component-wise candidate proposal (1) is computationally efficient, by construction it does not account for correlations between the components. Consequently, the proposed candidates may not be good representatives of the target distribution (i.e. most candidates will be rejected, resulting in a very low acceptance rate), if has highly correlated components. Therefore these “uninformed” candidates may lead to a poor state space exploration and thus to a poor performance of the MCMC method. To remedy these defects, we will combine component-wise proposals with multiple-tries for each component. Specifically, the multiple-try technique works by proposing many candidates from a proposal distribution, rather than just a single one, amongst which the “best” one is selected. Each trial may be proposed from a different proposal distribution. Thus, in combination with component-wise proposals, we (independently) generate possible candidates (i.e., trials) independently for each component . Let , , denote the proposal PDF of the th trial for the th component given .

2.1 Generic Procedure of Component-wise Multiple-Try Metropolis

A generic component-wise multiple-try algorithm is now described – the full pseudo-code for generating samples from the target distribution is presented in Algorithm 1. Each MCMC step (or iteration) of the algorithm involves drawing multiple trials and then performing an acceptance-rejection step for each component sequentially.

We now describe the intuition behind the steps of the procedure. Suppose that the state of the chain at the beginning of the th iteration is . For the first component (i.e., ), independent trials, , are drawn from . These trials are then weighted according to

where denotes the vector that is identical to except for its th component which is replaced by , that is .

The functions with for any , are non-negative, symmetric functions in and which are selected by the user. Further it is required that whenever . Each trial , , has an associated weight . A candidate for the first component of the chain’s next state is then randomly selected amongst all trial according to these weights. The selected candidate is then accepted or rejected. The remaining components of the chain’s state are updated in order in a similar fashion; Algorithm 1 illustrates a detailed pseudo-code of the corresponding MCMC method.

0:  number of tries ; number of MCMC realisations ; starting position ; target distribution (possibly un-normalised); proposal distributions
1:  Let .
2:  for  do
3:     for  do
4:        Propose trials: for .
5:        Compute the trial weights
6:        Draw randomly with probability proportional to .
7:        Draw for and let .
8:        Let and compute
Draw .
9:        if  then
10:           Accept and set .
11:        else
12:           .
13:        end if
14:     end for
15:  end for
16:  return  
Algorithm 1 Generic Component-wise Multiple-Try Metropolis

Specifically, to propose a candidate for the th component of the Markov chain’s next state in step 5 of Algorithm 1 each trial for the component at has an associated weight

As described above, a single trial is then randomly selected with probabilities proportional to . There are a number of choices for in the multiple-try literature Jun S. Liu (2000), such as

In work Yang et al. (2019) the authors suggest to use

(2)

where was used following a simulation study focusing on large moves in the state spaces. In simulations, not reported here, we find that performed best in terms of the mean squared error for a variety of target distributions. Consequently we use with for the remainder of the paper. We note that it is beyond the scope of this paper to propose a particular form for the class of functions due to the problem dependent nature of this choice. Instead we advocate that users perform a trial run with a variety of to determine which is best suit for their application and performance metric.

3 Non-Overlapping Proposal Distributions

In principle, any family of proposals can be used in Algorithm 1. However, a careful choice of the proposals can lead to a more efficient algorithm.

Recall that is the th trial proposal distribution for the th component. As mentioned earlier, the motivation of using multiple-tries is to explore a larger region of the state space than is achieved by using a single proposal. Therefore, it would not be beneficial to use proposals that are similar. To illustrate this, suppose that for a fixed component we use the proposal distribution for with known and take without loss of generality. The probability density functions of these proposals of with are presented in Figure 1.

Figure 1:

Probability density function of 5 Normal distributions with zero mean and standard deviations

for , .

As illustrated in Figure 1, these proposals are very similar. Indeed, % of ’s density mass lies within the interval

centred around , where

is the inverse of the cumulative distribution function of a standard normal so that

. Suppose that we consider for the second proposal distribution , then for any . That is, draws from and will be located in the same region with high probability. Similar arguments hold for the wider Gaussian proposals. Thus draws from these Gaussian proposals will tend to be similar, thus leading to an inefficient use of the multiple-try technique. To avoid sampling similar proposals for each trial, we seek densities which do not overlap (or overlap to a small degree).

(a)
(b)
Figure 2: PDF of plateau proposal distributions with . The five different proposals each have a different colour.

Specifically, we advocate using proposals of the type illustrated in Figure (b)b

. Each trial for each component-wise proposal distribution combines uniform distributions with exponentially decaying tails. Notice, that this means that the amount of overlapping is controlled through how fast the tails decay. Specifically, we first introduce the PDF

where

denotes the normalisation constant. This PDF is illustrated in Figure (a)a. For each component with given value , we then set the PDF of the each trial proposal as

for some values of . The trial proposals shown in Figure (b)b correspond to , and . We shall refer to the proposals of this type as Plateau proposals given the shape of their PDFs. The parameter controls the width of the central plateau centred at the current state . The parameter is the width of the other plateaus. The value controls the decay of the tails either side of the inner plateaus. The outer tails for the th proposal are described by and .

To compare with the earlier calculations for coverage probabilities for the Gaussian proposals; % of the density of with , , and lies in the interval . Suppose that , then , which is reduced by almost a factor of two compared to the overlapping Gaussian proposal. Further, if then and if then . Thus, the plateau proposals overlap less than the Gaussian proposals example and further, the extent of the overlapping of the proposals is controlled by the values of , , and .

Note that each plateau proposal distribution has a support on . This is to ensure that the support of the target distribution is included within the support of the proposals. In theory, this allows the Markov chain to explore the entire support of the target distribution. In a practical setting, however, by selecting the value of appropriately, the tails of the distribution decay to very quickly, making the inner proposals effectively uniform distributions in view of numerical simulations. Therefore, in practice one could elect to sample from these distributions using direct draws from Uniform distributions to increase computation speed. However, in all the experiments performed in this paper, we sample from the exact Plateau proposal distributions.

Note that the suggested Plateau proposals are very general; the choice of proposal parameters, , , , and , allow for diverse sets of proposals. In the remainder of the paper, we set for some value of , so that the decay of the tails of the outermost proposal, , are the same.

Using the set of Plateau proposals requires that the 4 parameters , , , and are selected. The selected values for these parameters will determine the movement of the Markov chain and thus its performance with respect to a particular target distribution. In principle, one could set the values of the parameters in an ad-hoc manner, e.g., a manually search over the parameter space until a certain acceptance rate is achieved. A practical approach is to automatically tune the parameters as the algorithm runs. In fact, in Section 4 an adaptation procedure that tunes the proposal parameter, , is introduced. Thus only the 3 parameters , , need to be selected by the user. In simulations we found that values of , and resulted in good performance for a variety of target distributions and have thus been used throughout this work.

4 Adaptation of Plateau Proposals

As is typically the case when working with MCMC methods, the proposals involving parameters need to be appropriately tuned for the algorithm to be effective. Instead of manually tuning the parameters, an automated method can be used to adapted the proposals as the MCMC procedure runs. These adaptive methods use the information revealed so far to tune the parameters of the proposals. For instance, Haario et al. (2001)

proposes updating the covariance of a multivariate Normal proposal using an empirical estimate of the covariance of the target. We now describe a method to adapt the Plateau proposal distributions defined in Section

3.

In the innermost for-loop in Algorithm 1 (steps 4 to 8) with Plateau proposals , only one trial is selected (step 6). The selected trial is associated with its generating non-overlapping proposal. Over the Markov chain’s iterations, the frequency at which each trial is selected from a particular proposal can be monitored, which offers additional insight into the state space exploration. We advocate a procedure to update the Plateau proposals to avoid two undesirable scenarios: (i) when the innermost proposal is selected too often; and (ii) when the outermost proposal is selected too often. Scenario (i) suggests that the proposal distributions are too wide, such that trials are regularly being suggested near the previous state of the chain, i.e., the majority of moves are occurring in , when current state’s component is . Conversely, scenario (ii) suggests that the proposal distributions are too narrow, such that the trials are regularly being suggested in the “tails”, far away from the current position .

We suggest the following adaptation of the Plateau proposals to counteract these scenarios. First, adaptation can take place at regular, predefined intervals, of length . Within these intervals each proposal is selected a number of times. Let denote the number of times was selected by the th MCMC iteration.

For scenario (i), if for some , then the width of all the Plateaus is halved and the proposals are shifted closer to to leave no gaps. More precisely, the Plateau proposal parameters are updated as: and .

For scenario (ii), if for some , then the Plateaus widths are doubled and the proposals are shifted away from to leave no gaps. Formally, then the Plateau proposal parameters are updated as: and .

The proposed adaptation is summarised in Algorithm 2 which can be inserted between steps 2 and 3 of the MCMC Algorithm 1. Note that the adaptation operation is performed every iterations. At iteration the adaptation is performed with probability . This ensures that the amount of adaptation reduces the longer the algorithm runs and thus satisfies the diminishing adaptation condition; see, e.g., (Roberts and Rosenthal, 2007).

0:  thresholds ; starting proposal parameter value ; iteration number ; adaptation interval length
1:  if  then
2:     Set for all
3:  end if
4:  if  and  then
5:     if  then
6:        Update: and
7:     end if
8:     if  then
9:        Update: and
10:     end if
11:     Reset .
12:  end if
Algorithm 2 Adaptation of MCMC

4.1 A note of convergence of adaptive component-wise multiple-try algorithms

The convergence (in total variation distance) of algorithms of the form of Algorithm 1 described in Section 4 has been proven in Yang et al. (2019). The proof of convergence is ensured by the algorithm, both the MCMC algorithm and the adaptation procedure, satisfying two conditions: diminishing adaptation and containment. As mentioned earlier, diminishing adaptation is satisfied by adapting with probability . For containment to hold two technical, but not practical, modifications are required – these follow directly from Yang et al. (2019) and are presented as quotations below with altered notation. The first modification is to

“ … choose a very large nonempty compact set and force for all . Specifically, we reject all proposals (but if , then we still accept/reject by the usual rule)… ”

The second modification which is altered for our proposed Plateau distributions is

“ … choose a very large constant and a very small constant and force the proposal width to always be in … ”

The proof then follows Section 3.5 in Yang et al. (2019).

5 Results

In this section, we compare the performance of the proposed adaptive component-wise, multiple-try MCMC method with other MCMC methods in simulations. The adaptive Plateau MCMC (AP) is compared a Metropolis-Hastings algorithm with Gaussian proposals (MH) and two versions of an adaptive Gaussian MCMC (AG1 and AG2)  as introduced in Yang et al. (2019). The difference between AG1 and AG2 is that AG1 uses (2) with (i.e., the same weight as in AP), while AG2 uses as is suggested in Yang et al. (2019).

The proposal distributions in both AG1 and AG2 are adapted as follows: If the proposal with the largest variance is under(over)-selected then it is halved (doubled). Conversely, if the proposal with the smallest variance is under(over)-selected then it is doubled (halved). After either of these two updates, the other variances are adjusted to be equidistant on a log-scale. For the complete algorithm with this adaptation scheme see Yang et al. (2019).

For all simulations and methods with multiple-tries we fix the number of trials . Investigation of the methods performance for differing values of is beyond the scope of this paper, which we leave for future work. However, interesting discussions in that direction already exist, see Martino and Louzada (2017) for example.

The proposal variances in the AG method are initialised at for . The Plateau parameters values are initialised at , and with for the AP method.

The proposal distributions used in the MH method depend on the particular target distribution and the choices used are summarised in Table 2 below. The various target distributions considered in the simulations are now introduced.

5.1 Target Distributions

In order to compare the aforementioned methods, we investigate their performances by applying them to sample from a variety of target distributions.

Mixture of Gaussians

Consider a mixture of two 4-dimensional Gaussians

where

and

We refer to this target distribution as .

Banana Distribution

Consider the 8-dimensional “banana-shaped” distribution (Haario et al., 1999), which is defined as follows. Let be the density of the D normal distribution with covariance given by . The density function of the banana distribution with non-linearity parameter is given by where the function is

The value of determines the amount of non-linearity of . Here, we consider the target distribution . Thus, the first 2 components of this distributions are highly correlated – see Figure (a)a.

Distributions perturbed by oscillations

Another target we consider is the perturbed 2-dimensional Gaussian, whose probability density function is given by

where

Figure (c)c displays the un-normalised function . Lastly, we also consider the following perturbed version of the 1D bi-stable distribution , whose PDF is given by

Figure (d)d displays the PDF of where the normalising constant is approximated by numerical integration.

5.2 Run parameters

Each simulation run of the MCMC methods was independently repeated times and for each run a burn-in period of % of the MCMC iterations was used. During the burn-in period, the AP, AG1 and AG2 were allowed to adapt their proposals. For each repetition, all methods started at the same random initial condition . The number of MCMC steps, , used for each method is presented in Table 1, which was determined by a trial run. In order to make fair comparisons, the number of MCMC steps performed by Metropolis-Hastings algorithm is times larger than the multiple-try versions. This is because multiple-try methods cycle over all components each iteration. This will ensure that the number of times the target distribution is evaluated by each MCMC method is the same and that the computational effort is approximately the same.

Target Dimensions () Adaptive MH
4 4,000 16,000
8 10,000 80,000
2 3,000 6,000
1 3,000 3,000
Table 1: Number of MCMC steps used in simulations for each target distribution.

The proposal distribution in the MH method for each target is presented in Table 2. Note that these proposals are based on the target distribution, which would be typically be unknown in practice. Consequently, the MH method can be viewed as being optimally tuned. The particular scaling of follows from Gelman et al. (1996).

Target Proposal Distribution
Table 2: Proposal distributions used in the Metropolis-Hastings algorithm.
(a)
(b)
(c)
(d)
Figure 3: Selected marginal density plots of simulation target distributions

5.3 Simulation Results

For each target distribution, we compare the performance of the MCMC methods using measures which we now define. Denote the Markov chain produced by one of the MCMC methods for the th independent repetition as where for . Denote the component-wise variances of the target as .

We will use the integrated autocorrelation time (ACT) of the MCMC methods as a measure of performance. The ACT for the chain’s th component is given by

provided that the chain is stationary so that . For every repetition of the MCMC method, the integrated autocorrelation times are estimated based on the observed Markov chain component-wise using the initial sequence estimator introduced in Geyer (1992). With slight abuse of notation, we will denote the resulting chain-based autocorrelation times by .

Smaller autocorrelation times indicate that consecutive samples have lower correlation. Autocorrelation times are inversely proportional to the effective sample size Kong (1992); Kong et al. (1994)

, which is commonly used as a measure of performance. In fact, the effective sample size is often interpreted as the number of samples that would need to be (directly) drawn from the target in order to achieve the same variance as that from an estimator of interest using independent samples. Higher effective sample sizes and therefore lower autocorrelation times are desirable. Another way of interpreting the ACTs is through the accuracy of a chain-based Monte Carlo integration. As a matter of fact, the mean squared error of a Monte Carlo estimator can be expressed as a sum of the component-wise ACTs weighted by the component-wise variance. Consequently, a method with lower ACTs will offer more accurate Monte Carlo integration for the same chain length. It is moreover noteworthy that an MCMC method’s ACTs also characterise the asymptotic variance (so-called time-averaged variance constant in this context) of an Monte Carlo estimator in the central limit theorem for Markov chains

(see e.g. Asmussen and Glynn, 2007). In fact, lower ACTs will lead to smaller time-averaged variance constants. Finally, we mention that in practice, the target distribution is intractable and therefore the variances of the components are unknown. However, these variances can be estimated by the initial sequence estimator method.

Another measure of performance is the chain’s average squared jump distance (ASJD), which, for the th component and repetition , we define as

The average squared jump distance measures the movement of the chain and also is linked with the acceptance rate of the MCMC method. Higher values of average squared jump distances are desired as it indicates larger moves and therefore more exploration of the space. We shall consider the ASJD as a measure of a MCMC method to move around the state-space.

In summary, in the following results we are interested in the ACTs and the ASJD per component. The distribution of the ACTs and ASJDs over the repetitions, i.e.  and , will be presented using violin plots Hintze and Nelson (1998)

which are boxplots with kernel density estimates attached to the sides. We use violin plots as opposed to just boxplots to give a better illustration of the shape of the distributions. In the violin plots the median is represented by a horizontal line.

The corresponding results of the AP method without the adaptation procedure are presented in Appendix A for the reader’s convenience. These results indicate that the adaptation procedure is necessary for the MCMC method to effectively explore components with large or small variance relative to the chosen width of the Plateau proposals.

5.3.1 Mixture of Gaussians

The results for the 4-dimensional mixture of two Gaussians target, , are presented in Fig. 4. Fig. (a)a indicates that the AP method achieves lower ACTs than the other methods for all components (including the MH method which is not included in this figure due to very high ACTs). Further, the range of ACT values suggest that the AP method consistently produces MCMC chains with lower ACTs.

In terms of the movement of the MCMC chains, the AP method outperforms the other methods for this target distribution. In fact, the ASJDs presented in Fig. (b)b show that the AP method moving around the state-space in larger jumps than the other methods. Since the AP and AG1 methods use the same weight function as discussed in Section 2.1 this advantage is due to using the Plateau proposals in contrast to Gaussian proposals. These results suggest that the AP method is able to move between the two Gaussians in the target densities efficiently.

(a) Autocorrelation times
(b) Average square jump distance
Figure 4: Distribution of the ACTs and ASJDs of MCMC methods for target . Red is AP, blue is AG1, green is AG2 and pink is MH.

5.3.2 Banana Distribution

The target, , is a difficult distribution to sample from due to the wide ranging variances in each component and the correlations between the first two components as a consequence of the unusual banana-shape. The ACTs for the methods, presented in Fig. (a)a, show similar results across the AP, AG1 and AG2 for the first two components. However, the remaining components the AP is achieving notably smaller ACTs. The ACT results for the MH are substantially larger for all components and thus not included in this figure. As an indication, the median ACTs for the components are: 580.77, 903.6, 52.09, 52.55, 52.13, 51.77, 52.27, 51.15 for the MH method.

The ASJDs for methods is shown in Fig. (b)b. The AP method again outperforms the other methods by achieving higher ASJDs for all components. Note that for the first component the wide range of jumping distance produced when using the Plateau proposals. This suggests that the AP method is able to navigate the banana-shape in the first component easily.

(a) Autocorrelation times
(b) Average square jump distance
Figure 5: Distribution of the ACTs and ASJDs of MCMC methods for target . Red is AP, blue is AG1, green is AG2 and purple is MH.

5.3.3 2D Perturbed Distribution

For the perturbed 2-dimensional distribution, , the perturbations represent local modes where MCMC methods may potentially get stuck. Again, the AP method’s ability to move slightly larger distances, as shown in Fig. (b)b, gives it a slight advantage over the other methods. This ability to jump further may explain the lower ACTs for the AP method – see Fig. (a)a.

(a) Distribution of log autocorrelation times
(b) Distance of average square jump distance
Figure 6: Distribution of the ACTs and ASJDs of MCMC methods for target . Red is AP, blue is AG1, green is AG2 and pink is MH.

5.3.4 1D Perturbed Distribution

Similar to , the oscillations in are potential areas where an MCMC may get stuck. The ACTs for the AP, AG1 and AG2 method are presented in Fig. (a)a

. The AP achieves the lowest ACTs, however there are a few outliers which may indicate a few runs where the sampler got stuck in the local modes. This may also be the case for the AG1 method. For the MH method, the ACTs (not presented in the figure) are extremely large in comparison to the the other methods – with a median of 145 and a range of

.

The ASJDs for the AP method, on average is jumping also twice the distance of the AG1 and AG2 methods – see Fig. (b)b. Again there are some outlying ASJDs for the AP method which may indicate some repetitions where the MCMC got stuck in local modes.

(a) Distribution of autocorrelation times
(b) Distribution of average square jump distance
Figure 7: Distribution of the ACTs and ASJDs of MCMC methods for target . Red is AP, blue is AG1, green is AG2 and pink is MH.

6 Conclusion

In this paper we have introduced Plateau distributions as a novel class of proposal distributions for the use with component-wise, multiple-try Metropolis MCMC. These proposal distributions can be thought of as a combination of uniform distributions, leading to a family of distributions with not overlapping support. The notion of using non-overlapping proposals in multiple-try MCMC methods is intuitive and, in fact, motivated as means to counter the disadvantages (e.g. inefficient proposing of trials) of greatly overlapping proposal distributions such as Gaussians. Moreover, the class of Plateau distributions are simple to implement for use as proposals in MCMC methods and can straightforwardly combined with the simple, yet highly effective, adaptation procedure presented in this paper as well.

We have demonstrated that using the Plateau proposal distributions with the suggested adaptation leads to MCMC methods that perform well for a variety of target distributions. In fact, the results indicate that these methods provide better MCMC chains, in terms of lower autocorrelation times and exploration of the state-space, compared to other adaptive multiple-try methods with greatly overlapping Gaussian proposals. Furthermore, the simulation results suggest that the Plateau proposals are able to efficiently sample from target distributions with distance modes, complex shapes, and many nearby modes.

The results and the simplicity of their design makes the Plateau proposals appealing for general use in component-wise, multiple-try MCMC algorithms. As a matter of fact, the introduced class of Plateau distributions is one type of non-overlapping proposals. Further research may investigate other types of non-overlapping proposals which may have multiple interacting trials (e.g. see Casarin et al. (2013)) and may be asymmetric.

Appendix A On the effects of adapting the Plateau distribution

In this appendix we present the same results for the AP method, as shown in Section 5, along with the results for the component-wise multiple-try MCMC using Plateau proposals without the adaptation method. We refer to the non-adapting MCMC method as non-AP. The initial proposal parameters are the same for both the AP and non-AP methods.

Figs 8 to 11 display the ACTs and the ASJDs for the target distributions , , , and . In general, using the adaptation procedure gives lower or similar ACTs to the non-AP method and higher or similar ASJDs. An exception is component 3 of , although this effect is negligible since the third component’s ACT is an order of magnitude smaller than the ACTs of the other components. On the other, the most notable improvements in performance when using the adaptation procedure are shown in are: Component 4 for (Fig. 8), Components 3 to 8 for (Fig. 9), Components 1 and 2 for (Fig. 10) and Component 1 for (Fig. 9). The common feature of these components are that they correspond to low valued variances. Since the non-AP does not tune the Plateau proposals, the resulting MCMC cannot scale to components with small variances. The Plateau proposals in the non-AP are relatively wide, meaning that small variances components are not explored effectively. The AP method, however, is able to tune the Plateau proposals according the individual component marginal variances.

(a) Distribution of autocorrelation times
(b) Distance of average square jump distance
Figure 8: Distribution of the ACTs and ASJDs of MCMC methods for target . Red is AP, orange is non-AP.
(a) Autocorrelation times
(b) Average square jump distance
Figure 9: Distribution of the ACTs and ASJDs of MCMC methods for target . Red is AP, orange is non-AP.
(a) Autocorrelation times
(b) Average square jump distance
Figure 10: Distribution of the ACTs and ASJDs of MCMC methods for target . Red is AP, orange is non-AP.
(a) Autocorrelation times
(b) Average square jump distance
Figure 11: Distribution of the ACTs and ASJDs of MCMC methods for target . Red is AP, orange is non-AP.

References

  • Asmussen and Glynn (2007) Asmussen, S. and Glynn, P. W. (2007). Stochastic Simulation: Algorithms and Analysis. Springer.
  • Brooks et al. (2011) Brooks, S., Gelman, A., Jones, G. L., and Meng, X.-L., editors (2011). Handbook of Markov chain Monte Carlo. CRC Press, Boca Raton, FL.
  • Casarin et al. (2013) Casarin, R., Craiu, R., and Leisen, F. (2013). Interacting multiple try algorithms with different proposal distributions. Statistics and Computing, 23(2):185–200.
  • Gelman et al. (1996) Gelman, A., Roberts, G., and Gilks, W. (1996). Efficient Metropolis jumping rules. Bayesian statistics, 5(599-608):42.
  • Geyer (1992) Geyer, C. J. (1992). Practical Markov chain Monte Carlo. Statistical Science, 7(4):473–483.
  • Gilks et al. (1995) Gilks, W. R., Richardson, S., and Spiegelhalter, D. (1995). Markov chain Monte Carlo in practice. Chapman and Hall/CRC.
  • Haario et al. (1999) Haario, H., Saksman, E., and Tamminen, J. (1999). Adaptive proposal distribution for random walk Metropolis algorithm. Computational Statistics, 14(3):375–395.
  • Haario et al. (2001) Haario, H., Saksman, E., and Tamminen, J. (2001). An adaptive Metropolis algorithm. Bernoulli, 7(2):223–242.
  • Hastings (1970) Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109.
  • Hintze and Nelson (1998) Hintze, J. L. and Nelson, R. D. (1998). Violin plots: A box plot-density trace synergism. The American Statistician, 52(2):181–184.
  • Jun S. Liu (2000) Jun S. Liu, Faming Liang, W. H. W. (2000). The multiple-try method and local optimization in Metropolis sampling. Journal of the American Statistical Association, 95(449):121–134.
  • Kong (1992) Kong, A. (1992). A note on importance sampling using standardized weights. University of Chicago, Dept. of Statistics, Technical Report, 348.
  • Kong et al. (1994) Kong, A., Liu, J. S., and Wong, W. H. (1994).

    Sequential imputations and Bayesian missing data problems.

    Journal of the American Statistical Association, 89(425):278–288.
  • Martino and Louzada (2017) Martino, L. and Louzada, F. (2017). Issues in the multiple try metropolis mixing. Computational Statistics, 32(1):239–252.
  • Metropolis et al. (1953) Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087–1092.
  • Roberts and Rosenthal (2001) Roberts, G. O. and Rosenthal, J. S. (2001). Optimal scaling for various Metropolis-Hastings algorithms. Statistical Science, 16(4):351–367.
  • Roberts and Rosenthal (2007) Roberts, G. O. and Rosenthal, J. S. (2007). Coupling and ergodicity of adaptive Markov chain Monte carlo algorithms. Journal of Applied Probability, 44(2):458–475.
  • Roberts and Tweedie (1996) Roberts, G. O. and Tweedie, R. L. (1996). Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363.
  • Rosenthal (2011) Rosenthal, J. S. (2011). Optimal proposal distributions and adaptive MCMC. In Brooks, S., Gelman, A., Jones, G., and Meng, X.-L., editors, Handbook of Markov chain Monte Carlo. CRC press.
  • Yang et al. (2019) Yang, J., Levi, E., Craiu, R. V., and Rosenthal, J. S. (2019). Adaptive component-wise multiple-try Metropolis sampling. Journal of Computational and Graphical Statistics, 28(2):276–289.