DeepAI
Log In Sign Up

Orthogonal parallel MCMC methods for sampling and optimization

07/30/2015
by   L. Martino, et al.
0

Monte Carlo (MC) methods are widely used for Bayesian inference and optimization in statistics, signal processing and machine learning. A well-known class of MC methods are Markov Chain Monte Carlo (MCMC) algorithms. In order to foster better exploration of the state space, specially in high-dimensional applications, several schemes employing multiple parallel MCMC chains have been recently introduced. In this work, we describe a novel parallel interacting MCMC scheme, called orthogonal MCMC (O-MCMC), where a set of "vertical" parallel MCMC chains share information using some "horizontal" MCMC techniques working on the entire population of current states. More specifically, the vertical chains are led by random-walk proposals, whereas the horizontal MCMC techniques employ independent proposals, thus allowing an efficient combination of global exploration and local approximation. The interaction is contained in these horizontal iterations. Within the analysis of different implementations of O-MCMC, novel schemes in order to reduce the overall computational cost of parallel multiple try Metropolis (MTM) chains are also presented. Furthermore, a modified version of O-MCMC for optimization is provided by considering parallel simulated annealing (SA) algorithms. Numerical results show the advantages of the proposed sampling scheme in terms of efficiency in the estimation, as well as robustness in terms of independence with respect to initial values and the choice of the parameters.

READ FULL TEXT VIEW PDF
01/27/2018

A Review of Multiple Try MCMC algorithms for Signal Processing

Many applications in signal processing require the estimation of some pa...
04/15/2017

Metropolis Sampling

Monte Carlo (MC) sampling methods are widely applied in Bayesian inferen...
10/14/2017

Mental Sampling in Multimodal Representations

Both resources in the natural environment and concepts in a semantic spa...
06/26/2020

Anytime Parallel Tempering

Developing efficient and scalable Markov chain Monte Carlo (MCMC) algori...
01/04/2023

Bayesian Weapon System Reliability Modeling with Cox-Weibull Neural Network

We propose to integrate weapon system features (such as weapon system ma...
12/01/2021

A quantum parallel Markov chain Monte Carlo

We propose a novel quantum computing strategy for parallel MCMC algorith...
03/06/2020

Generalized Parallel Tempering on Bayesian Inverse Problems

In the current work we present two generalizations of the Parallel Tempe...

1 Introduction

Monte Carlo (MC) methods are widely employed in different fields, for computational inference and stochastic optimization [1, 2, 3, 4]. Markov Chain Monte Carlo (MCMC) methods [4]

are well-known MC methodologies to draw random samples and efficiently compute integrals involving a complicated multidimensional target probability density function (pdf),

with . MCMC techniques only need to be able to evaluate the target pdf, but the difficulty of diagnosing and speeding up the convergence has driven intensive research effort in this field. For instance, several adaptive MCMC methods have been developed in order to determine adequately the shape and spread of the proposal density used to generate candidate samples within an MCMC scheme [5, 6, 4, 7]. Nevertheless, guaranteeing the theoretical convergence is still an issue in most of the cases. Moreover, in a single specific (long) run, the generated chain can remain trapped in a local mode and, in this scenario, the adaptation could even slow down the convergence. Thus, in order to speed up the exploration of the state space (and specially to deal with high-dimensional applications [8]), several schemes employing parallel chains have been recently proposed [2, 7], as well as multiple try and interacting schemes [9], but the problem is still far from being solved. The interest in the parallel computation can be also originated by other motivations. For instance, several authors have studied the parallelization of MCMC algorithms, which have traditionally been implemented in an iterative non-parallel fashion, in order to reduce their computation time [10, 11]. Furthermore, parallel MCMC schemes are required in a big data problems, where one possible approach consists of splitting the complete posterior distribution in different partial sub-posteriors [12, 13, 14, 15, 16].

In this work, we focus mainly on the implementation of parallel MCMC chains in order to foster the exploration of the state space and improve the overall performance.111A preliminary version of this work has been published in [17]. With respect to that paper, here we propose several novel interacting schemes for exchanging information among the chains, analyze the theoretical basis of the proposed approach and discuss its relationships with other techniques, in detail. Different variants are presented for reducing the overall computational cost and for applying O-MCMC in optimization problems. Furthermore, we provide more exhaustive numerical simulations. More specifically, we present a novel family of parallel MCMC schemes, called orthogonal MCMC (O-MCMC) algorithms, where different chains are independently run and, at some pre-specified iterations, they exchange information using another MCMC technique applied on the entire cloud of current states. Assuming that all the MCMC techniques used yield chains converging to the target pdf, the ergodicity of the global scheme is guaranteed: the whole kernel is still valid, since it is obtained as the multiplication of ergodic kernels with the same invariant pdf. Fixing the computational cost, this computing effort can be divided in parallel processes but, at some iteration, information among the chains is exchanged in order to enhance the overall mixing. Let us remark also that the novel O-MCMC scheme is able to combine efficiently both the random-walk and the independent proposal approaches, as both strategies have advantages and drawbacks. On the one hand, random-walk proposal pdfs are often used when there is no specific information about the target, since this approach turns out to be more explorative than using a fixed proposal. On the other hand, a well-chosen independent proposal density usually provides less correlation among the samples in the generated chain. In the novel method, the parallel “vertical” chains (based on random-walk proposals) move around as “free explorers” roaming the state space, whereas the “horizontal” MCMC technique (applied over the population of current states and based on independent proposals) works as a “park ranger”, redirecting “lost explorers” towards the “beaten track” according to the target pdf. Unlike in [18], the exchange of information occurs taking always into account the whole population of current states, instead of applying crossover or exchange schemes between specific pairs of chains. Furthermore, in this work tempering of the target pdf is not considered for sampling purposes (but, it is employed for optimization). Hence, in this sense our approach resembles the nonreversible parallel MH algorithms described in [19, 20], where the whole population of states is also updated jointly at the times of interaction, pursuing non-reversibility instead of tempering as a means to accelerate convergence towards posterior mode regions. However, both tempering and crossovers could also be easily implemented within the O-MCMC framework.

Another contribution of the work is the computational improvement provided by novel parallel implementations of MCMC techniques using multiple candidates at each iteration. We present two novel schemes for parallel Multiple try Metropolis (MTM) [21, 22, 23] chains in order to reduce the overall computational cost in the same fashion of [10], saving generated samples, target evaluations and multinomial sampling steps. One of them is an extended version, using several candidates, of the Block Independent Metropolis presented in [10]. The ergodicity of both schemes is guaranteed. These novel parallel MTM techniques are employed as horizontal methods in O-MCMC. The corresponding O-MCMC scheme (using a novel parallel MTM method) can also be interpreted as an MTM algorithm employing an adaptive proposal density. This pdf is a mixture of components: the adaptation of the location parameters of the components is driven by the vertical parallel chains (note that the outputs of these chains are also used in the estimation). Furthermore, we describe a modified version of O-MCMC for solving optimization problems (where we employ tempering of the target), considering parallel Simulated Annealing algorithms [24, 25, 26] for the vertical movements. The application for big data problems is also discussed. Numerical simulations show that O-MCMC exhibits both flexibility and robustness with respect to the initialization and parameterization of the proposals.

The paper is structured as follows. Section 2 summarizes the general framework and the aim of the work. Section 3 describes a generic complete O-MCMC scheme, whereas Sections 4 and 5 provide different specific examples of vertical and horizontal movements, respectively. Section 6 discusses the O-MCMC framework for optimization and big data problems. Section 7 provides different numerical results. Finally, we finish with some remarks in Section 8.

2 Bayesian inference problem

In many applications, we aim at inferring a variable of interest given a set of observations or measurements. Let us denote the variable of interest by , and let be the observed data. The posterior pdf is then

(1)

where is the likelihood function, is the prior pdf and is the model evidence (a.k.a. marginal likelihood). In general, is unknown, so we consider the corresponding unnormalized target function,

(2)

In general, the analytical study of the posterior density is unfeasible (for instance, integrals involving are typically intractable), hence numerical approximations are required. Our goal is to approximate efficiently employing a cloud of random samples. In general, a direct method for drawing independent samples from is not available and alternative approaches (e.g., MCMC algorithms) are needed. The only required assumption is to be able of evaluating the unnormalized target function .

3 O-MCMC algorithms: General outline

Let us consider parallel “vertical” chains, with , generated by different MCMC techniques with random-walk proposal pdfs , i.e., plays the role of a location parameter for the proposal pdf used in the next iteration. Let us denote the population of current states at the -th iteration as

At certain selected iterations, we apply another MCMC technique taking into account the entire population of states , yielding a new cloud of samples . In this “horizontal” transitions, the different chains share information. The horizontal MCMC technique uses a proposal pdf which is independent from the previous states, unlike the random walk proposals employed in the vertical MCMC chains. The general O-MCMC approach is represented graphically in Figure 1 and summarized below:

  1. Initialization: Set . Choose the initial states,

    the total number of iterations, , and three positive integer values such that .

  2. For m=1,…,M:

    1. Vertical period: For

      run independent MCMC techniques starting from , thus obtaining , for , i.e., a new population of states .

    2. Horizontal period: For

      apply an MCMC approach taking into account the entire population to generate the next cloud .

  3. Output: Return the samples contained in all the sets , .

One vertical period contains iterations of the chains, whereas in one horizontal period we have iterations. Hence, given , after one cycle of vertical and horizontal steps we have

. The total number of cycles (or epochs)

222One cycle, or epoch, includes one the vertical period and one horizontal period. is then . The ergodicity is guaranteed if the vertical and horizontal steps produce ergodic chains with invariant density (see Appendix A for further details). Relationships with other techniques, for instance [27, 28], are discussed in Section 5.4. In the following two sections, we introduce different specific implementations of O-MCMC.

Figure 1: A graphical representation of the O-MCMC approach. After vertical transitions, then horizontal steps are performed.

4 Vertical Movements

In this section, we describe different implementations of the vertical parallel chains. Although it is not strictly necessary, we consider only random walk proposal densities in the vertical chains. The idea is to exploit predominantly the explorative behaviors of the independent parallel MCMC methods. Therefore, we only consider proposals of type . In this case, a sample can be expressed as

(3)

where . Another more sophisticated possibility is to include the gradient information of the target within the proposal pdf, as suggest in the Metropolis-Adjusted Langevin Algorithm (MALA) [29]. In this case a sample becomes

(4)

where and denotes the gradient of a generic function . This second alternative can be particularly useful in high-dimensional spaces, although it inevitably increases the probability of the chain becoming trapped in one mode of the target, in a multi-modal scenario. Thus, the joint application of parallel chains appears very appropriate in this scenario, since they can easier reach different modes of the target. Moreover, the application of the O-MCMC scheme facilitates the jumps among different modes.

Regarding the MCMC algorithm, note that the random walk proposal density can be applied within different MCMC kernels. The simplest possibility is using a Metropolis-Hastings (MH) algorithm [4]. For each and for a given time step , one MH update of the -th chain is obtained as

  1. Draw .

  2. Set with probability

    Otherwise, i.e., with probability , set .

Many other schemes can be used as alternative to MH kernel for the vertical chains: for instance, two particularly appealing alternatives are the Multiple Try Metropolis (MTM) [21, 23] and the Delayed Rejection Metropolis [30] techniques.

5 Horizontal Movements

As described above, at each iteration , the vertical chains return a population . When , with , i.e., after vertical transitions, then horizontal steps are performed. The purpose of these horizontal MCMC transitions is to exchange information among the different chains, improving the global mixing. In the following, we consider two different general approaches for sharing the information among the chains:

  • In the first one, a population-based MCMC algorithm is applied. The states of the vertical chains contained in are used as the initial population. Furthermore, the population-based MCMC scheme takes into account all the current population for making decisions about the next population.

  • In the second one, named as mixture-based approach, the initial population is also used for building a suitable density . This pdf is employed as proposal by the parallel MCMC chains for yielding the next populations . More specifically, in this work we suggest to construct as a mixture of pdfs, each one centered in .

We show several specific examples. In all the different cases, for the horizontal movements we consider the use of independent proposal pdfs, unlike for the vertical ones, where we have suggested the use of random walk proposals.

5.1 Population-based approach

Here, we consider a generalized target density,

(5)

where each marginal, for and , coincides with the target pdf in Eq. (2). The idea is that the horizontal MCMC transitions leave invariant the extended target . Namely, after a “burn-in” period, the population is distributed according to . The simplest possible population-based scheme consists of employing an standard Metropolis-Hastings (MH) algorithm directly in the extended domain, , with target , generating (block) transitions from to . However, the probability of accepting a new population in this case becomes negligible as grows. As an alternative example of a population-based scheme, we consider the Sample Metropolis-Hastings (SMH) method [31, Chapter 4]. At each iteration, the underlying idea of SMH is replacing one “bad” sample in the population with a “better” one, according to a certain suitable probability. The new sample, candidate of be incorporated in the population, is generated from and independent proposal pdf . The algorithm is designed so that, after a “burn-in” period , the elements in () are distributed according to in Eq. (5). Table 1 provides a deitailed description of the algorithm.

  1. For :

    1. Draw .

    2. Choose a “bad” sample in , i.e., select an index , with probability proportional to the inverse of the importance sampling weights, i.e., γ_k= φ(xk,t-1)π(xk,t-1)nNφ(xn,t-1)π(xn,t-1),     k=1,…, N.

    3. Accept the new population, P_t = {x_1,t=x_1,t-1, …, x_k,t=x_0,t-1, …, x_N,t=x_N,t-1}, with probability

      (6)

      Otherwise, set .

Table 1: Sample Metropolis-Hastings (SMH) algorithm for horizontal transitions in O-MCMC.

Let us remark that the difference between and is at most one sample, and the acceptance probability, , depends on the entire population, for , and the new candidate sample . At each step, the sample chosen to be replaced is selected according to a probability proportional to the inverse of the corresponding importance weight. The ergodicity can be proved considering the extended density as target pdf (see Appendix C). Note also that the SMH algorithm becomes the standard MH method for . Hence, for the specific O-MCMC implementation using SMH consists of applying alternatively two MH kernels with different types of proposals: a random walk proposal, , and an independent one, . This a well-known scheme (cf. [4, 31]), which can be seen as a particular case of the O-MCMC family of algorithms.

5.2 Mixture-based approach

An alternative approach consists in defining the following mixture of pdfs, updated every vertical transitions,

(7)

where and each plays the role of the location parameter of the -th component of the mixture, . Observe that changes from one horizontal period to the next one (since it depends on the final population of the vertical period), but then it remains fixed within the iterations of each horizontal period. Thus, during the complete O-MCMC run we employ different mixtures, , one for each horizontal period. However, for simplifying the notation, we use . Figure 2 provides a graphical representation. We employ within independent MCMC schemes as an independent proposal density, namely independent from the previous state of the chain. The underlying idea is using the information in , with , to build a good proposal function for performing independent MCMC processes. The theoretical motivation is that, after the burn-in periods, the vertical chains have converged to the target, so for . Then, in Eq. (7

) can be interpreted as a kernel density estimation of

, where play the role of the kernel functions.

Figure 2: A graphical representation of the mixture-based strategy. The mixture is formed by components, , where plays the role of a location parameter.

Figure 3: A schematic representation of the basic horizontal scheme described in Table 2. One specific transition of one specific chain is represented with the probability , where , showing the two possible future states at the -th iteration, of the -th chain.

5.2.1 Basic schemes

As a first example of this strategy, we consider the application of MH transitions. At each iteration , one sample is generated from and then different MH tests are performed. The procedure is shown in Table 2 and represented in Figure 3. Alternatively, a different sample , drawn from , can be tested for each chain, as shown in Table 3. Hence, different samples are drawn at each iteration (instead of only one) but, after building , the process could be completely parallelized. The variant in Table 3 provides in general better performance, although at the expense of a increasing computational cost in terms of evaluations of the target and number of generated samples. However, the block independent MH methodology [10], proposed for reducing the computational effort by recycling generated samples and target evaluations, can be employed. For clarifying that, let us consider for simplicity . Step 2(a) in Table 2 could be modified by drawing only independent samples from and, at each iteration , a different circular permutation of the set could be tested in the different acceptance MH tests333For further clarifications, see the extension of this scheme for a Multiple Try Metropolis method described in Section 5.2.3.. Note that, the scheme in Table 2 yields dependent chains, whereas the algorithm in Table 3 produces independent chains (the interaction, in this case, is only contained in the construction of the mixture at the beginning of the horizontal period). Finally, observe that the procedure in Table 2 presents certain similarities with the Normal Kernel Coupler (NKC) method introduced in [32], thus indicating that NKC-type algorithms can be also employed as alternative population-based approaches.

  1. Build as in Eq. (7), where .

  2. For :

    1. Draw .

    2. For :

      1. Set , with probability α_n=min[1, π(x’) ψ(xn,t-1)π(xn,t-1) ψ(x’)]= ω(x’) ∧ω(x_n,t-1), where and , for any . Otherwise, set .

    3. Set .

Table 2: Basic mixture scheme for horizontal transitions in O-MCMC.
  1. Build as in Eq. (7), where .

  2. For :

    1. For :

      1. Draw .

      2. Set , with probability α_n=min[1, π(xn’) ψ(xn,t-1)π(xn,t-1) ψ(xn’)]= ω(x_n’) ∧ω(x_n,t-1), where and , for any . Otherwise, set .

    2. Set .

Table 3: Variant of the basic mixture scheme for horizontal transitions in O-MCMC.

5.2.2 Schemes based on multiple candidates

More advanced techniques can also be modified and used as horizontal methods. More specifically, the adaptation to this scenario of multiple try schemes is particularly interesting. For instance, we adjust two special cases444They are special cases of the corresponding algorithms, since an independent proposal pdf is used. of the Ensemble MCMC (EnM) [33] and Multiple Try Metropolis (MTM) methods [21, 34, 23] for fitting them within O-MCMC. Tables 4 and 5 summarize them. Note that standard parallel EnM and MTM chains can be considered, however we suggest two variants for reducing the computational cost. In both cases, different i.i.d. samples, , are draw from . In the parallel Ensemble MCMC (P-EnM) scheme, at each iteration , one resampling step per chain is performed, considering the set of samples , , using importance weights. In the parallel MTM (P-MTM) scheme, at each iteration , resampling steps are performed considering the set of candidates and the new possible states are tested (i.e., accepted or not) according to suitable acceptance probabilities , , involving also the previous states .

  1. Build as in Eq. (7), where .

  2. For :

    1. Draw i.i.d. candidates .

    2. For :

      1. Set , i.e., select , with probability

        (8)

        or set with probability

        (9)
      2. Set .

Table 4: Parallel Ensemble MCMC (P-EnM) scheme for horizontal transitions in O-MCMC.
  1. Build as in Eq. (7), where .

  2. For :

    1. Draw i.i.d. candidates .

    2. Draw independent samples such that , i.e., select for , with probability

      (10)

      Namely, resample times the samples in the set with probability , .

    3. For :

      1. Set with probability

        (11)

        Otherwise, set (with probability ).

    4. Set .

Table 5: Parallel Multiple Try Metropolis (P-MTM) scheme for horizontal transitions in O-MCMC.

The ergodicity of both schemes is discussed in Appendix D. The algorithms in Tables 4-5 are obtained by a rearrangement of the basic schemes in [33, 21, 34] in order to generate, at each iteration , new states for independent parallel chains. The new states of the chains are selected by filtering the same set of candidates , drawn from the same independent proposal pdf . Note that, with respect to a standard parallel approach, they require less evaluations of the target pdf: at each iteration, the algorithms in Tables 4-5 require new evaluations of the target instead of the target evaluations required by a standard parallel approach. For further explanations, see Appendix D.1.1 and Figure 8. With , the algorithm in Table 4 coincides with the application of parallel MH methods with Barker’s acceptance rule [35]. The algorithm in Table 5 with coincides with the scheme presented in Table 2. Although any can be employed, a number of tries is suggested. Note that an important difference with respect to the standard parallel implementation is that the generated chains are no longer independent.

5.2.3 Block Independent Multiple Try Metropolis algorithm

Previously, we have pointed out that with the scheme in Table 5 only evaluations of the target are required at each iteration, instead of as standard parallel approach. The proposed scheme in Table 5 can be also modified in the same fashion of the block independent MH method [10], in order to and reducing the number of multinomial sampling steps, without jeopardizing the ergodicity of the parallel chains. We remark that the corresponding technique, called Block Independent Multiple Try Metropolis (BI-MTM), can always be employed when parallel independent MTM are applied (even outside the O-MCMC scheme) in order to reduce the overall computational cost. Let us assume that the value is such that the number of total transitions of one chain, , can be divided in blocks. The idea is based on using circular permutations of the resampled set , i.e.,

(12)

where each set denotes one the possible circular permutations of . In order to preserve the ergodicity, each is drawn from a different set of tries . More specifically, before a block of iterations, tries are drawn from , yielding different sets, for, each one containing elements. Then, one sample is resampled from each with probability proportional to the corresponding importance weight, and the circular permutations in Eq. (12) are created considering . The complete BI-MTM algorithm is detailed in Table 6 and further considerations are provided in Appendix D. In Table 6, we have denoted the acceptance probability as for remarking the two possible future states of the -th chain at the -th iteration. Figure 4 depicts a schematic sketch of the different steps of one block within the BI-MTM algorithm. Moreover, Figure 8 provides a graphical comparison among different parallel MTM approaches. BI-MTM requires only multinomial sampling steps for each block, i.e., iterations, instead of as in P-MTM in Table 5. Moreover, BI-MTM is completely parallelizable. Indeed, alternatively one could draw samples from , perform multinomial sampling steps within different sets, and then run the parallel iterations of the chains, i.e., one unique block, using circular permutations of the resampled tries (previously obtained). The reduction in the computational cost is obtained at the expense of a moderate decrease of the performance.

Figure 4: A graphical representation of one block within the BI-MTM technique, described in Table 6. One specific transition of one MTM chain is represented with the probability , showing the two possible future states at the -th iteration, of the -th chain. One block is formed by transitions.
  1. Let be the total number of parallel MTM chains and be the total number of iterations of each chain, such that . Choose a number of tries . Set if BI-MTM is used within O-MCMC. Otherwise, set .

  2. For each block do:

    1. Draw i.i.d. candidates , for .

    2. Draw one sample from each set for , with probability β_ℓ^(h)=π(z(h))ψ(z(h))ℓ=1Lπ(z(h))ψ(z(h)). Thus, finally we have different samples, , such that for .

    3. Create the circular permutations as defined in Eq. (12).

    4. For (i.e., exactly transitions):

      1. Set (so that , in one block).

      2. For :

        1. Set , with probability α_n(x_n,t-1,v_n,j)=min[1,ℓ=1Lπ(z(j))ψ(z(j)) ℓ=1Lπ(z(j))ψ(z(j)) -π(vn,j)ψ(vn,j)+π(xn,t-1)ψ(xn,t-1) ]. Otherwise, set .

      3. Set .

Table 6: Block Independent Multiple Try Metropolis (BI-MTM) algorithm for parallel chains.

5.3 Computational Cost

In general, the most costly steps are those requiring the evaluation of the target pdf, especially for complex models or a large number of data. The number of evaluations of the target, in one horizontal period, are for SMH in Table 1, whereas in P-EnM and P-MTM (considering, in all cases, only the new evaluations at each iteration, the others can be automatically reused). Using SMH, multinomial sampling steps are performed, each one over a population of samples. In P-EnM and P-MTM, multinomial sampling steps are required (with ), each one over a set of samples. The total number of evaluations of the target, , including the vertical transitions, is when the SMH is employed in the horizontal steps, or when P-EnM and P-MTM are employed. Furthermore, in BI-MTM, we have again , but only multinomial sampling steps. Note also that in a standard parallel multiple try approach we would have evaluations of the target and multinomial sampling steps, each one over a set of samples. Finally, we remark that, using SMH, we perform one acceptance test in each step, i.e., in one horizontal period. Using a multiple candidates scheme, we employ acceptance test in one horizontal period. All these considerations are summarized in Table 7. For further details and observations, see Appendix D.1.1.

Computational features SMH P-MTM BI-MTM Standard par. MTM
Total number of
multinomial sampling steps
Cardinality of the set for
the multinomial sampling
Total number of
acceptance tests
Table 7: Computional cost of O-MCMC given different horizontal schemes ().

5.4 Relationship with other techniques

The techniques in Table 4 and 5 are particularly interesting, since they involve the use of resampling steps without jeopardizing the ergodicity of the resulting global O-MCMC process. Moreover, the SMH algorithm in Table 1 employs an inverted resampling scheme, since a sample in the population is chosen to be replaced with probability proportional to the inverse of the importance weights. Other methodologies in the literature employ a combination of MCMC iterations and resampling steps, for instance, iterated batch importance sampler (IBIS) and sequential Monte Carlo (SMC) samplers for a static scenario [27, 28]. Their underlying idea could be interpreted belonging to the O-MCMC philosophy: in these methodologies, the resampling steps are seen as a “horizontal” approach for exchanging information within the population. The resampling procedure generates samples from a particle approximation

(13)

of the measure of , where (or, similarly, [36]) and are defined in Eq. (10) in Table 5, with . The quality of this approximation improves as the number of samples grows. However, for a finite value of there exists a discrepancy which can produce problems in the corresponding sampling algorithm. For further details see Appendix B. One main issue is the loss in diversity in the population.

This problem is reduced drastically in O-MCMC, since the ergodicity is ensured in both the vertical and the horizontal movements. This improvement in the performance is obtained at the expense of an increase of the computational cost. For instance, let us consider the use of SMH in horizontal transitions. The cloud of samples is not impoverished by the application of SMH, even if a poor choice of the proposal is made. In the worst case, the newly proposed samples are always discarded and computational time is wasted. In the best case, a proposal located in a low probability region can jump close to a mode of the target. Clearly, In the mixture multiple try approach, it is better to choice for fostering the safeguard of the diversity. Moreover, in the mixture approach, the mixture is built using the states in as location parameters, and then it does not change for the next horizontal steps. Thus, the information contained in the states is employed in the next iterations even if some states are not well-located. For clarifying this point, consider for instance the basic scheme in Table 2. The mixture does not change, so the information provided by the population at the iteration is still used in the iterations . This feature is also the main difference between the scheme in Table 2 and the NKC-type methods [32], where one component of the mixture is relocated after each iteration.

5.5 Joint adaptation of the proposal densities

Let us denote as and the covariance matrices of the vertical and horizontal proposal pdfs, respectively. In order to design an algorithm as robust as possible, we suggest keeping the scale parameters fixed for the vertical proposal pdfs

, to avoid a loss of diversity within the set of chosen variances. However, if desired, they could be easily adapted as suggested in

[7]. On the other hand, we suggest adapting the scale parameters of the horizontal proposal pdfs , , since it is less delicate. Indeed, let us recall that a poor choice of the ’s entails an increase in the computational cost, but the diversity in the cloud of samples is always preserved. Several strategies have been proposed in [5, 37] and [7], for adapting proposal functions online within MCMC schemes. For the sake of simplicity, we discuss separately the cases of the population-based or the mixture-based approaches.

  • Adaptation within SMH: in this case, the strategies in [37, 7] are appropriate. Thus, After a training period , all the generated samples (i.e., for each and from all the chains) can be used to adapt the location and scale parameters of the proposal pdf . Namely, denoting , we can use the following approach:

    • If : set , (where and are the initial choices).

    • If : set , and , where is a chosen covariance matrix. The empirical mean and covariance matrix estimators can also be computed recursively [5].

  • Adaptation of the mixture : the methods in Section 5.2 employ a mixture . In this case, every component should be adapted, jointly with the weights of the mixture. A possible (and simple) adaptation scheme is provided in [5], where all the parameters of the mixture are updated online. The method in [5] can be easily reformulated for a framework with parallel chains. In this case, the states of the parallel chains are divided into different clusters according to the Euclidean distance between them and location parameters of the components in the mixture . Then, new centroids (i.e., location parameters), covariance matrices and weights are updated according to the mean, covariance and cardinality of each cluster, respectively.

6 O-MCMC for optimization and big data context

The O-MCMC schemes can be easily modified converting them in stochastic optimization algorithms. Indeed, it is possible to replace the vertical MH chains with parallel simulated annealing (SA) methods [24, 25]. Let us denote as a finite scale parameter that is a decreasing function of , approaching zero for , i.e.,

(14)

for . Moreover, for the sake of simplicity, we consider symmetric proposal functions . Then, one transition of the -th SA is described below:

  1. Draw .

  2. Set with probability

    Otherwise, i.e., with probability , set .

Note that, with respect to the MH algorithm, we have replaced the target with a modified target , with modes that become sharper and narrower when we reduce the scale parameter . Note also that the movements such are always accepted, whereas movements leading to are accepted with probability . This probability vanishes to zero as (guaranteeing the convergence to the global maximum when ). In the same fashion, the modified target is employed in the horizontal transitions of the mixture-based approach, whereas for the horizontal steps of the population-based approach we consider the modified extended target,

(15)

so that all the presented schemes, previously described, can be automatically applied. Several possible decreasing functions have been suggested in [24, 26, 25].

6.1 O-MCMC for big data and data-tempered distributions

In a Big Data context, the posterior density is typically split into different partial posteriors, with , each one considering a disjoint sub-sets of the observed data and such that

The key idea of those “divide-and-conquer” approaches is to generate samples from the partial posteriors , and then to combine them in some way for approximating the complete target [13]. Here, we remark that the O-MCMC philosophy can be applied in this framework, in order to improve the mixing of each chain. Thus, let us consider the application of parallel (vertical) MCMC chains, addressing each one a different partial target . Since each observed data provides information about the same phenomenon, an interaction among the parallel chains can be improve the mixing of different MCMC techniques. In this context, the use of a mixture pdf as described in the O-MCMC approach of Section 5.2, seems to be appropriate for providing a cooperative interaction among the parallel chains. Using the similar observations, O-MCMC can be applied within the so called “data point tempered” techniques [27] where a sequence of posteriors, , ,…,, with an increasing number of data, are considered (typically, the last one contains all the data, i.e., ).

7 Numerical simulations

7.1 Multimodal target distribution

In this section, we consider a bivariate multimodal target pdf, which is itself a mixture of Gaussian pdfs, i.e.,

(16)

with means