1 Introduction
Monte Carlo (MC) methods are widely employed in different fields, for computational inference and stochastic optimization [1, 2, 3, 4]. Markov Chain Monte Carlo (MCMC) methods [4]
are wellknown MC methodologies to draw random samples and efficiently compute integrals involving a complicated multidimensional target probability density function (pdf),
with . MCMC techniques only need to be able to evaluate the target pdf, but the difficulty of diagnosing and speeding up the convergence has driven intensive research effort in this field. For instance, several adaptive MCMC methods have been developed in order to determine adequately the shape and spread of the proposal density used to generate candidate samples within an MCMC scheme [5, 6, 4, 7]. Nevertheless, guaranteeing the theoretical convergence is still an issue in most of the cases. Moreover, in a single specific (long) run, the generated chain can remain trapped in a local mode and, in this scenario, the adaptation could even slow down the convergence. Thus, in order to speed up the exploration of the state space (and specially to deal with highdimensional applications [8]), several schemes employing parallel chains have been recently proposed [2, 7], as well as multiple try and interacting schemes [9], but the problem is still far from being solved. The interest in the parallel computation can be also originated by other motivations. For instance, several authors have studied the parallelization of MCMC algorithms, which have traditionally been implemented in an iterative nonparallel fashion, in order to reduce their computation time [10, 11]. Furthermore, parallel MCMC schemes are required in a big data problems, where one possible approach consists of splitting the complete posterior distribution in different partial subposteriors [12, 13, 14, 15, 16].In this work, we focus mainly on the implementation of parallel MCMC chains in order to foster the exploration of the state space and improve the overall performance.^{1}^{1}1A preliminary version of this work has been published in [17]. With respect to that paper, here we propose several novel interacting schemes for exchanging information among the chains, analyze the theoretical basis of the proposed approach and discuss its relationships with other techniques, in detail. Different variants are presented for reducing the overall computational cost and for applying OMCMC in optimization problems. Furthermore, we provide more exhaustive numerical simulations. More specifically, we present a novel family of parallel MCMC schemes, called orthogonal MCMC (OMCMC) algorithms, where different chains are independently run and, at some prespecified iterations, they exchange information using another MCMC technique applied on the entire cloud of current states. Assuming that all the MCMC techniques used yield chains converging to the target pdf, the ergodicity of the global scheme is guaranteed: the whole kernel is still valid, since it is obtained as the multiplication of ergodic kernels with the same invariant pdf. Fixing the computational cost, this computing effort can be divided in parallel processes but, at some iteration, information among the chains is exchanged in order to enhance the overall mixing. Let us remark also that the novel OMCMC scheme is able to combine efficiently both the randomwalk and the independent proposal approaches, as both strategies have advantages and drawbacks. On the one hand, randomwalk proposal pdfs are often used when there is no specific information about the target, since this approach turns out to be more explorative than using a fixed proposal. On the other hand, a wellchosen independent proposal density usually provides less correlation among the samples in the generated chain. In the novel method, the parallel “vertical” chains (based on randomwalk proposals) move around as “free explorers” roaming the state space, whereas the “horizontal” MCMC technique (applied over the population of current states and based on independent proposals) works as a “park ranger”, redirecting “lost explorers” towards the “beaten track” according to the target pdf. Unlike in [18], the exchange of information occurs taking always into account the whole population of current states, instead of applying crossover or exchange schemes between specific pairs of chains. Furthermore, in this work tempering of the target pdf is not considered for sampling purposes (but, it is employed for optimization). Hence, in this sense our approach resembles the nonreversible parallel MH algorithms described in [19, 20], where the whole population of states is also updated jointly at the times of interaction, pursuing nonreversibility instead of tempering as a means to accelerate convergence towards posterior mode regions. However, both tempering and crossovers could also be easily implemented within the OMCMC framework.
Another contribution of the work is the computational improvement provided by novel parallel implementations of MCMC techniques using multiple candidates at each iteration. We present two novel schemes for parallel Multiple try Metropolis (MTM) [21, 22, 23] chains in order to reduce the overall computational cost in the same fashion of [10], saving generated samples, target evaluations and multinomial sampling steps. One of them is an extended version, using several candidates, of the Block Independent Metropolis presented in [10]. The ergodicity of both schemes is guaranteed. These novel parallel MTM techniques are employed as horizontal methods in OMCMC. The corresponding OMCMC scheme (using a novel parallel MTM method) can also be interpreted as an MTM algorithm employing an adaptive proposal density. This pdf is a mixture of components: the adaptation of the location parameters of the components is driven by the vertical parallel chains (note that the outputs of these chains are also used in the estimation). Furthermore, we describe a modified version of OMCMC for solving optimization problems (where we employ tempering of the target), considering parallel Simulated Annealing algorithms [24, 25, 26] for the vertical movements. The application for big data problems is also discussed. Numerical simulations show that OMCMC exhibits both flexibility and robustness with respect to the initialization and parameterization of the proposals.
The paper is structured as follows. Section 2 summarizes the general framework and the aim of the work. Section 3 describes a generic complete OMCMC scheme, whereas Sections 4 and 5 provide different specific examples of vertical and horizontal movements, respectively. Section 6 discusses the OMCMC framework for optimization and big data problems. Section 7 provides different numerical results. Finally, we finish with some remarks in Section 8.
2 Bayesian inference problem
In many applications, we aim at inferring a variable of interest given a set of observations or measurements. Let us denote the variable of interest by , and let be the observed data. The posterior pdf is then
(1) 
where is the likelihood function, is the prior pdf and is the model evidence (a.k.a. marginal likelihood). In general, is unknown, so we consider the corresponding unnormalized target function,
(2) 
In general, the analytical study of the posterior density is unfeasible (for instance, integrals involving are typically intractable), hence numerical approximations are required. Our goal is to approximate efficiently employing a cloud of random samples. In general, a direct method for drawing independent samples from is not available and alternative approaches (e.g., MCMC algorithms) are needed. The only required assumption is to be able of evaluating the unnormalized target function .
3 OMCMC algorithms: General outline
Let us consider parallel “vertical” chains, with , generated by different MCMC techniques with randomwalk proposal pdfs , i.e., plays the role of a location parameter for the proposal pdf used in the next iteration. Let us denote the population of current states at the th iteration as
At certain selected iterations, we apply another MCMC technique taking into account the entire population of states , yielding a new cloud of samples . In this “horizontal” transitions, the different chains share information. The horizontal MCMC technique uses a proposal pdf which is independent from the previous states, unlike the random walk proposals employed in the vertical MCMC chains. The general OMCMC approach is represented graphically in Figure 1 and summarized below:

Initialization: Set . Choose the initial states,
the total number of iterations, , and three positive integer values such that .

For m=1,…,M:

Vertical period: For
run independent MCMC techniques starting from , thus obtaining , for , i.e., a new population of states .

Horizontal period: For
apply an MCMC approach taking into account the entire population to generate the next cloud .


Output: Return the samples contained in all the sets , .
One vertical period contains iterations of the chains, whereas in one horizontal period we have iterations. Hence, given , after one cycle of vertical and horizontal steps we have
. The total number of cycles (or epochs)
^{2}^{2}2One cycle, or epoch, includes one the vertical period and one horizontal period. is then . The ergodicity is guaranteed if the vertical and horizontal steps produce ergodic chains with invariant density (see Appendix A for further details). Relationships with other techniques, for instance [27, 28], are discussed in Section 5.4. In the following two sections, we introduce different specific implementations of OMCMC.4 Vertical Movements
In this section, we describe different implementations of the vertical parallel chains. Although it is not strictly necessary, we consider only random walk proposal densities in the vertical chains. The idea is to exploit predominantly the explorative behaviors of the independent parallel MCMC methods. Therefore, we only consider proposals of type . In this case, a sample can be expressed as
(3) 
where . Another more sophisticated possibility is to include the gradient information of the target within the proposal pdf, as suggest in the MetropolisAdjusted Langevin Algorithm (MALA) [29]. In this case a sample becomes
(4) 
where and denotes the gradient of a generic function . This second alternative can be particularly useful in highdimensional spaces, although it inevitably increases the probability of the chain becoming trapped in one mode of the target, in a multimodal scenario. Thus, the joint application of parallel chains appears very appropriate in this scenario, since they can easier reach different modes of the target. Moreover, the application of the OMCMC scheme facilitates the jumps among different modes.
Regarding the MCMC algorithm, note that the random walk proposal density can be applied within different MCMC kernels. The simplest possibility is using a MetropolisHastings (MH) algorithm [4]. For each and for a given time step , one MH update of the th chain is obtained as

Draw .

Set with probability
Otherwise, i.e., with probability , set .
Many other schemes can be used as alternative to MH kernel for the vertical chains: for instance, two particularly appealing alternatives are the Multiple Try Metropolis (MTM) [21, 23] and the Delayed Rejection Metropolis [30] techniques.
5 Horizontal Movements
As described above, at each iteration , the vertical chains return a population . When , with , i.e., after vertical transitions, then horizontal steps are performed. The purpose of these horizontal MCMC transitions is to exchange information among the different chains, improving the global mixing. In the following, we consider two different general approaches for sharing the information among the chains:

In the first one, a populationbased MCMC algorithm is applied. The states of the vertical chains contained in are used as the initial population. Furthermore, the populationbased MCMC scheme takes into account all the current population for making decisions about the next population.

In the second one, named as mixturebased approach, the initial population is also used for building a suitable density . This pdf is employed as proposal by the parallel MCMC chains for yielding the next populations . More specifically, in this work we suggest to construct as a mixture of pdfs, each one centered in .
We show several specific examples. In all the different cases, for the horizontal movements we consider the use of independent proposal pdfs, unlike for the vertical ones, where we have suggested the use of random walk proposals.
5.1 Populationbased approach
Here, we consider a generalized target density,
(5) 
where each marginal, for and , coincides with the target pdf in Eq. (2). The idea is that the horizontal MCMC transitions leave invariant the extended target . Namely, after a “burnin” period, the population is distributed according to . The simplest possible populationbased scheme consists of employing an standard MetropolisHastings (MH) algorithm directly in the extended domain, , with target , generating (block) transitions from to . However, the probability of accepting a new population in this case becomes negligible as grows. As an alternative example of a populationbased scheme, we consider the Sample MetropolisHastings (SMH) method [31, Chapter 4]. At each iteration, the underlying idea of SMH is replacing one “bad” sample in the population with a “better” one, according to a certain suitable probability. The new sample, candidate of be incorporated in the population, is generated from and independent proposal pdf . The algorithm is designed so that, after a “burnin” period , the elements in () are distributed according to in Eq. (5). Table 1 provides a deitailed description of the algorithm.

Let us remark that the difference between and is at most one sample, and the acceptance probability, , depends on the entire population, for , and the new candidate sample . At each step, the sample chosen to be replaced is selected according to a probability proportional to the inverse of the corresponding importance weight. The ergodicity can be proved considering the extended density as target pdf (see Appendix C). Note also that the SMH algorithm becomes the standard MH method for . Hence, for the specific OMCMC implementation using SMH consists of applying alternatively two MH kernels with different types of proposals: a random walk proposal, , and an independent one, . This a wellknown scheme (cf. [4, 31]), which can be seen as a particular case of the OMCMC family of algorithms.
5.2 Mixturebased approach
An alternative approach consists in defining the following mixture of pdfs, updated every vertical transitions,
(7) 
where and each plays the role of the location parameter of the th component of the mixture, . Observe that changes from one horizontal period to the next one (since it depends on the final population of the vertical period), but then it remains fixed within the iterations of each horizontal period. Thus, during the complete OMCMC run we employ different mixtures, , one for each horizontal period. However, for simplifying the notation, we use . Figure 2 provides a graphical representation. We employ within independent MCMC schemes as an independent proposal density, namely independent from the previous state of the chain. The underlying idea is using the information in , with , to build a good proposal function for performing independent MCMC processes. The theoretical motivation is that, after the burnin periods, the vertical chains have converged to the target, so for . Then, in Eq. (7
) can be interpreted as a kernel density estimation of
, where play the role of the kernel functions.5.2.1 Basic schemes
As a first example of this strategy, we consider the application of MH transitions. At each iteration , one sample is generated from and then different MH tests are performed. The procedure is shown in Table 2 and represented in Figure 3. Alternatively, a different sample , drawn from , can be tested for each chain, as shown in Table 3. Hence, different samples are drawn at each iteration (instead of only one) but, after building , the process could be completely parallelized. The variant in Table 3 provides in general better performance, although at the expense of a increasing computational cost in terms of evaluations of the target and number of generated samples. However, the block independent MH methodology [10], proposed for reducing the computational effort by recycling generated samples and target evaluations, can be employed. For clarifying that, let us consider for simplicity . Step 2(a) in Table 2 could be modified by drawing only independent samples from and, at each iteration , a different circular permutation of the set could be tested in the different acceptance MH tests^{3}^{3}3For further clarifications, see the extension of this scheme for a Multiple Try Metropolis method described in Section 5.2.3.. Note that, the scheme in Table 2 yields dependent chains, whereas the algorithm in Table 3 produces independent chains (the interaction, in this case, is only contained in the construction of the mixture at the beginning of the horizontal period). Finally, observe that the procedure in Table 2 presents certain similarities with the Normal Kernel Coupler (NKC) method introduced in [32], thus indicating that NKCtype algorithms can be also employed as alternative populationbased approaches.


5.2.2 Schemes based on multiple candidates
More advanced techniques can also be modified and used as horizontal methods. More specifically, the adaptation to this scenario of multiple try schemes is particularly interesting. For instance, we adjust two special cases^{4}^{4}4They are special cases of the corresponding algorithms, since an independent proposal pdf is used. of the Ensemble MCMC (EnM) [33] and Multiple Try Metropolis (MTM) methods [21, 34, 23] for fitting them within OMCMC. Tables 4 and 5 summarize them. Note that standard parallel EnM and MTM chains can be considered, however we suggest two variants for reducing the computational cost. In both cases, different i.i.d. samples, , are draw from . In the parallel Ensemble MCMC (PEnM) scheme, at each iteration , one resampling step per chain is performed, considering the set of samples , , using importance weights. In the parallel MTM (PMTM) scheme, at each iteration , resampling steps are performed considering the set of candidates and the new possible states are tested (i.e., accepted or not) according to suitable acceptance probabilities , , involving also the previous states .


The ergodicity of both schemes is discussed in Appendix D. The algorithms in Tables 45 are obtained by a rearrangement of the basic schemes in [33, 21, 34] in order to generate, at each iteration , new states for independent parallel chains. The new states of the chains are selected by filtering the same set of candidates , drawn from the same independent proposal pdf . Note that, with respect to a standard parallel approach, they require less evaluations of the target pdf: at each iteration, the algorithms in Tables 45 require new evaluations of the target instead of the target evaluations required by a standard parallel approach. For further explanations, see Appendix D.1.1 and Figure 8. With , the algorithm in Table 4 coincides with the application of parallel MH methods with Barker’s acceptance rule [35]. The algorithm in Table 5 with coincides with the scheme presented in Table 2. Although any can be employed, a number of tries is suggested. Note that an important difference with respect to the standard parallel implementation is that the generated chains are no longer independent.
5.2.3 Block Independent Multiple Try Metropolis algorithm
Previously, we have pointed out that with the scheme in Table 5 only evaluations of the target are required at each iteration, instead of as standard parallel approach. The proposed scheme in Table 5 can be also modified in the same fashion of the block independent MH method [10], in order to and reducing the number of multinomial sampling steps, without jeopardizing the ergodicity of the parallel chains. We remark that the corresponding technique, called Block Independent Multiple Try Metropolis (BIMTM), can always be employed when parallel independent MTM are applied (even outside the OMCMC scheme) in order to reduce the overall computational cost. Let us assume that the value is such that the number of total transitions of one chain, , can be divided in blocks. The idea is based on using circular permutations of the resampled set , i.e.,
(12) 
where each set denotes one the possible circular permutations of . In order to preserve the ergodicity, each is drawn from a different set of tries . More specifically, before a block of iterations, tries are drawn from , yielding different sets, for, each one containing elements. Then, one sample is resampled from each with probability proportional to the corresponding importance weight, and the circular permutations in Eq. (12) are created considering . The complete BIMTM algorithm is detailed in Table 6 and further considerations are provided in Appendix D. In Table 6, we have denoted the acceptance probability as for remarking the two possible future states of the th chain at the th iteration. Figure 4 depicts a schematic sketch of the different steps of one block within the BIMTM algorithm. Moreover, Figure 8 provides a graphical comparison among different parallel MTM approaches. BIMTM requires only multinomial sampling steps for each block, i.e., iterations, instead of as in PMTM in Table 5. Moreover, BIMTM is completely parallelizable. Indeed, alternatively one could draw samples from , perform multinomial sampling steps within different sets, and then run the parallel iterations of the chains, i.e., one unique block, using circular permutations of the resampled tries (previously obtained). The reduction in the computational cost is obtained at the expense of a moderate decrease of the performance.

5.3 Computational Cost
In general, the most costly steps are those requiring the evaluation of the target pdf, especially for complex models or a large number of data. The number of evaluations of the target, in one horizontal period, are for SMH in Table 1, whereas in PEnM and PMTM (considering, in all cases, only the new evaluations at each iteration, the others can be automatically reused). Using SMH, multinomial sampling steps are performed, each one over a population of samples. In PEnM and PMTM, multinomial sampling steps are required (with ), each one over a set of samples. The total number of evaluations of the target, , including the vertical transitions, is when the SMH is employed in the horizontal steps, or when PEnM and PMTM are employed. Furthermore, in BIMTM, we have again , but only multinomial sampling steps. Note also that in a standard parallel multiple try approach we would have evaluations of the target and multinomial sampling steps, each one over a set of samples. Finally, we remark that, using SMH, we perform one acceptance test in each step, i.e., in one horizontal period. Using a multiple candidates scheme, we employ acceptance test in one horizontal period. All these considerations are summarized in Table 7. For further details and observations, see Appendix D.1.1.
Computational features  SMH  PMTM  BIMTM  Standard par. MTM 
Total number of  
multinomial sampling steps  
Cardinality of the set for  
the multinomial sampling  
Total number of  
acceptance tests 
5.4 Relationship with other techniques
The techniques in Table 4 and 5 are particularly interesting, since they involve the use of resampling steps without jeopardizing the ergodicity of the resulting global OMCMC process. Moreover, the SMH algorithm in Table 1 employs an inverted resampling scheme, since a sample in the population is chosen to be replaced with probability proportional to the inverse of the importance weights. Other methodologies in the literature employ a combination of MCMC iterations and resampling steps, for instance, iterated batch importance sampler (IBIS) and sequential Monte Carlo (SMC) samplers for a static scenario [27, 28]. Their underlying idea could be interpreted belonging to the OMCMC philosophy: in these methodologies, the resampling steps are seen as a “horizontal” approach for exchanging information within the population. The resampling procedure generates samples from a particle approximation
(13) 
of the measure of , where (or, similarly, [36]) and are defined in Eq. (10) in Table 5, with . The quality of this approximation improves as the number of samples grows. However, for a finite value of there exists a discrepancy which can produce problems in the corresponding sampling algorithm. For further details see Appendix B. One main issue is the loss in diversity in the population.
This problem is reduced drastically in OMCMC, since the ergodicity is ensured in both the vertical and the horizontal movements. This improvement in the performance is obtained at the expense of an increase of the computational cost. For instance, let us consider the use of SMH in horizontal transitions. The cloud of samples is not impoverished by the application of SMH, even if a poor choice of the proposal is made. In the worst case, the newly proposed samples are always discarded and computational time is wasted. In the best case, a proposal located in a low probability region can jump close to a mode of the target. Clearly, In the mixture multiple try approach, it is better to choice for fostering the safeguard of the diversity. Moreover, in the mixture approach, the mixture is built using the states in as location parameters, and then it does not change for the next horizontal steps. Thus, the information contained in the states is employed in the next iterations even if some states are not welllocated. For clarifying this point, consider for instance the basic scheme in Table 2. The mixture does not change, so the information provided by the population at the iteration is still used in the iterations . This feature is also the main difference between the scheme in Table 2 and the NKCtype methods [32], where one component of the mixture is relocated after each iteration.
5.5 Joint adaptation of the proposal densities
Let us denote as and the covariance matrices of the vertical and horizontal proposal pdfs, respectively. In order to design an algorithm as robust as possible, we suggest keeping the scale parameters fixed for the vertical proposal pdfs
, to avoid a loss of diversity within the set of chosen variances. However, if desired, they could be easily adapted as suggested in
[7]. On the other hand, we suggest adapting the scale parameters of the horizontal proposal pdfs , , since it is less delicate. Indeed, let us recall that a poor choice of the ’s entails an increase in the computational cost, but the diversity in the cloud of samples is always preserved. Several strategies have been proposed in [5, 37] and [7], for adapting proposal functions online within MCMC schemes. For the sake of simplicity, we discuss separately the cases of the populationbased or the mixturebased approaches.
Adaptation within SMH: in this case, the strategies in [37, 7] are appropriate. Thus, After a training period , all the generated samples (i.e., for each and from all the chains) can be used to adapt the location and scale parameters of the proposal pdf . Namely, denoting , we can use the following approach:

If : set , (where and are the initial choices).

If : set , and , where is a chosen covariance matrix. The empirical mean and covariance matrix estimators can also be computed recursively [5].


Adaptation of the mixture : the methods in Section 5.2 employ a mixture . In this case, every component should be adapted, jointly with the weights of the mixture. A possible (and simple) adaptation scheme is provided in [5], where all the parameters of the mixture are updated online. The method in [5] can be easily reformulated for a framework with parallel chains. In this case, the states of the parallel chains are divided into different clusters according to the Euclidean distance between them and location parameters of the components in the mixture . Then, new centroids (i.e., location parameters), covariance matrices and weights are updated according to the mean, covariance and cardinality of each cluster, respectively.
6 OMCMC for optimization and big data context
The OMCMC schemes can be easily modified converting them in stochastic optimization algorithms. Indeed, it is possible to replace the vertical MH chains with parallel simulated annealing (SA) methods [24, 25]. Let us denote as a finite scale parameter that is a decreasing function of , approaching zero for , i.e.,
(14) 
for . Moreover, for the sake of simplicity, we consider symmetric proposal functions . Then, one transition of the th SA is described below:

Draw .

Set with probability
Otherwise, i.e., with probability , set .
Note that, with respect to the MH algorithm, we have replaced the target with a modified target , with modes that become sharper and narrower when we reduce the scale parameter . Note also that the movements such are always accepted, whereas movements leading to are accepted with probability . This probability vanishes to zero as (guaranteeing the convergence to the global maximum when ). In the same fashion, the modified target is employed in the horizontal transitions of the mixturebased approach, whereas for the horizontal steps of the populationbased approach we consider the modified extended target,
(15) 
so that all the presented schemes, previously described, can be automatically applied. Several possible decreasing functions have been suggested in [24, 26, 25].
6.1 OMCMC for big data and datatempered distributions
In a Big Data context, the posterior density is typically split into different partial posteriors, with , each one considering a disjoint subsets of the observed data and such that
The key idea of those “divideandconquer” approaches is to generate samples from the partial posteriors , and then to combine them in some way for approximating the complete target [13]. Here, we remark that the OMCMC philosophy can be applied in this framework, in order to improve the mixing of each chain. Thus, let us consider the application of parallel (vertical) MCMC chains, addressing each one a different partial target . Since each observed data provides information about the same phenomenon, an interaction among the parallel chains can be improve the mixing of different MCMC techniques. In this context, the use of a mixture pdf as described in the OMCMC approach of Section 5.2, seems to be appropriate for providing a cooperative interaction among the parallel chains. Using the similar observations, OMCMC can be applied within the so called “data point tempered” techniques [27] where a sequence of posteriors, , ,…,, with an increasing number of data, are considered (typically, the last one contains all the data, i.e., ).
7 Numerical simulations
7.1 Multimodal target distribution
In this section, we consider a bivariate multimodal target pdf, which is itself a mixture of Gaussian pdfs, i.e.,
(16) 
with means