Non-Reversible Parallel Tempering: an Embarassingly Parallel MCMC Scheme

05/08/2019 ∙ by Saifuddin Syed, et al. ∙ 0

Parallel tempering (PT) methods are a popular class of Markov chain Monte Carlo schemes used to explore complex high-dimensional probability distributions. These algorithms can be highly effective but their performance is contingent on the selection of a suitable annealing schedule. In this work, we provide a new perspective on PT algorithms and their tuning, based on two main insights. First, we identify and formalize a sharp divide in the behaviour and performance of reversible versus non-reversible PT methods. Second, we analyze the behaviour of PT algorithms using a novel asymptotic regime in which the number of parallel compute cores goes to infinity. Based on this approach we show that a class of non-reversible PT methods dominates its reversible counterpart and identify distinct scaling limits for the non-reversible and reversible schemes, the former being a piecewise-deterministic Markov process (PDMP) and the latter a diffusion. In particular, we identify a class of non-reversible PT algorithms which is provably scalable to massive parallel implementation, in contrast to reversible PT algorithms, which are known to collapse in the massive parallel regime. We then bring these theoretical tools to bear on the development of novel methodologies. We develop an adaptive non-reversible PT scheme which estimates the event rate of the limiting PDMP and uses this estimated rate to approximate the optimal annealing schedule. We provide a wide range of numerical examples supporting and extending our theoretical and methodological contributions. Our adaptive non-reversible PT method outperforms experimentally state-of-the-art PT methods in terms of taking less time to adapt, as well as providing better target approximations. Our scheme has no tuning parameters and appears in our simulations robust to violations of the theoretical assumption used to carry out our analysis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Problem formulation. Markov Chain Monte Carlo (MCMC) methods are widely used to approximate expectations with respect to a probability distribution with density known up to a normalizing constant, i.e., where can be evaluated pointwise but the normalizing constant is unknown. Approximating such expectations is of central importance in the vast majority of modern Bayesian analysis scenarios as well as frequentist models with complex random effects. In both cases, can be written as a likelihood times a prior , and the distribution of interest is a posterior distribution over a variable . When the posterior distribution has multiple well-separated modes, highly varying curvature or when one is interested in sampling over combinatorial spaces, standard MCMC algorithms such as Metropolis–Hastings, slice sampling and Hamiltonian Monte Carlo can perform very poorly. This work is motivated by the need for practical methods for these difficult sampling problems and a natural direction is to use multiple cores and/or to distribute the computation.

Figure 1:

Stochastic Even-Odd swaps (SEO, top row) and Deterministic Even-Odd swaps (DEO, bottom row) for PT, ran with

chains (left column) and chains (right column) on a Bayesian change-point detection model applied to text message data [davidson-pilon_bayesian_2015]. Even swap moves (rows labelled ‘E’) propose to exchange states at chains with an even index with the corresponding state in chain . Each such swap is independently accepted (green oblique edges) or rejected (red horizontal edges) according to a Metropolis–Hastings step. Odd swap moves (rows labelled ‘O’) propose between odd index and . The only difference between DEO and SEO is the way these moves are composed: in SEO, they are selected at random, while in DEO, the two are deterministically alternated. For both SEO and DEO, exploration kernels are used between each swap round (not shown). This sequence of moves forms annealing parameter trajectories (paths formed by the red and green edges) in the space of annealing parameters . We show one such paths in bold as a visual aid. Here for simplicity we use equally spaced annealing parameters. From this figure it is evident that this choice is suboptimal: notice that most swaps between the prior and chain are rejected. This is corrected by adaptive tuning (Section 5.4).

Background: Parallel Tempering (PT). One popular approach for multi-core/distributed exploration of complex distributions is Parallel Tempering (PT) which was introduced independently in statistics [geyer1991markov] and physics [hukushima1996exchange]; see also [swendsen1986replica]

for an earlier related proposal. Since its inception, PT remains to this day the go-to “workhorse” MCMC method to sample from complex multi-modal target distributions arising in physics, chemistry, biology, statistics, and machine learning; see, e.g.,

[desjardins2014deep, cho2010parallel, earl2005parallel, andrec2005protein, pitera2003understanding, cheon2008phylogenetic]. A recent empirical benchmark [ballnus2017comprehensive] shows PT methods consistently outperform other state-of-the-art sampling methods in practice.

To sample from the distribution of interest , PT introduces a sequence of auxiliary tempered or annealed probability distributions with densities . The auxiliary distributions are parameterized by an annealing schedule, which consists of an increasing sequence of annealing parameters . This bridge of auxiliary distributions is used to progressively transform samples from the prior (), for which it is often possible to obtain independent samples, into samples from the posterior distribution (), for which only poorly mixing MCMC kernels may be available.

More precisely PT algorithms are based on Markov chains in which the states are -tuples, . The augmented MCMC sampler is designed so that its stationary distribution is given by . At each iteration , PT proceeds by applying in parallel MCMC kernels targeting for . We call these model-specific kernels the exploration kernels. The chains closer to the prior chain (i.e. those with annealing parameter close to zero) can traverse regions of low probability mass under while the chain at ensures that asymptotically we obtain samples from the distribution of interest. Frequent communication between the chains at the two ends of the spectrum is therefore critical for good performance, and achieved by proposing to swap the states of chains at adjacent annealing parameters. These proposals are accepted or rejected according to a Metropolis mechanism inducing a random permutation of the components of x.

Background: Tuning PT. The effectiveness of PT is determined by how quickly the swapping scheme can transfer information from the prior chain to the posterior chain. There have been many proposals made to improve this information transfer by adjusting the annealing schedule; see, e.g., [kone_selection_2005, atchade_towards_2011, miasojedow2013adaptive] or adaptively reducing annealing parameters; see, e.g., [lkacki2016state]. These proposals are useful but do not address a crucial limitation of PT, illustrated in the top row of Figure 1: in standard PT algorithms, each annealing parameter trajectory (shown in bold in Figure 1 and formally defined in Section 2.3) exhibits a diffusive behaviour, hence we can expect that when is large it takes roughly swap attempts for a state at to reach [diaconis2000analysis]. The user thus faces a trade-off. If is too large, the acceptance probabilities of the swap moves are high but it takes a time of order for a state at to reach . If is too low, the acceptance probabilities of swap moves deteriorate resulting in poor mixing between the different chains. Informally, even in a multi-core or distributed setting, for large, the gains in being able to harness more cores do not offset the cost of the diffusion (see Figures 4, and Section 3.4 where we formalize this argument). As a consequence, the general consensus is that the temperatures should be chosen to allow for about a 20–40% acceptance rate to maximize the square jump distance travelled per swap in the space of annealing parameters [kone_selection_2005, lingenheil2009efficiency, atchade_towards_2011]. Previous work has shown that adding more chains past this threshold actually deteriorates the performance of PT and there have even been attempts to adaptability reduce the number of additional chains [lkacki2016state]. This is a lost opportunity, since PT is otherwise suitable to implementation on multi-core or distributed architectures.

Overview of our contributions. The literature on optimal PT tuning strategies has so far implicitly assumed that the algorithm was reversible and/or serial. Our first contribution is a rigorous, non-asymptotic result showing that a popular non-reversible PT algorithm introduced in the physics literature, Deterministic Even-Odd swap (DEO) [okabe2001replica], is guaranteed to outperform its reversible counterpart, which we call Stochastic Even-Odd swap (SEO); see Figure 1 for an informal introduction to DEO and SEO. This result holds under an efficient local exploration condition that we argue is a reasonable model for scenarios of practical interest. The notion of optimality we analyze, the round trip rate, is closely aligned to the running time of practical PT algorithms; see Section 3.

Our second contribution is the asymptotic analysis of the round trip rate in which the number of parallel chains and cores is taken to infinity (Sections

4-5). This novel asymptotic regime is highly relevant to modern computational architectures such as GPUs and distributed computing, and yields several additional results both theoretical and practical:

  1. In particular, we show in Section 4 that in the non-reversible regime (DEO) for challenging sampling problems one should use at least as many chains as the number of cores available. This contrasts with the reversible algorithm SEO, where adding more chains, even in a multi-core setup, is eventually detrimental. In other words, for this non-reversible PT algorithm, the optimal tuning recommendations are qualitatively different compared to reversible PT algorithms.

  2. While adding more parallel cores to the task improves the performance of non-reversible PT, we show formally that there is a diminishing return in doing so for large . We quantify this diminishing return using both non-asymptotic bounds as well as an asymptotic analysis letting both the dimension of the problem and the number of chains go to infinity (Section 4.1 and 4.2 respectively).

  3. In Section 5 we analyze optimal annealing schedules using our high parallelism asymptotics. We then develop a novel adaptive scheme (Procedure 3), which is both experimentally effective and simple to implement.

Our third contribution is a novel analysis of the scaling limit for the annealing parameter trajectories as the number of parallel chains goes to infinity (Section 6). We show that non-reversible PT scales weakly to a Piecewise Deterministic Markov Process (PDMP) under realistic conditions, contrasting with the diffusive limit we obtain for reversible PT. This offers intuition explaining the fundamental differences observed between the round trip rates for non-reversible and reversible PT as discussed in Section 3.4 and 4.2. The rate parameter of the limiting PDMP is intimately connected to our adaptive scheme and provides more intuition on its behaviour.

Finally in Section 7, we present a variety of experiments to validate and extend our theoretical analysis. We compare the performance of our non-reversible scheme with other state-of-the-art adaptive PT methods. We also provide empirical evidence that our adaptive scheme is robust to situations where a simplifying assumption used to carry out our theoretical analysis is violated.

Literature review. Previous theoretical studies analyzed the asymptotic behaviour of standard PT based on a target consisting of a product of independent components of increasing dimension [atchade_towards_2011], or an increased swap frequency relative to a continuous time sampling process [dupuis_infinite_2011]. We instead let the number of cores available in a massively parallel setup go to infinity. One advantage of our approach is that, in contrast to these previous analyses, we do not need to make assumptions on the structure of neither the target distribution (such as [atchade_towards_2011] where they assume the target distribution is a product of independent and identical distributions and is large) nor the exploration kernels (such as [dupuis_infinite_2011], where the exploration kernel is assumed to be driven by a class of stochastic differential equations).

The DEO algorithm was proposed in [okabe2001replica]. This algorithm was presumably devised on algorithmic grounds (it performs the maximum number of swap attempts in parallel) since no theoretical justification was provided and the non-reversibility of the scheme was not mentioned. The arguments given in [lingenheil2009efficiency] to explain the superiority of DEO communication over various PT algorithms rely on an erroneous assumption, namely a diffusive scaling limit. We show in this work that the scaling limit of non-reversible PT is actually not diffusive as the number of parallel chains goes to infinity. In particular, [lingenheil2009efficiency] still recommends to stop adding chains after a target acceptance rate is achieved.

Another related PT algorithm is the Lifted Parallel Tempering algorithm (LPT), described in [wu_irreversible_2017]; see [sakai_irreversible_2016] for a closely related idea developed in the context of simulated tempering, and also [spill_convective_2013] for an earlier attempt to build a non-reversible PT scheme which was later shown not to be invariant with respect to the distribution of interest [zhang_replica_2014]. These strategies are based on a common recipe to design non-reversible sampling algorithms, which consists in expanding the state space to include a “lifting” parameter that allows for a more systematic exploration of the state space [chen1999lifting, diaconis2000analysis, turitsyn2011irreversible, vucelja2016lifting]. We will show here that both LPT and DEO are actually closely related in that the marginal behaviour of individual chains under DEO is in fact distributionally equivalent to the one LPT chain. In a multi-core/distributed context, the DEO scheme therefore dominates LPT by having up to swaps per iteration whereas LPT only performs one.

2 Setup and notation

2.1 Annealed distributions

Henceforth we will assume that the probability distributions and on admit strictly positive densities with respect to a common dominating measure . We will also denote these densities somewhat abusively by and . It will be useful to define and . Using this notation, the annealed distribution at an annealing parameter is given by

(1)

where is the corresponding normalizing constant.

We denote an annealing schedule by , and in our asymptotic analysis we will view it as a partition of with mesh-size . Given an annealing schedule we define

, the joint distribution on the augmented space

, by

(2)

2.2 Parallel tempering

In this section, we define formally the Markov kernels corresponding to the reversible (SEO) and non-reversible (DEO) PT algorithms described informally in the introduction and in Figure 1. We provide pseudo-code for the overall algorithm in Procedure 1.

The two phases of Parallel Tempering. For both SEO and DEO, the overall Markov kernel describing the algorithm is obtained by the composition of a exploration kernel and a communication kernel ,

(3)

where denotes the alternation of followed by , i.e. for any two transition kernels and , . The difference between SEO and DEO is in the communication phase, namely in the former case and in the latter. Both communication kernels are detailed further.

The exploration kernels. These are defined in the same way for both SEO and DEO. They are also model specific, so we assume we are given one -invariant kernel for each annealing parameter . These can be based on Hamiltonian Monte Carlo, Metropolis–Hastings, Gibbs Sampling, Slice Sampling, etc. The exploration kernel of the prior chain can often be taken to be , i.e. . We construct the overall exploration kernel by applying the annealing parameter specific kernels to each component independently from each other:

(4)

Swap kernels. Before defining the communication scheme, it will be useful to first construct its fundamental building block, the swap kernel

. A swap kernel is a Metropolis–Hastings move with a deterministic proposal which consists of permuting two coordinates of a state vector. The proposed state is denoted

(5)

The Metropolis–Hastings acceptance ratio of this proposal is given by

(6)
(7)

Let denote the Metropolis-Hastings kernel corresponding to this update:

(8)

where denotes the Dirac delta.

The Odd and Even kernels. These kernels are maximal groups of swap moves such that members of the group do not interfere with each other. See Figure 1 for an illustration. We first define the even and odd indices:

(9)
(10)

The corresponding even and odd kernels and are then given by

(11)

The communication kernel for SEO and DEO. For SEO, the kernel is given by a mixture of the odd and even kernels in equal proportion:

(12)

For DEO, the kernel is given by a deterministic alternation between odd and even kernels. This is encoded by the following time heterogeneous kernel

(13)

Proposal and swap indicators. In our theoretical analysis it will be useful to re-express the exploration kernels in the following equivalent fashion. Let

(14)

where denotes an indicator that a swap is proposed (attempted) between chains and at iteration . In DEO, is deterministic, i.e. for even and for odd . In SEO, . We also set

. To avoid having too many subscripts, we use the same random variables for SEO and DEO but differentiate their behaviour by using two different probability measures

and with associated expectation operators and . We use and for statements that hold for both algorithms.

Figure 2: Illustration of the proposal, acceptance and swap indicators on a non-reversible realization.

The swap proposals are then defined from the proposal indicators as

(15)

where are acceptance indicator variables (see Figure 2). The equivalence between and this representation is given by

(16)

Permutation augmentation. Another useful construction is to add a permutation to the state space to keep track of the cumulative effect of the swaps. The augmented state space becomes , where denotes the group of bijections of . A first instance where this construction is useful is in the context of PT algorithms distributed over several compute nodes or machines. In this context a key implementation point is that instead of having pairs of machines exchanging states when a swap is accepted (which could be detrimental due to network latency and lower throughput), the machines should exchange annealing parameters. If are indices for machines, then the permutation at iteration encodes that machine is currently responsible for annealing parameter . Formally, the swap kernel in the augmented space is defined as:

(17)

where denotes an augmented state, denotes a transposition (swap) between and , and is the composition of followed by the swap . We abuse notation here and denote the kernel in the augmented space with the same symbol. The exploration kernel does not cause swaps, so in the permutation-augmented space we set it to

with a similar abuse of notation.

Invariant distribution. The kernels and hence are all invariant with respect to

(18)

See Appendix A for details.

Non-reversibility of DEO. Written as an homogeneous Markov chain, DEO takes the form . If , this kernel is in general non-reversible (it satisfies global balance but not detailed balance). Examples violating detailed balance can be constructed using a uniform likelihood, , in which case and swaps are systematically accepted. Non-reversibility will be explored more deeply in Section 3.3.

Reversibility of SEO. Let us assume that the exploration kernel can be decomposed as . This is a reasonable assumption: often is itself a composition of exploration passes, we are just assuming here that is even. In this case, it is reasonable to analyse the kernel . Since this kernel is palindromic, if is reversible then the palindrome is also reversible. We will refer to the PT algorithm with SEO and DEO communication as reversible PT and non-reversible PT respectively even when is non-reversible.

1: for all Swap rejection statistics used in 5.4 to adapt
2: Initialize chain
3:for  in 1, 2, …,  do
4:     for  in  do
5:          Exploration phase (embarrassingly parallel)
6:               
7:     if  is even then Non-reversibility inducing alternation
8:          Equation (9)
9:     else
10:          Equation (10)      
11:     for  in  do Communication phase (embarrassingly parallel)
12:         
13:         
14:         
15:         if  and  then
16:               Equation (7).               
17:     
18: for all Equation (57)
19:return
Procedure 1 Non-reversible PT (number of scans , annealing schedule )

2.3 Annealing trajectories and the index process

Chains and replicas. We will refer to the sequences and as the -th chain and replica respectively. The -th chain tracks the sequence of states with annealing parameter and the replica tracks the sequence of states on machine .

Annealing trajectories. Closely related to the replica, we define the annealing trajectory for index by . As discussed in the last section, index can be interpreted as a machine in a distributed context. We will use the notation when is unimportant. The annealing trajectory tracks the sequence of annealing parameters that machine is responsible of, as a function of the iteration index . The concept is best understood visually: refer to the bold piecewise linear path in Figure 1. We shall see in Section 3 that annealing trajectories encode the impact of the communication kernel in PT algorithms, and will allow us to illuminate fundamental differences between reversible and non-reversible PT.

Index process. To analyse the annealing trajectory , it will be equivalent and easier to study the sequence . For the remainder of this section we will introduce an alternative recursive construction to give intuition on the dynamics of . This recursion forms the basis for the analysis in the rest of the paper.

We define the index process for machine as and use the notation when is unimportant. Initialize and if and otherwise. For , we have

(19)

and,

(20)

The variables represent the direction proposed at iteration . For SEO communication, the variables are independent and identically distributed and equal to or with equal probability, and consequentially the annealing trajectories exhibit a random walk behaviour. In contrast for DEO communication, we have so long as the proposal involving replica was accepted and otherwise. Therefore annealing trajectories for non-reversible PT have a persistence in one direction and only change when a swap involving replica is rejected or if the boundary is reached. The qualitative differences between the two regimes can be seen in Figure 1.

3 Non-asymptotic analysis of PT

In this section, we motivate a formal notion of computational efficiency for SEO and DEO, the round trip rate, denoted and for the two algorithms, and provide conditions under which non-reversible DEO is guaranteed to perform better than its reversible counterpart, .

3.1 Model of compute time

We start with a definition of what we model as one unit of compute time: throughout the paper, we assume a massively parallel computational setup, and hence that sampling once from each of the kernels , and takes one unit of time, independently of the number of chains .

This assumption is realistic in both GPU and parallel computing scenarios, since the communication cost in PT only involves pairs of neighbours, and moreover does not increase with the dimensionality of the problem (as explained when we introduced the permutation augmentation in Section 2.2). In particular, all simulations considered in this work involve at most an order of tens to hundreds of chains (see for example Fig 16 for an example with up to 640 chains), so they are within reach of current commodity hardware: for example GPUs used in modern scientific applications often have roughly 1,000—5,000 cores as of 2019.

We also assume that the number of MCMC iterations will still dominate the number of parallel core available, i.e. . This is reasonable since the focus of this paper is in challenging sampling problems.

Empirical studies on multi-core and distributed implementation of PT are numerous [altekar_parallel_2004, mingas_parallel_2012, fang_parallel_2014]. However, despite its practical relevance, we are not aware of previous theoretical work investigating this computational model for PT. From now on, all analysis and recommendations in this paper assume a parallelized or distributed setup.

3.2 Performance metrics for PT methods

The standard notion of computational efficiency of MCMC schemes is the effective sample size (ESS) per compute time [flegal_monte_2008]. However, for PT methods, since the ESS per compute time depends on the details of the exploration kernels , alternatives have been developed in the literature. These alternative metrics allow us to give a representative analysis of PT as a “meta-algorithm” without reference to the specifics of the exploration kernels. In this section we motivate and describe the round trip rate, one such PT performance metric popular in the PT literature [katzgraber2006feedback, lingenheil2009efficiency]. The notion of round trip rate seems a priori somewhat disconnected to ESS per compute time, so we first introduce a more intuitive notion, the restart rate, and then show that the restart and round trip rates are essentially equivalent.

Tempered restarts. Our definition of a tempered restart is motivated by situations where the prior chain () provides one independent sample at each iteration. In this context, notice that each sample from the prior chain will either “succeed” in getting propagated to the posterior chain (), or “fail” and come back to the prior chain. The number of tempered restarts is defined as the number of distinct independent samples generated by the prior chain which are successfully propagated, via communication and exploration steps, to the posterior chain during the first iterations. This notion of optimality is not the full picture since intermediate chains also perform exploration, but nonetheless captures the essence of difficult multi-modal problems where only an independent initialization combined with successive exploration and communication steps can reach distinct modes in reasonable computational time. Informally, a tempered restart can be thought of as a sampling equivalent to what is known in optimization as a random restart. We define the restart rate as . Note also that each tempered restart is carried by one of the annealing trajectories, so we can decompose the tempered restarts as .

Round trips. Next, we say a round trip has occurred for replica when the annealing trajectory for replica successfully increases from to and back to . Formally, we recursively define and for ,

(21)
(22)

We say the -th round trip for replica occurred at iteration . Let denote the number of round trips for replica in the first iterations, and be the total number of round trips. We define the round trip rate as . This metric is commonly used in the literature to compare the effectiveness of various PT algorithms [katzgraber2006feedback, lingenheil2009efficiency].111In [katzgraber2006feedback, lingenheil2009efficiency] the round trip rate per annealing trajectory was optimized, i.e. .

Equivalence: each restart, except possibly for the last one, coincides with a round trip in one of the annealing trajectories. Hence, , so , and

Alternative PT performance metrics. Another performance metric used in the PT literature is the expected square jump distance (ESJD) [atchade_towards_2011, kone_selection_2005], defined as

(23)

where and . While this criterion is useful within the context of reversible PT for selecting the optimal number of parallel chains, it is too coarse to compare reversible against non-reversible PT methods. Indeed, for any given annealing schedule, the EJSD for DOE and SEO are identical. More generally, the metric is less directly aligned to the quantity practitioners care about, which is the restart rate.

The work of [chodera_replica_2011] proposes to use the relaxation time of the process . However, in our context this ignores the special structure of the chain at , which is an independent sequence of random variables distributed according to .

3.3 Index process as a Markov chain

The analysis of the round trip times is in general intractable because the index process is not Markovian. This is because simulating a transition depends on the acceptance indicators (see Equation (15)), the distributions of which themselves depend on the full state configuration X. If we further assume that the sequence is stationary and the exploration kernel is “locally efficient,” as defined below, we obtain that the index process is actually Markovian, and this will allow us to analytically compute round trip rates for both SEO and DEO communication schemes. We formally outline these assumptions below.

Stationarity. We assume and thus for all as the kernel is -invariant. An important observation that follows from assuming the stationarity regime is that the marginal behaviour of the communication scheme only depends on the distribution of the state via univariate distributions, namely the distributions of the chain-specific energies , . To see why, note that if , then by the definition of the stationary distribution, and Equation (7), the random variables are independent, and

(24)

Remarkably, this observation allows us to build a theoretical analysis of PT which makes no assumption on the nature of the state space . In contrast, previous work such as [atchade_towards_2011] assumed a product space for large .

Efficient Local Exploration (ELE). Let and denote the negative log-likelihood before and after an exploration step for any chain , , for , . The ELE assumption posits that and are independent.

This condition is more reasonable than it may appear at first glance and it is weaker than assuming that and are independent as typically done in the literature [atchade_towards_2011, roberts2014minimising]. Consider for example a scenario where we seek to explore the posterior distribution of a mixture model with symmetries induced by label switching. In such cases, being able to design exploration kernels such as and are approximately independent can be understood as being able to efficiently visit a neighbourhood of one of the local maxima. In contrast, being able to sample independently from would defy the need for using PT in the first place.

These two assumptions are assumed to hold throughout the paper. The assumptions are not expected to be exactly satisfied in real problems. However, they provide the foundations of a model for PT algorithms. We validate the predictions made by the model in Section 7.2

and empirically show robustness in performance even when the ELE assumption is violated. We also provide an heuristic argument to explain why our theoretical results appear robust to ELE violations for non-reversible PT. Moreover, the model is used to make algorithm optimization choices such as picking annealing parameters, and even if a slightly suboptimal PT algorithm is used, this PT algorithm still exactly targets the distribution of interest. Previous work on analyzing PT has also made modelling assumptions that are not expected to hold in practice but yield useful guidelines.



Markov transition kernel for the index process. Under ELE, we can express the acceptance indicators as independent Bernoulli random variables . The constant is given by the expectation of Equation (24),

(25)

where the expectation is over two independent random variables , satisfying for . From this, we obtain that is Markovian under ELE by mirroring the construction in Section 2.3.

For SEO, initialize and . Define the Markov transition kernel,

via chain rule in two steps. In the first step, simulate

(26)

where the expression “” enforces the annealing parameter boundaries. In the second step, independently sample .

Similarly for DEO, initialize and if is even and otherwise. Analogous to the SEO construction, we define via chain rule in two steps. We first update same as (26), but in the the second step we apply the deterministic update,

(27)

The lifted property of non-reversible PT. By inspection, we have for

(28)

implying that is reversible with respect to the uniform stationary distribution. In fact, since are independent and identically distributed, by itself is a reversible Markov process for SEO. The index process for SEO has been analysed in this context by [nadler_generalized_2007], where a master equation for was heuristically assumed to hold. However this approach does not provide a good approximation to the DEO case since, even if one assumes ELE, the process is not Markovian in contrast to the index process . However, contrary to the SEO case, the index process does not satisfy the detailed balance condition (28

) but the following skew-detailed balance condition,

(29)

where . This implies that the index process for DEO falls within the generalized Metropolis–Hastings framework outlined in [stoltz2010free, wu_irreversible_2017]

, and is non-reversible with respect to the uniform distribution.

Reversibility necessitates that the constructed MCMC chain must be allowed to backtrack its movements, which leads to inefficient exploration of the state space. As a consequence, non-reversibility is typically a favourable property for MCMC chains. One popular approach to design non-reversible MCMC samplers is to enlarge the state space with a “lifting parameter” which breaks reversibility and forces persistency in exploration [chen1999lifting, diaconis2000analysis].

We can interpret the index process for DEO communication as a “lifted” version of the index process for SEO with lifting parameter . In DEO communication, travels in the direction of and only reverses direction when reaches a boundary or when a swap rejection occurs. This idea was first explored by [wu_irreversible_2017] in the context of parallel tempering.

Consequentially, this “lifted property” built into DEO trajectories explains the qualitatively different behaviour between SEO and DEO. In Section 3.4 we will formally show that DEO annealing trajectories perform round trips in PT iterations whereas SEO annealing trajectories require instead PT iterations. We will also show in Section 6 that the scaling behaviour of the index process for reversible and non-reversible PT are qualitatively different, in particular for non-reversible PT, the index process is non-diffusive in contrast to its reversible counterpart.

3.4 Non-asymptotic domination of non-reversible PT

Based on our two assumptions, we will be able to compute explicit formulae for the round trip rates PT with DEO and SEO communication (and hence restart rates). Using the fact that the index process is Markovian, we can rewrite the round trip rate via an expected hitting time of , and then provide analytic expressions for the expected hitting times of index process for both DEO and SEO communication based on its transition probabilities. This yields that non-reversible PT outperforms reversible PT for any annealing schedule.

Computation of the round trip rate. We defined our notion of optimality using an asymptotic expression in the number of iteration . Our first goal is to obtain an analytic and non-asymptotic expression for for a given annealing schedule . As we will show shortly, the choice of schedule enters in the said analytic expression in terms of a schedule inefficiency defined as a sum of rejection odds:

(30)

where is the probability of rejecting a swap.

To achieve our first goal, we note that for each , is delayed renewal processes with inter-arrival times for and . In particular, are independent and identically distributed with common distribution . The key renewal theorem then implies

(31)

The following proposition gives us analytical expressions for the expected round trip times of PT with SEO and DEO communication respectively in terms of the schedule inefficiency. The proof can be found in Appendix B.

Proposition 3.1.

For any annealing schedule ,

(32)
(33)

The first term of (33) is the minimum number of swaps needed for a round trip to occur with no rejections. In contrast, the first term in (32) is the expected number of steps needed for a simple random walk on of size to make a round trip. The second term of (32) and (33) represents the expected impact of rejected swaps on the round trip times.

Intuitively, can be interpreted as the expected number of scans required before the first replica achieves a round trip. Therefore we should Proposition 3.1 implies we need scans for non-reversible PT before the first round trip occur. This is in contrast to the scans required for reversible PT.

Corollary 3.2.

For any annealing schedule we have

(34)
(35)

so .

4 Asymptotic analysis of PT

While the main result from the previous section ranks the performance of DEO communication relative to SEO communication, it does not provide insight on the absolute performances of these schemes, because of the inefficiency term .

Overview. In this section, we provide asymptotic estimates of as . The main result in this section is that the round trip rate of the reversible PT decays to zero. This in contrast to the non-reversible PT, where asymptotically increases (as defined below) to a positive constant . Moreover, we provide a characterization of in terms of a “communication barrier”, , measuring the deviance of from . We show both and can be estimated from the MCMC trace in Section 5 and can be used as the basis of schedule adaptation schemes.

4.1 The communication barrier

We begin by analyzing the behaviour of the PT swaps as goes to zero. In order to do so, we define the swap and rejection functions respectively as,

(36)
(37)

where for are independent. The quantities and are symmetric in their arguments and represent the probability of swapping and rejection occurring between and under the ELE assumption. Note that .

Local communication barrier. To take the limit as , it will be useful to understand the behaviour of when . The key quantity that drives this asymptotic regime is given by a function defined as

(38)

where are independent random variables with common distribution . We will use the following estimate for derived in the context of the design of a different class of tempering models used in the physics literature called incomplete beta function laws [predescu2004incomplete].

Theorem 4.1.

[predescu2004incomplete] For , let and . Suppose is integrable with respect to and then we have,

(39)
(40)

Theorem 4.1 shows that encodes up to third order the behaviour of and as the annealing parameter difference between the chains goes to 0. Since , Theorem 4.1 implies that can be expressed equivalently as the instantaneous rate of rejection of a proposed swap at annealing parameter ,

(41)

Note that is small when , which combined with Theorem 4.1 and (41) implies measures how sensitive is to perturbation in .

Replica with annealing trajectory will have very little difficulty accepting swaps when is small and will suffer from high rejection rates in regions when is large. Since chains communicate only when swaps are successful, measures the difficulty of communication at .

Global communication barrier. When , is the Riemann sum for with a single rectangle. If , then standard midpoint rule error estimates yield

(42)

Proposition 4.2 implies that the regularity of

is controlled by the moments of

with respect to and .

Proposition 4.2.

If is integrable with respect to and , then .

By applying Proposition 4.2, we can substitute (42) into Theorem 4.1, to obtain the following corollary.

Corollary 4.3.

If is integrable with respect to and , we have

(43)
(44)

Motivated by Corollary 4.3 we will henceforth assume that is integrable with respect to and and define by

(45)

We denote as the global communication barrier.

Remark 4.4.

Notice that with equality if and only if for all . It can be easily verified from (38) that if and only if is constant -a.s. for all which happens precisely when . So defines a natural symmetric divergence and measures the difficulty of communication between and .

High-dimensional scaling of communication barrier. We now determine the asymptotic behaviour of and when the dimension of is large. To make the analysis tractable, we assume that as in [atchade_towards_2011, roberts2014minimising]. This provides a model for weakly dependent high-dimensional distributions.

The corresponding tempered distributions are thus given by