The sampling of probability distributions in high dimensions is a fundamental challenge in computational science, with broad applications to physics, chemistry, biology, finance, machine learning, and other areas (e.g.[1, 2, 3, 4]). The methods of choice to tackle this problem are based on Monte Carlo or the simulation of stochastic differential equations such as the Langevin equation, either of which can be designed to be ergodic for a wide class of probability distributions. Yet, straightforward application of these methods typically fails when the distribution displays complex features such as multimodality. In high dimensions, the per-step cost of generating samples is significant and can be taken as the unit of computational effort; naive approaches may require many millions or billions of iterations. A typical case in point arises in computational chemistry, where molecular dynamics (MD) simulation has become an invaluable tool for resolving chemical structures, exploring the conformational states of biomolecules and computing free energies. Yet, despite its versatility, the use of MD simulation is often limited by the intrinsic high dimensionality of the systems involved and the presence of entropic and energetic barriers which lead to slow diffusion and necessitate the use of very long trajectories. A typical MD simulation is thus spent oversampling a few free energy minima, with consequent poor approximation of properties of interest.
which directly estimates the density of states and thus the entropy during simulation; metadynamics[7, 8] which progressively modifies the potential energy during simulation to flatten out the landscape and accelerate transitions between states; temperature-accelerated molecular dynamics (TAMD) [9, 10], which effectively evolves collective variables on their free energy landscape at the physical temperature but use an artificially high temperature to speed-up their dynamics; or the adaptive biasing force (ABF) method  which modifies forces using a continually refined estimated free energy gradient.
Another popular class of accelerated sampling schemes is based on adding the temperature to the system state variables and allowing it to vary during the simulation. These methods, which originated from simulated annealing, were first introduced for Langevin dynamics. They include replica exchange molecular dynamics (REMD) [13, 14, 15, 16, 17], in which several replicas of the system are evolved at different temperatures which they exchange, and simulated tempering (ST) methods, in which the temperature is treated as a dynamical variable evolving in tandem with the physical variables [18, 19]. The general idea underpinning REMD and ST is that modification of the temperature introduces nonphysical states that accelerate the dynamics of the physical system at target conditions. Closely related to the tempering methods are various “alchemical” simulation schemes which allow dynamical modification of parameters of the energy function describing the molecule, which are in spirit similar to what is done in umbrella sampling. Hamiltonian replica exchange is an example, for example [20, 21, 22, 23, 24, 25] which involves e.g. softening a dihedral bend [26, 27] or reducing the forces acting between protein and explicit solvent bath .
In this paper, we focus our attention primarily on the original ST in which the temperature is evolved along with the particle positions and momenta. In the standard implementations of the method, the temperature is treated as a discrete random variable, evolving together with the physical variables via a discrete-time Markov chain. The practitioner is normally required to make a choice of saytemperatures distributed over some range,
which defines the temperature “ladder”. Letting , we prescribe a weight at each of these reciprocal temperatures, and the system state variables are then evolved along with the reciprocal temperature in the following way:111Note that in this version, tempering amounts to rescaling the forces rather than actually changing the temperature of the bath. In the context of MD simulations, this is the way ST is typically implemented in practice, to not temper with the kinetic energy of the system, see (10).
Given the current state of the reciprocal temperature, say , standard MD simulations are performed with the force rescaled by the factor for a lag time of duration (here is fixed);
At the end of each time interval of duration , a switch from to some is attempted, and accepted or rejected according to a Metropolis-Hastings criterion with acceptance probability
Here is the potential energy at the current position .
The simple idea in ST is that exploration is aided by the high temperatures, when , since the rescaling of the force by the factor will help the system traverse energetic barriers easily. The lower temperatures, when , complement the sampling by providing enhanced resolution of low energy states. The scheme above guarantees that this acceleration of the sampling is done in a controlled way, in that we know that the ST dynamics is ergodic with respect to the extended Gibbs distribution
is the mass tensor,is the physical dimension times the number of particles, and is the normalization constant
Knowledge of (3) permits one to unbias the ST sampling and compute averages with respect to the target Gibbs distribution in the original system state space:
where is the partition function. Standard techniques, such as the Langevin Dynamics associated with (5), explore this measure poorly. ST can in principle accelerate sampling for the reasons listed above. However, to design an effective implementation of ST, practitioners must make choices for the temperature ladder in (1), switching frequency , and weight factor in (2). These choices are all non-trivial and exhibit some clear interdependence. Our aim here is to explain how to chose these parameters and show how to do so in practice. Specifically:
By adapting the large deviation approach proposed by Dupuis et al. in Refs. [29, 30] in the context of REMD, we show that ST is most efficient if operated in the infinite switch limit, . In this limit, one can derive a limiting equation for the particle positions and momenta alone, in which the original potential is replaced by an averaged potential. In the context of the standard ST method described above, this averaged potential reads
In the infinite switch limit, the ST method then becomes similar to the “integrate-over-temperature” method that Gao proposed in Ref. , and there is no longer any need to update the temperatures – they have been averaged over.
Regarding the choice of temperature ladder, we show that it is better to make the reciprocal temperature vary continuously between and . In this case, the averaged potential (6) becomes
and we can think of (6) as a way to approximate this integral by discretization. When the reciprocal temperature takes continuous values, the infinite switch limit of ST is also the scheme one obtains by infinite acceleration of the temperature dynamics in the continuous tempering method proposed in Ref. .
Regarding the choice of , the conventional wisdom is to take , where is the configurational part of the partition function:
This choice is typically justified because it flattens the marginal of the distribution (3) over the temperatures. Indeed this marginal distribution is given for by
which is uniform if . Here we show that the choice also flattens the distribution of potential energy in the modified ensemble with averaged potential (7). Interestingly, this offers a new explanation of why ST becomes inefficient if the system undergoes a phase transition, such that its density of states is not log-concave. This perspective will allow us to make a connection between ST and the Wang-Landau method [5, 6].
Building on these results, an additional contribution of this article is to give a precise formulation of an algorithm for learning the weights on the fly. The implicit coupling between physical dynamics and weight determination complicates implementation, and remains an active area of research [33, 31, 34, 35]. This problem of weight determination is comparable to a machine learning problem in which parameters of a statistical model must be inferred from data (in this case the microscopic trajectories themselves). This algorithm derives from an estimator of the partition function that utilizes a full MD trajectory over the full set of temperatures . This is similar to the Rao-Blackwellization procedure proposed in Ref.  and in contrast to, but more efficient than, traditional methods which are restricted to a particular temperature .
As was outlined above, to establish these results it will be convenient to work with a continuous formulation of ST, in which the reciprocal temperatures vary continuously both in time and in value in the range . This formulation is introduced next, in Sec. 2.1. We stress that it facilitates the analysis, but does not affect the results: all our conclusions also hold if we were to start from the standard ST algorithm described above.
The remainder of this paper is organized as follows: In Sec. 2 we present the theoretical foundations of ST using the continuous variant that we propose, which is introduced in Sec. 2.1. In Sec. 2.2 we derive a closed effective equation for the system state variables alone in the infinite switch limit and we justify that it is most efficient to operate ST in this limit. In Sec. 2.3 we show how to estimate canonical expectations using this limiting equation – the same estimator can also be used for standard ST and amount to performing a Rao-Blackwellization of the standard estimator used in that context. We also show how to estimate the partition function . In Sec. 2.4 we go on to explain why the choice is optimal. Finally, in Sec. 2.5 we explain how to learn these weights on the fly. These theoretical results are then used to develop a practical numerical scheme in Sec. 3, and this scheme is tested on two examples in Sec. 4: the -dimensional harmonic oscillator, which allows us to investigate the effects of dimensionality in a simple situation where all the relevant quantities can be calculated analytically; and a continuous version of the Curie-Weiss model, which displays a second-order phase transition and allows us to investigate the performance of ST in the infinite temperature switch limit when this happens. Finally, some concluding remarks are given in Sec. 5.
2 Foundations of Infinite Switch Simulated Tempering (ISST)
In this section, we discuss the theoretical foundation of simulated tempering, in particular deriving a simplified system of equations for the physical variables that eliminates the need to perform a discrete switching over temperature. Although we still technically work with a temperature ‘ladder’, as in other works in this area, we shall see that these are only used to perform the averaging across temperatures in an efficient way.
2.1 A Continuous formulation of Simulated Tempering
In order to simplify our presentation and advance the large deviation argument, we first replace standard simulated tempering, where is taken from a discrete sequence in to with a model that incorporates a continuously variable reciprocal temperature , taking values in the interval . Note that , which continuously varies, should not be confused with the physical reciprocal temperature , which is fixed. In this continuous tempering setting, the extended Gibbs distribution has density
where is a normalization constant:
For sampling purposes, we also need to introduce a dynamical system that is ergodic with respect to the distribution (10). A possible choice is
Here is a Langevin friction coefficient, and
represent independent white noise processes,is a time-scale parameter, and we recall that is the tempering variable that evolves whereas is the physical reciprocal temperature that is fixed. Note that other choices of dynamics are possible , as long as they are ergodic with respect to (10); working in the limit where is infinitely fast compared to can be applied to these other dynamical systems as well. The effective equation one obtains in that limit is (13), given that the system variables are the same as in (12) (i.e. the specifics of how evolves does not matter in this limit).
2.2 Infinite Switch Limit
In this subsection, we argue that it is best to use simulated tempering in an infinite switch limit, and we derive an effective equation for the physical state variables that emerge in that limit. In the context of (12), this infinite switch limit can be achieved by letting in this equation, in which case equilibrates rapidly on the timescale before moves. The state variables thus only feel the average effect of on the timescale, that is, (12) reduces to the following limiting system for alone
where is the conditional average of with respect to (10) at fixed:
(13) can be derived by standard averaging theorems for Markov processes, and it is easy to see that in this equation we can view the effective force as the gradient of a modified potential:
where is the averaged potential defined in (7). We will analyze the properties of this effective potential in more details later in Sec. 2.4. We stress that (14) is a closed equation for . In other words, in the infinite switch limit it is no longer necessary to evolve the reciprocal temperature : the averaged effect the dynamics of has on is fully accounted for in (13).
Let us now establish that the limiting equations in (13) are more efficient than the original (12) (or variants thereof that lead to the same limiting equation) for sampling purposes. To this end we use the approach based on large deviation theory proposed by Dupuis and collaborators [29, 30]. Define the empirical measure for the dynamics up to time by
where we denote by the infinitesimal generator of (12):
Colloquially, the large deviation principle asserts that the probability that the empirical measure be close to , is asymptotically given for large by
For Langevin dynamics, the rate functional can be further simplified [39, 40], as we show next. Due to the Hamiltonian part, the generator is not self-adjoint with respect to , where is the density in (3); denote by the weighted adjoint, then we have
where is the operator that flips the momentum direction: . We will split the generator into the symmetric part (corresponding to damping and diffusion in momentum), the anti-symmetric part (corresponding to Hamiltonian dynamics), and the tempering part (corresponding to the dynamics of ):
Let be the Radon-Nikodym derivative of with respect to , where is the measure with density (10) (i.e. the invariant measure for the dynamics in (12)). If , then the large deviation rate functional is given by
In particular, after an integration by parts, the only -dependent term is given by
It is thus clear that for , we have , which leads to the conclusion that the rate function is pointwise monotonically decreasing in .
2.3 Estimation of Canonical Expectations and
where is the normalization constant defined in (11). As a result, if is an observable of interest, its canonical expectation at temperature can be expressed as a weighted average,
where we assumed ergodicity in the last equality and we define,
Expression (28) is different from the standard approach used in ST in that it uses the data from all to calculate expectations at reciprocal temperature . By contrast, the typical estimator used in ST only uses the part of the time series during which (or when one uses a discrete set of reciprocal temperatures). As identified in Ref. , (28
) amounts to performing a Rao-Blackwellization of the standard ST estimator, which always reduces the variance of this estimator.
In the same way, the expression (29) for the weights in (28) clearly indicates that the reweighting can only be done with knowledge of , which is typically not available a priori. Often the aim of sampling is precisely to calculate . Thus to make (13) useful one also needs to design a way to estimate from the simulation; this issue also arises with standard ST or when one uses (12). This can be done using the following estimator that can be verified by direct calculation using the definition of in (27),
where the last equality follows from ergodicity. The estimator (30) permits the calculation of the ratio for any pair , . It will allow us to kill two birds with one stone, since the scheme is most efficient if used with proportional to . By learning we are also able to adjust on-the-fly and thereby improve sampling efficiency along the way. As mentioned before, leads to a flattening of the distribution of potential energy, as in Wang-Landau sampling , as discussed in the next subsection.
2.4 Optimal Choice of the Temperature Weights
The choice is typically justified because it flattens the marginal distribution of (10) with respect to reciprocal temperature. Indeed, this marginal density is given by (which is the continuous equivalent of (9)):
Having a flat is deemed advantageous since it guarantees that the reciprocal temperature explores the entire available range homogeneously and does not get trapped for long stretches of time in regions of where is low. In the infinite switch limit, however, the reciprocal temperature is never trapped on the time-scale over which varies, since the reciprocal temperature evolves infinitely fast and is averaged over. Thus a different reasoning should be provided for the choice .
Such a justification can be found by looking at the probability density function of the potential energyin the system governed by (12) or its limiting version (13). It is given by,
where is the density of states
Next we show that, in the limit of large system size, can be made flat for a band of energies by setting . Note that, in contrast, the probability density of of the original system with canonical density , i.e.,
is in general very peaked at one value of in the large system size limit; this will also become apparent from the argument below.
To begin, use the standard expression of in terms of
where, without loss of generality we have assumed that on . In terms of the (dimensionless) microcanonical entropy and the canonical free energy defined as
we can write (35) as
Now suppose that the system size is large enough that the integral in (37) can be estimated asymptotically by Laplace’s method. If that is the case, then
where means that the ratio of the terms on both sides of the equation tends to 1 as the system size goes to infinity (thermodynamic limit). Equation (38) states that the free energy is asymptotically given by the Legendre-Fenchel transformation of the entropy . By the involution property of this transformation, this implies,
where is the concave envelope of . This asymptotic relation can also be written as
where means that the ratio of the logarithms of the terms on both sides of the equation tends to 1 as the system size goes to infinity. As a result
Comparing with (32), we see that if we set we have
where we have defined
As a result , as desired, provided that
the system energy is such that the minimizer of (39) lies in the interval , i.e. , and
(i.e. is concave down).
If is bounded not only from below but also from above, i.e. then the range of possible system’s energies is and we can adjust the interval so that the minimizer of (39) always lies in it. If is unbounded from above, which is the generic case, this is not possible unless we let (i.e. allow infinitely high temperatures). This limit breaks ergodicity of the dynamics, and cannot be implemented in practice. Rather, for unbounded , it is preferable to adjust to control the value of up to which is flat. Similarly, regulating allows one to keep flat up to some but not below it.
A more serious problem arises if , i.e. if is not concave down, since in this case we cannot flatten in the region where does not coincide with its concave envelope. This is the signature that a first-order phase transition happens. Indeed, a non-concave implies that the free energy (which, in contrast to , is always concave down) is not differentiable at at least one value of : at each of these values is the transform of a non-concave branch of . On these branches, ST does not flatten and as a result it does not a priori provide a significant acceleration over direct sampling with the original equations of motion. These observations give an alternative explanation to a well-known problem of ST in the presence of a first-order phase transition.
which also implies that . As a result, (13) becomes
The equations used in the Wang-Landau method are similar to (45) but with replaced by . In other words, these equations are asymptotically equivalent to (45) if . This is not surprising since this method also leads to a flat . In other words, ST in the infinite switch limit with is conceptually equivalent to the Wang-Landau method, although in practice the two methods differ. Indeed in ST we need to learn to adjust the weights, whereas in the Wang-Landau method we need to learn its dual .
2.5 Adaptive Learning of the Temperature Weights
Given any set of weights we can in principle learn (or ratios thereof) using the estimator (30). However we know from the results in Sec. 2.4 that this procedure will be inefficient in practice unless . Here we show how to adjust as we learn . To this end we introduce two quantities: , constructed in such a way that it converges towards a normalized variant of ; and , giving the current estimate of the weights.
with is given by
with is the instantaneous estimate of the weights, normalized so that
where is a parameter (defining the time-scale with which the inverse weights are updated in comparison to the evolution of the physical variables) and is a renormalizing factor added to guarantee that the dynamics in (48) preserve the constraint (47) (the form of this term will become more transparent when looking at (59)–(61), which are the time-discretized version of (48) considered in Sec. 3). Equation (48) should be solved with the initial condition , where is some initial guess for the weights consistent with (47), i.e. such that . In addition, in (46) , should be obtained by solving (13) with substituted for , i.e. using
where is given by
To understand these equations, suppose first that . In this case, is not evolving, i.e. for all . It is then clear that (49) reduces to (13) and (46) to the estimator at the right hand side of (30) as . In other words, when we have
i.e. for any , ,
When , is evolving and we need to consider the fixed point(s) of this equation. Suppose that at least one such a fixed point exists and denote it by . Since this fixed point must satisfy , it is easy to see from (48) that it must be given by
where which, from (46) is given by (replacing again by ):
where we used ergodicity and is the equilibrium density of (13) when the weights are used to calculate