Numerical sampling of high dimensional probability distributions with unknown normalization has important applications in machine learning, Bayesian statistics, computational physics, and other related fields. The most popular approaches are based on Markov chain Monte Carlo (MCMC), including Langevin MCMC, underdamped Langevin MCMC , Hamiltonian Monte Carlo [41, 1, 36], bouncy particle and zigzag samplers [4, 3], etc. Many of these approaches, often combined with stochastic gradient [45, 30], have been widely used in machine learning.
When the target probability distribution is (strongly) log-concave, that is, when its density with respect to the Lebesgue measure is with being (strongly) convex, it is known that most sampling schemes mentioned above can produce independent random samples efficiently [12, 15, 31, 9, 16]. The sampling problem becomes much more challenging when the probability distribution exhibits multi-modality, as it takes much longer time for the sampling Markov chain to get through low-probability regions in the phase space to explore and balance between multiple modes (also known as metastability). Many enhanced sampling schemes have been proposed over the years to overcome such difficulty, including various tempering schemes [43, 19, 32, 35], biasing techniques [44, 24], non-equilibrium sampling , just to name a few.
In this work, we propose a simple, novel sampling dynamics to overcome the metastability based on birth-death process. On the continuous level of evolution of the probability density, the proposed dynamics is given by
where one adds a non-local (due to the expectation) update rule to the conventional overdamped Langevin dynamics. The advantage of the birth-death process is that it allows global move of the mass of a probability density directly from one mode to another in the phase space according to their relative weights, without the difficulty of going through low probability regions, suffered by any local dynamics such as the overdamped Langevin MCMC. It is possible to combine the birth-death process with other sampling dynamics, such as underdamped Langevin or various accelerated dynamics, while for simplicity of presentation we will focus only on the birth-death accelerated overdamped Langevin dynamics in the current work.
Our main theoretical result is that under mild assumptions the asymptotic convergence rate of the proposed sampling dynamics is independent of the barrier of the potential corresponding to the target measure – this is a substantial improvement of the convergence rate of overdamped Langevin diffusion, which is exponentially small due to metastability. Moreover, we also establish a gradient flow structure of the birth-death accelerated Langevin dynamics: it is a gradient flow of the Kullback-Leibler (KL)-divergence with respect to the Wasserstein-Fisher-Rao metric.
To demonstrate this improved convergence, we study two analytical examples showing significant speedup of mixing compared to Langevin diffuion. We also propose a practical interacting particle sampling scheme as a numerical implementation of the birth-death accelerated Langevin dynamics. The efficiency of the proposed algorithm is illustrated through several numerical examples.
1.2. Related works.
The proposed sampling scheme involves interacting particles that undergo Langevin diffusion and birth-death process. Other sampling schemes via interacting particles have been proposed recently, including the Stein variational gradient descent (SVGD) flow [28, 27] (see also its continuous limit studied in ). Unlike the samplers based on Stein discrepancy, which replaces the random noise in Langevin dynamics by repulsion of particles, the sampling scheme proposed in this work employs birth-death process to enhance the mixing of existing sampling schemes. In fact, it can also be potentially combined with SVGD to improve its convergence.
The birth-death process has been used in sequential Monte Carlo (SMC) samplers 
. In SMC, the birth-death and branching process is used to reduce the variance of particle weights. While in the current proposed scheme, it is used to globally move the sampling particles according to the target measure. The birth-death process is also used recently to accelerate training of neural networks in the mean-field regime, also for a quite different purpose than accelerating convergence of Monte Carlo samplers.
The work of Jianfeng Lu and the work of James Nolen were partially funded through grants DMS-1454939 and DMS-1351653 from the National Science Foundation, respectively.
2. Fokker-Planck equation and birth-death process
2.1. Langevin dynamics and its Fokker-Planck equation.
Recall the (overdamped) Langevin diffusion is the solution to the following stochastic differential equation
where and is a -dimensional Brownian motion. Many popular sampling schemes are constructed from discretizations of (2), such as the unadjusted Langevin algorithm (ULA) and its Metropolized version – Metropolis-adjusted Langevin algorithm (MALA) of (2) solves the linear Fokker-Planck equation (FPE) on
The stationary distribution of (2) and (3) has density . In the seminal work of Jordan-Kinderlehrer-Otto , the FPE (3) was identified as the gradient flow of the KL-divergence (i.e., relative entropy) , with respect to the -Wasserstein distance. Moreover, if the target measure satisfies the logarithmic Sobolev inequality (LSI): for any probability distribution ,
2.2. Pure birth-death process.
The main idea of this work is to use birth-death process to accelerate sampling. Before combining it with the Langevin dynamics, let us consider the pure birth-death equation (BDE) given by
As there is no spatial derivative involved in (5
), it can be viewed as a (infinite) system of ordinary differential equations, indexed by, coupled through the integral term in . Observe that is an invariant measure of (5). Moreover, equation (5) depends on only up to a multiplicative constant, making it feasible for sampling with an unknown normalization constant. The definition of the birth/death rate in (5) is very intuitive. In fact, ignoring the last integral term in the definition of , one sees that the solution to (5) adjusts the mass according to the difference of the logarithm of current density and that of the target: the density at a location increases (or decreases) if (or ). The integral term in (5) is added to guarantee that the total integral of is conserved during the evolution, and thus remains a probability distribution (positivity is also to verify).
The birth-death dynamics (5) differs substantially from FPE (3) in many aspects. The former is essentially a nonlinear system of ODEs (but with a non-local coefficient) whereas the later is a linear parabolic PDE. Due to the absence of diffusion, the support of the solution of (5) never increases during the evolution. This seems suggesting that the birth-death equation is unsuitable for sampling. However, we shall show in Theorem 3.4 that if the initial density is positive everywhere, then converges to as .
2.3. Fokker-Planck equation with birth-death dynamics.
The real power of the birth-death process above comes in when it is combined with the Fokker-Planck equation (3) , which yields the following equation on the level of probability density, already appeared in the introduction:
where . Before we discuss the discretization of (6) in Section 5, which will lead to an efficient particle sampler in practice, in what follows, we study the Fokker-Planck equation of birth-death accelerated Langevin dynamics (BDL-FPE) (6), in particular its favorable convergence properties compared to the original Langevin dynamics (3) and the pure birth-death process (5).
3. Analysis of the Fokker-Planck equation with birth-death
3.1. Gradient flow structure
In parallel to well-known fact that FPE (3) is the 2-Wasserstein gradient flow of the KL-divergence, BDL-FPE (6) can be viewed as a gradient flow of the KL-divergence with respect to a different metric. Our result is motivated by recent works on the Wasserstein-Fisher-Rao (WFR) distance [23, 10, 26, 33] in the study of unbalanced optimal transport. Specifically, we define the Wasserstein-Fisher-Rao distance (also known as the spherical Hellinger-Kantorovich distance [23, 7]) by
where the admissible set consists of all pairs such that is a narrowly continuous curve in connecting and and that
Here denotes the space of probability measures on and is the space of functions satisfying . We emphasize that for our sampling purpose we have modified the original definition of WFR distance in [2, 23, 10] by adding the integral penalty term to keep mass conserved. Without this term, may experience gain and loss of mass during transportation procedure. Our first result is the following theorem which characterizes the gradient flow structure of BDL-FPE (6), whose proof is provided in Appendix A.
3.2. Convergence analysis
We now analyze the convergence of BDL-FPE (6). Proofs of results in this section can be found in Appendix B. We first establish in the following theorem the global convergence of (6) by assuming the validity of LSI (4).
Theorem 3.2 shows that the global convergence rate of BDL-FPE (6) can be no worse than that of FPE (3). The convergence rate obtained this way is fully characterized by the log-Sobolev constant though, which may scale badly when the potential has high potential barriers. In contrast, we show in the next theorem that the birth-death term accelerates the diffusion dramatically in the sense that the asymptotic convergence rate of BDL-FPE (6) is independent of the potential barrier of .
Theorem 3.3 states that as long as the solution of BDL-FPE (6) is not too far from the target () and satisfies a uniform lower bound (maybe tiny) with respect to starting from , it will converge to with a rate independent of after a short waiting time. In practice, can be chosen to satisfy the condition (10); see Section 4 for examples.
For completeness, we also show the global convergence of BDE (5) (without a rate) in the next theorem. The BDE also has a similar gradient flow structure; we will not go into details here.
Let be the solution to (5) with initial condition . Assume that is finite for any . Assume also that satisfies that and for all . Then for all and . Moreover, for all , and .
4. Illustrative examples
4.1. Uniform distribution on torus.
Let , and suppose the domain is the -dimensional torus with being the density of the uniform measure on . In this case, FPE dynamics (3) corresponds to the heat equation on . The spectral gap is , and hence the rate of convergence to the equilibrium measure is . While this convergence rate is slow for large , the FPE dynamics (3) may be used to prepare a good initial condition for the combined BDL-FPE dynamics (6). Specifically, a lower bound on the heat kernel shows that at time the solution to will satisfy
for a universal constant , that is independent of the initial data (assuming it is a probability measure) and the dimension . Then, if we use as initial data for the combined dynamics (6), for , the condition (10) holds with . If also holds, Theorem (3.3) then implies for . In particular, the convergence rate does not depend on and the time lag is rather than .
4.2. Double well.
Suppose with , , for some . Here we regard as a density on the one-dimensional torus . This density has two modes at . Moreover, . It is known that for this potential , the FPE dynamics (3) exhibits a metastability phenomenon, and the mixing time for pure Langevin dynamics is for (see [5, 6, 34]).
Suppose that the initial density is the restriction of to the region :
Then . If evolves according to pure Langevin dynamics (3) for
, a lower bound on the heat kernel (via a large deviation type estimate, or by ) implies that at ,
for some postive constants and . Then, suppose that for , evolves according to the combined dynamics (6). Theorem (3.3) implies that for , we have . So, compared to the solution to the Langevin dynamics (3), the birth-death accelerated dynamics (6) exhibits a dramatic acceleration and converges to at a rate that is independent of after a brief delay of .
5. An interacting particle implementation
As we mentioned earlier, the FPE (3) has a nice particle interpretation since it is the probability density function of the Langevin diffusion (2). The dynamics of BDL-FPE (6) does not have such a simple particle interpretation, due to the logarithmic nonlinearity in the birth-death term. To resolve this difficulty, given a smooth kernel function approximating the Dirac delta, we might approximate (6) by the equation
For this equation, the solution can be approximated by the empirical measure of a collection of interacting particles evolving as follows ():
Step 1: between birth/death events, each particle diffuses independently according to (2).
Step 2: each particle also has an independent exponential clock with instantaneous birth-death rate
Specifically, if , then partial is killed with instantaneous rate and another particle is duplicated randomly to preserve the population size; if , then partial is duplicated with instantaneous rate and another particle is killed randomly to preserve the population size. Thus the total number of particles is preserved. The proposition below shows convergence of the empirical measure of the particle system described above to the solution of (13) in the large particle limit. Its proof can be found in Appendix B.
Let be the empirical measure of particles defined above. Assume that as . Then for all , where solves (13) with initial condition .
To implement the birth-death particle dynamics above in practice, we also need time-discretization. In particular, discretizing the Langevin diffusion by the Euler-Maruyama scheme leads to the following birth-death accelerated Langevin sampler (BDLS).
6. Numerical results
In the numerical examples below, we compare the sampling efficiency of the proposed sampler BDLS of size with the sampler built from running independent copies of ULA (we call it parallel ULA or simply ULA for short). We choose the kernel to be the Gaussian kernel with width , i.e. . The kernel width varies in different examples and is tuned to produce the best numerical performance. How to optimize the choice of with a sound theoretical basis is to be investigated in future work.
6.1. Example 1: multimodal distribution on a 1D torus.
Consider the target with defined on the torus
. We initialize the continuous dynamics and particle systems according to the Gaussian distribution. This makes sampling the measure difficult since has four modes on , while is very peaked and almost does not overlap with . We show in the left figure of Figure 1 the convergence of KL-divergence , from which one sees that BDL-PDE (6) substantially accelerates the slow convergence of FPE (3) of Langevin diffusion, consistent with Theorem 3.3. The reason for fast convergence of BDE (5) is unclear to us and will be investigated in the future. To compare the particle algorithms, we plot in Figure 1 (the middle and right figures) the mean square errors (MSE) of BDLS and ULA in estimating the mean and variance of the target versus number of sample size. We see that BDLS performs much better than ULA. BDS performs the worst in this example (see snapshots in Figure C.5) as due to absence of diffusion the particles only rearrange themselves inside the small region around zero they initialize and never get out. Thus we do not plot the MSE of BDS as they are too large to be fitted in the same figure. We choose the number of particles , time step size and a Gaussian kernel with width in this example. See Appendix C for more implementation details and additional numerical results.
6.2. Example 2: two dimensional Gaussian mixture.
Consider now a target of two dimensional Gaussian mixture consisting of four components and initial particles sampled from the Gaussian , where the parameters are defined by
In this example, we choose particles and use time step size for both ULA and BDLS algorithms. Figure 2 shows scatter plots along with their corresponding marginals of particles computed using parallel ULA and BDLS at different number of iterations. The target distribution has a square shape and the particles are initialized within a small neighborhood of the top edge. At the -th iteration, the particles generated by BDLS already start equilibrating around all modes, whereas only very few particles generated by parallel ULA reach to the bottom mode at the same time. We also compare the absolute error of estimating for different in Figure 3. We find that the estimation errors of using our BDLS converge to the lowerest values much faster than ULA.
Scatter plots of particles and their marginal distributions (computed by kernel density estimators) in Example 2. Top left figure displays the target density and the bottom left shows initial locations of particles. Each column in the rest shows the scatter plots and the marginal distributions of particles computed using parallel ULA (top, blue) and BDLS (bottom, red) at different iterations.
6.3. Example 3: Bayesian learning of Gaussian mixture model.
We consider the Bayesian approach to fitting the distribution of a dataset with a univariate Gaussian mixture model of three components in the same setting as in. The unknown parameters are the means , precisions and the weights with . We use the prior as in  which has a hyperparamter describing the prior distribution of the precisions, thus defining a posterior distribution on . Due to the permutation invariance with respect to the component label, the resulting posterior has at least modes. We generate a synthetic dataset of 200 samples from the mixture measure with “true” parameters The data size is large enough to make the posterior peaked so that hopping across different modes is challenging. We use particles, time step size and kernel width . We initialize particles as iid samples from the following distributions: . To compare the performance of BDLS and that of ULA, we show the evolution of sampling particles in -coordinate in Figure 4 (see also Figure D.8 for snapshots in -coordinate). We see that BDLS algorithm exhibits stronger mode exploration ability than ULA. Once all modes are identified, BDLS quickly redistributes the particles in different local modes towards the equilibrium through the birth-death process, while ULA takes much longer time to equilibrate in the local modes. In fact, the distribution of the BDLS particles in at -th iteration is already very close to the equilibrium (see Figure D.9). Appendix D contains further details about the model and numerical results for this example.
We propose a new sampling dynamics based on birth-death process and an algorithm based on interacting particles to accelerate the classical Langevin dynamics for statistical sampling. Future directions include a rigorous analysis of the birth-death accelerated Langevin sampler and its further applications when combined with other conventional sampling schemes.
Appendix A Gradient flow structure
We first introduce a Riemannian structure, denoted by on the space of smooth probability densities on . Consider the tangent space at
Since is a probability density, the tangent space can also be identified as
Indeed, there is “one-to-one” correspondence between and , since for any such that , there exists (determined uniquely up to a constant) solving
Here denotes the space of functions such that . Informed by the Lagrangian minimization in the definition of the Wasserstein-Fisher-Rao distance (7
), we also define the Riemannian metric tensorby
where . With this metric tensor , the Wasserstein-Fisher-Rao distance defined by (7) can be regarded as the geodesic distance on with the Riemannian metric , namely,
Let be a continuous and differentiable energy functional. Then the metric gradient of via the metric tensor is
As a result, the gradient flow of with respect to the Wasserstein-Fisher-Rao distance is given by
Appendix B Proofs of convergence results
Proof of Theorem 3.2.
Differentiating in time gives
Then the theorem is proved by using (4) and the fact the second term on the right side above is non-positive due to the Cauchy-Schwartz inequality. ∎
Proof of Theorem 3.3.
It suffices to assume . First, we claim that if (10) holds, then for all :
This is because the function satisfies
Now, under the evolution (6), the time derivative of is
We may ignore the the first term on the right side since it is non-positive. Define . Then,
Observe that the functions and are both non-negative for , , and and if . Therefore, we have
holds for all . Since also holds, by assumption, this implies
for all , so that . Now, returning to (B.5), we have
for all .
Proof of Theorem 3.4.
If satisfies (5), then
So, is non-increasing and hence is finite since is finite. By the monotone convergence theorem there exists such that .
Next we show that the solution if . Let us denote . Then solves the equation
Hence, satisfies the relation
In particular, is increasing if , which implies that . As a result, . In addition, since and , we have
Because of this and (B.10), we conclude that for any , as , and as . However, since both and are probability densities, this implies , so that . Thus, and . ∎
Proof of Proposition 5.1.
We give a formal proof using the theory of measure-valued Markov process ; a similar proof strategy is used recently in . Our goal is to first derive the (infinite dimensional) generator and the backward Kolmogorov equation of the measure-valued Markov process of . To this end, let us define for any smooth functional the generator