Accelerating Langevin Sampling with Birth-death

05/23/2019 ∙ by Yulong Lu, et al. ∙ Duke University 0

A fundamental problem in Bayesian inference and statistical machine learning is to efficiently sample from multimodal distributions. Due to metastability, multimodal distributions are difficult to sample using standard Markov chain Monte Carlo methods. We propose a new sampling algorithm based on a birth-death mechanism to accelerate the mixing of Langevin diffusion. Our algorithm is motivated by its mean field partial differential equation (PDE), which is a Fokker-Planck equation supplemented by a nonlocal birth-death term. This PDE can be viewed as a gradient flow of the Kullback-Leibler divergence with respect to the Wasserstein-Fisher-Rao metric. We prove that under some assumptions the asymptotic convergence rate of the nonlocal PDE is independent of the potential barrier, in contrast to the exponential dependence in the case of the Langevin diffusion. We illustrate the efficiency of the birth-death accelerated Langevin method through several analytical examples and numerical experiments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Numerical sampling of high dimensional probability distributions with unknown normalization has important applications in machine learning, Bayesian statistics, computational physics, and other related fields. The most popular approaches are based on Markov chain Monte Carlo (MCMC), including Langevin MCMC

[40], underdamped Langevin MCMC [20], Hamiltonian Monte Carlo [41, 1, 36], bouncy particle and zigzag samplers [4, 3], etc. Many of these approaches, often combined with stochastic gradient [45, 30], have been widely used in machine learning.

When the target probability distribution is (strongly) log-concave, that is, when its density with respect to the Lebesgue measure is with being (strongly) convex, it is known that most sampling schemes mentioned above can produce independent random samples efficiently [12, 15, 31, 9, 16]. The sampling problem becomes much more challenging when the probability distribution exhibits multi-modality, as it takes much longer time for the sampling Markov chain to get through low-probability regions in the phase space to explore and balance between multiple modes (also known as metastability). Many enhanced sampling schemes have been proposed over the years to overcome such difficulty, including various tempering schemes [43, 19, 32, 35], biasing techniques [44, 24], non-equilibrium sampling [37], just to name a few.

In this work, we propose a simple, novel sampling dynamics to overcome the metastability based on birth-death process. On the continuous level of evolution of the probability density, the proposed dynamics is given by

(1)

where one adds a non-local (due to the expectation) update rule to the conventional overdamped Langevin dynamics. The advantage of the birth-death process is that it allows global move of the mass of a probability density directly from one mode to another in the phase space according to their relative weights, without the difficulty of going through low probability regions, suffered by any local dynamics such as the overdamped Langevin MCMC. It is possible to combine the birth-death process with other sampling dynamics, such as underdamped Langevin or various accelerated dynamics, while for simplicity of presentation we will focus only on the birth-death accelerated overdamped Langevin dynamics in the current work.

1.1. Contribution.

Our main theoretical result is that under mild assumptions the asymptotic convergence rate of the proposed sampling dynamics is independent of the barrier of the potential corresponding to the target measure – this is a substantial improvement of the convergence rate of overdamped Langevin diffusion, which is exponentially small due to metastability. Moreover, we also establish a gradient flow structure of the birth-death accelerated Langevin dynamics: it is a gradient flow of the Kullback-Leibler (KL)-divergence with respect to the Wasserstein-Fisher-Rao metric.

To demonstrate this improved convergence, we study two analytical examples showing significant speedup of mixing compared to Langevin diffuion. We also propose a practical interacting particle sampling scheme as a numerical implementation of the birth-death accelerated Langevin dynamics. The efficiency of the proposed algorithm is illustrated through several numerical examples.

1.2. Related works.

The proposed sampling scheme involves interacting particles that undergo Langevin diffusion and birth-death process. Other sampling schemes via interacting particles have been proposed recently, including the Stein variational gradient descent (SVGD) flow [28, 27] (see also its continuous limit studied in [29]). Unlike the samplers based on Stein discrepancy, which replaces the random noise in Langevin dynamics by repulsion of particles, the sampling scheme proposed in this work employs birth-death process to enhance the mixing of existing sampling schemes. In fact, it can also be potentially combined with SVGD to improve its convergence.

The birth-death process has been used in sequential Monte Carlo (SMC) samplers [14]

. In SMC, the birth-death and branching process is used to reduce the variance of particle weights. While in the current proposed scheme, it is used to globally move the sampling particles according to the target measure. The birth-death process is also used recently to accelerate training of neural networks in the mean-field regime

[42], also for a quite different purpose than accelerating convergence of Monte Carlo samplers.

Acknowledgement

The work of Jianfeng Lu and the work of James Nolen were partially funded through grants DMS-1454939 and DMS-1351653 from the National Science Foundation, respectively.

2. Fokker-Planck equation and birth-death process

2.1. Langevin dynamics and its Fokker-Planck equation.

Recall the (overdamped) Langevin diffusion is the solution to the following stochastic differential equation

(2)

where and is a -dimensional Brownian motion. Many popular sampling schemes are constructed from discretizations of (2), such as the unadjusted Langevin algorithm (ULA) and its Metropolized version – Metropolis-adjusted Langevin algorithm (MALA) [40]

. The probability density function

of (2) solves the linear Fokker-Planck equation (FPE) on

(3)

The stationary distribution of (2) and (3) has density . In the seminal work of Jordan-Kinderlehrer-Otto [22], the FPE (3) was identified as the gradient flow of the KL-divergence (i.e., relative entropy) , with respect to the -Wasserstein distance. Moreover, if the target measure satisfies the logarithmic Sobolev inequality (LSI): for any probability distribution ,

(4)

then we have the exponential convergence . The convergence rate above may be exponentially small though when the target distribution is multimodal with high potential barrier; see [5, 6].

2.2. Pure birth-death process.

The main idea of this work is to use birth-death process to accelerate sampling. Before combining it with the Langevin dynamics, let us consider the pure birth-death equation (BDE) given by

(5)

As there is no spatial derivative involved in (5

), it can be viewed as a (infinite) system of ordinary differential equations, indexed by

, coupled through the integral term in . Observe that is an invariant measure of (5). Moreover, equation (5) depends on only up to a multiplicative constant, making it feasible for sampling with an unknown normalization constant. The definition of the birth/death rate in (5) is very intuitive. In fact, ignoring the last integral term in the definition of , one sees that the solution to (5) adjusts the mass according to the difference of the logarithm of current density and that of the target: the density at a location increases (or decreases) if (or ). The integral term in (5) is added to guarantee that the total integral of is conserved during the evolution, and thus remains a probability distribution (positivity is also to verify).

The birth-death dynamics (5) differs substantially from FPE (3) in many aspects. The former is essentially a nonlinear system of ODEs (but with a non-local coefficient) whereas the later is a linear parabolic PDE. Due to the absence of diffusion, the support of the solution of (5) never increases during the evolution. This seems suggesting that the birth-death equation is unsuitable for sampling. However, we shall show in Theorem 3.4 that if the initial density is positive everywhere, then converges to as .

2.3. Fokker-Planck equation with birth-death dynamics.

The real power of the birth-death process above comes in when it is combined with the Fokker-Planck equation (3) , which yields the following equation on the level of probability density, already appeared in the introduction:

(6)

where . Before we discuss the discretization of (6) in Section 5, which will lead to an efficient particle sampler in practice, in what follows, we study the Fokker-Planck equation of birth-death accelerated Langevin dynamics (BDL-FPE) (6), in particular its favorable convergence properties compared to the original Langevin dynamics (3) and the pure birth-death process (5).

3. Analysis of the Fokker-Planck equation with birth-death

3.1. Gradient flow structure

In parallel to well-known fact that FPE (3) is the 2-Wasserstein gradient flow of the KL-divergence, BDL-FPE (6) can be viewed as a gradient flow of the KL-divergence with respect to a different metric. Our result is motivated by recent works on the Wasserstein-Fisher-Rao (WFR) distance [23, 10, 26, 33] in the study of unbalanced optimal transport. Specifically, we define the Wasserstein-Fisher-Rao distance (also known as the spherical Hellinger-Kantorovich distance [23, 7]) by

(7)

where the admissible set consists of all pairs such that is a narrowly continuous curve in connecting and and that

(8)

Here denotes the space of probability measures on and is the space of functions satisfying . We emphasize that for our sampling purpose we have modified the original definition of WFR distance in [2, 23, 10] by adding the integral penalty term to keep mass conserved. Without this term, may experience gain and loss of mass during transportation procedure. Our first result is the following theorem which characterizes the gradient flow structure of BDL-FPE (6), whose proof is provided in Appendix A.

Theorem 3.1.

The Fokker-Planck equation for birth-death accelerated Langevin (BDL-FPE) dynamics (6) is the gradient flow of the KL-divergence with respect to the Wasserstein-Fisher-Rao distance (7).

As a consequence of Theorem 3.1, the dynamics (6) dissipates the KL-divergence in a steepest descent manner with respect to the WFR metric (7), similar to the variational structure for the Fokker-Planck equation (3) (w.r.t. the 2-Wasserstein metric).

3.2. Convergence analysis

We now analyze the convergence of BDL-FPE (6). Proofs of results in this section can be found in Appendix B. We first establish in the following theorem the global convergence of (6) by assuming the validity of LSI (4).

Theorem 3.2.

Assume that satisfies the log-Sobolev inequality (4) with constant . Then the solution to BDL-FPE (6) with initial condition satisfies

(9)

Theorem 3.2 shows that the global convergence rate of BDL-FPE (6) can be no worse than that of FPE (3). The convergence rate obtained this way is fully characterized by the log-Sobolev constant though, which may scale badly when the potential has high potential barriers. In contrast, we show in the next theorem that the birth-death term accelerates the diffusion dramatically in the sense that the asymptotic convergence rate of BDL-FPE (6) is independent of the potential barrier of .

Theorem 3.3.

Let solve (6) for , with initial condition satisfying . Suppose that for some ,

(10)

also holds. Then, for any ,

(11)

holds for all . In particular, the BDL-FPE (6) has an asymptotic convergence rate which is independent of the potential corresponding to .

Theorem 3.3 states that as long as the solution of BDL-FPE (6) is not too far from the target () and satisfies a uniform lower bound (maybe tiny) with respect to starting from , it will converge to with a rate independent of after a short waiting time. In practice, can be chosen to satisfy the condition (10); see Section 4 for examples.

For completeness, we also show the global convergence of BDE (5) (without a rate) in the next theorem. The BDE also has a similar gradient flow structure; we will not go into details here.

Theorem 3.4.

Let be the solution to (5) with initial condition . Assume that is finite for any . Assume also that satisfies that and for all . Then for all and . Moreover, for all , and .

4. Illustrative examples

Here we present two very simple examples illustrating how the combined dynamics of BDL-FPE (6) may significantly enhance convergence to equilibrium, compared to either FPE (3) or BDE (5).

4.1. Uniform distribution on torus.

Let , and suppose the domain is the -dimensional torus with being the density of the uniform measure on . In this case, FPE dynamics (3) corresponds to the heat equation on . The spectral gap is , and hence the rate of convergence to the equilibrium measure is . While this convergence rate is slow for large , the FPE dynamics (3) may be used to prepare a good initial condition for the combined BDL-FPE dynamics (6). Specifically, a lower bound on the heat kernel shows that at time the solution to will satisfy

for a universal constant , that is independent of the initial data (assuming it is a probability measure) and the dimension . Then, if we use as initial data for the combined dynamics (6), for , the condition (10) holds with . If also holds, Theorem (3.3) then implies for . In particular, the convergence rate does not depend on and the time lag is rather than .

4.2. Double well.

Suppose with , , for some . Here we regard as a density on the one-dimensional torus . This density has two modes at . Moreover, . It is known that for this potential , the FPE dynamics (3) exhibits a metastability phenomenon, and the mixing time for pure Langevin dynamics is for (see [5, 6, 34]).

Suppose that the initial density is the restriction of to the region :

(12)

Then . If evolves according to pure Langevin dynamics (3) for

, a lower bound on the heat kernel (via a large deviation type estimate

[17], or by [38]) implies that at ,

for some postive constants and . Then, suppose that for , evolves according to the combined dynamics (6). Theorem (3.3) implies that for , we have . So, compared to the solution to the Langevin dynamics (3), the birth-death accelerated dynamics (6) exhibits a dramatic acceleration and converges to at a rate that is independent of after a brief delay of .

5. An interacting particle implementation

As we mentioned earlier, the FPE (3) has a nice particle interpretation since it is the probability density function of the Langevin diffusion (2). The dynamics of BDL-FPE (6) does not have such a simple particle interpretation, due to the logarithmic nonlinearity in the birth-death term. To resolve this difficulty, given a smooth kernel function approximating the Dirac delta, we might approximate (6) by the equation

(13)

For this equation, the solution can be approximated by the empirical measure of a collection of interacting particles evolving as follows ():

Step 1: between birth/death events, each particle diffuses independently according to (2).

Step 2: each particle also has an independent exponential clock with instantaneous birth-death rate

(14)

Specifically, if , then partial is killed with instantaneous rate and another particle is duplicated randomly to preserve the population size; if , then partial is duplicated with instantaneous rate and another particle is killed randomly to preserve the population size. Thus the total number of particles is preserved. The proposition below shows convergence of the empirical measure of the particle system described above to the solution of (13) in the large particle limit. Its proof can be found in Appendix B.

Proposition 5.1.

Let be the empirical measure of particles defined above. Assume that as . Then for all , where solves (13) with initial condition .

To implement the birth-death particle dynamics above in practice, we also need time-discretization. In particular, discretizing the Langevin diffusion by the Euler-Maruyama scheme leads to the following birth-death accelerated Langevin sampler (BDLS).

Input: A potential corresponding to the target distribution , a set of initial particles , number of iterations , time step , kernel function Output: A set of particles whose empirical measure approximates . for do
          for do
                  set , where
                  calculate
                  set
                  if
                           kill with probability ,
                           duplicate one particle that is uniformly chosen from the rest
                  else if
                           duplicate with probability ,
                           kill one particle that is uniformly chosen from the rest
                  end if
         end for
end for
Algorithm 1 BDLS: birth-death accelerated Langevin sampler

6. Numerical results

In the numerical examples below, we compare the sampling efficiency of the proposed sampler BDLS of size with the sampler built from running independent copies of ULA (we call it parallel ULA or simply ULA for short). We choose the kernel to be the Gaussian kernel with width , i.e. . The kernel width varies in different examples and is tuned to produce the best numerical performance. How to optimize the choice of with a sound theoretical basis is to be investigated in future work.

6.1. Example 1: multimodal distribution on a 1D torus.

Consider the target with defined on the torus

. We initialize the continuous dynamics and particle systems according to the Gaussian distribution

. This makes sampling the measure difficult since has four modes on , while is very peaked and almost does not overlap with . We show in the left figure of Figure 1 the convergence of KL-divergence , from which one sees that BDL-PDE (6) substantially accelerates the slow convergence of FPE (3) of Langevin diffusion, consistent with Theorem 3.3. The reason for fast convergence of BDE (5) is unclear to us and will be investigated in the future. To compare the particle algorithms, we plot in Figure 1 (the middle and right figures) the mean square errors (MSE) of BDLS and ULA in estimating the mean and variance of the target versus number of sample size. We see that BDLS performs much better than ULA. BDS performs the worst in this example (see snapshots in Figure C.5) as due to absence of diffusion the particles only rearrange themselves inside the small region around zero they initialize and never get out. Thus we do not plot the MSE of BDS as they are too large to be fitted in the same figure. We choose the number of particles , time step size and a Gaussian kernel with width in this example. See Appendix C for more implementation details and additional numerical results.

Figure 1. Convergence of continuous dynamics and particles systems in Example 1. The left figure shows decay of the KL divergence in semilogy scale along the evolution of three continuous dynamics. The middle (or the right) figure shows the decay in loglog scale of mean square errors of estimating mean (or variance) using varying number of particles.

6.2. Example 2: two dimensional Gaussian mixture.

Consider now a target of two dimensional Gaussian mixture consisting of four components and initial particles sampled from the Gaussian , where the parameters are defined by

In this example, we choose particles and use time step size for both ULA and BDLS algorithms. Figure 2 shows scatter plots along with their corresponding marginals of particles computed using parallel ULA and BDLS at different number of iterations. The target distribution has a square shape and the particles are initialized within a small neighborhood of the top edge. At the -th iteration, the particles generated by BDLS already start equilibrating around all modes, whereas only very few particles generated by parallel ULA reach to the bottom mode at the same time. We also compare the absolute error of estimating for different in Figure 3. We find that the estimation errors of using our BDLS converge to the lowerest values much faster than ULA.

Figure 2.

Scatter plots of particles and their marginal distributions (computed by kernel density estimators) in Example 2. Top left figure displays the target density and the bottom left shows initial locations of particles. Each column in the rest shows the scatter plots and the marginal distributions of particles computed using parallel ULA (top, blue) and BDLS (bottom, red) at different iterations.

(a) Estimating
(b) Estimating
(c) Estimating
(d) Estimating
Figure 3. The absolute errors of estimating with various observables in Example 2. In the third figure . The blue dash-dot and red-dot lines are estimation errors along iterations using ULA and BDLS respectively. The total number of iterations is . For the purpose of resolution, we plot the error for every 400 iterations.

6.3. Example 3: Bayesian learning of Gaussian mixture model.

We consider the Bayesian approach to fitting the distribution of a dataset with a univariate Gaussian mixture model of three components in the same setting as in

[11]. The unknown parameters are the means , precisions and the weights with . We use the prior as in [11] which has a hyperparamter describing the prior distribution of the precisions, thus defining a posterior distribution on . Due to the permutation invariance with respect to the component label, the resulting posterior has at least modes. We generate a synthetic dataset of 200 samples from the mixture measure with “true” parameters The data size is large enough to make the posterior peaked so that hopping across different modes is challenging. We use particles, time step size and kernel width . We initialize particles as iid samples from the following distributions: . To compare the performance of BDLS and that of ULA, we show the evolution of sampling particles in -coordinate in Figure 4 (see also Figure D.8 for snapshots in -coordinate). We see that BDLS algorithm exhibits stronger mode exploration ability than ULA. Once all modes are identified, BDLS quickly redistributes the particles in different local modes towards the equilibrium through the birth-death process, while ULA takes much longer time to equilibrate in the local modes. In fact, the distribution of the BDLS particles in at -th iteration is already very close to the equilibrium (see Figure D.9). Appendix D contains further details about the model and numerical results for this example.

Figure 4. Evolution of particles in -coordinate for Example 3. The first column shows the histogram (top) of the synthetic data and the initial locations (bottom) of particles in -coordinate. The rest columns compare the scatter plots of particles in and their marginals computed using parallel ULA (top, blue) and BDLS (bottom, red) at different iterations.

7. Conclusion

We propose a new sampling dynamics based on birth-death process and an algorithm based on interacting particles to accelerate the classical Langevin dynamics for statistical sampling. Future directions include a rigorous analysis of the birth-death accelerated Langevin sampler and its further applications when combined with other conventional sampling schemes.

Appendix A Gradient flow structure

This section devotes to the proof of Theorem 3.1. We mainly follow [39] and [18].

We first introduce a Riemannian structure, denoted by on the space of smooth probability densities on . Consider the tangent space at

Since is a probability density, the tangent space can also be identified as

Indeed, there is “one-to-one” correspondence between and , since for any such that , there exists (determined uniquely up to a constant) solving

Here denotes the space of functions such that . Informed by the Lagrangian minimization in the definition of the Wasserstein-Fisher-Rao distance (7

), we also define the Riemannian metric tensor

by

(A.1)

where . With this metric tensor , the Wasserstein-Fisher-Rao distance defined by (7) can be regarded as the geodesic distance on with the Riemannian metric , namely,

(A.2)
Proposition A.1.

Let be a continuous and differentiable energy functional. Then the metric gradient of via the metric tensor is

(A.3)

As a result, the gradient flow of with respect to the Wasserstein-Fisher-Rao distance is given by

(A.4)
Proof.

Let be a curve passing through

with tangent vector

The gradient with respect to is defined by

(A.5)

By the definition of the Riemannian metric in (A.1), the right hand side above is

Since is arbitrary, this proves (A.3) and hence (A.4) follows. ∎

Proof of Theorem 3.1.

This is a direct consequence of Proposition A.1 and the fact that the functional derivative of is

Appendix B Proofs of convergence results

In this section we prove Theorem 3.2, Theorem 3.3, Theorem 3.4, and Proposition 5.1.

Proof of Theorem 3.2.

Differentiating in time gives

Then the theorem is proved by using (4) and the fact the second term on the right side above is non-positive due to the Cauchy-Schwartz inequality. ∎

Proof of Theorem 3.3.

It suffices to assume . First, we claim that if (10) holds, then for all :

(B.1)

This is because the function satisfies

(B.2)

where . By the maximum principle, the minimum of , which must be negative, cannot decrease. In fact, (10) and (B.2) implies so that , which is (B.1). In particular, if , then we have

(B.3)

Now, under the evolution (6), the time derivative of is

(B.4)

We may ignore the the first term on the right side since it is non-positive. Define . Then,

and

Observe that the functions and are both non-negative for , , and and if . Therefore, we have

The condition (B.3) implies for all . Combining these observations with (B.4), we see that

(B.5)

holds for all . Since also holds, by assumption, this implies

(B.6)

for all , so that . Now, returning to (B.5), we have

(B.7)

for all .

Proof of Theorem 3.4.

If satisfies (5), then

(B.8)

So, is non-increasing and hence is finite since is finite. By the monotone convergence theorem there exists such that .

Next we show that the solution if . Let us denote . Then solves the equation

(B.9)

Hence, satisfies the relation

(B.10)

In particular, is increasing if , which implies that . As a result, . In addition, since and , we have

(B.11)

Because of this and (B.10), we conclude that for any , as , and as . However, since both and are probability densities, this implies , so that . Thus, and . ∎

Proof of Proposition 5.1.

We give a formal proof using the theory of measure-valued Markov process [13]; a similar proof strategy is used recently in [42]. Our goal is to first derive the (infinite dimensional) generator and the backward Kolmogorov equation of the measure-valued Markov process of . To this end, let us define for any smooth functional the generator