Suppose that we would like to sample from a density
where is the normalizing constant. We know , but we do not know the normalizing constant. This comes up, for example, in variational inference, when the normalization constant is computationally intractable.
One way to sample from is to consider the Langevin diffusion:
Where is some initial distribution and is Brownian motion (see Section 4). The stationary distribution of the above SDE is .
Previous works have shown the convergence of (4) in both total variation distance (, ) and 2-Wasserstein distance (). The approach in these papers relies on first showing the convergence of (1), and then bounding the discretization error between (4) and (2).
In this paper, our main goal is to establish the convergence of in (4) in
. KL-divergence is perhaps the most natural notion of distance between probability distributions in this context, because of its close relationship to maximum likelihood estimation, its interpretation as information gain in Bayesian statistics, and its central role in information theory. Convergence in KL-divergence implies convergence in total variation and 2-Wasserstein distance, thus we are able to obtain convergence rates in total variation and 2-Wasserstein that are comparable to the results shown in (, , ).
2 Related Work
The first non-asymptotic analysis of the discrete Langevin diffusion (4) was due to Dalalyan in . This was soon followed by the work by Durmus and Moulines in , which improved upon the results in . Subsequently, Durmus and Moulines also established convergence of (4) for the 2-Wasserstein distance in . We remark that the proofs of Lemma 7, 11 and 13 are essentially taken from .
is not smooth. This is important, for example, when we want to sample from the uniform distribution over some convex set, sois the indicator function.
Very recently, Dalalyan et al  proved the convergence of Langevin Monte Carlo when only stochastic gradients are available.
Our work also borrows heavily from the theory established in the book of Ambrosio, Gigli and Savare , which studies the underlying probability distribution induced by (1) as a gradient flow in probability space. This allows us to view (4) as a deterministic convex optimization procedure over the probability space, with KL-divergence as the objective. This beautiful line of work relating SDEs with gradient flows in probability space was begun by Jordan, Kinderlehrer and Otto . We refer any interested reader to an excellent survey by Santambrogio in .
3 Our Contribution
Our main contribution is establishing the first nonasymptotic convergence Kullback-Leibler divergence for (4) when is strongly convex and smooth. (see Theorem 3). As a consequence, we also unify the proof of convergence in total variation and as simple corollaries to the convergence in .
The following table compares the number of iterations of (3) required to achieve error in each of the three quantities according to the analysis of various papers.
We denote by the space of all probability distributions over . In the rest of this paper, only distributions with densities wrt the Lebesgue measure will appear (see Lemma 16), both in the algorithm and in the analysis. With abuse of notation, we use the same symbol (e.g. ) to denote both the probability distribution and its density wrt the Lebesgue measure.
We let be the d-dimensional Brownian motion.
Let be the target distribution such that has Lipschitz continuous gradients and strong convexity, i.e. for all :
For a given initial distribution , the Exact Langevin Diffusion is given by the following stochastic differential equation (recall ):
(This is identical to (1), restated here for ease of reference.) For a given initial distribution , and for a given stepsize , the Langevin MCMC Algorithm is given by the following:
For a given initial distribution and stepsize , the Discretized Langevin Diffusion is given by the following SDE:
For the rest of this paper, we will use to exclusively denote the distribution of in (4).
We assume without loss of generality that
, and that
. (We can always shift the space to achieve this, and the minimizer of is easy to find using, say, gradient descent.)
For the rest of this paper, we will let
be the KL-divergence between and . It is well known that is minimized by , and .
Finally, given a vector fieldand a distribution , we define the -norm of as
4.1 Background on Wasserstein distance and curves in
Given two distributions , let
be the set of all joint distributions over the product spacewhose marginals equal and respectively. ( is the set of all couplings)
The Wasserstein distance is defined as
Let and be two measurable spaces, be a measure, and be a measurable map. The push-forward measure of through is defined as
Intuitively, for any , .
It is a well known result that for any two distributions and which have density wrt the Lebesgue measure, the optimal coupling is induced by a map , i.e. for
Where is the identity map, and satisfies , so by definition, . We call the optimal transport map, and the optimal displacement map.
Given two points and in , a curve is a constant-speed-geodesic between and if , and for all . If is the optimal displacement map between and , then the constant-speed-geodesic is nicely characterized by
Given a curve , we define its metric derivative as
. Intuitively, this is the speed of the curve in 2-Wasserstein distance. We say that a curve is absolutely continuous if for all .
Given a curve and a sequence of velocity fields , we say that and satisfy the continuity equation at if
(We assume that has density wrt Lebesgue measure for all )
We say that is tangent to at if the continuity equation holds and for all such that . Intuitively, is tangent to if it minimizes among all velocity fields that satisfy the continuity equation.
5 Preliminary Lemmas
This section presents some basic results needed for our main theorem.
5.1 Calculus over
In this section, we present some crucial Lemmas which allow us to study the evolution of along a curve . These results are all immediate consequences of results proven in .
For any , let be the first variation of at defined as . Let the subdifferential of at be given by
. For any curve , and for any that satisfies the continuity equation for (see equation (7)), the following holds:
Based on Lemma 1, we define (for any ) the operator
is linear in .
Let be an absolutely continuous curve in with tangent velocity field . Let be the metric derivative of .
For any , let , then
Furthermore, for any absolutely continuous curve with tangent velocity , we have
Let be an absolutely continuous curve with tangent velocity field . Then
5.2 Exact and Discrete Gradient Flow for
In this section, we will study the curve defined in (4). Unless otherwise specified, we will assume that is an arbitrary distribution.
Let be as defined in (4).
For any given and for all , we define a stochastic process as
|let denote the distribution for|
From onwards, this is the exact Langevin diffusion with as the initial distribution (compare with expression (2)).
Finally, for each , we define a sequence by
|let denote the distribution for|
represents the discretization error of through the divergence between and (formally stated in Lemma 5). Note that because .
Our proof strategy is as follows:
In Lemma 5, we demonstrate that the divergence between (discretized Langevin) and (exact Langevin) can be represented as a curve .
In Lemma 6, we demonstrate that the "decrease in due to exact Langevin" given by is sufficiently negative.
In Lemma 7, we show that the "discretization error" given by is small.
Added together, they imply that is sufficiently negative.
For all and
6 Strong Convexity Result
In this section, we study the consequence of assuming strong convexity and smoothness of .
6.1 Theorem statement and discussion
Let and be as defined in (4) with .
The above theorem immediately allows us to obtain the convergence rate of in both total variation and 2-Wasserstein distance.
Using the choice of and in Theorem 3, we get
The first item follows from Pinsker’s inequality. The second item follows from (12), where we take to be and to be , and noting that . To achieve accuracy in Total Variation or , we apply Theorem 3 with and respectively.
6.2 Proof of Theorem 3
We now state the Lemmas needed to prove Theorem 3. We first establish a notion of strong convexity of with respect to metric.
If is strongly convex, then
for all and , let be the constant-speed geodesic between and . (recall from (5) that If is the optimal displacement map from to , then .)
We call this the m-strong-geodesic-convexity of wrt the distance.
Next, we use the strong geodesic convexity of to upper bound by (for any ). This is analogous to how for standard -strongly-convex functions in .
Under our assumption that is strongly convex, we have that for all ,
Finally, we put everything together to prove Theorem 3.
Proof of Theorem 3
We first note that .
Suppose that , and let
Where the last inequality is because Lemma 10 and the assumption that together imply that .
Where the last line once again follows from Lemma 10.
To handle the case when , we use the following argument:
We can conclude that implies .
Thus, if for some , then for all as implies and is continuous in . Thus .
Thus, we need only consider the case that for all . This means that (13) holds for all .
By Gronwall’s inequality, we get
We thus need to pick
Using the fact that . Using -smoothness and -strong convexity, we can show that
. We thus get that , so
7 Weak convexity result
In this section, we study the case when is not strongly convex (but still convex and smooth). Let be the stationary distribution of (4) with stepsize .
We will assume that we can choose an initial distribution which satisfies
. Let be the largest stepsize such that
7.1 Theorem statement and discussion
Let , and be defined as in the beginning of this section.
Once again, applying Pinsker’s inequality, we get that the above choice of and yields . Without strong convexity, we cannot get a bound on from bounding like we did in corollary 8.
In , a proof in the non-strongly-convex case was obtained by running Langevin MCMC on
iterations to get .
On the other hand, if we assume and the results of Theorem 5 implies that
To get , we need
Even if we ignore and , our result is not strictly better than (17) as we have a worse dependence on . However, we do have a better dependence on .
-  Ambrosio, Luigi, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
-  Jordan, Richard, David Kinderlehrer, and Felix Otto. "The variational formulation of the Fokker–Planck equation." SIAM journal on mathematical analysis 29.1 (1998): 1-17.
-  Dalalyan, Arnak S. "Theoretical guarantees for approximate sampling from smooth and log-concave densities." Journal of the Royal Statistical Society: Series B (Statistical Methodology) (2016).
-  Durmus, Alain, and Eric Moulines. "Non-asymptotic convergence analysis for the Unadjusted Langevin Algorithm." arXiv preprint arXiv:1507.05021 (2015).
Durmus, Alain, and Eric Moulines. "High-dimensional Bayesian inference via the Unadjusted Langevin Algorithm." (2016).
-  Bubeck, Sébastien, Ronen Eldan, and Joseph Lehec. "Sampling from a log-concave distribution with Projected Langevin Monte Carlo." Advances in Neural Information Processing Systems 28 (2015).
Danilo Jimenez Rezende and Shakir Mohamed. "Variational inference with normalizing flows" Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (2015)
-  Liu, Qiang, and Dilin Wang. "Stein variational gradient descent: A general purpose bayesian inference algorithm." Advances In Neural Information Processing Systems. 2016.
-  Durmus, Alain, Eric Moulines, and Marcelo Pereyra. "Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau." arXiv preprint arXiv:1612.07471 (2016).
-  Santambrogio, Filippo. "Euclidean, metric, and Wasserstein gradient flows: an overview." Bulletin of Mathematical Sciences 7.1 (2017): 87-154.
-  Santambrogio, Filippo. "Optimal transport for applied mathematicians." Birkäuser, NY (2015).
Eberle, Andreas. "Reflection couplings and contraction rates for diffusions." Probability theory and related fields 166.3-4 (2016): 851-886.
-  Dalalyan, Arnak S., and Avetik G. Karagulyan. "User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient." arXiv preprint arXiv:1710.00095 (2017).
8 Supplementary Materials
Proof of Lemma 1 The proof is directly from results in . See Theorem 10.4.9, with , with , , , , , and . The expression for comes from expression 10.1.16 (section E of chapter 10.1.2, page 233). See also expressions 10.4.67 and 10.4.68.
(One can also refer to Theorem 10.4.13 and Theorem 10.4.17 for proofs of for the KL-divergence functional in more general settings.) By Lemma 16, is well defined for all .
First, consider the case when . By definition, , and . By Fokker Planck,
On the other hand
Thus So Lemma (5) holds.
In the remainder of this proof, we assume that .
For a given , we let denote the projection of onto its first coordinates, and denote the projection of onto its last coordinates. With abuse of notation, for , we let and denote the corresponding marginal densities.
We will consider three stochastic processes: over for .
First, we introduce the stochastic process for
We let denote the density for . Intuitively, is the joint density between and . One can verify that and . By Fokker-Planck, we have
Next, for any given , we introduce the stochastic process