Many modern learning tasks involve sampling from a high-dimensional and large-scale distribution, which calls for algorithms that are scalable with respect to both the dimension and the data size. One approach [welling2011bayesian] that has found wide success is to discretize the Langevin Dynamics:
where presents a target distribution and is a -dimensional Brownian motion. Such a framework has inspired numerous first-order sampling algorithms [ahn2012bayesian, chen2014stochastic, ding2014bayesian, durmus2016stochastic, lan2016sampling, liu2016stochastic, patterson2013stochastic, simsekli2016stochastic], and the convergence rates are by now well-understood for unconstrained and log-concave distributions [cheng2017convergence, dalalyan2017user, durmus2018analysis].
However, applying (1.1) to sampling from constrained distributions (i.e., when has a bounded convex domain) remains a difficult challenge. From the theoretical perspective, there are only two existing algorithms [brosse17sampling, bubeck2015sampling] that possess non-asymptotic guarantees, and theif rates are significantly worse than the unconstrained scenaria under the same assumtions; cf., Table 1. Furthermore, many important constrained distributions are inherently non-log-concave. A prominent instance is the Dirichlet posterior, which, in spite of the presence of several tailor-made first-order algorithms [lan2016sampling, patterson2013stochastic], is still lacking a non-asymptotic guarantee.
In this paper, we aim to bridge these two gaps at the same time. For general constrained distributions with a strongly convex potential , we prove the existence of a first-order algorithm that achieves the same convergence rates as if there is no constraint at all, suggesting the state-of-the-art can be brought down to . When specialized to the important case of simplex constraint, we provide the first non-asymptotic guarantee for Dirichlet posteriors, for deterministic and for the stochastic version of our algorithms; cf., Example 1 and 2 for the involved parameters.
Our framework combines ideas from the Mirror Descent [beck2003mirror, nemirovsky1983problem] algorithm for optimization and the theory of Optimal Transport [villani2008optimal]. Concretely, for constrained sampling problems, we propose to use the mirror map to transform the target into an unconstrained distribution, whereby many existing methods apply. Optimal Transport theory then comes in handy to relate the convergence rates between the original and transformed problems. For simplex constraints, we use the entropic mirror map to design practical first-order algorithms that possess rigorous guarantees, and are amenable to mini-batch extensions.
The rest of the paper is organized as follows. We briefly review the notion of push-forward measures in Section 2. In Section 3, we propose the Mirrored Langevin Dynamics and prove its convergence rates for constrained sampling problems. Mini-batch extensions are derived in Section 4. Finally, in Section 5, we provide synthetic and real-world experiments to demonstrate the empirical efficiency of our algorithms.
1.1 Related Work
First-Order Sampling Schemes with Langevin Dynamics: There exists a bulk of literature on (stochastic) first-order sampling schemes derived from Langevin Dynamics or its variants [ahn2012bayesian, brosse17sampling, bubeck2015sampling, chen2015convergence, cheng2017convergence, cheng2017underdamped, dalalyan2017user, durmus2018analysis, dwivedi2018log, luu2017sampling, patterson2013stochastic, welling2011bayesian]. However, to our knowledge, this work is the first to consider mirror descent extensions of the Langevin Dynamics.
The authors in [ma2015complete] proposed a formalism that can, in principle, incorporate any variant of Langevin Dynamics for a given distribution . The Mirrored Langevin Dynamics, however, is targeting the push-forward measure (see Section 3.1), and hence our framework is not covered in [ma2015complete].
For Dirichlet posteriors, there is a similar variable transformation as our entropic mirror map in [patterson2013stochastic] (see the “reduced-natural parametrization” therein). The dynamics in [patterson2013stochastic] is nonetheless drastically different from ours, as there is a position-dependent matrix multiplying the Brownian motion, whereas our dynamics has no such feature; see (3.2).
Mirror Descent-Type Dynamics for Stochastic Optimization: Although there are some existing work on mirror descent-type dynamics for stochastic optimization [krichene2017acceleration, mertikopoulos2018convergence, raginsky2012continuous, xu2018accelerated], we are unaware of any prior result on sampling.
Convergence Rates for Sampling from Dirichlet Posteriors: The work [dai2016provable] proposed a zeroth order method that achieves convergence in relative entropy for Dirichlet posteriors, which requires computation per iteration. Our method achieves the same rate with -complexity per iteration.
In this paper, all Lipschitzness and strong convexity are with respect to the Euclidean norm . We use to denote -times differentiable functions with continuous th derivative. The Fenchel dual [rockafellar2015convex] of a function is denoted by . Given two mappings of proper dimensions, we denote their composite map by
. For a probability measure, we write to mean that “
is a random variable whose probability law is”.
2.2 Push-Forward and Optimal Transport
Let be a probability measure with support , and be a convex function on . Throughout the paper we assume:
is closed, proper, on .
All measures have finite second moments.
All measures have finite second moments.
All measures vanish on sets with Hausdorff dimension [mandelbrot1983fractal] at most .
The gradient map induces a new probability measure through for every Borel set on . We say that is the push-forward measure of under , and we denote it by . If and , we will sometimes abuse the notation by writing to mean
If , the triplet must satisfy the Monge-Ampère equation:
Using and , we see that (2.1) is equivalent to
which implies .
3 Mirrored Langevin Dynamics
This section demonstrates a framework for transforming constrained sampling problems into unconstrained ones. We then focus on applications to sampling from strongly log-concave distributions and simplex-constrained distributions, even though the framework is more general and future-proof.
3.1 Motivation and Algorithm
We begin by briefly recalling the mirror descent (MD) algorithm for optimization. In order to minimize a function over a bounded domain, say , MD uses a mirror map to transform the primal variable into the dual space , and then performs gradient updates in the dual: for some step-size . The mirror map is chosen to adapt to the geometry of the constraint , which can often lead to faster convergence [nemirovsky1983problem] or, more pivotal to this work, an unconstrained optimization problem [beck2003mirror].
Inspired by the MD framework, we would like to use the mirror map idea to remove the constraint for sampling problems. Toward this end, we first establish a simple fact [villani2003topics]:
Let satisfy Assumption 1. Suppose that and . Then and .
For any Borel set , we have . Since is one-to-one, if and only if . ∎
In the context of sampling, Theorem 1 suggests the following simple procedure: For any target distribution with support , we choose a mirror map on satisfying Assumption 1, and we consider the dual distribution associated with and :
Theorem 1 dictates that if we are able to draw a sample from , then immediately gives a sample for the desired distribution . Furthermore, suppose for the moment that , so that is unconstrained. Then we can simply exploit the classical Langevin Dynamics (1.1) to efficiently take samples from .
The above reasoning leads us to set up the Mirrored Langevin Dynamics (MLD):
Notice that the stationary distribution of in MLD is , since is nothing but the Langevin Dynamics (1.1) with . As a result, we have .
In order to arrive at a practical algorithm, we then discretize the MLD, giving rise to the following equivalent iterations:
where in both cases , ’s are i.i.d. standard Gaussian, and ’s are step-sizes. The first formulation in (3.3) is useful when has a tractable form, while the second one can be computed using solely the information of and .
Next, we turn to the convergence of discretized MLD. Since in (3.2) is the classical Langevin Dynamics, and since we have assumed that is unconstrained, it is typically not difficult to prove the convergence of to . However, what we ultimately care about is the guarantee on the primal distribution . The purpose of the next theorem is to fill the gap between primal and dual convergence.
We consider three most common metrics in evaluating approximate sampling schemes, namely the 2-Wasserstein distance , the total variation , and the relative entropy .
Theorem 2 (Convergence in implies convergence in ).
If, furthermore, is -strongly convex: . Then .
See Appendix A. ∎
3.2 Applications to Sampling from Constrained Distributions
We now consider applications of MLD. For strongly log-concave distributions with general constraint, we prove matching rates to that of unconstrained ones; see Section 3.2.1. In Section 3.2.2, we consider the important case where the constraint is a probability simplex.
3.2.1 Sampling from a strongly log-concave distribution with constraint
As alluded to in the introduction, the existing convergence rates for constrained distributions are significantly worse than their unconstrained counterparts; see Table 1 for a comparison.
|MLD; this work|
|Langevin Dynamics [cheng2017convergence, dalalyan2017theoretical, durmus2018analysis]|
The main result of this subsection is the existence of a “good” mirror map for arbitrary constraint, with which the dual distribution becomes unconstrained:
Theorem 3 (Existence of a good mirror map for MLD).
Let be a probability measure with bounded convex support such that , , and is bounded away from in the interior of the support. Then there exists a mirror map such that the discretized MLD (3.3) yields
See Appendix B. ∎
We remark that Theorem 3 is only an existential result, not an actual algorithm. Practical algorithms are considered in the next subsection.
3.2.2 Sampling Algorithms on Simplex
We apply the discretized MLD (3.3) to the task of sampling from distributions on the probability simplex
, which is instrumental in many fields of machine learning and statistics.
On a simplex, the most natural choice of is the entropic mirror map [beck2003mirror], which is well-known to be 1-strongly convex:
In this case, the associated dual distribution can be computed explicitly.
Lemma 1 (Sampling on a simplex with entropic mirror map).
Let be the target distribution on , be the entropic mirror map (3.4), and . Then the potential of the push-forward measure admits the expression
where is the Fenchel dual of , which is strictly convex and 1-Lipschitz gradient.
See Appendix C. ∎
Crucially, we have , so that the Langevin Dynamics for is unconstrained.
Based on Lemma 1, we now present the surprising case of the non-log-concave Dirichlet posteriors, a distribution of central importance in topic modeling [blei2003latent], for which the dual distribution becomes strictly log-concave.
Example 1 (Dirichlet Posteriors).
Given parameters and observations where is the number of appearance of category , the probability density function of the Dirichlet posterior is
, the probability density function of the Dirichlet posterior is
where is a normalizing constant and . The corresponding is
The interesting regime of the Dirichlet posterior is when it is sparse, meaning the majority of the ’s are zero and a few ’s are large, say of order . It is also common to set for all in practice. Evidently, is neither convex nor concave in this case, and no existing non-asymptotic rate can be applied. However, plugging into (3.5) gives
which, magically, becomes strictly convex and -Lipschitz gradient no matter what the observations and parameters are! In view of Theorem 2 and Corollary 7 of [durmus2018analysis], one can then apply (3.3) to obtain an convergence in relative entropy, where is the initial Wasserstein distance to the target. ∎
4 Stochastic Mirrored Langevin Dynamics
We have thus far only considered deterministic methods based on exact gradients. In practice, however, evaluating gradients typically involves one pass over the full data, which can be time-consuming in large-scale applications. In this section, we turn attention to the mini-batch setting, where one can use a small subset of data to form stochastic gradients.
Toward this end, we assume:
Assumption 4 (Primal Decomposibility).
The target distribution admits a decomposable structure for some functions .
The above assumption is often met in machine learning applications, where each represents one data. If there is an additional prior term (that is, for some ), then one can redefine so that Assumption 4 still holds.
Consider the following common scheme in obtaining stochastic gradients. Given a batch-size , we randomly pick a mini-batch from with
, and form an unbiased estimate ofby computing
The following lemma asserts that exactly the same procedure can be carried out in the dual.
Assume that is 1-strongly convex. For i = let be such that
Define and , where is chosen as in (4.1). Then:
Primal decomposibility implies dual decomposability: There is a constant such that .
For each , the gradient depends only on and the mirror map .
The gradient estimate is unbiased: .
The dual stochastic gradient is more accurate: .
See Appendix D. ∎
Let be a distribution satisfying Assumption 4, and a 1-strongly convex mirror map. Let be the variance of the stochastic gradient of
be the variance of the stochastic gradient ofin (4.1). Suppose that the corresponding dual distribution satisfies . Then, applying SMLD with constant step-size yields222Our guarantee is given on a randomly chosen iterate from , instead of the final iterate . In practice, we observe that the final iterate always gives the best performance, and we will ignore this minor difference in the theorem statement.:
provided that .
See Appendix E. ∎
Example 2 (SMLD for Dirichlet Posteriors).
For the case of Dirichlet posteriors, we have seen in (3.7) that the corresponding dual distribution satisfies , where and . Furthermore, it is easy to see that the stochastic gradient can be efficiently computed (see Appendix F):
where is the number of observations of category in the mini-batch . As a result, Theorem 4 states that SMLD achieves
with a constant step-size.
We conduct experiments with a two-fold purpose. First, we use a low-dimensional synthetic data, where we can evaluate the total variation error by comparing histograms, to verify the convergence rates in our theory. Second, We demonstrate that the SMLD, modulo a necessary modification for resolving numerical issues, outperforms state-of-the-art first-order methods on the Latent Dirichlet Allocation (LDA) application with Wikipedia corpus.
5.1 Synthetic Experiment for Dirichlet Posterior
We implement the deterministic MLD for sampling from an 11-dimensional Dirichlet posterior (3.6) with , and , which aims to capture the sparse nature of real observations in topic modeling. We set for all .
As a baseline comparison, we include the Stochastic Gradient Riemannian Langevin Dynamics (SGRLD) [patterson2013stochastic] with the expanded-mean parametrization. SGRLD is a tailor-made first-order scheme for simplex constraints, and it remains one of the state-of-the-art algorithms for LDA. For fair comparison, we use deterministic gradients for SGRLD.
We perform a grid search over the constant step-size for both algorithms, and we keep the best three for MLD and the best five for SGRLD. For each iteration, we build an empirical distribution by running independent trials, and we compute its total variation with respect to the histogram generated by the true distribution.
Figure 1(a) reports the total variation error along the first dimension, where we can see that MLD outperforms SGRLD by a substantial margin. As dictated by our theory, all the MLD curves decay at the rate until they saturate at the dicretization error level. In contrast, SGRLD lacks non-asymptotic guarantees, and there is no clear convergence rate we can infer from Figure 1(a).
5.2 Latent Dirichlet Allocation with Wikipedia Corpus
An influential framework for topic modeling is the Latent Dirichlet Allocation (LDA) [blei2003latent], which, given a text collection, requires to infer the posterior word distributions without knowing the exact topic for each word. The full model description is standard but somewhat convoluted; we refer to the classic [blei2003latent] for details.
Each topic in LDA determines a word distribution , and suppose there are in total topics and words. The variable of interest is therefore . Since this domain is a Cartesian product of simplices, we propose to use , where is the entropic mirror map (3.4), for SMLD. It is easy to see that all of our computations for Dirichlet posteriors generalize to this setting.
5.2.1 Experimental Setup
We implement the SMLD for LDA on the Wikipedia corpus with documents, and we compare the performance against the SGRLD [patterson2013stochastic]. In order to keep the comparison fair, we adopt exactly the same setting as in [patterson2013stochastic], including the model parameters, the batch-size, the Gibbs sampler steps, etc. See Section 4 and 5 in [patterson2013stochastic] for omitted details.
Another state-of-the-art first-order algorithm for LDA is the SGRHMC in [ma2015complete], for which we skip the implementation, due to not knowing how the was chosen in [ma2015complete]. Instead, we will repeat the same experimental setting as [ma2015complete] and directly compare our results versus the ones reported in [ma2015complete]. See Appendix G for comparison against SGRHMC.
5.2.2 A Numerical Trick and the SMLD-approximate Algorithm
A major drawback of the SMLD in practice is that the stochastic gradients (4.4) involve exponential functions, which are unstable for large-scale problems. For instance, in python, np.exp(800) = inf, whereas the relevant variable regime in this experiment extends to 1600. To resolve such numerical issues, we appeal to the linear approximation333One can also use a higher-order Taylor approximation for , or add a small threshold to prevent the iterates from going to the boundary. In practice, we observe that these variants do not make a huge impact on the performance. . Admittedly, our theory no longer holds under such numerical tricks, and we shall not claim that our algorithm is provably convergent for LDA. Instead, the contribution of MLD here is to identify the dual dynamics associated with (3.7
), which would have been otherwise difficult to perceive. We name the resulting algorithm “SMLD-approximate” to indicate its heuristic nature.
Figure 1(b) reports the perplexity on the test data up to documents, with the five best step-sizes we found via grid search for SMLD-approximate. For SGRLD, we use the best step-sizes reported in [patterson2013stochastic].
From the figure, we can see a clear improvement, both in terms of convergence speed and the saturation level, of the SMLD-approximate over SGRLD. One plausible explanation for such phenomenon is that our MLD, as a simple unconstrained Langevin Dynamics, is less sensitive to discretization. On the other hand, the underlying dynamics for SGRLD is a more sophisticated Riemannian diffusion, which requires finer discretization than MLD to achieve the same level of approximation to the original continuous-time dynamics, and this is true even in the presence of noisy gradients and our numerical heuristics
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement n 725594 - time-data).
Appendix A Proof of Theorem 2
We first focus on the convergence for total variation and relative entropy, since they are in fact quite trivial. The proof for the 2-Wasserstein distance requires a bit more work.
a.1 Total Variation and Relative Entropy
Since is strictly convex, is one-to-one, and hence
On the other hand, it is well-known that applying a one-to-one mapping to distributions leaves the relative entropy intact. Alternatively, we may also simply write (letting ):
The “in particular” part follows from noticing that and .
a.2 2-Wasserstein Distance
Now, let be -strongly convex. The most important ingredient of the proof is Lemma 3 below, which is conceptually clean. Unfortunately, for the sake of rigor, we must deal with certain intricate regularity issues in the Optimal Transport theory. If the reader wishes, she/he can simply assume that the quantities (A.1) and (A.2) below are well-defined, which is always satisfied by any practical mirror map, and skip all the technical part about the well-definedness proof.
For the moment, assume ; the general case is given at the end. Every convex generates a Bregman divergence via . The following key lemma allows us to relate guarantees in between ’s and ’s. It can be seen as a generalization of the classical duality relation (A.4) in the space of probability measures.
Lemma 3 (Duality of Wasserstein Distances).
Before proving the lemma, let us see that the relation in is a simple corollary of Lemma 3. Since is -strongly convex, it is classical that, for any and ,
a.2.1 Proof of Lemma 3 When
We first prove that (A.2) is well-defined by verifying the sufficient conditions in Theorem 3.6 of [de2014monge]. Specifically, we will verify (C0)-(C2) in p.554 of [de2014monge] when the transport cost is .
Since is -strongly convex, is injective, and hence is also injective, which implies that is strictly convex. On the other hand, the strong convexity of implies , and hence is globally upper bounded by a quadratic function.
We now show that the conditions (C0)-(C2) are satisfied. Since we have assumed , we have . Since is upper bounded by a quadratic function, the condition (C0) is trivially satisfied. On the other hand, since is strictly convex, simple calculation reveals that, for any , the mapping is injective, which is (C1). Similarly, for any , the mapping is also injective, which is (C2). By Theorem 3.6 in [de2014monge], (A.2) is well-defined.
for all measurable . Using (A.5) in the definition of , we get
where the infimum is over all such that . Using the classical duality and , we may further write
where the infimum is again over all such that . In view of (A.6), the proof would be complete if we can show that if and only if .
For any two maps and , we claim that
Indeed, for any Borel set , we have, by definition of the push-forward,
On the other hand, recursively applying the definition of push-forward to gives
which establishes (A.7).
a.2.2 When is only
Appendix B Proof of Thereom 3
In previous sections, we are given a target distribution and a mirror map , and we derive the induced distribution through the Monge-Ampère equation (2.1). The high-level idea of this proof is to reverse the direction: We start with two good distributions and , and we invoke deep results in Optimal Transport to deduce the existence of a good mirror map .
First, notice that if has bounded domain, then the strong convexity of implies . Along with the assumption that is bounded away from in the interior, we see that is bounded away from and in the interior of support.
Let be the standard -dimensional Gaussian measure. By Brenier’s polarization theorem [brenier1987decomposition, brenier1991polar] and Assumption 2, 3, there exists a convex function whose gradient solves the optimal transportation problem. Caffarelli’s regularity theorem [caffarelli1990localization, caffarelli1992regularity, caffarelli2000monotonicity] then implies that the Brenier’s map is in . Finally, a slightly stronger form of Caffarelli’s contraction theorem [kolesnikov2011mass] asserts:
which implies is -strongly convex.
Let us consider the discretized MLD (3.3) corresponding to the mirror map . The SDE governing is simply the Ornstein-Uhlenbeck Process [oksendal2003stochastic]:
Appendix C Proof of Lemma 1
Straightforward calculations in convex analysis shows
which proves that is 1-strongly convex.