Langevin Monte Carlo without Smoothness

05/30/2019 ∙ by Niladri S. Chatterji, et al. ∙ berkeley college 0

Langevin Monte Carlo (LMC) is an iterative algorithm used to generate samples from a distribution that is known only up to a normalizing constant. The nonasymptotic dependence of its mixing time on the dimension and target accuracy is understood only in the setting of smooth (gradient-Lipschitz) log-densities, a serious limitation for applications in machine learning. In this paper, we remove this limitation, providing polynomial-time convergence guarantees for a variant of LMC in the setting of nonsmooth log-concave distributions. At a high level, our results follow by leveraging the implicit smoothing of the log-density that comes from a small Gaussian perturbation that we add to the iterates of the algorithm and controlling the bias and variance that are induced by this perturbation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of generating a sample from a distribution that is known up to a normalizing constant is a core problem across the computational and inferential sciences [33, 18, 5, 31, 35, 6]

. A prototypical example involves generating a sample from a log-concave distribution—a probability distribution of the following form:

where the function is convex and is referred to as the potential function. While generating a sample from the exact distribution is often computationally intractable, for most applications it suffices to generate a sample from a distribution that is close to

in some distance (such as, e.g., total variation distance, Wasserstein distance, or Kullback-Leibler divergence).

The most commonly used methods for generating a sample from a log-concave distribution are (i) random walks [16, 21], (ii) different instantiations of Langevin Monte Carlo (LMC) [29], and (iii) Hamiltonian Monte Carlo (HMC) [25]. These methods trade off rate of convergence against per-iteration complexity and applicability: random walks are typically the slowest in terms of the total number of iterations, but each step is fast as it does not require gradients of the log-density and they are broadly applicable, while HMC is the fastest in the number of iterations, but each step is slow as it uses gradients of the log-density and it mainly applies to distributions with smooth log-densities.

LMC occupies a middle ground between random walk and HMC. In its standard form, LMC updates its iterates as:

(LMC)

where

are independent Gaussian random vectors. The per-iteration complexity is reduced relative to HMC because it only requires stochastic gradients of the log-density 

[36]

. This also increases its range of applicability relative to HMC. While it is not a reversible Markov chain and classical theory of MCMC does not apply, it is nonetheless amenable to theoretical analysis given that it is obtained via discretization of an underlying stochastic differential equation (SDE). There is, however, a fundamental difficulty in connecting theory to the promised wide range of applications in statistical inference. In particular, the use of techniques from SDEs generally requires

to have Lipschitz-continuous gradients. This assumption excludes many natural applications [18, 14, 24]. In this work, we tackle this problem head-on and pose the following question:

Is it possible to obtain nonasymptotic convergence results for LMC with a nonsmooth potential?

We answer this question positively through a series of results that involve transformations of the basic stochastic dynamics in (LMC). In contrast to previous work that aims at nonsmooth potentials [1, 14, 17], the transformations we consider are simple (such as perturbing a gradient query point by a Gaussian), they do not require strong assumptions such as the existence of proximal maps, and they apply directly to nonsmooth Lipschitz potentials without any additional structure (such as composite structure in [1, 14] or strong convexity in [17]). Our main theorem is based on a Gaussian smoothing result summarized in the following theorem.

Main Theorem (Informal).

Let be a probability distribution, where is a convex subdifferentiable function whose subgradients satisfy

and is -strongly convex and -smooth. There exists an algorithm—Perturbed Langevin Monte Carlo (P-LMC)—which has the same computational complexity as (LMC) and that requires no more than iterations to generate a sample that is -close to in 2-Wasserstein distance.

Further, if the goal is to sample from a variant of (P-LMC) takes poly iterations to generate a sample from a distribution that is -close to in total variation distance.

This informal version of the theorem displays only the dependence on the dimension and accuracy . A detailed statement is provided in Theorems 3.3 and 3.4, and Corollary 4.1.

Our assumption on the subgradients of from the statement of the Main Theorem is known as Hölder-continuity, or

-weak smoothness of the function. It interpolates between Lipschitz gradients (smooth functions, when

) and bounded gradients (nonsmooth Lipschitz functions, when ). To understand the behavior of LMC on weakly smooth (including nonsmooth) potentials, we leverage results from the optimization literature. First, by using the fact that a weakly smooth function can be approximated by a smooth function—a result that has been exploited in the optimization literature to obtain methods with optimal convergence rates [26, 12]—we show that even the basic version of LMC can generate a sample in polynomial time, as long as is “not too nonsmooth” (namely, as long as can be treated as a constant).

The main impediment to the convergence analysis of LMC when treating a weakly smooth function as an inexact version of a nearby smooth function is that a constant bias is induced on the gradients, as discussed in Section 3.1. To circumvent this issue, in Section 3.2 we argue that an LMC algorithm can be analyzed as a different LMC run on a Gaussian-smoothed version of the potential using

unbiased stochastic estimates of the gradient

.222A similar idea was used in [19]

to view expected iterates of stochastic gradient descent as gradient descent on a smoothed version of the objective.

Building on this reduction, we define a Perturbed Langevin Monte Carlo (P-LMC) algorithm that reduces the additional variance that arises in the gradients from the reduction.

To obtain our main theorem, we couple a result about convergence of LMC with stochastic gradient estimates in Wasserstein distance [11] with carefully crafted applications of inequalities relating Kullback-Leibler divergence, Wasserstein distance, and total variation distance. Also useful are structural properties of the weakly smooth potentials and its Gaussian smoothing. As a byproduct of our techniques, we obtain a nonasymptotic result for convergence in total variation distance of LMC with stochastic gradients.

1.1 Related Work

Starting with the work of Dalalyan [10], a variety of theoretical results have established mixing time results for LMC [13, 32, 38, 7, 9, 11, 37] and closely related methods, such as Metropolis-Adjusted LMC [15] and HMC [22, 4, 20, 23, 8]. These results apply to sampling from well-behaved distributions whose potential function is smooth (Lipschitz gradients) and (usually) strongly convex. For standard (LMC) with smooth and strongly convex potentials, the tightest upper bounds for the mixing time are . They were obtained in [10, 13] for convergence in total variation (with a warm start; without a warm start the total variation result scales as ) and in 2-Wasserstein distance.

When it comes to using (LMC) with nonsmooth potential functions, there are far fewer results. In particular, we are only aware of approaches (such as, e.g., [1, 14]) that assume a composite structure of the potential (namely, that the potential is a sum of a smooth and a nonsmooth function) and rely on the use of proximal maps. Note that this is a very strong assumption. In fact, when the composite structure exists in convex optimization and proximal maps are efficiently computable, it is possible to solve nonsmooth optimization problems with the same iteration complexity as if the objective were smooth (see, e.g., [2]).Thus, while the method from [14] has a lower iteration complexity than our approach ( for smooth plus nonsmooth and for smooth and strongly convex plus nonsmooth potentials), the use of proximal maps increases its per-iteration complexity (each iteration needs to solve a convex optimization problem). It is also unclear how the performance of the method degrades when the proximal maps are computed only to finite accuracy. Finally, unlike our work, [1, 14] do not handle potentials that are purely nonsmooth, without a composite structure.

It is also worth mentioning that there exist approaches such as the Mirrored Langevin Algorithm [17] that can be used to efficiently sample from structured nonsmooth distributions such as the Dirichlet posterior. However, this algorithm’s applicability to general nonsmooth densities is unclear.

1.2 Outline

Section 2 provides the notation and background required to state the main technical results. Section 3 provides our main theorems, stated for deterministic and stochastic approximations of the potential (negative log-density) and composite structure of the potential. Section 4 extends the result of Section 3 to non-composite potentials. Finally, Section 5 presents our conclusions.

2 Preliminaries

The goal is to generate samples from a distribution , where . We equip the space with the standard Euclidean norm and use to denote inner products. We assume the following for the potential (negative log-density) :

  1. is convex and subdifferentiable. Namely, for all there exists a subgradient of such that

  2. There exist and such that , we have

    (2.1)

    where denotes an arbitrary subgradient of at .

  3. The distribution has a finite

    fourth moment

    :

    where is an arbitrary minimizer of .

Assumption (A2) is known as -weak smoothness or Hölder continuity of the (sub)gradients of When it corresponds to the standard smoothness (Lipschitz continuity of the gradients), while at the other extreme, when is (possibly) non-smooth and Lipschitz-continuous.

Properties of weakly smooth functions.

A property that follows directly from (2.1) is that:

(2.2)

One of the most useful properties of weakly smooth functions that has been exploited in optimization is that they can be approximated by smooth functions to an arbitrary accuracy, at the cost of increasing their smoothness parameter [26, 12]. This was shown in [26, Lemma 1] and is summarized in the following lemma for the special case of the unconstrained Euclidean setting used here.

Lemma 2.1.

Let be a convex function that satisfies (2.1) for some and Then, for any and we have that

(2.3)

Furthermore, it is not hard to show that (2.3) implies the following (see [12, Section 2.2]):

(2.4)

where , for any , as in Lemma 2.1.

Gaussian smoothing.

Given , define the Gaussian smoothing of as:

where The reason for considering the Gaussian smoothing instead of is that it generally enjoys better smoothness properties. In particular, is smooth even if is not. Here we review some basic properties of most of which can be found in [27, Section 2] for non-smooth Lipschitz functions. We generalize some of these results to weakly smooth functions. While the results can be obtained for arbitrary normed spaces, here we state all the results for the space which is the only setting considered in this paper.

The following lemma is a simple extension of the results from [27, Section 2] and it establishes certain regularity conditions for Gaussian smoothing that will be used in our analysis.

Lemma 2.2.

Let be a convex function that satisfies Eq. (2.1) for some and Then:

Additionally, we show that Gaussian smoothing preserves strong convexity, stated in the following (simple) lemma. Recall that a differentiable function is -strongly convex if,

Lemma 2.3.

Let be -strongly convex. Then is also -strongly convex.

Composite potentials and regularization.

To prove convergence of the continuous-time process (which requires strong convexity), we work with potentials that have the following composite form:

(2.5)

where is -smooth and -strongly convex. For obtaining guarantees in terms of convergence to we do not need Assumption (A3), which bounds the fourth moment of the target distribution—this is only needed in establishing the results for

If the goal is to sample from a distribution (instead of ), then we need to ensure that the distributions and are sufficiently close to each other. This can be achieved by choosing where and are sufficiently small, for an arbitrary (see Corollary 4.1 for precise details).

Note that by the triangle inequality, we have that:

(2.6)

Thus, by (2.4), we have the following (deterministic) Lipschitz approximation of the gradients of : , any , and (as in Lemma 2.1):

(2.7)

On the other hand, for Gaussian-smoothed composite potentials, using Lemma 2.2, we have:

(2.8)

Distances between probability measures.

Given any two probability measures and on , where is the Borel -field of , the total variation distance between them is defined as:

The Kullback-Leibler divergence between and is defined as:

where is the Radon-Nikodym derivative of with respect to .

Define a transference plan , a distribution on such that and for any . Let denote the set of all such transference plans. Then the -Wasserstein distance is defined as:

3 Sampling for Composite Potentials

In this section, we consider the setting of composite potentials of the form where is -weakly smooth (possibly with in which case is nonsmooth and Lipschitz-continuous) and is -smooth and -strongly convex. We provide results for mixing times of different variants of overdamped LMC in both 2-Wasserstein and total variation distance.

We first consider the deterministic smooth approximation of which follows from Lemma 2.1. This approach does not require making any changes to the standard overdamped LMC. However, it leads to a polynomial dependence of the mixing time on and only when is bounded away from zero (namely, when can be treated as a constant).

We then consider another approach that relies on a Gaussian smoothing of and that leads to a polynomial dependence of the mixing time on and for all values of In particular, the approach leads to the mixing time for 2-Wasserstein distance that matches the best known mixing time of overdamped LMC when is smooth (), and preserves polynomial-time dependence on and even if is nonsmooth (), in which case the mixing time scales as The analysis requires us to consider a minor modification to standard LMC in which we perturb the points at which gradients of

are queried by a Gaussian random variable. Note that it is unclear whether it is possible to obtain such bounds for (

LMC) without this modification (see Appendix D).

3.1 First Attempt: Deterministic Approximation by a Smooth Function

In the optimization literature, deterministic smooth approximations of weakly smooth functions (as in Lemma 2.1) are generally useful for obtaining methods with optimal convergence rates [26, 12]. A natural question is whether the same type of approximation is useful for bounding the mixing times of the Langevin Monte Carlo method invoked for potentials that are weakly smooth.

There are (at least) two reasons why it is not obvious that such a deterministic approximation would be useful. First, to control the (worst case, deterministic) error introduced by the smooth approximation, optimization methods crucially rely on averaging of the iterates [26, 12]. However, for sampling methods it is generally not clear how to incorporate averaging, and, to the best of our knowledge, there are no known results that bound the mixing time of sampling methods with averaging, for either continuous- or discrete-time methods. Second, the deterministic error introduces an adversarial bias in the Lipschitz approximation of the gradients (see Eq. (2.4)). While this bias can be made arbitrarily small for values of that are bounded away from zero, when and the induced bias is constant for any value of

We show that it is possible to bound the mixing times of LMC when the potential is “not too far” from being smooth. In particular, we show that the upper bound on the mixing time of LMC when applied to an -weakly smooth potential scales with in both the 2-Wasserstein and total variation distance, which is polynomial in for bounded away from zero. Although we do not prove any lower bounds on the mixing time in this case, the obtained result aligns well with our observation that the deterministic bias cannot be controlled for the deterministic smooth approximation of a nonsmooth Lipschitz function, as explained above. All technical details are deferred to Appendix C.

3.2 Gaussian Smoothing

The main idea can be summarized as follows. Recall that LMC with respect to the potential can be stated as:

(LMC)

where are independent Gaussian random vectors. This method corresponds to the Euler-Mayurama discretization of the Langevin diffusion.

Consider a modification of (LMC) in which we add another Gaussian term:

(3.1)

where and is independent of . Observe that (3.1) is simply another (LMC) with a slightly higher level of noise— instead of . Let Then:

(S-LMC)

Taking expectations on both sides with respect to , we have:

where is the Gaussian smoothing of as defined in Section 2. Thus, we can view the sequence in Eq. (S-LMC) as obtained by simply transforming the standard LMC chain to another LMC chain using stochastic estimates of the gradients. However, the variance of this gradient estimate is too high to handle nonsmooth functions, and, as before, our bound on the mixing time of this chain blows up as (see Appendix D).

Thus, instead of working with the algorithm defined by Eq. (S-LMC), we correct for the extra induced variance and consider the sequence of iterates defined by:

(P-LMC)

This sequence will have a sufficiently small bound on the variance to obtain the desired results.

Lemma 3.1.

For any , and , let denote a stochastic gradient of . Then

is an unbiased estimator of

whose (normalized) variance can be bounded as:

Let the distribution of the iterate be denoted by , and let be the distribution with as the potential. Our overall strategy for proving our main result is as follows. First, we show that the Gaussian smoothing does not change the target distribution significantly with respect to the Wasserstein distance, by bounding (Lemma 3.2). Using Lemma 3.1, we then invoke a result on mixing times of Langevin diffusion with stochastic gradients, which allows us to bound . Finally, using the triangle inequality and choosing a suitable step size , smoothing radius and number of steps so that we establish our final bound on the mixing time of (P-LMC), stated as Theorem 3.3.

Lemma 3.2.

Let and be the distributions corresponding to the potentials and respectively. Then, we have:

where

Our main result is stated in the following theorem.

Theorem 3.3.

Let the initial iterate be drawn from a probability distribution . If the step size satisfies , then:

where ; and .

Further, if, for we choose , where:

then .

Treating as constants and using the fact that (see [9, Lemma 13]), we find that Theorem 3.3 yields a bound of . When (the Lipschitz gradient case), we recover the known mixing time of , while at the other extreme when (the nonsmooth Lipschitz potential case), we find that .

Proof of Theorem 3.3.

By the triangle inequality,

(3.2)

To bound the first term, , we invoke [10, Theorem 1] (see Theorem A.4 in Appendix A). Recall that is continuously differentiable, -smooth (with ), and -strongly convex. Additionally, the sequence of points can be viewed as a sequence of iterates of overdamped LMC with respect to the potential specified by where the iterates are updated using unbiased stochastic estimates of . Thus we have:

(3.3)

and as shown in Lemma 3.1,

The last piece we need is control over the distance between the distributions and . This is established above in Lemma 3.2, which gives:

(3.4)

where is as defined above. Combining Eqs. (3.2)-(3.4), we get a bound on in terms of the relevant problem parameters. This proves the first part of the theorem.

It is straightforward to verify that our choice of ensures that . The choice of ensures that and the choice of ensures that the initial error contracts exponentially to (see the proof of Theorem 3.4 in Appendix F for a similar calculation). This yields the second claim. ∎

Further, we show that this result can be generalized to total variation distance.

Theorem 3.4.

Let the initial iterate be drawn from a probability distribution . If we choose the step size such that then:

where ; , and .

Further, if, for we choose and

then by choosing the step size and number of steps as

we have .

Remark 3.1.

Treating as constants and using the fact that (by [9, Lemma 13]), Theorem 3.4 gives a bound on the mixing time When (Lipschitz gradients), we have , while when (nonsmooth Lipschitz potential) we have . While the bound for the smooth case (Lipschitz gradients, ) is looser than the best known bound for LMC with a warm start [10], our results in TV distance scale as while not requiring a warm start, which improves upon the corresponding result of [10] where the bound scaled as . Further, we conjecture that our bound is improvable. The main loss is incurred when relating to KL distance, using an inequality from [30] (see Appendix A). If tighter inequalities were obtained in the literature, either relating and KL, or directly relating and TV, this result would immediately improve as a consequence. The results for LMC with non-Lipschitz gradients () are novel. Finally, as a byproduct of our approach, we obtain the first bound for stochastic gradient LMC in TV distance (see Remark E.1 in Appendix E).

4 Sampling for Regularized Potentials

Consider now the case in which we are interested in sampling from a distribution . As mentioned in Section 2, we can use the same analysis as in the previous section, by running (P-LMC) with a regularized potential , where . To obtain the desired result, the only missing piece is bounding the distance between and , leading to the following corollary of Theorem 3.4.

Corollary 4.1.

Let the initial iterate satisfy for some distribution and let denote the distribution of . If we choose the step-size such that , then:

where is bounded as in Theorem 3.4 and is the fourth moment of

Further, if, for we choose and all other parameters as in Theorem 3.4 for then, we have .

Remark 4.1.

Treating as constants, the upper bound on the mixing time is Thus, when we have while when

5 Discussion

We obtained polynomial-time theoretical guarantees for a variant of Langevin Monte Carlo—(P-LMC)—that uses Gaussian smoothing and applies to target distributions with nonsmooth log-densities. The smoothing we apply is tantamount to perturbing the gradient query points in Langevin Monte Carlo by a Gaussian random variable, which is a minor modification to the standard method.

Beyond its applicability to sampling from more general weakly smooth and nonsmooth target distributions, our work also has some interesting implications. For example, we believe it is possible to extend our results to sampling from structured distributions with nonsmooth and nonconvex negative log-densities, following an argument from, e.g., [8]. Further, it seems plausible that coupling our results with the results for derivative-free Langevin Monte Carlo [34] (which only applies to distributions with smooth and strongly convex log-densities) would lead to a more broadly applicable derivative-free Langevin Monte Carlo algorithm.

Several other interesting directions for future research remain. For example, as discussed in Remark 3.1 and Remark E.1 (Appendix E), we conjecture that the asymptotic dependence on and in our bounds on the mixing times for total variation distance (Theorem 3.4) can be improved to match those obtained for the 2-Wasserstein distance (Theorem 3.3).

Acknowledgements

We gratefully acknowledge the support of the NSF through grants CCF-1740855 and IIS-1619362. This work was done in part while the authors were visiting the Simons Institute for the Theory of Computing.

References

Appendix A Additional Background

Here we state the results from related work that are invoked in our analysis.

First, the smooth approximations of the potentials used in this paper are pointwise larger than the original potentials, and have a bounded distance from the original potentials. This allows us to invoke the following lemma from [10].

Lemma A.1.

[10, Lemma 3] Let and be two functions such that and both and are integrable. Then the Kullback-Leibler divergence between the distributions defined by densities and can be bounded as:

As a consequence,

The next result that we will be invoking allows us to bound the Wasserstein distance between the target distributions corresponding to the composite potential and its Gaussian smoothing

Lemma A.2.

[3, Corollary 2.3] Let be a measurable space equipped with a measurable distance , let and let be a probability measure on . Assume that there exist and such that is finite. Then, for any other probability measure on

where

Another useful result, due to [30], lets us bound the KL-divergence between two distributions in terms of their -Wasserstein distance. This is used to relate the TV distance between distributions to their respective Wasserstein distance in Section 3.2.

Proposition A.3.

[30, Proposition 1] Let be a density on such that for all : for some . Then,

In particular, if for some -smooth function then we immediately have:

where , and the assumption of the proposition is satisfied with

We will be invoking a result from [11] that bounds the Wasserstein distance between the target distribution and the distribution of the iterate of LMC with stochastic gradients. The assumptions about the stochastic gradients is that their bias and variance are bounded. Namely:

and

where the diffusion term