 # Weakly smooth Langevin Monte Carlo using p-generalized Gaussian smoothing

Langevin Monte Carlo (LMC) is an iterative process for sampling from a distribution of interests. The nonasymptotic mixing time is studied mostly in the context of smooth (gradient-Lipschitz) log-densities, a significant constraint for its deployment in many sciences including computational statistics and artificial intelligence. In the original article,  eliminates this restriction and establishes polynomial-time convergence assurances for a variation of LMC in the context of weakly smooth log-concave distributions. Based on their approach, we generalize the Gaussian smoothing to p-generalized Gaussian perturbation process, while maintaining the induced bias and variance bounded. We also improve their nonasymptotic dependence on the dimension and strongly convex parameters.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Over the last decade, Bayesian inference is one of the most prevalent inferring instruments for variety of disciplines including the computational statistics and artificial intelligence

[4, 6, 14, 19, 21]. In general, Bayesian inference seeks to generate samples of the full posterior distribution over the parameters and perhaps latent variables which yields a mean to quantify uncertainty in the model and prevents it from over-fitting [17, 24]. A typical problem is aiming to sample up to a normalizing constant from a log-concave posterior distribution of the subsequent form:

 π(x)=e−U(x)/∫Rde−U(y)dy,

where the function , also known as the potential function, is convex. The most conventional approaches for sampling from a posterior distribution are random walks Metropolis Hasting [12, 15] where sampling is reduced to constructing a Markov kernel on (-the Borel -field of

), whose invariant probability distribution is

.

However, selecting a proper proposal distribution for the Metropolis Hastings algorithm is a complicated matter. As a result, it has been proposed to consider continuous dynamics which inherently leave the objective distribution invariant. Probably, one of the most well-known example of these continuous dynamics applications are the over-damped Langevin diffusion  associated with , assumed to be continuously differentiable:

 dYt=−∇U(Yt)dt+√2dBt, (1.1)

where is a -dimensional Brownian motion. Under suitable assumptions on , this stochastic differential equations (SDE) possesses a unique strong solution and defines a strong Markov semigroup which converges to in total variation , or Wasserstein distance . Yet, simulating path solutions of such SDE is not feasible in most circumstances, and discretization of these equations are employed instead. Moreover, numerical solutions related to these approaches define Markov kernels for which is no longer stationary. Hence, measuring the error induced by these approximations is critical to justify their application to sample from the objective distribution . We study in this paper the Euler-Maruyama discretization of Eq.(1.1

) which defines the (possibly inhomogenous) Markov chain

given for all by

 xk+1=xk−ηk∇U(xk)+√2ηξk, (1.2)

where is a sequence of step sizes which can be kept constant or decreasing to , and

are independent Gaussian random vectors. Nonetheless, there is a critical gap in the theory of discretization of an underlying SDE to the potential broad spectrum of applications in statistical inference. In particular, the application of techniques from SDEs traditionally requires

to have Lipschitz-continuous gradients. This requirement prohibits many typical utilizations [11, 14, 16].

 has recently established an original approach to deal with weakly smooth (possibly non-smooth) potential problems directly. Their technique rests on results obtained from the optimization community, perturbing a gradient evaluating point by a Gaussian. They do not demand strong assumptions such as the existence of proximal maps, composite structure [2, 11] or strong convexity ). In this paper, we show that Gaussian smoothing can be generalized to -generalized Gaussian smoothing, which is novel in both Bayesian and optimization communities. In a more general context, this novel perturbation process provides additional features, comparable to Gaussian smoothing while preserving the same result when a -generalized Gaussian is a Gaussian. In particular, the bias and variance are bounded for a general while these bounds are equivalent to Gaussian smoothing for the case . In addition, we also improve the convergence result in Wasserstein distance by a simpler approach. We integrate this result with an LMC convergence result of stochastic gradient estimates , achieving a sharper nonasymptotic result for convergence in Wasserstein distance of LMC with stochastic gradients.

The paper is organized as follows. Section 2 sets out the notation and context necessary to give our core theorems, generalizing the results obtained from  for Gaussian smoothing. Section 3 broadens these outcomes of stochastic approximation of the potential (negative log-density) and composite structure of the potential, while Section 4 shows our conclusions.

## 2 Weakly smooth Langevin Monte Carlo using p-generalized Gaussian smoothing

The objective is to sample from a distribution , where . We furnish the space with the regular Euclidean norm and use to specify inner products. While sampling from the exact distribution is generally computationally demanding, it is largely adequate to sample from an approximated distribution which is in the vicinity of by some distances. In this paper, we use Wasserstein distance and briefly define it in Appendix A.

Suppose the following conditions, identical to the first two assumptions of :

###### Assumption 1.

is convex and sub-differentiable. Specifically, for all , there exists a sub-gradient of , ensuring that

###### Assumption 2.

There exist and so that , we have

 ∥∇U(x)−∇U(y)∥2≤L∥x−y∥α2, (2.1)

where represents an arbitrary sub-gradient of at .

###### Remark 1.

Condition 2 is known as -weakly smoothness or Holder continuity of the (sub)gradients of . A feature that follows straightfowardly from Eq.(2.1) is that:

 U(y)≤U(x)+⟨∇U(x), y−x⟩+L1+α∥y−x∥1+α, ∀x, y∈Rd. (2.2)

When , it is equivalent to the standard smoothness (Lipschitz continuity of the gradients), whereas at the opposite extreme, when is (possibly) non-smooth and Lipschitz-continuous.

 perturbs a gradient evaluating point by a Gaussian so that they can utilize weakly smooth Lipschitz potentials directly without any additional structure. Here, we generalize their approach to perturbation using -generalize Gaussian smoothing. Particularly, consider , -generalized Gaussian smoothing of is defined as:

 Uμ(y):=Eξ[U(y+μξ)],

where (the

-generalized Gaussian distribution),

. The rationale for taking into account the -generalized Gaussian smoothing rather than is that it typically benefits from superior smoothness properties. In particular, is smooth albeit is not. In addition,

-generalized Gaussian smoothing is more generalized than Gaussian smoothing in the sense that it contains all normal distribution when

and all Laplace distribution when

. This family of distributions allows for tails that are either heavier or lighter than normal and in the limit, it contains all the continuous uniform distribution. It is possible to study

but to simplify the proof, we are only interested in . Here we examine some primary features of based on adapting those results of .

###### Lemma 1.

Given that is a convex function which satisfies Eq.(2.1) for some and , then:

(i) : .

(ii) : .

###### Proof.

See Appendix B. ∎

The subsequent lemma is a straightforward adaptation of the results of  and it establishes particular regularity conditions for -generalized Gaussian smoothing that will be employed in our investigation. We demonstrate that -generalized Gaussian smoothing maintains strong convexity in the following lemma.

###### Lemma 2.

Given that : is -strongly convex, then is also -strongly convex.

###### Proof.

See Appendix B. ∎

## 3 Sampling for regularized potential

To study convergence of the continuous-time process (which involves strong convexity), we work with regularized potentials that have the following composite structure:

 ¯¯¯¯U(x):=U(x)+ψ(x), (3.1)

where is m-smooth and -strongly convex. Observe that by the triangle inequality, we have:

 ∥∇¯¯¯¯U(x)−∇¯¯¯¯U(y)∥2 ≤∥∇U(x)−∇U(y)∥2+∥∇ψ(x)−∇ψ(y)∥2 (3.2) ≤L∥x−y∥α2+m∥x−y∥2.

From which, by employing Lemma 1, we get:

 ∥∇¯¯¯¯Uμ(x)−∇¯¯¯¯Uμ(y)∥2≤⎛⎝Ldp−1p(1−α)μ1−α(1+α)1−α+m⎞⎠∥x−y∥2. (3.3)

In this case, is -weakly smooth (possibly with , is nonsmooth and Lipschitz-continuous). We now analyze LMC where we perturb the points at which gradients of are evaluated by a

-generalized Gaussian random variable. Remark that it remains unclear whether it is possible to achieve such bounds for (LMC) without this perturbation. Recall that LMC in terms of the potential

can be specified as:

 xk+1=xk−η∇¯¯¯¯U(xk)+√2ηξk, (3.4)

where are independent Gaussian random vectors. This method is actually the Euler-Mayurama discretization of the Langevin diffusion.

Rather than working with the algorithm specified by Eq. 3.4, we rectify the algorithm and inspect the folowing form:

 yk+1+μωk=yk+μωk−1−η∇¯¯¯¯U(yk+μωk−1)+√2ηξk, (3.5)

where for every and is sequence of -generalized Gaussian noise. We obtain the following result.

###### Lemma 3.

For any , and , let

 G(x,z1,z2):=∇¯¯¯¯U(x+μz1)−μηz1−μηz2

denote a stochastic gradient of . Then

is an unbiased estimator of

whose (normalized) variance can be bounded as:

 σ2: =Ez1,z2[∥∇¯¯¯¯Uμ(x)−G(x,z1,z2)∥22]d ≤4dα−1μ2αL2⎛⎜ ⎜⎝p2pΓ(3p)Γ(1p)⎞⎟ ⎟⎠α+4μ2m2⎛⎜ ⎜⎝p2pΓ(3p)Γ(1p)⎞⎟ ⎟⎠+8μ2p2pΓ(3p)η2Γ(1p).
###### Proof.

See Appendix B. ∎

Let the distribution of the iterate be represented by , and let be the distribution with as the potential. Our overall tatic for showing our core finding is as belows. First, we prove that the -generalized Gaussian smoothing does not alter the objective distribution substantially in term of the Wasserstein distance, by bounding (Lemma 4). Employing Lemma 3, we then deploy a result on mixing times of Langevin diffusion with stochastic gradients, which enables us to bound .

###### Lemma 4.

Assume that and . We have the following bounds, for any

 W22(¯¯¯π, ¯¯¯πμ)≤4(d+λ∥x∗∥22)λ(a+ea−1).

where is unique minimizer of , . If, in addition, and for sufficient small , then

 W2(¯¯¯π, ¯¯¯πμ)≤3√daλ,
###### Proof.

See Appendix B. ∎

Our primary outcome is reported in the subsequent theorem.

###### Theorem 1.

Let the initial iterate be drawn from a probability distribution . If the step size satisfies and for sufficient small , then:

 W2(¯¯¯πK,¯¯¯π)≤(1−λη)KW2(¯π0,¯¯¯πμ)+2(M+m)λ(ηd)1/2+σ2(ηd)1/2M+m+λ+σ√λ+3√daλ,

where , , and .

###### Proof.

Combining Lemma 3 and Lemma 4, it follows straightfowardly from  Theorem 3.1. ∎

###### Remark 2.

Our bound is equivalent to . In comparison to the previous result of , which is of order , our approach has better dependences on both dimension and strongly convex parameter , and strictly better whenever is not the dominant term. In addition, our method is more generalized than Gaussian smoothing, especially when we desire to explore the space by heavier tail distribution than normal.

## 4 Conclusion

We derive polynomial-time theoretical assurances for a Perturbed Langevin Monte Carlo (P-LMC) that uses -generalized Gaussian smoothing and deploy to objective distributions with weakly smooth log-densities. The algorithm we proposed, is perturbing the gradient evaluating points in Langevin Monte Carlo by a -generalized Gaussian random variable, which is a generalization of the recent Gaussian smoothing LMC method.

It is potential to broaden our results to sampling from structured distributions with nonsmooth and nonconvex negative log-densities . Further, it is not hard to show that our results can be integrated seamlessly in a largely applicable derivative-free Langevin Monte Carlo algorithm.

There are several attractive directions for future research. For instance, as discussed in Remark 2, we speculate that the dependence on and is not optimal and can be improved to match those obtained for the 2-Wasserstein distance using proximal assumption.

## Appendix A Distance measures

Define a transference plan , a distribution on (where is the Borel -field of ()) so that and for any . Let designate the set of all such transference plans. Then the 2-Wasserstein distance is formulated as:

 W2(P,Q) :=(infζ∈Γ(P,Q)∫x,y∈Rd∥x−y∥22dζ(x, y))1/2.

## Appendix B Proofs

###### Lemma 1.

Given that is a convex function which satisfies Eq. 2.1 for some and , then:

(i) : .

(ii) : .

###### Proof.

Part (i). Since , and , which implies , we have

 Uμ(x)−U(x)=E[U(x+μξ)−U(x)−μ⟨∇U(x), ξ⟩].

First, if is convex and , from A.1, for every and , so . By the definition of the density of -generalized Gaussian distribution , we also have:

 Uμ(x)−U(x)=⎛⎝p1−1p2Γ(1p)⎞⎠d∫Rd[U(x+μξ)−U(x)−μ⟨∇U(x), ξ⟩]e−∥ξ∥pp/pdξ.

Applying Eq. 2.2 and previous inequality:

 |Uμ(x)−U(x)| ≤L1+αμ1+α⎛⎝p1−1p2Γ(1p)⎞⎠d∫Rd∥ξ∥(1+α)2e−∥ξ∥pp/pdξ ≤L1+αμ1+α⎛⎝p1−1p2Γ(1p)⎞⎠d∫Rd∥ξ∥(1+α)pe−∥ξ∥pp/pdξ =L1+αμ1+α⎛⎝p1−1p2Γ(1p)⎞⎠d∫Rd∥ξ∥2p(1+α)/(2p)pe−∥ξ∥pp/pdξ.

The second inequality follows from for any . Since we have and is concave. It follows that

 Uμ(x)−U(x) ≤L1+αμ1+αE[∥ξ∥2p(1+α)/(2p)p] ≤L1+αμ1+α[E∥ξ∥2pp](1+α)/(2p) ≤L1+αμ1+α⎡⎢ ⎢⎣2pΓ(dp+2)Γ(dp)⎤⎥ ⎥⎦(1+α)/(2p) =L1+αμ1+α(2pd(d+p)p2)(1+α)/(2p) =L1+αμ1+α(2d(d+p)p)(1+α)/(2p).

Finally, for sufficiently large , we have

 |Uμ(x)−U(x)|≤L1+αμ1+αd1+αp.

Part (ii). First, observe that, by Jensen’s inequality and Eq. 2.1:

 ≤⎛⎝p1−1p2Γ(1p)⎞⎠d∫Rd∥∇U(y+μξ)−∇U(x+μξ)∥2e−∥ξ∥pp/pdξ (B.1) ≤L∥y−x∥α2.

Further, by exchanging gradient and integral, the gradient of can be expressed as:

 ∇Uμ(x)=1μ⎛⎝p1−1p2Γ(1p)⎞⎠d∫RdU(x+μξ)∥ξ∥p−1pe−∥ξ∥pp/pdξ.

Thus, applying Jensen’s inequality, we also have:

 ∥∇Uμ(y)−∇Uμ(x)∥2≤1μ⎛⎝p1−1p2Γ(1p)⎞⎠d∫Rd|U(x+μξ)−U(y+μξ)|∥ξ∥p−1pe−∥ξ∥pp/pdξ. (B.2)

Using Eq. 2.2, we have that:

 |U(x+μξ)−U(y+μξ)| ≤min{⟨∇U(y+μξ),x−y⟩+L1+α∥y−x∥1+α2, ⟨∇U(x+μξ),y−x⟩+L1+α∥y−x∥1+α2} ≤12⟨∇U(y+μξ)−∇U(x+μξ), x−y⟩+L1+α∥y−x∥1+α2 ≤L1+α∥y−x∥1+α2,

where the second inequality comes from the minimum being smaller than the mean, and the last inequality is by convexity of (which implies , , ). Thus, combining with Eq. B.2, we have:

 ∥∇Uμ(y)−∇Uμ(x)∥2 ≤Lμ(1+α)∥y−x∥1+α2⎛⎝p1−1p2Γ(1p)⎞⎠d∫Rd∥ξ∥p−1pe−∥ξ∥pp/pdξ =Lμ(1+α)∥y−x∥1+α2⎡⎢ ⎢⎣pp−1pΓ(n+p−1p)Γ(np)⎤⎥ ⎥⎦. (B.3)

From , we want to show that Since is log-convex, by Jensen’s inequality for any , we have

 1plogΓ(dp)+p−1plogΓ(dp+1) ≥logΓ(1pdp+p−1p(dp+1)), ≥logΓ(d+p−1p)>0.

Raising to the power of both sides, we get

 Γ(dp)1pΓ(dp+1)p−1p≥Γ(d+p−1p),

which implies that

 ⎡⎢ ⎢⎣Γ(dp+1)Γ(dp)⎤⎥ ⎥⎦p−1p ≥Γ(d+p−1p)Γ(dp), [dp]p−1p ≥Γ(d+p−1p)Γ(dp).

So

 ∥∇Uμ(y)−∇Uμ(x)∥2 ≤Lμ(1+α)∥y−x∥1+α2pp−1p[dp]p−1p ≤Lμ(1+α)∥y−x∥1+α2dp−1p.

Finally, combining Eqs. B.1 and B.3:

 ∥∇Uμ(y)−∇Uμ(x)∥2=∥∇Uμ(y)−∇Uμ(x)∥α2⋅∥∇Uμ(y)−∇Uμ(x)∥1−α2
 ≤Lα⎛⎝Ldp−1pμ(1+α)⎞⎠1−α∥y−x∥2
 =Ldp−1p(1−α)μ1−α(1+α)1−α∥y−x∥2,

as claimed. ∎

###### Lemma 2.

Given that : is -strongly convex, then is also -strongly convex.

###### Proof.

Recall that a differentiable function is -strongly convex if, :

By the definition of a -Gaussian smoothing, :

 ψμ(y)−ψμ(x)−⟨∇ψμ(x), y−x⟩ =⎛⎝p1−1p2Γ(1p)⎞⎠d∫Rd(ψ(y+μξ)−ψ(x+μξ)−⟨∇ψ(x+μξ), y−x⟩)e−∥ξ∥pp/pdξ =λ2∥y−x∥2,

where we have used -strong convexity of

###### Lemma 3.

For any , and , let denote a stochastic gradient of . Then is an unbiased estimator of whose (normalized) variance can be bounded as:

 σ2: =Ez1,z2[∥∇¯¯¯¯Uμ(x)−G(x,z1,z2)∥22]d ≤4dα−1μ2αL2⎛⎜ ⎜⎝p2pΓ(3p)Γ(1p)⎞⎟ ⎟⎠α+4μ2m2⎛⎜ ⎜⎝p2pΓ(3p)Γ(1p)⎞⎟ ⎟⎠+8μ2p2pΓ(3p)η2Γ(1p).
###### Proof.

Recall that by definition of , we have , where , and is independent of . Clearly, .

We now proceed to bound the variance of . First, using Young’s inequality (which implies and that , we have:

 Ez1,z2[∥∇¯¯¯¯Uμ(x)−G(x,z1,z2)∥22] ≤2Ez1[∥Ew[¯¯¯¯U(x+μw)]−∇¯¯¯¯U(x+μz1)∥22]+4μ2Ez1[∥z1∥22]η2+4μ2Ez12[∥z2∥22]η2 =2Ez1[∥Ew[¯U(x+μw)]−∇¯¯¯¯U(x+μz1)∥22]+8μ2η2tr(Σ) =2Ez1[∥Ew[¯U(x+μw)]−∇¯¯¯¯U(x+μz1)∥22]+8μ2p2pΓ(3p)dη2Γ(1p),

where the last equation is by . The rest of the proof is as follows. By Jensen’s inequality,

 Ez1[∥Ew[∇¯¯¯¯U(x+μw)]−∇¯¯¯¯U(x+μz1)∥22]≤Ez1,w[∥∇¯¯¯¯U(x+μw)−∇¯¯¯¯U(x+μz1)∥22].

Hence, applying Young’s inequality , we further have:

 Ez1,w[∥∇¯¯¯¯U(x+μw)−∇¯¯¯¯U(x+μz1)∥22]≤Ez1,w[(L∥μ(w−z1)∥α2+m∥μ(w−z1)∥2)2]
 ≤2L2μ2αEz1,w[∥w−z1∥2α2]+2m2μ2Ez1,w[∥w−z1∥22].

Observe that is a concave function, . Hence, we have that . As and are independent, . Thus, we finally have:

 Ez1,z2[∥∇¯¯¯¯Uμ(x)−G(x, z1, z2)∥22