Primal Dual Interpretation of the Proximal Stochastic Gradient Langevin Algorithm

06/16/2020 ∙ by Adil Salim, et al. ∙ 16

We consider the task of sampling with respect to a log concave probability distribution. The potential of the target distribution is assumed to be composite, i.e., written as the sum of a smooth convex term, and a nonsmooth convex term possibly taking infinite values. The target distribution can be seen as a minimizer of the Kullback-Leibler divergence defined on the Wasserstein space (i.e., the space of probability measures). In the first part of this paper, we establish a strong duality result for this minimization problem. In the second part of this paper, we use the duality gap arising from the first part to study the complexity of the Proximal Stochastic Gradient Langevin Algorithm (PSGLA), which can be seen as a generalization of the Projected Langevin Algorithm. Our approach relies on viewing PSGLA as a primal dual algorithm and covers many cases where the target distribution is not fully supported. In particular, we show that if the potential is strongly convex, the complexity of PSGLA is (1/ε^2) in terms of the 2-Wasserstein distance. In contrast, the complexity of the Projected Langevin Algorithm is (1/ε^12) in terms of total variation when the potential is convex.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sampling from a target distribution is a fundamental task in machine learning. Consider the Euclidean space

and a convex function . Assuming that has a positive finite integral w.r.t. the Lebesgue measure , we consider the task of sampling from the distribution whose density is proportional to (we shall write ).

If is smooth, Langevin algorithm produces a sequence of iterates asymptotically distributed according to . Langevin algorithm performs iterations of the form

(1)

where and

is a sequence of i.i.d. standard Gaussian vectors in

. Each iteration of (1) can be seen as a gradient descent step for , where the gradient of perturbed by a Gaussian vector. Nonasymptotic bounds for Langevin algorithm have been established in dalalyan2017theoretical ; durmus2017nonasymptotic . Moreover, Langevin algorithm can be interpreted as an inexact gradient descent method to minimize the Kullback-Leibler (KL) divergence w.r.t.  in the space of probability measures cheng2018convergence ; durmus2018analysis ; bernton2018langevin ; wibisono2018sampling ; ma2019there ; ambrosio2008gradient .

In many applications, the function

is naturally written as the sum of a smooth and a nonsmooth term. In Bayesian statistics for example,

typically represents some posterior distribution. In this case, is the sum of the -likelihood (which is itself a sum over the data points) and the possibly nonsmooth potential of the prior distribution welling2011bayesian ; durmus2018efficient ; durmus2018analysis , which plays the role of a regularizer. In some other applications in Bayesian learning, the support of is not the whole space  bubeck2018sampling ; brosse2017sampling (i.e., can take the value ). In order to cover these applications, we consider the case where is written as

(2)

where

is a random variable,

for every , is smooth and convex and is nonsmooth and convex. We assume to have access to the stochastic gradient (where is a random variable with values in ) and to the proximity operator of . The template (2) covers many log concave densitiesChaux2007 ; durmus2018efficient ; durmus2018analysis ; bubeck2018sampling . In optimization, the minimization of can be efficiently tackled by the proximal stochastic gradient algorithm atc-for-mou-17 . Inspired by this optimization algorithm, the Proximal Stochastic Gradient Langevin Algorithm (PSGLA) durmus2018analysis is the method performing proximal stochastic Langevin steps of the form

(3)

where , is a sequence of i.i.d. standard Gaussian random vectors in , and is a sequence of i.i.d. copies of . Remarkably, the iterates of PSGLA remain in the domain of , i.e., the support of , a property that is useful in many contexts. When is Lipschitz continuous, the support of is and PSGLA can be interpreted as an inexact proximal gradient descent method for minimizing KL, with convergence rates proven in terms of the KL divergence durmus2018analysis . However, for general , the KL divergence can take infinite values along PSGLA. Therefore, a new approach is needed.

1.1 Related works

First, various instances of the PSGLA algorithm have already been considered.

The only instance allowing to be infinite (i.e., the support of not to be ) is the Projected Langevin Algorithm bubeck2018sampling , which corresponds to our setting in the special case with (i.e., the indicator function of a convex body111A convex body is a compact convex set with a nonempty interior. ), and (i.e., the full gradient of ). In this case, is the orthogonal projection onto and is supported by . Bubeck et al bubeck2018sampling provide complexity results in terms of sufficient number of iterations to achieve accuracy in terms of the Total Variation between the target distribution and the current iterate distribution. Assuming that is convex and smooth, the complexity of the Projected Langevin Algorithm is 222Our big O notation ignores logarithm factors., and if , the complexity is improved to .

Other instances of PSGLA were proposed in the case where is Lipschitz continuous or smooth (and hence finite). Wibisono wibisono2018sampling considered the case with and , proposing the Symmetrized Langevin Algorithm (SLA), and showed that the current iterate distribution converges linearly in Wasserstein distance to the invariant measure of the SLA, if is strongly convex and smooth. Durmus et al durmus2018analysis considered the case where is Lipschitz continuous, and showed that the complexity of PSGLA is in terms of the KL divergence and in terms of the Total Variation distance if is convex and smooth. If is strongly convex, the complexity is in Wasserstein distance and in KL divergence. Bernton bernton2018langevin studied a setting similar to durmus2018analysis and derived a similar result for the Proximal Langevin Algorithm (i.e., PSGLA without the gradient step) in the strongly convex case. The Proximal Langevin Algorithm was also studied in a recent paper of Wibisono wibisono2019proximal , where a rapid convergence result was proven in the case where is nonconvex but satisfies further smoothness and geometric assumptions.

Second, the task of sampling w.r.t. , where is nonsmooth and possibly takes infinite values, using Langevin algorithm, has also been considered.

When is strongly convex and an indicator function of a bounded convex set, the existence of an algorithm achieving in Wasserstein and Total Variation distances was proven by Hsieh et al (hsieh2018mirrored, , Theorem 3). However, an actual algorithm is only given in a specific, although nonconvex, case. Besides, MYULA (Moreau-Yosida Unadjusted Langevin Algorithm) durmus2018efficient ; brosse2017sampling can tackle the task of sampling from efficiently. MYULA is equivalent to Langevin algorithm (1) applied to sampling from , where is the Moreau-Yosida approximation of  atchade2015moreau . By choosing the smoothing parameter appropriately, and making assumptions that allow to control the distance between and (e.g., Lipschitz or ), complexity results for MYULA were established in durmus2018efficient ; brosse2017sampling . For example, if is the indicator function of a convex body, Brosse et al brosse2017sampling show that the complexity of MYULA is in terms of the Total Variation distance (resp. 1-Wasserstein distance) if is convex and smooth (resp., if is strongly convex and smooth), provided that the algorithm is initialized from a minimizer of . Similarly to PSGLA, MYULA involves one proximal step and one gradient step per iteration. However, the support of the smoothed distribution is always (even if is not fully supported), and therefore the iterates of MYULA do not remain in the support of the target distribution , contrary to PSGLA.

Finally, the task of sampling w.r.t. , where is not smooth but finite, has also been considered.

The Perturbed Langevin Algorithm proposed by Chatterji et al chatterji2019langevin allows to sample from in the case when satisfies a weak form of smoothness (generalizing both Lipschitz continuity and smoothness) and without accessing its proximity operator. Finally, if is Lipschitz continuous, the Stochastic Proximal Langevin Algorithm proposed by Salim et al salim2019stochastic and Schechtman et alschechtman2019passty allows to sample from using cheap stochastic proximity operators only.

1.2 Contributions

In summary, the only instance of PSGLA allowing to be infinite has complexity in Total Variation bubeck2018sampling and only applies to the case where is the indicator of a convex body. In this case, assuming moreover that is strongly convex, MYULA has complexity in 1-Wasserstein distance brosse2017sampling , but allows the iterates to leave the support of . Besides, in the latter case, there exists a Langevin algorithm achieving rate in the Wasserstein distance hsieh2018mirrored .

In this paper, we consider other cases where can be infinite. More precisely, we consider a general and we assume that has a mild Sobolev regularity. Under this assumption, we prove that PSGLA has the complexity in Wasserstein distance if is strongly convex. We also show that, if is just convex, PSGLA has the complexity in terms of a newly defined duality gap, which can be seen as the notion that replaces KL, since KL can be infinite.

Our approach follows the line of works cheng2018convergence ; durmus2018analysis ; wibisono2018sampling ; bernton2018langevin ; ma2019there ; wibisono2019proximal ; vempala2019rapid that formulate the task of sampling form as the problem of minimizing the KL divergence w.r.t . In summary, our contributions are the following.

In the first part of the paper, we reformulate the task of sampling from as the resolution of a monotone inclusion defined on the space of probability measures. We subsequently use this reformulation to define a duality gap for the minimization of the KL divergence, and show that strong duality holds.

In the second part of this paper, we use this reformulation to represent PSGLA as a primal dual stochastic Forward Backward algorithm involving monotone operators.

This new representation of PSGLA, along with the strong duality result from the first part, allows us to prove new complexity results for PSGLA that extend and improve the state of the art.

Finally, we conduct some numerical experiments for sampling from a distribution supported by a half space of matrices.

In the first part we combine tools from optimization duality condatreview and optimal transport ambrosio2008gradient and in the second part we combine tools from the analysis of the Langevin algorithm durmus2018analysis , and the analysis of primal dual optimization algorithms drori2015simple ; chambolle2016ergodic .

The remainder is organized as follows. In Section 2 we provide some background knowledge on convex analysis and optimal transport. In Section 3 we develop a primal dual optimality theory for the task of sampling from . In Section 4 we give a new representation of PSGLA using monotone operators. We use it to state our main complexity result on PSGLA in Section 5. We discuss the broader impact of our work in Section 6. Numerical experiments (which are secondary as this is mainly a theoretical work) and all proofs are postponed to the appendix. Therein we also provide further intuitions on PSGLA along with an extension of PSGLA for handling a third (stochastic and proximable) term in the definition of the potential  (2).

2 Background

Throughout this paper, we use the conventions and .

2.1 Convex analysis

In this section, we recall some facts from convex analysis. These facts will be used in the proofs without mention. For more details, the reader is referred to bau17 .

2.1.1 Convex optimization

By we denote the set of proper, convex, lower semicontinuous functions . A function is -smooth if is differentiable and its gradient is -Lipschitz continuous. Consider and denote its domain. Given , a subgradient of at is any vector satisfying

(4)

for every . If the set of subgradients of at is not empty, then there exists a unique element of with minimal norm. This particular subgradient is denoted . The set valued map is called the subdifferential. The proximity operator of , denoted , is defined by

(5)

By we denote the indicator function of set given by if and if . If , where is a closed convex set, then is the orthogonal projection onto . Moreover, is the only solution to the inclusion . The Fenchel transform of is the function defined by Several properties relate to its Fenchel transform . First, the Fenchel transform of is . Then, the subdifferential is characterized by the relation Finally, is -strongly convex if and only if is -smooth.

2.1.2 Maximal monotone operators

A set valued function is monotone if whenever and . The inverse of , denoted , is defined by the relation , and the set of zeros of is . If is monotone, is maximal if its resolvent, i.e., the map , is single valued. If , then is a maximal monotone operator and . Moreover, and . If

is a skew symmetric matrix on

, the operator is maximal monotone. Finally, the sum is also a maximal monotone operator. Many problems in optimization can be cast as the problem of finding a zero of the sum of two maximal monotone operators condatreview . For instance, . To solve this problem, the Forward Backward algorithm is given by the iteration , where is a symmetric positive definite matrix (),333The operators and are not monotone in general, however they are monotone under the inner product induced by . and is single valued. Using the definition of the resolvent, the Forward Backward algorithm can equivalently be written as

(6)

2.2 Optimal transport

In this section, we recall some facts from optimal transport theory. These facts will be used in the proofs without mention. For more details, the reader is referred to Ambrosio et al ambrosio2008gradient .

2.2.1 Wasserstein distance

By we denote the -field of Lesbesgue measurable subsets of , and by the set of probability measures over

with finite second moment

. Denote the support of . The identity map belongs to the Hilbert space of -square integrable random vectors in . We denote (resp. ) the inner product (resp. the norm) in this space. Given , where is some Euclidean space, the pushforward measure of by , also called the image measure, is defined by for every . Consider . A coupling between and (we shall write ) is a probability measure over such that , where , and , where . In other words, is a random variable such that the distribution of is (we shall write ) and if and only if the distribution of is a coupling. The (2-)Wasserstein distance is then defined by

(7)

Let be the set of elements such that is absolutely continuous w.r.t.  (we shall write ). Brenier’s theorem asserts that if , then the defining is actually a achieved by a unique minimizer . Moreover, there exists a uniquely determined -almost everywhere (a.e.) map such that , where . In this case, is called the optimal pushforward from to and satisfies

(8)
2.2.2 Geodesically convex functionals

We shall consider several functionals defined on the space . For every with density denoted w.r.t. , the entropy is defined by

(9)

and if , then . Given , the potential energy is defined for every by

(10)

Finally, if such that , the Kullback-Leibler (KL) divergence is defined by

(11)

where denotes the density of w.r.t. , and if is not absolutely continuous w.r.t. . The functionals , and satisfy a form of convexity over called geodesic convexity. If is geodesically convex, then for every , , and , Given , a (Wasserstein) subgradient of at is a random variable such that for every ,

(12)

Moreover, if is a subgradient of at , then the following monotonicity property holds

(13)

If the set of subgradients of at is not empty, then there exists a unique element of with minimal norm. This particular subgradient is denoted . However, the set might be empty. A typical condition for nonemptiness requires the density to have some Sobolev regularity. For every open set , we denote the Sobolev space of -integrable functions admitting a -integrable weak gradient . We say that if for every bounded open set . Obviously, .

2.3 Assumptions on and

Consider and . We make the following assumptions.

Assumption 1.

The function is convex and -smooth. Moreover, .

Note that . We denote (resp. ) the strong convexity parameter of (resp. ), equal to zero if (resp. ) is not strongly convex.

Assumption 2.

The integral is positive and finite.

Assumption 2 is needed to define the target distribution , and implies that , where .

Lemma 1.

If Assumptions 1 and 2 hold, then

Lemma 1 implies that and using Assumption 1, . Using (rockafellar1970convex, , Theorem 25.5), is differentiable -a.e. (almost everywhere) on .

Assumption 3.

The integral is finite.

Assumption 3 is equivalent to requiring , see below. Moreover, we assume the following regularity property for the function .

Assumption 4.

The function belongs to the space .

Assumption 4 is a necessary condition for , see below. This assumption precludes

from being a uniform distribution. However, Assumption 

4 is quite general, e.g., need not be continuous or positive (see the numerical experiment section). Finally, we assume that the stochastic gradients of

have a bounded variance. Consider an abstract measurable space

, and a random variable with values in .

Assumption 5.

For every , is integrable and . Moreover, there exists such that for every , .

The last assumption implies that the stochastic gradients are unbiased: for every .

3 Primal dual optimality in Wasserstein space

Let be defined by

(14)

Using Lemma 1, and are finite real numbers. Moreover, using (durmus2018analysis, , Lemma 1.b), for every such that , we have the identity

(15)

Equation (15) says that is the unique minimizer of :

3.1 Subdifferential calculus

The following result is a consequence of (ambrosio2008gradient, , Theorem 10.4.13).

Theorem 2.

Let be an element of . Then, and . Moreover, if and only if and there exists such that

(16)

for -a.e. . In this case, .

If Assumptions 1 and 2 hold, then using Lemma 1. Then, Theorem 2 implies that . Therefore, using (rockafellar1970convex, , Theorem 25.5), and are -a.s. differentiable, and .

Moreover, applying Theorem 2 with and , if and only if and for some . Using Assumption 4 and setting , we obtain that -a.e. Therefore, satisfies

(17)

Equation (17) can be seen as the first order optimality conditions associated with the minimization of the functional . Consider the "dual" variable defined a.e. Using Assumption 3 and , . We can express the first order optimality condition (17) as , a.e. Besides, , therefore using Denote and . The relationship between and can be summarized as

(18)

In the sequel, we fix the probability space , denote the mathematical expectation and the space . The expression "almost surely" (a.s.) will be understood w.r.t. . Recall that is the map and . Using Assumption 3, , , , and a.s.

3.2 Lagrangian function and duality gap

We introduce the following Lagrangian function defined for every and by

(19)

where . We also define the duality gap by

(20)

The next theorem, which is of independent interest, can be interpreted as a strong duality result for the Lagrangian function , see (rockafellar1970convex, , Lemma 36.2).

Theorem 3 (Strong duality).

Let Assumptions 14 hold true. Then, for every , and . Moreover, is a saddle point of with saddle value , i.e.,

(21)

Finally, if and only if , and, if is strictly convex, if and only if .

The proof of Theorem 3 relies on using (18) to write the duality gap as the sum of the Bregman divergences of , and . We shall use the nonnegativity of the duality gap to derive convergence bounds for PSGLA.

4 Forward Backward representation of PSGLA

In this section, we present our viewpoint on PSGLA (3). More precisely, we represent PSGLA as a (stochastic) Forward Backward algorithm involving (stochastic) monotone operators which are not necessarily subdifferentials.

Intuition.

Let and consider the set valued maps

(22)

where . The maps and satisfy a monotonicity property similar to (13) (note that is a maximal monotone operator as the sum of and the subdifferential of the function ). Inclusion (18) can be rewritten as

(23)
Rigorous Forward Backward representation.

The “monotone” inclusion (23) intuitively suggests the following stochastic Forward Backward algorithm for obtaining samples from (and hence from by marginalizing):

(24)
(25)

Above, is an appropriately chosen matrix. Indeed, Algorithm (24)-(25) is a stochastic Forward-Backward algorithm rosasco2016stochastic ; com-pes-pafa16 ; bia-hac-16 ; bia-hac-sal-jca17 where the gradient is perturbed by a Gaussian vector, as in the Langevin algorithm (1). In Algorithm (24)-(25), we cannot set to be the identity map of because the inclusion (25) is intractable in this case. We take , i.e., with our notations, . Although the matrix is only semi-definite positive, the next lemma shows that Algorithm (24)-(25) is still well defined. More precisely, the next lemma shows that (by taking in the lemma) and hence the resulting algorithm (24)-(25) is PSGLA. Based on the representation (24)-(25) of PSGLA, the next lemma also provides an important inequality used later in the proof of Theorem 5.

Lemma 4.

Let . Then if and only if and . Moreover, if is -smooth, then

(26)

5 Main results

We now provide our main result on PSGLA (3). For , denote (resp. ) the distribution of (resp. ), defined in the previous section.

Theorem 5.

Let Assumptions 12, 3 and 5 hold true. If is -strongly convex and is -smooth, then for every ,

(27)

where and where .

The proof of Theorem 5 relies on using Lemma 4 along with (durmus2018analysis, , Lemma 30). Inspecting the proof of Theorem 5, one can see that any can replace 444The proof does not rely on specific properties of the latter like being primal dual optimal.. The situation is similar to primal dual algorithms in optimization chambolle2016ergodic ; drori2015simple and Evolution Variational Inequalities in optimal transport ambrosio2008gradient .

The next corollary is obtained by using (Theorem 3) and iterating (5).

Corollary 6.

Let Assumptions 15 hold true. If , then

(28)
(29)

Finally, if , then

(30)

If is Lipschitz continuous (in particular if ), then our Assumptions hold true. Moreover, inequality (28) recovers (durmus2018analysis, , Corollary 18) but with the duality gap instead of the KL divergence. Obtaining a result in terms of KL divergence is hopeless for PSGLA in general because the KL divergence is infinite; see the appendix. To the best of our knowledge, inequality (29) that holds when is just convex is new, even in the particular case where is Lipschitz. Corollary 6 implies the following complexity results. Given , choosing and in inequality (28) leads to . If (i.e., if is smooth), choosing and in inequality (29) leads to . Finally, if (i.e., if is strongly convex), choosing and i.e.,

(31)

in inequality (30), leads to . In the case where is -Lipschitz continuous, the complexity (31) improves (durmus2018analysis, , Corollary 22) since .

6 Broader impact

We made a step towards theoretical understanding the properties of the Langevin algorithm in the elusive case where the target distribution is not smooth and not fully supported. The case is known to be difficult to analyze and has many applications brosse2017sampling ; durmus2018efficient ; bubeck2018sampling . Our analysis improves and extends the state of the art.

Moreover, our approach is new. We developed a primal dual theory for a minimization problem over the Wasserstein space, which is of independent interest and may inspire other researchers to push the field further. A broader duality theory for minimization problems in the Wasserstein space would be of practical and theoretical interest.

References

Appendix A Numerical experiments

In this section, we illustrate our results on PSGLA through numerical experiments.

Sampling a posteriori.

We consider a statistical framework where i.i.d. random vectors (data) with distribution are observed. We adopt a Bayesian strategy where we assume the distribution to be indexed by a random vector with values in . Denote the density of (a.k.a. the likelihood function) w.r.t. some reference measure. Given a prior distribution for with density w.r.t. , our goal is construct samples from the posterior distribution

(32)

in order e.g.

to estimate the mean

a posteriori via Monte Carlo approximations,

(33)
Wishart distribution.

In the experiments, the Euclidean space is a spac