# Stochastic Proximal Langevin Algorithm: Potential Splitting and Nonasymptotic Rates

We propose a new algorithm---Stochastic Proximal Langevin Algorithm (SPLA)---for sampling from a log concave distribution. Our method is a generalization of the Langevin algorithm to potentials expressed as the sum of one stochastic smooth term and multiple stochastic nonsmooth terms. In each iteration, our splitting technique only requires access to a stochastic gradient of the smooth term and a stochastic proximal operator for each of the nonsmooth terms. We establish nonasymptotic sublinear and linear convergence rates under convexity and strong convexity of the smooth term, respectively, expressed in terms of the KL divergence and Wasserstein distance. We illustrate the efficiency of our sampling technique through numerical simulations on a Bayesian learning task.

## Authors

• 12 publications
• 16 publications
• 108 publications
12/04/2019

### Stochastic proximal splitting algorithm for stochastic composite minimization

Supported by the recent contributions in multiple branches, the first-or...
06/16/2020

### Primal Dual Interpretation of the Proximal Stochastic Gradient Langevin Algorithm

We consider the task of sampling with respect to a log concave probabili...
04/03/2020

### Dualize, Split, Randomize: Fast Nonsmooth Optimization Algorithms

We introduce a new primal-dual algorithm for minimizing the sum of three...
01/31/2016

### A Proximal Stochastic Quasi-Newton Algorithm

In this paper, we discuss the problem of minimizing the sum of two conve...
02/13/2020

### Sampling and Update Frequencies in Proximal Variance Reduced Stochastic Gradient Methods

Variance reduced stochastic gradient methods have gained popularity in r...
10/01/2019

### An Efficient Sampling Algorithm for Non-smooth Composite Potentials

We consider the problem of sampling from a density of the form p(x) ∝(-f...
04/07/2021

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

### 1 Introduction

Many applications in the field of Bayesian machine learning require to sample from a probability distribution

with density ,

. Due to their scalability, Monte Carlo Markov Chain (MCMC) methods such as Langevin Monte Carlo

(47) or Hamiltonian Monte Carlo (27)

are popular algorithms to solve such problems. Monte Carlo methods typically generate a sequence of random variables

with the property that the distribution of approaches as grows.

While the theory of MCMC algorithms has remained mainly asymptotic, in recent years the exploration of non-asymptotic properties of such algorithms has led to a renaissance in the field (13; 25; 38; 14; 15; 18; 21; 11). In particular, if , where is a smooth convex function, (13; 18) provide explicit convergence rates for the Langevin algorithm (LA)

 xk+1=xk−γ∇U(xk)+√2γWk, (1)

where and is a sequence of i.i.d. standard Gaussian random variables. The function , also called the potential, enters the algorithm through its gradient.

In this work we study the setting where the potential is the sum of a single smooth and a potentially large number of nonsmooth convex functions. In particular, we consider the problem

 Sample fromμ⋆(x)∝exp(−U(x)),% whereU(x)\coloneqqF(x)+n∑i=1Gi(x), (2)

where is a smooth convex function and are (possibly nonsmooth) convex functions. The additive model for we propose offers ample flexibility as typically there are multiple decompositions of in the form (2), and is then able to choose the one that fits any given situation best.

### 2 Contributions

We now briefly comment some of the key contributions of this work.

A splitting technique for Langevin algorithm. We propose a new variant of LA for solving (2), which we call Stochastic Proximal Langevin Algorithm (SPLA). We assume that and can be written as expectations over some simpler functions and

 F(x)=Eξ(f(x,ξ)),andGi(x)=Eξ(gi(x,ξ)). (3)

SPLA (see Algorithm 1 in Section 4) only requires accesses to the gradient of and to proximity operators of the functions . SPLA can be seen as a Langevin version of the stochastic Passty algorithm (29; 35). To the best of our knowledge, this is the first time a splitting technique that involves multiple (stochastic) proximity operators is used in a Langevin algorithm.

Remarks: Current forms of LA tackle problem (2) using stochastic subgradients (17). If and is proximable (i.e., the learner has access to the full proximity operator of ), it has recently been proposed to use proximity operators instead of (sub)gradients (19; 17), as it is done in the optimization literature (28; 2). Indeed, in this case, the proximal stochastic gradient method is an efficient method to minimize . If , and the functions are proximable, the minimization of is usually tackled using the operator splitting framework: the (stochastic) three-operator splitting (50; 16) or (stochastic) primal dual algorithms (12; 45; 9; 34). These algorithms involve the computation of (stochastic) gradients and (full) proximity operators and enjoy numerical stability properties. However, proximity operators are sometimes difficult to implement1. In this case, stochastic proximity operators are known to be cheaper than full proximity operators and numerically more stable than stochastic subgradients to handle nonsmooth terms (31; 30; 4; 5; 6) but also smooth (42) terms. In this paper, we bring together the advantages of operator splitting and stochastic proximity operators for sampling purposes.

Theory. We perform a nonasymptotic convergence analysis of SPLA. Our main result, Theorem 1

, gives a tractable recursion involving the Kullback-Leibler divergence and Wasserstein distance (when

is strongly convex) between and probability measures associated with certain samples generated by our method. We use this result to show that the KL divergence is lower than after iterations if the constant stepsize is used (Corollary 2). Assuming is -strongly convex, we show that the Wasserstein distance and (resp. the KL divergence) decrease exponentially, up to an oscillation region of size (resp. ) as shown in Corollary 3 (resp. Corollary 4). If we wish to push the Wasserstein distance below (resp. the KL divergence below ), this could be achieved by setting , and it would be sufficient to take iterations. These results are summarized in Table 1. The obtained convergence rates match the previous known results obtained in simpler settings (17). Note that convergence rates of optimization methods involving multiple stochastic proximity operators haven’t been established yet.

Remarks: Our proof technique is inspired by (37), which is itself based on (17). In (37), the authors consider the case, and assume that the smooth function is proximable. In (17), a proximal stochastic (sub)gradient Langevin algorithm is studied. In this paper, convergence rates are established by showing that the probability distributions of the iterates shadow some discretized gradient flow defined on a measure space. Hence, our work is a contribution to recent efforts to understand Langevin algorithm as an optimization algorithm in a space of probability measures (25; 48; 3).

Online setting. In online settings, is unknown but revealed across time. Our approach provides a reasonable algorithm for such situations, especially in cases when the information revealed about is stationary in time. In particular, this includes online Bayesian learning with structured priors or nonsmooth log likelihood (49; 22; 39; 46). In this context, the learner is required to sample from some posterior distribution that takes the form (2) where are intractable. However, these functions can be cheaply sampled, or are revealed across time through i.i.d. streaming data.

Simulations. We illustrate the promise of our approach numerically by performing experiments with SPLA applied to sampling from the posterior distribution in the Graph Trend Filtering context (46). For this nonsmooth large scale simulation problem, SPLA is performing better than the state of the art method that uses stochastic subgradients instead of stochastic proximity operators. Indeed, in the optimization litterature (2), proximity operators are already known to be more stable than subgradients.

### 3 Technical Preliminaries

In this section, we recall certain notions from convex analysis and probability theory, which are keys to the developments in this paper, state our main assumptions, and introduce needed notations.

#### 3.1 Subdifferential, minimal section and proximity operator

Given a convex function , its subdifferential at , , is the set

 ∂g(x)\coloneqq{d∈Rd:g(x)+⟨d,y−x⟩≤g(y)}. (4)

Since is a nonempty closed convex set (2), the projection of onto —the least norm element in the set —is well defined, and we call this element . The function is called the minimal section of . The proximity operator associated with is the mapping defined by

 proxg(x)\coloneqqargminy∈Rd{12∥x−y∥2+g(y)}. (5)

Due to its implicit definition, can be hard to evaluate.

#### 3.2 Stochastic structure of F and Gi: integrability, smoothness and convexity

Here we detail the assumptions behind the stochastic structure (3) of the functions and defining the potential . Let be a probability space and denote the mathematical expectation and

the variance. Consider

a random variable from to another probability space with distribution .

###### Assumption 1 (Integrability).

The functions and , , are -integrable for every .

Furthermore, we will make the following convexity and smoothness assumptions.

###### Assumption 2 (Convexity and differentiability).

The function is convex and differentiable for every . The functions are convex for every .

The gradient of is denoted , the subdifferential of is denoted and its minimal section is denoted . Under Assumption 2, it is known that is convex and differentiable and that  (33). Next, we assume that is smooth and -strongly convex. However, we allow if is not strongly convex. We will only assume that in Corollaries 3 and 4.

###### Assumption 3 (Convexity and smoothness of F).

The gradient of is -Lipschitz continuous, where . Moreover, is -strongly convex, where .

Under Assumption 2, the second part of the above holds for . Finally, we will introduce two noise conditions on the stochastic (sub)gradients of and .

###### Assumption 4 (Bounded variance of ∇f(x,⋅)).

There exists , such that for every .

###### Assumption 5 (Bounded second moment of ∇0gi(x,⋅)).

For every , there exists such that for every .

Note that if is -Lipschitz continuous for every , and if is -square integrable, then Assumption 5 holds.

#### 3.3 KL divergence, entropy and potential energy

Recall from (2) that and assume that . Our goal is to sample from the unique distribution over with density (w.r.t. the Lebesgue measure denoted ) proportional to , for which we write . The closeness between the samples of our algorithm and the target distribution will be evaluated in terms of information theoretic and optimal transport theoretic quantities.

Let be the Borel -field of . Given two nonnegative measures and on , we write if is absolutely continuous w.r.t. , and denote its density. The Kullback-Leibler (KL) divergence between and , , quantifies the closeness between and . If , then the KL divergence is defined by

 KL(μ|ν)\coloneqq∫log(dμdν(x))dμ(x), (6)

and otherwise we set . Up to an additive constant, can be seen as the sum of two terms (36): the entropy and the potential energy . The entropy of is given by , and the potential energy of is defined by

#### 3.4 Wasserstein distance

Although the KL divergence is equal to zero if and only if , it is not a mathematical distance (metric). The Wasserstein distance, defined below, metricizes the space of probability measures over

with a finite second moment. Consider

. A transference plan of is a probability measure over with marginals : for every , and . In particular, the product measure is a transference plan. We denote the set of transference plans. A coupling of is a random variable over some probability space with values in (i.e., and are random variables with values in ) such that the distribution of is and the distribution of is . In other words, is a coupling of if the distribution of is a transference plan of . The Wasserstein distance of order between and is defined by

 W2(μ,ν)\coloneqqinf{∫Rd×Rd∥x−y∥2dυ(x,y),υ∈Γ(μ,ν)}. (7)

One can see that where the is taken over all couplings of defined on some probability space with expectation .

### 4 The SPLA Algorithm and its Convergence Rate

#### 4.1 The algorithm

To solve the sampling problem (2), our Stochastic Proximal Langevin Algorithm (SPLA) generates a sequence of random variables from to defined as follows

 zk =xk−γ∇f(xk,ξk) yk0 =zk+√2γWk (8) yki =proxγgi(⋅,ξk)(yki−1)fori=1,…,n xk+1 =ykn, (9)

where is a sequence of i.i.d. standard Gaussian random variables, is a sequence of i.i.d. copies of and is a positive step size. Our SPLA method is formalized as Algorithm 1; its steps are explained therein.

#### 4.2 Main theorem

We now state our main results in terms of Kullback-Leibler divergence and Wasserstein distance. We denote the distribution of every random variable defined on .

###### Theorem 1.

Let Assumptions 15 hold and assume that . There exists such that,

 2γKL(μyk0|μ⋆)≤(1−γα)W2(μxk,μ⋆)−W2(μxk+1,μ⋆)+γ2(2σ2F+2Ld+C). (10)

The constant can be expressed as a linear combination of with integer coefficients. Moreover, if , then . More generally, if for every , admits almost surely the representation where are independent random variables, then .

###### Proof.

A full proof can be found in the Supplementary material. We only sketch the main steps here. For every -integrable function , we denote . Moreover, we denote First, using (17, Lemma 1), , and if , then

 KL(μ|μ⋆)=EU(μ)+H(μ)−(EU(μ⋆)+H(μ⋆))=F(μ)−F(μ⋆),

provided that . Then, we decompose where . Using (17) again, we can establish the inequality

 2γ[H(μyk0)−H(μ⋆)]≤W2(μzk,μ⋆)−W2(μyk0,μ⋆). (11)

Then, if we obtain, for every random variable with distribution ,

 E[∥∥zk−a∥∥2]≤(1−γα)E[∥∥xk−a∥∥2]+2γ[EF(μ⋆)−EF(μzk)]+2γ2σ2F, (12)

using standard computations regarding the stochastic gradient descent algorithm. Using the smoothness of and the definition of the Wasserstein distance, this implies

 2γ[EF(μyk0)−EF(μ⋆)]≤(1−γα)W2(μxk,μ⋆)−W2(μzk,μ⋆)+γ2(2σ2F+2Ld). (13)

It remains to establish which is the main technical challenge of the proof. This is done using the frameworks of Yosida approximation of random subdifferentials and Moreau regularizations of random convex functions (2). Equation (10) is obtained by summing the obtained inequalities. ∎

Equation (10) is reminiscent of the fact that the SPLA shadows the gradient flow of in the metric space . To see this, first consider the gradient flow associated to . By definition, it is the flow of the differential equation (7)

 ddtx(t)=−∇F(x(t)),t>0. (14)

The function can alternatively be defined as a solution of the variational inequalities

 2(F(x(t))−F(a))≤−ddt∥x(t)−a∥2,t>0,∀a∈Rd. (15)

The iterates of the stochastic gradient descent (SGD) algorithm applied to can be seen as a (noisy) Euler discretization of (14) with a step size . This idea has been used successfully in the stochastic approximation litterature (32; 23). This analogy goes further since a fundamental inequality used to analyze SGD applied to is ((26))

 2γE(F(uk+1)−F(a))≤E∥uk−a∥2−E∥uk+1−a∥2+γ2K,k≥0, (16)

where is some constant, which can be seen as a discrete counterpart of (15). Note that this inequality is similar to (12) that is used in the proof of Theorem 1.

In the optimal transport theory, the point of view of (15) is used to define the gradient flow of a (geodesically) convex function defined on (see (36) or (1, Page 280)). Indeed, the gradient flow of in the space satisfies for every ,

 2(F(νt)−F(μ))≤−ddtW2(νt,μ), (17)

which can be seen as a continuous time counterpart of Equation (10) by setting . Furthermore, Equation (11) in the proof of Theorem 1 is also related to (17). It is obtained by applying Equation (17) with and (see e.g (17, Lemma 4)).

#### 4.3 Explicit convergence rates for convex and strongly convex F

Corollaries 2,  3 and 4 below are obtained by unrolling the recursion provided by Theorem 1. The convergence rates are summarized in Table 1.

###### Corollary 2 (Convex F).

Consider a sequence of independent random variables such that is independent of and , and the distribution of is uniform over . Denote . If , then,

 KL(μ^xk|μ⋆)≤12γ(k+1)W2(μx0,μ⋆)+γ2(2σ2F+2Ld+C). (18)

Hence, given any , choosing stepsize and a number of iterations

 k+1≥max{Lε,2σ2F+2Ld+Cε2}W2(μx0,μ⋆), (19)

implies

###### Corollary 3 (Strongly convex F).

If and , then,

 W2(μxk,μ⋆)≤(1−γα)kW2(μx0,μ⋆)+γ(2σ2F+2Ld+C)α. (20)

Hence, given any , choosing stepsize and a number of iterations

 k≥max{Lα,2(2σ2F+2Ld+C)εα2}log(2W2(μx0,μ⋆)ε), (21)

implies

###### Corollary 4 (Strongly convex F).

Consider a sequence of independent random variables such that is independent of and . Assume that the distribution of is geometric over :

 P(jk=r)∝(1−γα)−r.

Denote . If and , then,

 KL(μ~xk|μ⋆)≤αW2(μx0,μ⋆)2⋅(1−γα)k+11−(1−γα)k+1+γ(2σ2F+2Ld+C)2. (22)

Hence, given any , choosing stepsize and a number of iterations

 k≥max{Lα,2σ2F+2Ld+Cεα2}log(2max{1,W2(μx0,μ⋆)ε}),

implies

### 5 Application to Trend Filtering on Graphs

In this section we consider the following Bayesian point of view of trend filtering on graphs (41). Consider a finite undirected graph , where is the set of vertices and is the set of edges. Denote the cardinality of and the cardinality of . A realization of a random vector is observed. In a Bayesian framework, the distribution of is parametrized by a vector which is itself random and whose distribution is proportional to , where is a scaling parameter and where for every

 TV(x,G)=∑i,j∈V,{i,j}∈E|x(i)−x(j)|,

is the Total Variation regularization over . The goal is to learn after an observation of . The paper (46) consider the case where the distribution of given (a.k.a the likelihood) is proportional to , where is another scaling parameter. In other words, the distribution of given is

, a normal distribution centered at

with variance (where is the identity matrix). Denoting

 π(x|y)∝exp(−U(x)),U(x)=12σ2∥x−y∥2+λTV(x,G), (23)

the posterior distribution of given

, the maximum a posteriori estimator in this Bayesian framework is called the Graph Trend Filtering estimate

(46). It can be written

 x⋆=argmaxx∈RVπ(x|Y)=argminx∈RV12σ2∥x−Y∥2+λTV(x,G).

Although maximum a posteriori estimators carry some information, they are not able to capture uncertainty in the learned parameters. Samples a posteriori provide a better understanding of the posterior distribution and allow to compute other Bayesian estimates such as confidence intervals. This allows to avoid overfitting among other things. In our context, sampling a posteriori would require to sample from the target distribution

.

In the case where is a 2D grid (which can be identified with an image), the proximity operator of can be computed using a subroutine (8) and the proximal stochastic gradient Langevin algorithm can be used to sample from  (19; 17). However, on a large/general graph, the proximity operator of is hard to evaluate (40; 35). Since is written as a sum, we shall rather select a batch of random edges and compute the proximity operators over these randomly chosen edges. More precisely, we write the potential defining in the form (2) by setting

 U(x)=F(x)+n∑i=1Gi(x),F(x)=12σ2∥x−Y∥2,Gi(x)=λ|E|nEei(|x(vi)−x(wi)|), (24)

where for every , is an uniform random edge and the are independent. For every edge , denote . The SPLA applied to sample from is presented as Algorithm 2.

In our simulations, we represent the functional as a function of CPU time while running the algorithms. The parameters and are chosen such that the log likelihood term and the Total Variation regularization term have the same weight. The functionals and are estimated using five random realizations of each iterate (

is estimated using a kernel density estimator). The batch parameter

is equal to . We consider cases where

has a standard gaussian distribution and cases where half of the components of

are standard gaussians and half are equal to zero (this correspond to the graph inpainting task (10)). SPLA and SSLA are always simulated with the same step size.

As expected, the numerical experiments show the advantage of using stochastic proximity operators instead of stochastic subgradients. It is a standard fact that proximity operators are better than subgradients to tackle -norm terms (2). Our figures show that stochastic proximity operators are numerically more stable than alternatives (42). Our figures also show the advantage of stochastic methods (SSLA or SPLA) over deterministic ones for large scale problems. The SSLA and the SPLA provide iterates about one hundred times more frequently than ProxLA, and are faster in the first iterations.

### 6 Acknoledgements

The first author is grateful to Sholom Schechtman for his help in the numerical experiments.

### References

• (1) L. Ambrosio, N. Gigli, and G. Savaré. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Springer Science & Business Media, 2008.
• (2) H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC. Springer, New York, 2011.
• (3) E. Bernton. Langevin Monte Carlo and JKO splitting. arXiv preprint arXiv:1802.08671, 2018.
• (4) P. Bianchi. Ergodic convergence of a stochastic proximal point algorithm. SIAM Journal on Optimization, 26(4):2235–2260, 2016.
• (5) P. Bianchi and W. Hachem. Dynamical behavior of a stochastic Forward-Backward algorithm using random monotone operators. Journal of Optimization Theory and Applications, 171(1):90–120, 2016.
• (6) P. Bianchi, W. Hachem, and A. Salim. A constant step Forward-Backward algorithm involving random maximal monotone operators. Journal of Convex Analysis, 2018.
• (7) H. Brézis. Opérateurs maximaux monotones et semi-groupes de contractions dans les espaces de Hilbert. North-Holland mathematics studies. Elsevier Science, Burlington, MA, 1973.
• (8) A. Chambolle, V. Caselles, D. Cremers, M. Novaga, and T. Pock. An introduction to total variation for image analysis. Theoretical foundations and numerical methods for sparse recovery, 9:263–340, 2010.
• (9) A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40(1):120–145, 2011.
• (10) S. Chen, A. Sandryhaila, G. Lederman, Z. Wang, J. MF Moura, P. Rizzo, J. Bielak, J. H Garrett, and J. Kovačević. Signal inpainting on graphs via total variation minimization. In International Conference on Acoustics, Speech, and Signal Processing, pages 8267–8271, 2014.
• (11) X. Cheng, N. S Chatterji, P. L Bartlett, and M. I Jordan. Underdamped Langevin MCMC: A non-asymptotic analysis. arXiv preprint arXiv:1707.03663, 2017.
• (12) L. Condat. A primal-dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms. Journal of Optimization Theory and Applications, 158(2):460–479, 2013.
• (13) A. S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017.
• (14) A. S Dalalyan and A. Karagulyan. User-friendly guarantees for the langevin monte carlo with inaccurate gradient. Stochastic Processes and their Applications, 2019.
• (15) A. S Dalalyan and L. Riou-Durand. On sampling from a log-concave density using kinetic Langevin diffusions. arXiv preprint arXiv:1807.09382, 2018.
• (16) D. Davis and W. Yin. A three-operator splitting scheme and its optimization applications. Set-Valued and Variational Analysis, 25(4):829–858, 2017.
• (17) A. Durmus, S. Majewski, and B. Miasojedow. Analysis of Langevin Monte Carlo via convex optimization. arXiv preprint arXiv:1802.09188, 2018.
• (18) A. Durmus and E. Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. The Annals of Applied Probability, 27(3):1551–1587, 2017.
• (19) A. Durmus, E. Moulines, and M. Pereyra. Efficient Bayesian computation by proximal Markov Chain Monte Carlo: when Langevin meets Moreau. SIAM Journal on Imaging Sciences, 11(1):473–506, 2018.
• (20) E. Hazan, K. Y Levy, and S. Shalev-Shwartz. On graduated optimization for stochastic non-convex problems. In International conference on machine learning, pages 1833–1841, 2016.
• (21) Y.-P. Hsieh, A. Kavis, P. Rolland, and V. Cevher. Mirrored Langevin dynamics. In Advances in Neural Information Processing Systems, pages 2878–2887, 2018.
• (22) S. Kotz, T. Kozubowski, and K. Podgorski. The Laplace distribution and generalizations: a revisit with applications to communications, economics, engineering, and finance. Springer Science & Business Media, 2012.
• (23) H. J. Kushner and G. G. Yin. Stochastic approximation and recursive algorithms and applications, volume 35 of Applications of Mathematics (New York). Springer-Verlag, New York, second edition, 2003. Stochastic Modelling and Applied Probability.
• (24) Jure L. and Andrej K. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.
• (25) Y.-A. Ma, N. Chatterji, X. Cheng, N. Flammarion, P. L Bartlett, and M. I Jordan. Is there an analog of Nesterov acceleration for MCMC? arXiv preprint arXiv:1902.00996, 2019.
• (26) E. Moulines and F. Bach.

Non-asymptotic analysis of stochastic approximation algorithms for machine learning.

In Advances in Neural Information Processing Systems, pages 451–459, 2011.
• (27) R. M Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11):2, 2011.
• (28) N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends® in Optimization, 1(3):127–239, 2014.
• (29) G. B. Passty. Ergodic convergence to a zero of the sum of monotone operators in Hilbert space. Journal of Mathematical Analysis and Applications, 72(2):383–390, 1979.
• (30) A. Patrascu and I. Necoara. Nonasymptotic convergence of stochastic proximal point algorithms for constrained convex optimization. Journal of Machine Learning Research, 18:1–42, 2018.
• (31) P. Richtárik and M. Takáč. Stochastic reformulations of linear systems: algorithms and convergence theory. arXiv:1706.01108, 2017.
• (32) H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
• (33) R. T. Rockafellar and R. J.-B. Wets. On the interchange of subdifferentiation and conditional expectations for convex functionals. Stochastics, 7(3):173–182, 1982.
• (34) L. Rosasco, S. Villa, and B. C. Vũ. Stochastic inertial primal-dual algorithms. arXiv preprint arXiv:1507.00852, 2015.
• (35) A. Salim, P. Bianchi, and W. Hachem. Snake: a stochastic proximal gradient algorithm for regularized problems over large graphs. IEEE Transactions on Automatic Control, 2019.
• (36) F. Santambrogio. Euclidean, metric, and Wasserstein gradient flows: an overview. Bulletin of Mathematical Sciences, 7(1):87–154, 2017.
• (37) S. Schechtman, A. Salim, and P. Bianchi. Passty Langevin. In CAp, 2019.
• (38) U. ŞimŠekli. Fractional Langevin Monte Carlo: Exploring Lévy driven stochastic differential equations for Markov Chain Monte Carlo. In International Conference on Machine Learning, pages 3200–3209, 2017.
• (39) P. Sollich.

Bayesian methods for support vector machines: Evidence and predictive class probabilities.

Machine learning, 46(1-3):21–52, 2002.
• (40) W. Tansey and J. G Scott. A fast and flexible algorithm for the graph-fused lasso. arXiv preprint arXiv:1505.06475, 2015.
• (41) R. J. Tibshirani. Adaptive piecewise polynomial estimation via trend filtering. The Annals of Statistics, 42(1):285–323, 2014.
• (42) P. Toulis, T. Horel, and E. M Airoldi. Stable Robbins-Monro approximations through stochastic proximal updates. arXiv preprint arXiv:1510.00967, 2015.
• (43) T. Van Erven and P. Harremos. Rényi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014.
• (44) C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
• (45) B. C. Vũ. A splitting algorithm for dual monotone inclusions involving cocoercive operators. Advances in Applied Mathematics, 38(3):667–681, 2013.
• (46) Y.-X. Wang, J. Sharpnack, A. J Smola, and R. J Tibshirani. Trend filtering on graphs. The Journal of Machine Learning Research, 17(1):3651–3691, 2016.
• (47) M; Welling and Y. W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In International Conference on Machine Learning, pages 681–688, 2011.
• (48) A. Wibisono. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. arXiv preprint arXiv:1802.08089, 2018.
• (49) X. Xu and M. Ghosh. Bayesian variable selection and estimation for group lasso. Bayesian Analysis, 10(4):909–936, 2015.
• (50) A. Yurtsever, B. C. Vu, and V. Cevher. Stochastic three-composite convex minimization. In Advances in Neural Information Processing Systems, pages 4329–4337, 2016.

### Appendix A Lemmas Needed for the Proof of the Main Theorem

In order to prove Theorem 1, we will need to establish several lemmas. For every convex function , we denote In the sequel, we assume that Assumptions 15 hold true.

We first recall (17, Lemma 1).

###### Lemma 5.

The target distribution satisfies and .

Moreover, if , then

 KL(μ|μ⋆)=EU(μ)+H(μ)−(EU(μ⋆)+H(μ⋆))=F(μ)−F(μ⋆),

provided that .

###### Lemma 6.
 2γ[H(μyk0)−H(μ⋆)]≤W2(μzk,μ⋆)−W2(μyk0,μ⋆). (25)
###### Proof.

This is an application of (17, Lemma 4) with and .

The proof is the roughly the same as proving the convergence rate of the gradient descent algorithm, but in continuous time and in the space . For the sake of completeness, we provide the main arguments of a different proof than (17) here, that makes easier the connection with Lyapunov techniques used in the analysis of gradient descent/gradient flows in Euclidean spaces.

Consider a Brownian Motion initialized with and denote, for every , the distribution of . Then is a gradient flow of (see (36)). This implies (see (1, Page 280))

 ∀t>0,2(H(νt)−H(μ⋆))≤−ddtW2(νt,μ⋆). (26)

This also implies (44, Page 711) that (the objective function) is a Lyapunov function for (the gradient flow) :

 ∀t>0,ddtH(νt)≤0. (27)

Now consider the function . For every ,

 ddtℓ(t)=(H(νt)−H(μ⋆))+tddtH(νt)−(H(νt)−H(μ⋆))≤0, (28)

using the inequalities (26) and (27). In other words, is also a Lyapunov function. Therefore, for every , , which implies222Here, one has to prove the is continuous at , we skip this part of the proof i.e,

 γ(H(ν