# A Proximal Algorithm for Sampling from Non-smooth Potentials

Markov chain Monte Carlo (MCMC) is an effective and dominant method to sample from high-dimensional complex distributions. Yet, most existing MCMC methods are only applicable to settings with smooth potentials (log-densities). In this work, we examine sampling problems with non-smooth potentials. We propose a novel MCMC algorithm for sampling from non-smooth potentials. We provide a non-asymptotical analysis of our algorithm and establish a polynomial-time complexity Õ(dε^-1) to obtain ε total variation distance to the target density, better than all existing results under the same assumptions. Our method is based on the proximal bundle method and an alternating sampling framework. This framework requires the so-called restricted Gaussian oracle, which can be viewed as a sampling counterpart of the proximal mapping in convex optimization. One key contribution of this work is a fast algorithm that realizes the restricted Gaussian oracle for any convex non-smooth potential with bounded Lipschitz constant.

## Authors

• 2 publications
• 26 publications
• ### An Efficient Sampling Algorithm for Non-smooth Composite Potentials

We consider the problem of sampling from a density of the form p(x) ∝(-f...
10/01/2019 ∙ by Wenlong Mou, et al. ∙ 24

• ### High-dimensional Gaussian sampling: a review and a unifying approach based on a stochastic proximal point algorithm

Efficient sampling from a high-dimensional Gaussian distribution is an o...
10/04/2020 ∙ by Maxime Vono, et al. ∙ 0

• ### Composite Logconcave Sampling with a Restricted Gaussian Oracle

We consider sampling from composite densities on ℝ^d of the form dπ(x) ∝...
06/10/2020 ∙ by Ruoqi Shen, et al. ∙ 0

• ### Sharp Convergence Rates for Langevin Dynamics in the Nonconvex Setting

We study the problem of sampling from a distribution where the negative ...
05/04/2018 ∙ by Xiang Cheng, et al. ∙ 0

• ### Structured Logconcave Sampling with a Restricted Gaussian Oracle

We give algorithms for sampling several structured logconcave families t...
10/07/2020 ∙ by Yin Tat Lee, et al. ∙ 0

• ### Intracluster Moves for Constrained Discrete-Space MCMC

This paper addresses the problem of sampling from binary distributions w...
03/15/2012 ∙ by Firas Hamze, et al. ∙ 0

• ### Smooth Shells: Multi-Scale Shape Registration with Functional Maps

We propose a novel 3D shape correspondence method based on the iterative...
05/29/2019 ∙ by Marvin Eisenberger, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Core to many scientific and engineering problems that face uncertainty (either physically or algorithmically) is the task of drawing samples from a given, often unnormalized, probability density. Sampling plays a crucial role in many applications such as statistical inference/estimation, operations research, physics, biology, and machine learning, etc

[2, 10, 11, 12, 14, 15, 16, 33]

. For instance, in Bayesian statistics, we can sample from the posterior distribution to infer its mean, covariance, or other important statistics. Sampling is heavily used in molecular dynamics to discover new structures. Sampling is also closely related to optimization. On one hand, optimization can be viewed as the limit of sampling when the temperature parameter goes to 0. On the other hand, sampling can be viewed as an optimization over the manifold of probability distributions

[34, 36].

Over the years, many methods and algorithms have been developed for sampling [1, 5, 11, 17, 18, 24, 31]. A very popular framework for sampling from high dimensional complex distributions is the Markov chain Monte Carlo (MCMC) algorithm [4, 6, 7, 9, 10]. In MCMC, a Markov chain is constructed so that its invariant distribution is the given target distribution we want to sample from. After running the Markov chain for sufficiently many iterations, the state will follow this invariant distribution, generating samples from it. Several widely used MCMC methods include Langevin Monte Carlo (LMC) [8, 9], Metropolis-adjusted Langevin algorithm (MALA) [13], and Hamiltonian Monte Carlo (HMC) [27]. These three algorithms use gradient information of the potential (log-density) to construct the Markov chain. They resemble the gradient-based algorithms in optimization and can be viewed as the sampling counterpart of it. Over the last few years, many theoretical results [5, 8, 9, 28, 29] have been established to understand the computational complexities of these MCMC algorithms.

Most existing MCMC methods are only applicable to settings with smooth potentials [8, 17, 34]

whose gradient is Lipschitz continuous. However, non-smooth sampling is also an important problem as many applications of sampling involve non-smooth potentials. For instance, in Bayesian inference, the prior is naturally non-smooth when a compact support is considered. Many problems in deep learning are also non-smooth, not only due to non-smooth activation functions like ReLU used in the neural networks, but also due to intrinsic scaling symmetries. Nevertheless, the study of sampling without smoothness is nascent. This is in sharp contrast to optimization where a plethora of algorithms, e.g. subgradient method, proximal algorithm, bundle method have been developed for non-smooth optimization

[20, 21, 22, 23, 25, 30, 35].

The goal of this work is to establish an efficient algorithm to draw samples from a distribution with non-smooth potential. We focus on the case where the potential is convex and is Lipschitz continuous. Our algorithm is based the recent alternative sampling framework [17], which can be viewed as a sampling counterpart of the proximal point method in optimization [30]. The key of the alternative sampling framework is a step known as the restricted Gaussian oracle (RGO) to draw samples from a potential regularized by a large isotropic quadratic term. To utilize this framework to sample with general non-smooth potentials with bounded Lipschitz constant, we develop an efficient realization of the RGO through rejection sampling with a properly designed proposal. A non-smooth optimization technique known as the bundle method [20, 22, 23] is used to compute the proposal. We establish a polynomial-time complexity to obtain total variation distance to the target density, better than all existing results under the same assumptions.

Over the last few years, several new algorithms and theoretical results in sampling with non-smooth potentials have been established. In [26], sampling for non-smooth composite potentials is considered. The algorithm needs the proximal sampling oracle that samples from the target potential regularized by a large isotropic quadratic term as well as computes the corresponding partition function. In [32], an algorithm to sample from non-smooth composite potentials are developed; both are based on the RGO which is similar to the proximal sampling oracle but do not need to compute the partition function. In [3], the authors developed an algorithm to sample from non-smooth potentials by running LMC on the Gaussian smoothing of the potentials. In [19], the author developed the projected LMC algorithm and analyzed its complexity for non-smooth potentials. In [9], the authors presented an optimization approach to analyze the complexity of sampling and established a complexity result for sampling with non-smooth composite potentials. Compared with [26, 32], our algorithm does not require any sampling oracle. Compared with [19], we consider sampling from a distribution supported on instead of a convex compact set. Compared with [3, 9], our algorithm has better complexity in terms of total variation when the target error is small. See Table 1 for the detailed complexity bounds.

The rest of this paper is structured as follows. In Section 2 we provide the problem formulation we are interested in. We also briefly review the alternating sampling framework on which our algorithm is based. In Section 3 we present an efficient realization of the RGO for general convex potentials with bounded Lipschitz constants. This is then combined with the alternating sampling framework in Section 4 to establish our main results for sampling without smoothness.

## 2 Problem formulation and alternating sampling framework

The problem of interest is to sample from a distribution on proportional to where the potential is convex and -Lipschitz continuous. Note that the potential does not need to be smooth. This violates the smoothness assumption for most existing MCMC sampling methods [17, 34]. Before describing our approach to design an efficient algorithm for sampling with non-smooth potentials, we introduce two important algorithmic notions used in this paper.

Our method is based on the alternating sampling framework introduced in [17], which is a generic framework for sampling from a distribution where is a -strongly convex function. For a given point and stepsize , the alternating sampling framework repeats the two steps as in Algorithm 1.

In Algorithm 1, sampling given in step 1 can be easily done since . Sampling given in step 2 corresponds to the so-called restricted Gaussian oracle for introduced in [17], which is the second crucial algorithmic notion used in this paper.

###### Definition 2.1.

Given a point and stepsize , a restricted Gaussian oracle (RGO) for convex is a sampling oracle that returns a sample from a distribution proportional to .

RGO is an analogy of the proximal mapping in convex optimization, which is heavily used in proximal point methods. RGO is a key algorithmic ingredient used in [17] together with the alternating sampling framework to improve the iteration-complexity bounds for various sampling algorithms. Examples of a convex function that admits an computationally efficient RGO have been presented in [26, 32], including coordinate-separable regularizers, -norm, and group Lasso.

We recall the main result of [17], which gives the complexity of Algorithm 1 in terms of number of calls to RGO and is useful in this paper.

###### Theorem 2.2.

(Theorem 1 of [17]) Let be a distribution on with such that is -strongly convex, and let . Let for some . Algorithm 1, initialized at a -warm start, runs in iterations, each querying RGO for with parameter a constant number of times, and obtains total variation distance to .

We are now ready to describe our approach to sample from non-smooth potentials. We first consider a regularized density where and is an arbitrary point but preferred to be close to the minimum set of .Since is -strongly convex, the alternating sampling framework is applicable to it. Instead of assuming the existence of an RGO for , we have developed an efficient implementation of the RGO based on the proximal bundle method and rejection sampling for an arbitrary convex with bounded Lipschitz constant. Finally, we justify the sample generated from by using Algorithm 1 and the implementable RGO with a proper choice of is a sample within total variation distance to the target density .

## 3 An implementable restricted Gaussian oracle

The bottleneck of applying the alternating sampling framework (Algorithm 1) to sample from general log-concave distributions is the availability of RGO. In this section, we focus on designing computationally efficient and implementable RGO for -strongly convex of the form . Subsection 3.1 presents an implementation of the RGO for by using the proximal mapping of and rejection sampling. In order to develop an implementable RGO for in the cases where there are no efficient optimization oracles for , we resort to the proximal bundle method, which is a standard method in convex non-smooth optimization. Subsection 3.2 briefly reviews the proximal bundle method and its iteration-complexity bound. Finally, Subsection 3.3 describes the implementation of RGO for based on rejection sampling and the proximal bundle method, instead of the proximal mapping of .

An interesting result in this section that may be of independent interest is a computationally efficient implementation of RGO for any convex and Lipschitz continuous function . In both settings where the proximal mapping of exists or not, replacing by , then the implementations and results for RGO in this section are also applicable for .

### 3.1 Sampling with an optimization oracle

Assume that has a proximal mapping and let

 x∗=argminx∈Rd{gη(x):=g(x)+12η∥x−y∥2}. (1)

Note that solving (1) is equivalent to invoking one proximal mapping of since .

The following technical result is useful in the complexity analysis of Algorithm 2. It is a special case of a more general result (Lemma 3.5) we present later and thus the proof is omitted.

###### Lemma 3.1.

Let and define

 h1:=12ημ∥⋅−x∗∥2+gη(x∗),h2:=12ημ∥⋅−x∗∥2+2M∥⋅−x∗∥+gη(x∗).

Then, for every , we have .

We now describe the implementation of the RGO for based on the proximal mapping of .

The next proposition justifies the correctness and develops the complexity of Algorithm 2.

###### Proposition 3.2.

Assume is convex and -Lipschitz continuous. Let and

 p(x)=p(x|y)∝exp(−g(x)−12η∥x−y∥2),

then generated by Algorithm 2 is such that . If , then the expected number of iteration in Algorithm 2 is at most 2.

Proof: It is a well-known result for rejection sampling that and the probability that is accepted is

 P(U≤exp(−gη(X))exp(−h1(X)))=∫exp(−gη(x))dx∫exp(−h1(x))dx.

Using the assumption that , Lemma 3.1 and Proposition A.3 that

 ∫Rdexp(−gη(x))dx≥∫Rdexp(−h2(x))dx≥exp(−gη(x∗))(2πημ)d/22.

Moreover, it follows from the definition of and Lemma A.1(a) that

 ∫Rdexp(−h1(x))dx=exp(−gη(x∗))(2πημ)d/2.

The above three relations immediately imply that

 P(U≤exp(−gη(X))exp(−h1(X)))≥12,

and hence that the expected number of iterations is bounded above by 2.

### 3.2 Review of the proximal bundle method

The proximal bundle method [22, 23] is a standard and efficient algorithm for solving convex non-smooth optimization problems. In this subsection, we briefly review a standard approach to solve the subproblem considered in the proximal bundle method, the properties of the solution to the subproblem, and the iteration-complexity for solving the subproblem.

We consider the subproblem

 gη∗:=gη(x∗)=min{gη(x):=g(x)+12η∥x−y∥2:x∈Rd}, (2)

and we aim at obtaining a -solution (i.e., a point such that ) to (2).

We now state in Algorithm 3 the method for obtaining a -solution to subproblem (2).

We make some remarks about Algorithm 3. First, (4) shows the flexibility in the choice of . More specifically, choosing results in the standard cutting-plane model which is underneath , and choosing gives a cutting-plane model with less cuts. Second, (3) can be reformulated into a convex quadratic programming with affine constraints and the number of constraints is equal to the cardinality of . Since the subproblem (3) becomes harder to solve as the size of grows, we are in favor of choosing in (4) as lean as possible.

The following lemma contains technical results about Algorithm 3 that are useful in the complexity analysis in Subsection 3.3.

###### Lemma 3.3.

Assume is convex and M-Lipschitz continuous. Let denote the last iteration index, then the following statements hold:

• , and for every ;

• ;

• ;

• .

Proof: a) The first two inequalities directly follow from the convexity of , and the definitions of and . The third inequality follows from (3).

b) This statement immediately follows from step 3 of Algorithm 3.

c) It follows from the optimality condition of (3) and the definition of that

 −μ(xj−x0)−xj−yη∈∂fj(xj).

This inclusion and the fact that imply that c) holds.

d) The proof can be found in Lemma 5.1(e) of [22].

The following result states the iteration-complexity bound for Algorithm 3 to obtain a -solution to the subproblem (2). We have omitted the proof since it is relatively technical and beyond the scope of this paper, however, a complete proof can be found in Section 4 of [23].

###### Proposition 3.4.

Algorithm 3 takes iterations to terminate, and each iteration solves an affinely constrained convex quadratic programming problem.

### 3.3 Sampling without an optimization oracle

Define

 h1 :=12ημ∥⋅−xj∥2+gη(~xj)−δ, (5) h2 :=12ημ∥⋅−~xj∥2+(2M+√2δ√ημ)∥⋅−~xj∥+gη(~xj). (6)

Algorithms 4 describes the implementation of RGO for based on Algorithm 3 and rejection sampling. It differs from Algorithm 2 in that: 1) it uses Algorithm 3 to compute an approximate solution to (2) instead of calling the proximal mapping as in (1).

The following lemma is a counterpart of Lemma 3.1 in the context of RGO without an optimization oracle and plays an important role in Proposition 3.6. It reduces to Lemma 3.1 when .

###### Lemma 3.5.

Assume is convex and M-Lipschitz continuous. Let and be as in (1). Then, for every , we have

 h1(x)≤gη(x)≤h2(x) (7)

where and are as in (5) and (6), respectively.

Proof: Using Lemma 3.3(a)-(b) and the definition of , we have

 g(~xj)−g(x)+12ημ∥x−xj∥2 ≤g(~xj)−gj(x)+12ημ∥x−xj∥2 ≤g(~xj)−gηj(xj)+12η∥x−y∥2 ≤δ−12η∥~xj−y∥2+12η∥x−y∥2.

The first inequality in (7) holds in view of the definition of in (5). Using the definition of in (1) and the fact that is -Lipschtz, we have

 gη(x)−gη(~xj) =f(x)−f(~xj)+μ2∥x−x0∥2−μ2∥~xj−x0∥2+12η∥x−y∥2−12η∥~xj−y∥2 ≤M∥x−~xj∥+μ2∥x−~xj∥2+μ⟨x−~xj,~xj−x0⟩+12η∥x−~xj∥2+1η⟨x−~xj,~xj−y⟩ =M∥x−~xj∥+12ημ∥x−~xj∥2+μ⟨x−~xj,~xj−xj+xj−x0⟩+1η⟨x−~xj,~xj−xj+xj−y⟩.

The above inequality, the Cauchy-Schwarz inequality and Lemma 3.3(c)-(d) imply that

 gη(x)−gη(~xj) ≤M∥x−~xj∥+12ημ∥x−~xj∥2+1ημ∥x−~xj∥∥~xj−xj∥+∥x−~xj∥∥∥∥μ(xj−x0)+xj−yη∥∥∥ ≤M∥x−~xj∥+12ημ∥x−~xj∥2+√2δ√ημ∥x−~xj∥+M∥x−~xj∥ =(2M+√2δ√ημ)∥x−~xj∥+12ημ∥x−~xj∥2.

It follows from the above inequality and the definition of in (6) that the second inequality in (7) holds.

The next proposition is the main result of this subsection and shows that the number of rejections in Algorithm 4 is small in expectation. Hence, the implementation of RGO for is computationally efficient.

###### Proposition 3.6.

If

 ημ≤164M2d,δ≤132d, (8)

then the expected number of iterations in the rejection sampling is at most .

Proof: We first observe that the assumption (8) implies that

 2√ημM+√2δ≤12√d,

which satisfies the assumption in Lemma A.3 with

 λ=ημ,a=M+√δ√2ημ.

Using the definition of in (6) and Lemma A.3, we have

 ∫Rdexp(−h2(x))dx≥12exp(−gη(~xj))(2πημ)d/2.

It follows from the definition of in (5) and Lemma A.1(a) with that

 ∫exp(−h1(x))dx=exp(−gη(~xj)+δ)(2πημ)d/2.

We conclude that

 P(U≤exp(−gη(X))exp(−h1(X))) =∫exp(−gη(x))dx∫exp(−h1(x))dx ≥∫exp(−h2(x))dxexp(−gη(~xj)+δ)(2πημ)d/2≥12exp(−δ),

and the expected number of the iterations is

 1P(U≤exp(−gη(X))exp(−f(X)))≤2exp(δ)≤2(1+2δ)≤2(1+116d)≤3

where the last two inequalities are due to the second inequality in (8).

## 4 Sampling from non-smooth potentials

In this section we present our main results to sample from log-concave probability densities with non-smooth potentials. This section contains two subsections. Subsection 4.1 presents one of the main results of the paper, i.e., the iteration-complexity bound for sampling from where . Based on this, Subsection 4.2 provides the other main result about the iteration-complexity for sampling from where is convex and Lipschitz continuous. Apart from being a transition step to our final result for sampling without smoothness, the results in Subsection 4.1 provide an efficient method to sample from composite potentials of the form and may be of independent interest.

### 4.1 Total complexity for strongly convex potential

Using the efficient implementation of RGO for developed in Section 3 and the alternating sampling framework Algorithm 1, we are now able to sample from and establish the complexity for this sampling task.

The following theorem states the iteration-complexity bound for Algorithm 1 using Algorithm 4 as the RGO to sample from . Note that this iteration-complexity bound is poly-logarithmic in the precision in terms of total variation.

###### Theorem 4.1.

Let , , , and satisfying

 δM2≤η≤min{164M2d,1μ} (9)

be given. Let be a distribution on satisfying where is convex and -Lipschitz continuous on . Consider Algorithm 1 using Algorithm 4 as an RGO for step 1, initialized at a -warm start, then the iteration-complexity bound for obtaining total tolerance to in terms of total variation is

 ~O(M2μδlog(logβε)+1), (10)

and each iteration queries one subgradient oracle of

and solves a quadratic programming problem. Moreover, the number of Gaussian distribution sampling queries in Algorithm

1 is

 Θ(1ημlog(logβε)+1). (11)

Proof: It follows from Theorem 2.2 that the iteration-complexity for Algorithm 1 to obtain total tolerance to is (11), which together with Proposition 3.4 implies that the total iteration-complexity is

 ~O([ημM2δ+1][1ημlog(logβε)+1]). (12)

Let

 a=ημM2δ,b=1ημ.

In view of the above definitions of and , the total iteration-complexity (12) becomes . It is easy to see from (9) that and , and hence is equal to . Since , the total iteration-complexity is (10). Moreover, it follows from (9) that (8) is satisfied, and hence that Proposition 3.6 holds. This conclusion together with the iteration-complexity (11) for Algorithm 1 implies the last conclusion of the theorem.

It is worth noting that if the proximal mapping of exists, then the implementation of RGO only requires one call to the proximal mapping and a number of rejection sampling. As a result, the total complexity for Algorithm 1 is the same as the complexity in Theorem 2.2, i.e., .

### 4.2 Total complexity for convex potential

This subsection studies the main problem of this paper, i.e., sampling from . Building upon Theorem 4.1 for sampling from and a proper choice of , the following theorem establishes the iteration-complexity bound for Algorithm 1 to sample from .

###### Theorem 4.2.

Let be a distribution on satisfying where is convex and -Lipschitz continuous on . Let and be given and

 μ=ε√M4+∥x0−xmin∥2 (13)

where and . Choose and such that (9) holds and consider Algorithm 1 using Algorithm 4 as an RGO for step 1, applied to , and initialized at a -warm start. Then, the iteration-complexity bound for obtaining total tolerance to is

 ~O(M2(√M4+∥x0−xmin∥2)εδlog(logβε)+1). (14)

Proof: Let denote the distribution of the points generated by Algorithm 1 using Algorithm 4 as an RGO, and let denote the distribution proportional to Following the proof of Corollary 4.1 of [3], we similarly have

 ∥ρ−π∥TV≤∥ρ−^π∥TV+∥^π−π∥TV

and

 ∥^π −π∥TV≤12(∫Rd[f(x)−g(x)]2dπ(x))1/2=12(∫Rd(μ2∥x−x0∥)2dπ(x))1/2 ≤12(∫Rd(μ∥x−x% min∥+μ∥xmin−x0∥)2dπ(x))1/2≤μ2(√M4+∥x0−xmin∥2)=ε2

where the last identity is due to the definition of in (13). Hence, it suffices to derive the iteration-complexity bound for Algorithm 1 to obtain , which is (14) in view of Theorem 4.1 with as in (13).

Finally, we remark that (9) implies that , and if we choose for some universal constant , then the total complexity (14) for sampling from non-smooth potentials becomes .

## 5 Conclusion

This paper presents an algorithm based on the alternating sampling framework for sampling from non-smooth potentials and establishes a complexity bound to obtain total variation distance to the target density. An interesting result of this paper that may be of independent interest is a computationally efficient implementation of RGO for any convex and Lipschitz continuous function. A possible extension of our analysis in this paper is to consider sampling from composite densities proportional to where is strongly convex and smooth and is convex and Lipschitz continuous.

## Acknowledgement

This work was supported by NSF under grant 1942523 and 2008513.

## References

• [1] David Applegate and Ravi Kannan. Sampling and integration of near log-concave functions. In

Proceedings of the twenty-third annual ACM symposium on Theory of computing

, pages 156–163, 1991.
• [2] Dimitris Bertsimas and Santosh Vempala. Solving convex programs by random walks. Journal of the ACM (JACM), 51(4):540–556, 2004.
• [3] Niladri Chatterji, Jelena Diakonikolas, Michael I Jordan, and Peter Bartlett. Langevin monte carlo without smoothness. In

International Conference on Artificial Intelligence and Statistics

, pages 1716–1726. PMLR, 2020.
• [4] Yuansi Chen, Raaz Dwivedi, Martin J Wainwright, and Bin Yu. Fast mcmc sampling algorithms on polytopes. The Journal of Machine Learning Research, 19(1):2146–2231, 2018.
• [5] Yuansi Chen, Raaz Dwivedi, Martin J Wainwright, and Bin Yu. Fast mixing of metropolized hamiltonian monte carlo: Benefits of multi-step gradients. J. Mach. Learn. Res., 21:92–1, 2020.
• [6] Xiang Cheng and Peter Bartlett. Convergence of langevin mcmc in kl-divergence. In Algorithmic Learning Theory, pages 186–211. PMLR, 2018.
• [7] Xiang Cheng, Niladri S Chatterji, Peter L Bartlett, and Michael I Jordan. Underdamped langevin mcmc: A non-asymptotic analysis. In Conference on learning theory, pages 300–323. PMLR, 2018.
• [8] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017.
• [9] Alain Durmus, Szymon Majewski, and Błażej Miasojedow. Analysis of langevin monte carlo via convex optimization. The Journal of Machine Learning Research, 20(1):2666–2711, 2019.
• [10] Alain Durmus, Eric Moulines, and Marcelo Pereyra. Efficient bayesian computation by proximal Markov Chain Monte Carlo: when Langevin meets Moreau. SIAM Journal on Imaging Sciences, 11(1):473–506, 2018.
• [11] Martin Dyer, Alan Frieze, and Ravi Kannan. A random polynomial-time algorithm for approximating the volume of convex bodies. Journal of the ACM (JACM), 38(1):1–17, 1991.
• [12] Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian Data Analysis. CRC press, 2013.
• [13] Ulf Grenander and Michael I Miller. Representations of knowledge in complex systems. Journal of the Royal Statistical Society: Series B (Methodological), 56(4):549–581, 1994.
• [14] Adam Tauman Kalai and Santosh Vempala. Simulated annealing for convex optimization. Mathematics of Operations Research, 31(2):253–266, 2006.
• [15] Ravi Kannan, László Lovász, and Miklós Simonovits. Random walks and an volume algorithm for convex bodies. Random Structures & Algorithms, 11(1):1–50, 1997.
• [16] Werner Krauth. Statistical mechanics: algorithms and computations, volume 13. OUP Oxford, 2006.
• [17] Yin Tat Lee, Ruoqi Shen, and Kevin Tian. Structured logconcave sampling with a restricted gaussian oracle. In Conference on Learning Theory, pages 2993–3050. PMLR, 2021.
• [18] Yin Tat Lee and Santosh S Vempala. Convergence rate of riemannian hamiltonian monte carlo and faster polytope volume computation. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1115–1121, 2018.
• [19] Joseph Lehec. The langevin monte carlo algorithm in the non-smooth log-concave case. arXiv preprint arXiv:2101.10695, 2021.
• [20] C. Lemaréchal. An extension of davidon methods to non differentiable problems. In Nondifferentiable optimization, pages 95–109. Springer, 1975.
• [21] C. Lemaréchal. Nonsmooth optimization and descent methods. 1978.
• [22] J. Liang and R. D. C. Monteiro. A proximal bundle variant with optimal iteration-complexity for a large range of prox stepsizes. Available on arXiv:2003.11457, 2020.
• [23] J. Liang and R. D. C. Monteiro. A unified analysis of a class of proximal bundle methods for hybrid convex composite optimization problems. Available on arXiv:2110.01084, 2021.
• [24] László Lovász and Santosh Vempala. Fast algorithms for logconcave functions: Sampling, rounding, integration and optimization. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pages 57–68. IEEE, 2006.
• [25] R. Mifflin. A modification and an extension of Lemaréchal’s algorithm for nonsmooth minimization. In Nondifferential and variational techniques in optimization, pages 77–90. Springer, 1982.
• [26] Wenlong Mou, Nicolas Flammarion, Martin J Wainwright, and Peter L Bartlett. An efficient sampling algorithm for non-smooth composite potentials. arXiv preprint arXiv:1910.00551, 2019.
• [27] Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.
• [28] Gareth O Roberts and Jeffrey S Rosenthal. Optimal scaling of discrete approximations to langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255–268, 1998.
• [29] Gareth O Roberts and Richard L Tweedie. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, pages 341–363, 1996.
• [30] R Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877–898, 1976.
• [31] Ruoqi Shen and Yin Tat Lee. The randomized midpoint method for log-concave sampling. arXiv preprint arXiv:1909.05503, 2019.
• [32] Ruoqi Shen, Kevin Tian, and Yin Tat Lee. Composite logconcave sampling with a restricted gaussian oracle. arXiv preprint arXiv:2006.05976, 2020.
• [33] Jack W Sites Jr and Jonathon C Marshall. Delimiting species: a renaissance issue in systematic biology. Trends in Ecology & Evolution, 18(9):462–470, 2003.
• [34] Andre Wibisono. Sampling as optimization in the space of measures: The langevin dynamics as a composite optimization problem. In Conference on Learning Theory, pages 2093–3027. PMLR, 2018.
• [35] P. Wolfe. A method of conjugate subgradients for minimizing nondifferentiable functions. In Nondifferentiable optimization, pages 145–173. Springer, 1975.
• [36] Zhuoran Yang, Yufeng Zhang, Yongxin Chen, and Zhaoran Wang. Variational transport: A convergent particle-based algorithm for distributional optimization. arXiv preprint arXiv:2012.11554, 2020.

## Appendix A Technical results

###### Lemma A.1.

Useful Gaussian integrals:

• for any ,

 ∫Rdexp(−12λ∥x∥2)dx=(2πλ)d/2;
• for any and ,

 ∫∞0exp(−cx2)xndx=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩(n−1)!!2n/2+1cn/2√πc, for n even,(n−12)!2c(n+1)/2, for n.
###### Lemma A.2.

Facts about the Gamma function : for every ,

 Γ(k)=(k−1)!,Γ(k+12)=(k−12)(k−32)⋯12√π, (15)

and

 √k<Γ(k+1)Γ(k+12)<√k+12. (16)

Proof: Identities in (15) are well-known and hence their proofs are omitted. We give an elementary proof of (16). Let

 Ik=∫π/20sinkxdx,∀k≥0,

then integration by parts gives the recursive formula

 Ik=k−1kIk−2,∀k≥2. (17)

Applying the above identity recursively and using the facts that and , we have for every ,

 I2k=2k−12k2k−32k−2⋯12π2,I2k+1=2k2k+12k−22k−1⋯23,

and hence,

 I2kI2k+1=2k+12((2k)!)242k(k!)4π. (18)

It follows from the fact that for that for every . This observation together with (17) implies that

 1

Using the above inequality and (18), we have

 1√k+12<(2k)!4k(k!)2√π=14k(2kk)√π<1√k. (19)

Finally, it follows from (15) that

 Γ(k+12)Γ(k+1)=14k(2kk)√π,

which together with (19) implies (16).

###### Proposition A.3.

For and , if , then

 ∫Rdexp(−12λ∥x∥2−2a∥x∥)dx≥(2πλ)d/22. (20)

Proof: Let and note that

 dx=rd−1dSd−1dr

where is the surface area of the -dimensional unit sphere. Define

 Fd,λ(a):=∫∞0exp(−12λr2−2ar)rddr, (21)

then we have

 ∫exp(−12λ∥x∥2−2a∥x∥)dx =∫exp(−12λr2−2ar)rd−1drdSd−1 =2πd/2Γ(d2)Fd−1,λ(a). (22)

In the above, we have used the fact that the total surface area of a -dimensional unit sphere is . It follows from the definition of in (21) that

 dFd−1,λ(a)da=∫∞0exp(−12λr2−2ar)(−2r)rd−1dr=−2Fd,λ(a)≥−2Fd,λ(0),

and hence that

 Fd−1,λ(a)≥Fd−1,λ(0)−2aFd,λ(0). (23)

We now consider two cases: for and for .

Case 1: . It follows from the definition of in (21) and Lemma A.1(b) that

 F2k,λ(0)=∫∞0exp(−12λr2)r2kdr=(2k−1)!!λk2√2λπ, (24) F2k+1,λ(0)=∫∞0exp(−12λr2)r2k+1dr=k!(2λ)k+12.

Using the above two identities and (23) with , we have

 F2k,λ(a) ≥F2k,λ(0)−2aF2k+1,λ(0) =(2k−1)!!λk2√2λπ−ak!(2λ)k+1 =λk(2k−1)!!(√λπ2−2aλ(2k)!!(2k−1)!!).

The above inequality, the fact that

 (2k)!!(2k−1)!!=Πki=12iΠki=1(2i−1)=Πki=1iΠki=1(i−12)=Γ(k+1)Γ(k+12)√π

and (16) imply that

 F2k,λ(a) ≥λk(2k−1)!!(√λπ2−2aλ√(k