Composite Logconcave Sampling with a Restricted Gaussian Oracle

06/10/2020 ∙ by Ruoqi Shen, et al. ∙ 0

We consider sampling from composite densities on ℝ^d of the form dπ(x) ∝(-f(x) - g(x))dx for well-conditioned f and convex (but possibly non-smooth) g, a family generalizing restrictions to a convex set, through the abstraction of a restricted Gaussian oracle. For f with condition number κ, our algorithm runs in O (κ^2 d log^2κ dϵ) iterations, each querying a gradient of f and a restricted Gaussian oracle, to achieve total variation distance ϵ. The restricted Gaussian oracle, which draws samples from a distribution whose negative log-likelihood sums a quadratic and g, has been previously studied and is a natural extension of the proximal oracle used in composite optimization. Our algorithm is conceptually simple and obtains stronger provable guarantees and greater generality than existing methods for composite sampling. We conduct experiments showing our algorithm vastly improves upon the hit-and-run algorithm for sampling the restriction of a (non-diagonal) Gaussian to the positive orthant.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 34

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We study the problem of approximately sampling from a distribution on , with density

(1)

Here, is assumed to be “well-behaved” (i.e. has finite condition number), and is a convex, but possibly non-smooth function. This problem generalizes the special case of sampling from for well-behaved , simply by setting

to be uniformly zero. The existing (and extensive) literature on logconcave sampling, a natural problem family with roots in Bayesian statistics, machine learning, and theoretical computer science, typically focuses on the case when the log-density is well-behaved, and the distribution has support

. Indeed, even the specialization of (1) where indicates a convex set is not well-understood; existing bounds on mixing time for this restricted setting are large polynomials in [BDMP17, BEL18], and typically weaker than guarantees in the general logconcave setting [LV06b, LV06a], where no assumptions are made at all other than convexity of , and only access to a zeroth order oracle is assumed111Throughout, we refer to a first order oracle for function as returning on query , the pair , whereas a zeroth order oracle only returns . Typical methods developed for sampling in the well-conditioned log-density regime are based on interacting with first order oracles..

Sampling from logconcave distributions and optimization of convex functions have a close relationship, which has been extensively studied [BV04, LV06a]. However, the toolkit for first-order convex optimization has to date been much more flexible in terms of the types of problems it is able to handle, beyond optimizing well-conditioned functions. Examples of problem families which efficient first-order methods for convex optimization readily generalize to solving are

as well as its generalization

(2)

The seminal work [BT09] extends accelerated gradient methods to solve (2) via proximal oracles, and has prompted many follow-up studies. Existence of an efficient proximal oracle is a natural measure of “simplicity” of in the context of composite optimization, which we now define.

Definition 1 (Proximal oracle).

is a proximal oracle for convex if it returns

In other words, a proximal oracle minimizes functions which sum a quadratic and . It is clear that the proximal oracle definition implies they can also handle arbitrary sums of linear functions and quadratics, as the resulting function can be rewritten as the sum of a constant and a single quadratic. Definition 1 is desirable as many natural non-smooth composite objectives arising in learning settings, such as the Lasso [Tib96] and elastic net [ZH05], admit efficient proximal oracles.

1.1 Our contribution

Motivated by the success of the proximal oracle framework, we study sampling from the family (1) through the natural extension of Definition 1, which we term a “restricted Gaussian oracle”. Informally, the oracle samples from a Gaussian (with covariance a multiple of ) restricted by .

Definition 2 (Restricted Gaussian oracle).

is a restricted Gaussian oracle for convex if it returns

The notion of a restricted Gaussian oracle has appeared previously [CV18, MFWB19], and its efficient implementation was a key subroutine in the fastest (zeroth-order) sampling algorithm for general logconcave distributions [CV18]. It was shown in [MFWB19] that a variety of composite distributions arising in practical applications, including coordinate-separable , and or group Lasso regularized densities, admit such oracles. Our main result is an algorithm efficiently sampling from (1), assuming access to a restricted Gaussian oracle for and the minimizer of .222This assumption is not restrictive, as efficient algorithms minimize in gradient queries to and proximal oracle queries to [BT09]. Proximal oracle access is typically a weaker assumption than restricted Gaussian oracle access. We discuss effects of inexactness in this minimization procedure in Appendix D.

Theorem 1.

Consider a distribution of the form (1), where has a condition number , and convex admits a restricted Gaussian oracle . Also, assume we know the minimizer of . Algorithm 1, Composite-Sample, samples from within total variation distance , in iterations. Each iteration queries and an expected constant number of times.

Recent work [MFWB19] also considered the problem of composite sampling via a restricted Gaussian oracle. However, their work also assumed access to the normalization constant of the restricted Gaussian, as well as Lipschitzness of , amongst other criteria. Our result, Theorem 1, holds with no additional assumptions other than the relevant oracle access, including in the absence of a warm start. While there remains a gap between our runtime333Throughout, the notation hides logarithmic factors in , , and . of and recent runtimes of in the non-composite setting [LST20], our algorithm substantially improves upon prior composite sampling work in both generality and runtime guarantees. We believe this provides evidence that the restricted Gaussian oracle is a useful abstraction in studying logconcave sampling with composite potentials.

Finally, we remark that although our method follows several reductions, each is conceptually lightweight (as discussed in the following section) and easily implementable via either a rejection sampling procedure or oracle calls. To demonstrate this empirically, we evaluate our method for the task of sampling a (non-diagonal) Gaussian restricted to the positive orthant in Section 4.

1.2 Technical overview

We now survey the main components in the development of our algorithm.

Reduction to the shared minimizer case. We first observe that we can without loss of generality assume that and share a minimizer. In particular, by shifting both functions by a linear term, i.e. , , where is the minimizer of , first-order optimality implies both and are minimized by . Moreover, implementation of a first-order oracle for and a restricted Gaussian oracle for

are immediate without additional assumptions. This modification becomes crucial for our later developments, and we expect this simple observation, reminiscent of “variance reduction” techniques in stochastic optimization

[JZ13], to be broadly applicable to improving algorithms for the sampling problem induced by (1).

Beyond Moreau envelopes: expanding the space. A typical approach in convex optimization in handling non-smooth objectives is to instead optimize its Moreau envelope, defined by

(3)

Intuitively, the envelope trades off function value with proximity to ; a standard exercise shows that is smooth (has a Lipschitz gradient), with smoothness depending on , and moreover that computing gradients of is equivalent to calling a proximal oracle (Definition 1). It is natural to extend this idea to the composite sampling setting, e.g. via sampling from the density

However, a variety of complications prevent such strategies from obtaining rates comparable to their noncomposite, well-conditioned counterparts, including difficulty in bounding closeness of the resulting distribution, as well as bias in drift of the sampling process due to error in gradients.

Our approach departs from this smoothing strategy in a crucial way, inspired by Hamiltonian Monte Carlo (HMC) methods [Kra40, Nea11]. Hamiltonian Monte Carlo can be seen as a discretization of the ubiquitous Langevin dynamics, on an expanded space. In particular, discretizations of Langevin dynamics simulate the stochastic differential equation , where is Brownian motion. HMC methods instead simulate dynamics on an extended space , via an auxiliary “velocity” variable which accumulates gradient information. This is sometimes interpreted as a discretization of the underdamped Langevin dynamics [CCBJ18]. HMC often has desirable stability properties, and the strategy of expanding the dimension via an auxiliary variable has been used in algorithms obtaining the fastest rates in the well-conditioned logconcave sampling regime [SL19, LST20]. Inspired by this phenomenon, we consider the density on

(4)

Due to technical reasons, the family of distributions we use in our final algorithms are of slightly different form than (4), but this simplification is useful to build intuition. Note in particular that the form of (4) is directly inspired by (3), where rather than maximizing over , we directly expand the space. The idea is that for small enough and a set on of large measure, smoothness of will guarantee that the marginal of (4) on will concentrate near , a fact we make rigorous. To sample from (1), we then show that a rejection filter applied to a sample from the marginal of (4) will terminate in constant steps. Consequently, it suffices to develop a fast sampler for (4).

Alternating sampling with an oracle. The form of the distribution (4) suggests a natural strategy for sampling from it: starting from a current state , we iterate

  1. Sample .

  2. Sample , via a restricted Gaussian oracle.

When and share a minimizer, taking a first-order approximation in the first step, i.e. sampling , can be shown to be a generalization of the Leapfrog step of Hamiltonian Monte Carlo updates. However, for very small (as in our setting), we observe that the first step itself reduces to the case of sampling from a distribution with constant condition number, which can be performed in gradient calls by e.g. Metropolized HMC [DCWY18, CDWY19, LST20]. Moreover, it is not hard to see that this “alternating marginal” sampling strategy preserves the stationary distribution exactly, so no filtering is necessary. Directly bounding the conductance of this random walk, for small enough , leads to an algorithm running in iterations, each calling a restricted Gaussian oracle once, and a gradient oracle for roughly times. This latter guarantee is by an appeal to known bounds [CDWY19, LST20] on the mixing time in high dimensions of Metropolized HMC for a well-conditioned distribution, a property satisfied by the -marginal of (4) for small .

Stability of Gaussians under bounded perturbations. To obtain our tightest runtime result, we use that is chosen to be much smaller than to show structural results about distributions of the form (4), yielding tighter concentration for bounded perturbations of a Gaussian (i.e. the Gaussian has covariance , and is restricted by -smooth for ). To illustrate, let

and let its mean and mode be , . It is standard that , by -strong logconcavity of . Informally, we show that for and not too far from the minimizer of , we can improve this to ; see Proposition 8 for a precise statement.

Using our structural results, we sharpen conductance bounds, improve the warmness of a starting distribution, and develop a simple rejection sampling scheme for sampling the variable in expected constant gradient queries. These improvements lead to our main result, an algorithm running in iterations. Our proofs are continuous in flavor and based on gradually perturbing the Gaussian and solving a differential inequality; we believe they may of independent interest.

1.3 Related work

The broad problem of sampling from a logconcave distribution (with no assumptions beyond convexity on the log-density) has attracted much interest in the theoretical computer science community, as it generalizes uniform sampling from a convex set. General bounds under zeroth-order query access imply logconcave distributions are samplable in polynomial time ( in the absence of a warm start [LV06a]). For more densities with more favorable structure, however, the first-order access model is attractive to exploit said structure.

Since seminal work of [Dal17], an exciting research direction has studied first-order random walks for distributions with well-behaved log-densities, developing guarantees under assumptions such as Lipschitz derivatives of different orders [CCBJ18, DR18, CV19, CDWY19, DCWY18, DM19, DMM19, LSV18, MMW19, SL19, LST20]. To our knowledge, when the log-density has a condition number of (with no other assumptions), to obtain total variation distance the best-known guarantee is calls to [LST20], and to obtain -Wasserstein distance444 is the scale-invariant effective diameter of a -strongly logconcave distribution. the best-known is oracle calls [SL19]. These results do not typically generalize beyond when the support of is , prompting study of a more flexible distribution family.

Towards this goal, recent works studied sampling from densities of the form (1), or its specializations (e.g. restrictions to a convex set). Several [Per16, BDMP17, Ber18] are based on Moreau envelope or proximal regularization strategies, and demonstrate efficiency under more stringent assumptions on the structure of the composite function , but under minimal assumptions obtain fairly large provable mixing times . Algorithms derived from proximal regularization have also been considered for non-composite sampling [Wib19]. Another discretization strategy based on projections was studied by [BEL18], but obtained mixing time . Finally, improved algorithms for special constrained sampling problems have been proposed, such as simplex restrictions [HKRC18].

Of particular relevance and inspiration to this work is the algorithm of [MFWB19]. By generalizing and adapting Metropolized HMC algorithms of [DCWY18, CDWY19], adopting a Moreau envelope strategy, and using (a stronger version of) the restricted Gaussian oracle access model, [MFWB19] obtained a runtime which in the best case scales as , similar to our guarantee. However, this result required a variety of additional assumptions, such as access to the normalization factor of restricted Gaussians, Lipschitzness of , warmness of the start, and various problem parameter tradeoffs. The general problem of sampling from (1) under minimal assumptions more efficiently than general-purpose logconcave algorithms is to the best of our knowledge unresolved (even under restricted Gaussian oracle access), a novel contribution of our method and mixing time bound.

1.4 Roadmap

Section 3 states our algorithm and subroutines, and provides a proof of Theorem 1 assuming various properties of our process. We demonstrate empirical performance of our method in Section 4. We defer proofs of technical ingredients to appendices, but give strategy overviews in the body.

2 Preliminaries

General notation.

For , denotes the set of naturals . We use the Loewner order on symmetric matrices,

to denote the identity matrix of appropriate dimension, and

to mean the Euclidean norm. is the Gaussian density with specified mean and covariance.

Functions.

We call differentiable -smooth if it has a Lipschitz gradient, i.e. for all . If is twice-differentiable, it is well-known this implies for all , . We say twice-differentiable is strongly convex if everywhere. When a function is -smooth and -strongly convex, we define its condition number . Strong convexity and smoothness respectively imply for all ,

Distributions.

We say distribution is logconcave if , for some convex function ; it is -strongly logconcave if its negative log-density is -strongly convex. It is known that -strong logconcavity implies -sub-Gaussian tails (e.g. [DCWY18], Lemma 1). For , ; we denote the complement by . We say distribution is -warm with respect to if everywhere. The total variation distance between two distributions and is . Finally, for a density on and function ,

3 Algorithm

In this section, we state the components of our method. Throughout, fix distribution with density

(5)

Observe that distribution is -strongly logconcave. We assume that we have precomputed ; see discussion in Section 1.1. Our algorithm proceeds in stages following the outline in Section 1.2, which are put together in Section 3.4 to prove Theorem 1.

  1. Composite-Sample is reduced to Composite-Sample-Shared-Min, which takes as input a distribution with negative log-density , where and share a minimizer; this reduction is given in Section 3.1, and the remainder of the paper handles the shared-minimizer case.

  2. The algorithm Composite-Sample-Shared-Min

    is a rejection sampling scheme built on top of sampling from a joint distribution

    on whose -marginal approximates . We give this reduction in Section 3.2.

  3. The bulk of our analysis is for Sample-Joint-Dist, an alternating marginal sampling algorithm for sampling from . To implement marginal sampling, it alternates calls to and a rejection sampling algorithm Sample-Y. We prove its correctness in Section 3.3.

Input: Distribution of form (5), minimizing negative log-density of , .
Output: Sample from a distribution with .

1:,
2:return
Algorithm 1

Input: Distribution of form (5), where and are both minimized by , .
Output: Sample from a distribution with .

1:while true do
2:     Define the set
(6)
3:     
4:     if  then
5:         
6:         
7:         
8:         
9:         if  then
10:              return
11:         end if
12:     end if
13:end while
Algorithm 2

Input: , of form (5) both minimized by , , , restricted Gaussian oracle for .
Output: Sample from a distribution with , where we overload to mean the marginal of (7) on the variable.

1:
2:Let be the density with
(7)
3:Call to sample , for
(8)
4: (see Remark 1)
5:for  do
6:     Call to sample (Algorithm 4), for
(9)
7:     Call to sample , for
(10)
8:end for
9:return
Algorithm 3

3.1 Reduction from Composite-Sample to Composite-Sample-Shared-Min

Correctness of Composite-Sample is via the following properties, whose proofs are in Appendix A.

Proposition 1.

Let and be defined as in Composite-Sample.

  1. The density is the same as the density .

  2. Assuming first-order (function and gradient evaluation) access to , and restricted Gaussian oracle access to , we can implement the same accesses to , with constant overhead.

  3. and are both minimized by .

3.2 Reduction from Composite-Sample-Shared-Min to Sample-Joint-Dist

Composite-Sample-Shared-Min is a rejection sampling scheme, which accepts samples from subroutine Sample-Joint-Dist

in the high-probability region

defined in (6). We give a general analysis for approximate rejection sampling in Appendix A.3.1, and Appendix A.3.2 bounds relationships between distributions and , defined in (5) and (7) respectively (i.e. relative densities and normalization constant ratios). Combining these pieces proves the following main claim.

Proposition 2.

Let , and assume samples within total variation of the -marginal on (7). Composite-Sample-Shared-Min outputs a sample within total variation of (5) in an expected calls to Sample-Joint-Dist.

3.3 Implementing Sample-Joint-Dist

Sample-Joint-Dist alternates between sampling marginals in the joint distribution , as seen by definitions (9), (10). In Appendix A.4.1, we give a short proof that marginal sampling attains the correct stationary distribution. We bound the conductance of the induced walk on iterates by combining an isoperimetry bound with a total variation guarantee between transitions of nearby points in Appendix A.4.2. Finally, we give a simple rejection sampling scheme Sample-Y as Algorithm 4 for implementing the step (9). Since the -marginal of is a bounded perturbation of a Gaussian (intuitively, is -smooth and ), we show in a high probability region that rejecting from the sum of a first-order approximation to and the Gaussian succeeds in iterations.

Remark 1.

For simplicity of presentation, we were conservative in bounding constants throughout; in practice (cf. Section 4), we found that the constant in Line 4 is orders of magnitude too large (a constant sufficed). Several constants were inherited from prior analyses, which we do not rederive to save on redundancy.

We now give a complete guarantee on the complexity of Sample-Joint-Dist.

Proposition 3.

Sample-Joint-Dist outputs a point with distribution within total variation distance from the -marginal of . The expected number of gradient queries per iteration is constant.

3.4 Putting it all together: Proof of Theorem 1

We show Theorem 1 follows from the guarantees of Propositions 12, and 3. By observing the value of in Sample-Joint-Dist, we see that the number of total iterations in each call to Sample-Joint-Dist is bounded by Proposition 3 also shows that every iteration, we require an expected constant number of gradient queries and calls to , the restricted Gaussian oracle for , and that the resulting distribution has total variation from the desired marginal of . Next, Proposition 2 implies that the number of calls to Sample-Joint-Dist in a run of Composite-Sample-Shared-Min is bounded by a constant, the choice of is , and the resulting point has total variation from the original distribution . Finally, Proposition 1 shows sampling from a general distribution of the form (1) is reducible to one call of Composite-Sample-Shared-Min, and the requisite oracles are implementable.

4 Experiments

We test our algorithm on the problem of sampling from a Gaussian restricted to an orthant. Formally, for a Gaussian with mean and covariance , and where is a random orthant555This generalizes the case of the positive orthant by changing signs of appropriately. (coordinatewise sign restrictions on ), we consider sampling from the distribution666The indicator is if and otherwise.

This problem is motivated by applications in posterior estimation with side information that the variable of interest has sign constraints, e.g. in physics simulations

[NBD18]. For such distributions with nondiagonal covariances, sampling in the high-dimensional regime can be challenging, and to our knowledge no high-accuracy practical samplers exist for this fundamental problem.

We verify the correctness of our algorithm by using the output of naïve rejection sampling (accepting samples in

) on Gaussian distributions with random covariance and random mean in low dimensions, where we can meaningfully plot histograms. We defer this test to Appendix 

E.

In high dimensions, we show our algorithm vastly improves upon the hit-and-run method [LV06b], the most efficient general-purpose logconcave sampler in practice. Hit-and-run has a mixing time of theoretically [LV06b] and empirically. We test our algorithm on randomly generated Gaussian distributions with dense covariance matrices. For fair comparison to hit-and-run (which works on well-rounded distributions), the condition numbers of all randomly generated Gaussian distributions are small constants and the smoothness parameters are . The main tunable parameter in our algorithm is the step size , which we chose so that both Sample-Y and Sample-Joint-Dist reject with probability at most .

In Figure 1, we compare the mixing times of our algorithm and hit-and-run and show the dependence on the dimension . The mixing criterion used was that the process has an effective sample size for all coordinates. To ensure a stable scaling of the mixing time, we use fixed mean . Our algorithm used step size for each . We show that our algorithm improves upon hit-and-run by a factor , which corroborates our theoretical analysis.

In Figure 2, we plot the autocorrelation of the two algorithms’ trajectories for , projected on a random unit direction. We show that in the very high-dimensional regime, our algorithm can converge significantly faster than hit-and-run. In this experiment, each coordinate of is chosen uniformly at random from , and for our algorithm. We include an autocorrelation plot of shorter trajectories showing mixing time of our algorithm in Appendix E.

Figure 1: Mixing time versus dimension.
Figure 2: Autocorrelation plot.
Figure 3: Comparison between Composite-Sample and Hit-and-Run. (a) Dimension dependence of mixing time, averaged over runs. (b) Autocorrelation plots of algorithms for .

Acknowledgments

We thank Yair Carmon for suggesting the experiment in Section 4.

References

  • [BDMP17] Nicolas Brosse, Alain Durmus, Eric Moulines, and Marcelo Pereyra. Sampling from a log-concave distribution with compact support with proximal langevin monte carlo. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10 July 2017, pages 319–342, 2017.
  • [BEL18] Sébastien Bubeck, Ronen Eldan, and Joseph Lehec. Sampling from a log-concave distribution with projected langevin monte carlo. Discret. Comput. Geom., 59(4):757–783, 2018.
  • [Ber18] Espen Bernton. Langevin monte carlo and JKO splitting. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, pages 1777–1798, 2018.
  • [BT09] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences, 2(1):183–202, 2009.
  • [BV04] Dimitris Bertsimas and Santosh S. Vempala. Solving convex programs by random walks. J. ACM, 51(4):540–556, 2004.
  • [CCBJ18] Xiang Cheng, Niladri S. Chatterji, Peter L. Bartlett, and Michael I. Jordan.

    Underdamped langevin MCMC: A non-asymptotic analysis.

    In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, pages 300–323, 2018.
  • [CDWY19] Yuansi Chen, Raaz Dwivedi, Martin J. Wainwright, and Bin Yu. Fast mixing of metropolized hamiltonian monte carlo: Benefits of multi-step gradients. CoRR, abs/1905.12247, 2019.
  • [CV18] Ben Cousins and Santosh S. Vempala. Gaussian cooling and o algorithms for volume and gaussian volume. SIAM J. Comput., 47(3):1237–1273, 2018.
  • [CV19] Zongchen Chen and Santosh S. Vempala. Optimal convergence rate of hamiltonian monte carlo for strongly logconcave distributions. In

    Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2019, September 20-22, 2019, Massachusetts Institute of Technology, Cambridge, MA, USA

    , pages 64:1–64:12, 2019.
  • [Dal17] Arnak Dalalyan. Theoretical guarantees for approximate sampling from smooth and logconcave densities. Journal of the Royal Statistical Society, Series B (Methodological), 79(3):651–676, 2017.
  • [DCWY18] Raaz Dwivedi, Yuansi Chen, Martin J. Wainwright, and Bin Yu. Log-concave sampling: Metropolis-hastings algorithms are fast! In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, pages 793–797, 2018.
  • [DM19] Alain Durmus and Éric Moulines.

    High-dimensional bayesian inference via the unadjusted langevin algorithm.

    Bernoulli, 25(4A):2854–2882, 2019.
  • [DMM19] Alain Durmus, Szymon Majewski, and Blazej Miasojedow. Analysis of langevin monte carlo via convex optimization. J. Mach. Learn. Res., 20:73:1–73:46, 2019.
  • [DR18] Arnak S. Dalalyan and Lionel Riou-Durand. On sampling from a log-concave density using kinetic langevin diffusions. CoRR, abs/1807.09382, 2018.
  • [GMT06] Sharad Goel, Ravi Montenegro, and Prasad Tetali. Mixing time bounds via the spectral profile. Electronic Journal of Probability, 11:1–26, 2006.
  • [Har04] Gilles Hargé. Analysis of langevin monte carlo via convex optimization. J. Mach. Learn. Res., 130(3):415–440, 2004.
  • [HKRC18] Ya-Ping Hsieh, Ali Kavis, Paul Rolland, and Volkan Cevher. Mirrored langevin dynamics. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pages 2883–2892, 2018.
  • [JZ13] Rie Johnson and Tong Zhang.

    Accelerating stochastic gradient descent using predictive variance reduction.

    In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 315–323, 2013.
  • [KLM06] Ravi Kannan, László Lovász, and Ravi Montenegro. Blocking conductance and mixing in random walks. Combinatorics, Probability & Computing, 15(4):541–570, 2006.
  • [Kra40] Hendrik Anthony Kramers. Brownian motion in a field of force and the diffusion model of chemical reactions. Physica, 7(4):284–304, 1940.
  • [LK99] László Lovász and Ravi Kannan. Faster mixing via average conductance. In

    Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, May 1-4, 1999, Atlanta, Georgia, USA

    , pages 282–287, 1999.
  • [LPW09] David Asher Levin, Yuval Peres, and Elizabeth Wilmer. Markov Chains and Mixing Times. American Mathematical Society, 2009.
  • [LST20] Yin Tat Lee, Ruoqi Shen, and Kevin Tian. Logsmooth gradient concentration and tighter runtimes for metropolized hamiltonian monte carlo. CoRR, abs/2002.04121, 2020.
  • [LSV18] Yin Tat Lee, Zhao Song, and Santosh S. Vempala. Algorithmic theory of odes and sampling from well-conditioned logconcave densities. CoRR, abs/1812.06243, 2018.
  • [LV06a] László Lovász and Santosh S. Vempala. Fast algorithms for logconcave functions: Sampling, rounding, integration and optimization. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), 21-24 October 2006, Berkeley, California, USA, Proceedings, pages 57–68, 2006.
  • [LV06b] László Lovász and Santosh S. Vempala. Hit-and-run from a corner. SIAM J. Comput., 35(4):985–1005, 2006.
  • [MFWB19] Wenlong Mou, Nicolas Flammarion, Martin J. Wainwright, and Peter L. Bartlett. An efficient sampling algorithm for non-smooth composite potentials. CoRR, abs/1910.00551, 2019.
  • [MMW19] Wenlong Mou, Yi-An Ma, Martin J. Wainwright, Peter L. Bartlett, and Michael I. Jordan. High-order langevin diffusion yields an accelerated MCMC algorithm. CoRR, abs/1908.10859, 2019.
  • [NBD18] P Norgaard, E Baltz, M Dikovsky, I Langmore, T Madams, J Romero, M Thompson, E Trask, and H Gota. Application of bayesian inference for reconstruction of frc plasma state in c-2w. In APS Meeting Abstracts, 2018.
  • [Nea11] Radford M Neal. Mcmc using hamiltonian dynamics. Handbook of Markov chain Monte Carlo, 2(11):2, 2011.
  • [Per16] Marcelo Pereyra. Proximal markov chain monte carlo algorithms. Stat. Comput., 26(4):745–760, 2016.
  • [SL19] Ruoqi Shen and Yin Tat Lee. The randomized midpoint method for log-concave sampling. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 2098–2109, 2019.
  • [Tib96] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B (Methodological), 58(1):267–288, 1996.
  • [Wib19] Andre Wibisono. Proximal langevin algorithm: Rapid convergence under isoperimetry. CoRR, abs/1911.01469, 2019.
  • [ZH05] Hui Zou and Trevor Hastie. Regularization and variable section via the elastic net. Journal of the Royal Statistical Society, Series B (Methodological), 67(2):301–320, 2005.

Appendix A Deferred proofs from Section 3

a.1 Technical facts

We will repeatedly use the following facts throughout this paper.

Fact 1 (Gaussian integral).

For any and ,

Fact 2 ([Har04], Theorem 1.1).

Let be a -strongly logconcave density. Let be the Gaussian density with covariance matrix . For any convex function ,

Fact 3 ([Dcwy18], Lemma 1).

Let be a -strongly logconcave distribution, and let minimize its negative log-density. Then, for and any , with probability at least ,

(11)
Fact 4 ([Dm19], Theorem 1).

Let be a -strongly logconcave distribution, and let minimize its negative log-density. Then, .

a.2 Deferred proofs from Section 3.1

See 1

Proof.

For and with properties as in (5), with minimizing , define the functions

and observe that everywhere. This proves the first claim. Further, implementation of a first-order oracle for and a restricted Gaussian oracle for are immediate assuming a first-order oracle for and a restricted Gaussian oracle for , showing the second claim; any quadratic shifted by a linear term is the sum of a quadratic and a constant. We now show and have the same minimizer. By strong convexity, has a unique minimizer; first-order optimality shows that

so this unique minimizer is . Moreover, optimality of for implies that for all ,

Here, is a subgradient. This shows first-order optimality of for also, so minimizes . ∎

a.3 Deferred proofs from Section 3.2

a.3.1 Approximate rejection sampling

We first define the rejection sampling framework we will use, and prove various properties.

Definition 3 (Approximate rejection sampling).

Let be a distribution, with . Suppose set has , and distribution with has for some ,

Suppose there is an algorithm which draws samples from a distribution , such that . We call the following scheme approximate rejection sampling: repeat independent runs of the following procedure until a point is outputted.

  1. Draw via until .

  2. With probability , output .

Lemma 1.

Consider an approximate rejection sampling scheme with relevant parameters defined as in Definition 3, with . The algorithm terminates in at most

(12)

calls to in expectation, and outputs a point from a distribution with .

Proof.

Define for notational simplicity normalization constants and . First, we bound the probability any particular call to returns in the scheme:

(13)

The second line followed by the definitions of and , and the third followed by triangle inequality, the assumed lower bound on , and the total variation distance between and . By linearity of expectation and independence, this proves the first claim.

Next, we claim the output distribution is close in total variation distance to the conditional distribution of restricted to . The derivation of (13) implies

(14)

Thus, the total variation of the true output distribution from restricted to is

The first inequality was triangle inequality, and we bounded the second term by (14). To obtain the final equality, we used

We now bound this final term. Observe that the given conditions imply that is bounded by everywhere in . Thus, expanding we have

Finally, combining these guarantees, and the fact that restricting to loses in total variation distance, yields the desired conclusion by triangle inequality. ∎

Corollary 1.

Let

be an unbiased estimator for

, and suppose with probability 1 for all . Then, implementing the procedure of Definition 3 with acceptance probability has the same runtime bound and total variation guarantee as given by Lemma 1.

Proof.

It suffices to take expectations over the randomness of everywhere in the proof of Lemma 1. ∎

a.3.2 Distribution ratio bounds

We next show two bounds relating the densities of distributions and . We first define the normalization constants of (5), (7) for shorthand, and then tightly bound their ratio.

Definition 4 (Normalization constants).

We denote normalization constants of and by

Lemma 2 (Normalization constant bounds).

Let and be as in Definition 4. Then,

Proof.

For each , by convexity we have

(15)

Integrating both sides over yields the upper bound on . Next, for the lower bound we have a similar derivation. For each , by smoothness

Integrating both sides over yields

The last inequality followed from Proposition 7, where we used is -strongly convex. ∎

Lemma 3 (Relative density bounds).

Let . For all , as defined in (6), . Here, denotes the marginal density of . Moreover, for all , .

Proof.

We first show the upper bound. By Lemma 2,

(16)