Sampling from a distribution is a crucial computational step in several domains such as Bayesian inference, prediction in adversarial environments, Monte Carlo approximation and reinforcement learning. In the high dimensional setting, a common approach is to use Markov chain Monte Carlo (MCMC) methods to generate a sample. Among these methods, particularly useful and widely applicable are MCMC algorithms that generate a sample from a probability distribution with a density on a continuous state space robert2013monte. A broad class of such algorithms are first-order (gradient) methods. These methods operate by using the gradient of the log density at different points in the space.
Recently, the focus has been on methods that work with stochasticestimates of the gradient given the emergence of large-scale datasets welling2011bayesian. On these datasets it is computationally challenging to calculate the exact gradient over the entire data. An example of this, of course, is the case of sampling from a posterior distribution when the log density is sum decomposable over a large dataset.
While there is a large body of work studying MCMC methods in this setting (refer to brooks2011handbook for a survey of the results in this area), classically, little was known about the explicit, non-asymptotic dependence of the iteration complexity—the number of steps of the method needed to get a good sample—of these sampling algorithms in terms of the dimension and the target accuracy. The seminal work of dalalyan2017theoretical established a non-asymptotic bound on the iteration complexity for the Unadjusted Langevin algorithm (ULA) ermak1975computer,parisi1981correlation,grenander1994representations,roberts1996exponential in terms of the dimension and the target accuracy when the log density is sufficiently regular. This has kicked off a slew of research in this field and we now have many upper bounds on the non-asymptotic iteration complexity of several popular first-order sampling methods durmus2017nonasymptotic,dalalyan2019user,cheng2018convergence,cheng2017underdamped,dwivedi2018log,mangoubi2017rapid,mangoubi2018dimensionally in terms of relevant problem parameters like dimension, condition number and target accuracy.
In spite of this large and growing body of work, what has remained crucially missing is a characterization of the intrinsic hardness of sampling for a given family of distributions, that is, an information theoretic lower bound on the iteration complexity in terms of problem parameters for first-order sampling methods. This is the question that we address in our work.
We establish a lower bound on the iteration complexity for sampling from continuous distributions over that have smooth and strongly concave (see Section 2 for definitions) log densities of the form:
To establish our lower bounds we borrow the noisy oracle model nemirovsky1983problem used in the allied field of optimization. There, this model was used to establish lower bounds on the iteration complexity of optimizing a function using gradients/function-values [see also,]raginsky2011information,agarwal2012information,ramdas2013optimal. In this oracle model an algorithm (in our case a sampling method) is allowed to query the oracle at a point and the oracle returns the gradient/function value of the log density at that point corrupted by some independent and unbiased noise with bounded variance. The iteration complexity of the method is therefore equal to the number of queries that need to be made to this noisy oracle. The method is allowed to be adaptive, in the sense that it is allowed to choose its next query based on its entire history—past query points and the values returned by the oracle. Many popular first-order and zeroth-order (methods that work with function-values) sampling methods like Hamiltonian Monte Carlo neal2011mcmc, unadjusted Langevin algorithm (ULA) ermak1975computer,parisi1981correlation,grenander1994representations,roberts1996exponential,dalalyan2017theoretical, Metropolis adjusted Langevin algorithm roberts1996exponential,roberts2002langevin,bou2013nonasymptotic, Ball walk lovasz1990mixing,dyer1991random,lovasz1993random,lovasz2007geometry, Metropolized random walk mengersen1996rates,roberts1996geometric, Hit-and-run smith1984efficient,belisle1993hit,kannan1995isoperimetric,lovasz1999hit,lovasz2006hit,lovasz2007geometry, underdamped Langevin MCMC eberle2019couplings,cheng2017underdamped, Riemannian MALA xifara2014langevin, Proximal-MALA pereyra2016proximal and Projected ULA bubeck2018sampling, can be described by this oracle model.
The main result of this paper is to establish a lower bound on the iteration complexity of any algorithm with access to a stochastic gradient oracle. We show that the iteration complexity must grow at least linearly with the noise variance in the gradients.
More concretely, a sampling algorithm operates in a sequential fashion. It begins by querying the stochastic gradient oracle at an initial point (which could be chosen in a randomized fashion). The oracle then returns the gradient of the negative log density at that point with some noise added to it: , where is independent of the query point, has zero mean and has bounded variance [such an assumption on the gradient oracle is quite common in the study of upper bounds of sampling algorithms, see, e.g.,]dalalyan2019user,cheng2017underdamped. Denote the distribution of the noisy gradients as —the conditional distribution of the noisy gradient at a point . On observing this noisy gradient , the algorithm can now choose to query on new point , where this point is dependent on the initial query point and the noisy gradient . This protocol continues for rounds, after which the algorithm is asked to output a point from a distribution close to the target . Let us denote the distribution of the point output by such an algorithm after such queries by .
We show that for every such algorithm, there exists a log smooth and strongly log-concave target density for which the distribution of the sample generated by the algorithm would be at least away from the target in total variation distance if the number of queries is less than . A precise statement of this result is presented as Theorem 1.
This lower bound requires the choice of a different family of distributions for every target accuracy . A result that is easier to interpret is our lower bound for the case when the target accuracy is a constant. In this regime our results are as follows:
There exists constants and such that for all and , implies,
where, the infimum is over all gradient-based sampling algorithms, the supremum over is over unbiased gradient oracles with noise-variance bounded by , and the supremum over is over -log smooth, -strongly log-concave distributions over with the mean .
This corollary asserts that to get to a distribution that is a constant away from a well-conditioned target distribution with constant log-smoothness it must make order queries to the stochastic gradient oracle in the worst case.
We consider target distributions that have their mean in a ball of constant radius around the origin. This ensures that most of the mass of these distributions is concentrated in a ball of radius around the mean (as is -strongly log-concave and hence -sub-Gaussian).
At a high level the approach here is inspired by the work that proved lower bounds in stochastic optimization. There, they reduce the problem of lower bounding the error of any first-order optimization algorithm to that of lower bounding the Bayes risk of an appropriately constructed statistical problem. We find that this reduction is not directly possible in our problem. Instead, we relate the problem of lower bounding the iteration complexity of sampling to that of lower bounding a Le Cam deficiency between two statistical experiments. Le Cam deficiency is a measure that is widely studied in the literature on the comparison of statistical experiments bohnenblust1949reconnaissance,blackwell1953equivalent,le1964sufficiency,torgersen1991comparison and it turns out to be a useful idea in our problem.
The rest of the paper is organized as follows. In Section 2, we introduce and define useful quantities used throughout the paper in the statement of our results and their proofs. In Section 3, we reduce the problem of lower bounding the iteration complexity of sampling algorithms to that of lower bounding a difference of Bayes risk functions. In Section 4, we state and prove our main results concerning gradient oracles. We conclude with a discussion in Section 5. The more technical details of the proofs are deferred to the appendix.
We use bold font to denote a random variable, and unbolded font to denote the particular sample value of that random variable. For example,denotes a random variable and denotes that the realization of is
. We use uppercase letters to denote matrices and lower case letters to denote vectors. Given any vector, let denote its Euclidean norm. Let
denote the identity matrix indimensions, denote the matrix of all zeros and denote a row vector of all zeros. Given any subset of , let denote the Borel -field associated with this set. For any finite set , let denote its cardinality. denotes the indicator of an event. We use to denote positive universal constants that are independent of any problem parameters. We adopt conventional notations and as suppressing constants independent of dimension and problem parameters, and and as suppressing poly-logarithmic factors.
2 Definitions and problem set-up
In this section, we define several quantities that are useful in stating and proving our results.
In our paper, we refer to , the negative log density of the distribution , as the potential function. As stated in the introduction, we consider study a family of distributions with smooth and strongly convex potentials. We define smoothness and strong convexity below.
A function from is -smooth and -strongly convex if there exists constants such that for all ,
If a function is smooth and strongly convex then is smooth and strongly concave. Next let us define a probability kernel111What is the difference between a probability kernel and a conditional distribution? See the blog posts by larryblog, raginskyblog for an entertaining and insightful discussion..
For measurable spaces and , is a probability kernel (conditional distribution) from to probability measures over if
The map is -measurable for each .
There is probability measure associated with each which we denote by .
We will use to represent the random variable drawn from the distribution .
In this paper we will only be using measurable spaces that are product spaces of the reals. So from this point on we avoid explicitly mentioning the associated -field and always work with the appropriate Borel -field.
We use probability kernels to describe the response of the algorithm during its interaction with the gradient oracle.
A gradient-based sampling algorithm over consists of the following elements:
A distribution over from which the initial point is drawn.
Probability kernels for , with the probability kernel being a map from to probability measures over . The kernel is a map from the history of queries and gradients seen up to round to a distribution over the next query point. The final kernel produces the sample.
To understand why this amounts to a valid and interesting class of sampling algorithms, let us understand the protocol to produce a sample given a stochastic gradient oracle (we use or to denote the distribution of the gradients when queried at a point ).
Given the protocol above it is possible to write down the joint probability distribution of the path. Note that the path of the algorithm need not have a density with respect to the Lebesgue measure. But, for the sake of intuition let us assume that the path has a density; then the density of the path would be
Let the distribution of the sample produced produced by the interaction between the algorithm and a gradient oracle be denoted by . Explicitly, the marginal density of is given by:
Restriction to finite parametric classes.
As we are interested in demonstrating a lower bound for sampling algorithms it suffices to work with only a finite indexed class of distributions , where each is -log smooth and -strongly log-concave. Let be the negative log density corresponding to the distribution .
In the discussion that follows we let be a stochastic gradient oracle that outputs , where is a Gaussian random variable with mean zero and covariance when queried at a point . The covariance matrix will be specified subsequently. For such a finite class of distributions, it suffices to lower bound the quantity on the right side below,
where the infimum over is over the class of sampling algorithms defined in Definition 3.
An estimator is a probability kernel from to the decision space .
An estimator , when given a sample from the distribution , yields a distribution over the decision space . We choose the decision space to be which is distinct from the set of parameters as we will be interested in estimates for the mean of .
Further, we define sequential randomized estimators – distributions over the decision space that arise by appropriately combining a randomized estimator with a sampling algorithm.
Definition 5 (Sequential estimators).
Given an estimator from to , a sampling algorithm and a gradient oracle , define a sequential estimator as
where is the marginal distribution of the sample produced by .
Often we will simply use to refer to the sequential estimator that is built using the estimator and sampling algorithm . In the paper, when we write , this indicates an infimum simultaneously over both estimators and algorithms that are used to build .
3 A characterization in terms of Bayes risk functions
In this section we prove a proposition which is crucial to prove our lower bound on the iteration complexity. Operationally, it reduces the problem of lower bounding the total variation distance between and to lower bounding a difference of two Bayes risk functions. This is advantageous as there exists a large set of well-developed techniques to both lower and upper bound Bayes risk functions.
As will be clear in the proof, the choice of total variation distance as our metric is pivotal. The total variation distance lends itself to a variational definition in terms of bounded loss functions which is exploited in the proof of this proposition.
(Randomization criterion) Let be a finite set. Then for any loss function over such that we have,
where, the infimum over is over gradient-based sampling algorithms, the infimum over is over all sequential estimators mapping to and the infimum over is over estimators from to .
As mentioned earlier, the infimum over sequential estimator in the proposition above involves taking an infimum simultaneously over both estimators and sampling algorithms that make up . The proof of the proposition is in Appendix A.
To understand why this proposition is useful let us unpack the right hand side of the randomization criterion which is a difference of two Bayes risk functions. Given a fixed loss function , the first term is the average risk associated with the optimal estimator that is fed a fictitious sample generated by the optimal sampling algorithm after rounds of interaction with the stochastic gradient oracle. This is compared to the second term which is the average risk of an optimal estimator that receives a single real sample from the target distribution .
Thus, to demonstrate a lower bound for sampling we will construct a pair of two distributions and , and a gradient oracle such that:
It extremely easy to estimate the mean of the distribution given a single sample from the real distribution. This is to ensure that the second term is small and is independent of dimension.
But, the Bayes risk associated with estimating the mean when given a fictitious sample from the sampler grows at a rate of .
Such a construction ensures that the difference between these Bayes risk functions is large and proves a lower bound on the iteration complexity of sampling via the randomization criterion.
The term on the left hand side of our randomization criterion is similar in spirit to a quantity called the Le Cam deficiency between two families of distributions. Given two families and , the Le Cam deficiency between these families is defined to be
where, is a probability kernel which takes as input samples from . It is measure of how difficult it is to transform a sample from the distribution to a sample from using a probability kernel (which does not vary with ). In our setting, in some sense we want to transform adaptively chosen samples from the the gradient oracle to a sample from the target distribution .
This proposition above is similar in flavour to the classical Le Cam randomization criterion le1964sufficiency studied extensively in the literature on the comparison of statistical experiments [see, e.g,]torgersen1991comparison,le2012asymptotics. In fact, it is a version of Le Cam’s randomization criterion restricted to a special class of sequential estimators.
Finally, we note that an object similar to first Bayes risk (when the estimator is fed fictitious samples from the sampling algorithm) also arises in the study of lower bounds on the iteration complexity of gradient-based stochastic optimization algorithms. At a high level, there, this Bayes risk lower bounds the worst case gap to the optimum for the optimal stochastic optimization algorithm. We point the interested reader to the work of raginsky2011information, agarwal2012information, and ramdas2013optimal for further details. Interestingly, in the study of lower bounds for stochastic optimization algorithms, no term analogous to second Bayes risk (when the estimator is fed real samples) arises which adds a level of difficulty to our problem.
4 Lower bounds for stochastic first order methods
We now state our main result for stochastic gradient oracles.
For all , , and for all ,
where, the infimum is over all gradient-based sampling algorithms, the supremum over is over unbiased gradient oracles with noise-variance bounded by , and the supremum over is over -log smooth, -strongly log-concave distributions over with the mean .
The interpretation of the theorem given the order of the quantifiers, is as follows: Given any , and , for every gradient-based sampling algorithm, there exists a noise oracle and a target distribution (both of which could depend on the sampling algorithm and on the problem parameters and ) such that the total variation between the distribution of points generated by the algorithm and the target distribution is lower bounded by .
To reiterate, we only consider distributions that have their means in a ball of radius around the origin. This ensures that most of the mass of these distributions is concentrated in a ball of radius around the mean (as is -strongly log-concave and hence sub-Gaussian). This constraint on the family of target distributions ensures that the lower bound does not trivially hold by choosing an appropriate family of distribution with means that are arbitrarily far away from each other.
We note that our lower bound only holds for the case when the gradients are noisy. If we are given exact gradients there exist methods that use a Metropolis-Hastings filter that provably converge exponentially fast (in the target accuracy) when the target distribution is strongly log-concave [see, e.g.,]dwivedi2018log.
It is also fairly straightforward to run through our argument and also arrive at a lower bound in terms of Kullback-Leibler divergence by using the Pinsker-Csiszár-Kullback inequality appropriately.
Proof of Corollary 1 and its consequences.
If we apply Theorem 1 for , then the claim of the corollary follows immediately by choosing .
Corollary 1 dictates that there exists a family of well-conditioned distributions with constant log-smoothness such that it takes at least order queries to the stochastic gradient oracle to obtain a sample whose distribution is a constant () away from the target distribution in total variation distance.
When do we have optimal algorithms?
One candidate would be the Stochastic Gradient Langevin Dynamics (SGLD) algorithm [it is a version of the unadjusted Langevin algorithm that uses stochastic gradients instead of exact ones]welling2011bayesian. Corollary 25, durmus2019analysis characterizes an upper bound for the rate of convergence of Stochastic Gradient Langevin Dynamics (SGLD) in total variation distance. Let denote the distribution of the sample produced by SGLD (with parameters tuned according to requirements of their results) after iterations. Then if and the distribution is well-conditioned their results guarantee,
If and are viewed as constants, this matches our lower bound on the number of iterations needed to achieve constant accuracy.
Proof details for Theorem 1
Before we start our proof of the main theorem, we set up some quantities that will be useful in the proof. Let be a scalar parameter that will be specified in the sequel. We consider a family of two Gaussian distributions,
The last coordinates of the two mean vectors are all . The mean vectors differ only on the first coordinate. Set and , where and refers to the first coordinate of the vectors. Clearly, these vectors are separated in Euclidean norm
The distributions we have defined are Gaussian, therefore, each distribution in this family is -log smooth and -strongly log-concave (in fact, it is -strongly log-concave).
We choose our decision space to be the space of vectors in and define a loss function over this space as,
We will be working with a fixed gradient oracle that adds zero mean Gaussian noise with covariance
to the gradient of when queried at . This noise oracle uses its entire noise budget on the first coordinate and adds Gaussian noise with variance to the gradient along this coordinate. We chose such an oracle because the two distributions and differ only along this single coordinate. Since , this is a valid gradient oracle. We are now ready to begin the proof.
Proof : We want a lower bound on the quantity,
First we replace the supremum over all distributions with our pair of Gaussian distributions parameterized by defined in equation (2) above. Further, as mentioned above we drop the supremum over gradient oracle distributions and consider the gradient oracle that adds Gaussian noise to the first coordinate of the gradient. By relaxing the problem in this manner we must have,
To lower bound we need to establish a lower bound on and an upper bound on .
The term is the Bayes risk of the optimal sequential estimator with respect to the loss function when the true parameter
is drawn from a uniform distribution over the set. To prove a lower bound on this term we turn to Le Cam’s method. Roughly, this entails reducing this mean estimation problem to a hypothesis testing problem via a standard argument (see Lemma 1). Then, we lower bound the error made in this hypothesis testing problem by the optimal test. This is achieved by upper bounding the total variation distance between the distributions of the algorithm’s history over rounds when the true distribution is either or (in Lemma 2). This argument leads to lower bound: . The details for this lower bound are presented in Appendix B.1.
Similar to the first term, is the Bayes risk of the optimal estimator that is given a single sample from the true distribution distribution . A bound of
follows by considering the estimator that outputs the first coordinate of the sample along the first dimension (the standard deviation along with dimension isand ) and along the remaining dimensions. This simple calculation is detailed in Appendix B.2.
By combining these results,
Pick . Recall that the number of queries , which guarantees that . Plugging this value of into the inequality above,
where, the condition on the smoothness ensures that the second term in the inequality above is smaller than the first, which completes the proof.
5 Discussion and open problems
Many interesting questions remain open related to lower bounding the iteration complexity of sampling algorithms. In our lower bounds we ignored the dependence on the condition number , an important parameter of the problem, and established a lower bound for well-conditioned distributions (where ). It would be interesting to study if the lower bounds can be extended to also capture the dependence on . Another related question concerns the iteration complexity of sampling when the potentials are smooth and strongly-convex with respect to general norms for .
We have given a lower bound in terms of the variance, number of variables, and number of samples, for a worst-case smooth distribution. The degree of smoothness in our construction depends (albeit somewhat mildly), on the other parameters. It would be interesting to see how the optimal accuracy behaves when the smoothness and the other parameters are varied independently.
Another direction is to develop techniques to establish lower bounds for sampling methods when we have access to the exact gradients of the potential function. On this front, there has been some recent progress on related problems, which may provide some hints as to how one might proceed. ge2019estimating provide a lower bound on the number of exact gradient queries required to estimate the normalization constant of strongly log-concave distributions. rademacher2008dispersion showed that a quadratic number of queries to a deterministic membership oracle is required to estimate the volume of a convex body in dimensions. talwar2019computational recently exhibited a family of distributions with non-convex potentials for which sampling is NP-Hard.
We are grateful to Aditya Guntuboyina for pointing us towards the literature on Le Cam deficiency. We would also like to thank Jelena Diakonikolas, Sébastien Gerchinovitz, Michael Jordan, Aldo Pacchiano, Aaditya Ramdas and Morris Yau for many helpful conversations. We thank Kush Bhatia for helpful comments that improved the presentation of the results.
We gratefully acknowledge the support of the NSF through grants IIS-1619362 and IIS-1909365. Part of this work was completed while the author NC was interning at Google.
Appendix A Proof of the randomization criterion, Proposition 1
Given any sampling algorithm , for any estimator it is possible to define a sequential estimator by using both and . Then, for all we have,
where, step follows by the definition of and step is by Fubini’s theorem. Define a function . It is easy to verify that this function is also bounded, . Hence, we have,
where the equality is due to the variational definition of the total variation distance. Taking an infimum over all possible sequential estimators (by taking an infimum over all estimators and algorithms ) we find that for all estimators ,
As the above statement holds for all estimators , taking a supremum over on the right side and an infimum over all gradient-based sampling algorithms on the left hand side completes the proof.
Appendix B Additional proof details for lower bounds with gradient oracles
b.1 Lower Bound on
In this subsection we prove a lower bound on . By definition
This is the Bayes risk of sequential estimators with respect to the loss function . We want a lower bound on it; which is a lower bound on the risk of the optimal estimator that attempts to identify the underlying value of the mean vector associated with given adaptive queries to a stochastic gradient oracle.
Let random variables in a sample path of algorithm be , where are the query points of the algorithm, are the stochastic gradients returned by the oracle and is the sample produced. Let be a random variable that is a concatenation of the the path that the algorithm takes. In Lemma 1 below, we prove that
where is the distribution of when and is its distribution . The proof uses a fairly standard argument (called Le Cam’s method) to reduce from mean estimation to testing [see, e.g.,][Chapter 15]wainwright2019high. To complete our lower bound, we need to upper bound the total variation distance between the distributions and . We do this by instead controlling the Kullback-Leibler divergence between these distributions which bounds the total variation distance. See Lemma 2 below for the details of this calculation. This upper bound combined with the inequality in the display above yields,
which is the desired lower bound on .
Let the Bayes risk be as defined in (5), and and be as defined above, then,
Proof By the definition of a sequential estimator in Definition 5,
It is made up of an estimator and a sampling algorithm . Let random variables in a sample path of algorithm be , where are the query points of the algorithm, are the stochastic gradients returned by the oracle and is the sample produced. Let be a random variable that is a concatenation of the variables along this path.
Reduction to testing: We proceed by the standard reduction used in proving lower bounds on Bayes risk functions, that of reducing from an estimation problem to a testing problem [see, e.g.,][Chapter 15]wainwright2019high. Recall the definition of
Consider the event when then
as . Thus, by applying Markov’s inequality with respect to the distribution we infer,
where is the expectation of the indicator function under the distribution over and under the uniform mixture distribution over . Let denote this joint probability over the random variables (defined above), and (which takes values in ) then,
So we have reduced it to lower bounding the quantity on the right hand side. Toward this end, let us consider a test that attempts to identify the value of given ,
where we break ties arbitrarily. Suppose that the conditional value of is : we claim that the event ensures that the test is correct. In order to see this, note that for any other , by triangle inequality we have,
The lower bound follows by our construction of the set of mean vectors, see inequality (3). As these vectors are separated we must also have, where .
Therefore, conditioned on a value of the random variable , the event is contained within the event . This implies,
Taking expectations with respect to the uniform mixture distribution over yields,
We take an infimum over all sequential estimators and an infimum over all tests , coupled with inequality (7) to conclude that,
Le Cam’s Method: The next step is to demonstrate a lower bound on failure probability of any test for this, we use Le Cam’s method [see, e.g.,][Section 15.2.1]wainwright2019high,
where and denote the distribution of when the true underlying target distribution is and respectively. Combining this lower bound with inequality (8) completes the proof.
[2mm] The next lemma establishes an upper bound on the total variation distance between the distributions and .
Let and be as defined above. Then
Proof Recall that the random variable is made up of the query points of the algorithm , the stochastic gradients and the output of the sampling algorithm . The distribution is the law of when and is its law when .
Say, , then the density222Let us temporarily assume that the density exists for the sake of intuition. The density is not required to exist and our arguments hold more generally. of is:
The oracle noise distribution , adds mean-zero Gaussian noise to the gradient of where
Note that, by the choice of the noise oracle the measure is absolutely continuous with respect to 333See, for example, kallenberg, for a review of absolute continuity.. We will proceed to bound the Kullback-Leibler divergence between and which also leads to a bound on the total variation distance.