Let be a positive integer and be a measurable function such that the integral is finite. If we think of as the negative log-likelihood or the negative log-posterior of a statistical model, then the maximum likelihood and the Bayesian estimators, which are perhaps the most popular in statistics, are respectively defined as
These estimators are rarely available in closed-form. Therefore, optimisation techniques are used for computing the maximum-likelihood estimator while the computation of the Bayes estimator often requires sampling from a density proportional to . In most situations, the exact computation of these two estimators is impossible and one has to resort to approximations provided by iterative algorithms. There is a vast variety of such algorithms for solving both tasks, see for example (Boyd and Vandenberghe, 2004) for optimisation and (Atchadé et al., 2011) for approximate sampling. However, a striking fact is that the convergence properties of optimisation algorithms are much better understood than those of the approximate sampling algorithms. The goal of the present work is to partially fill this gap by establishing easy-to-apply theoretical guarantees for some approximate sampling algorithms.
To be more precise, let us consider the case of a strongly convex function having a Lipschitz continuous gradient. That is, there exist two positive constants and such that
where stands for the gradient of and is the Euclidean norm. There is a simple result characterising the convergence of the well-known gradient descent algorithm under the assumption (1).
Theorem 1 (Eq. (9.18) in (Boyd and Vandenberghe, 2004)).
If is continuously differentiable and fulfils (1), then the gradient descent algorithm defined recursively by
This theorem implies that the convergence of the gradient descent is exponential in . More precisely, it results from Eq. (3) that in order to achieve an approximation error upper bounded by in the Euclidean norm it suffices to perform
evaluations of the gradient of . An important feature of this result is the logarithmic dependence of on but also its independence of the dimension . Note also that even though the right-hand side of (4) is a somewhat conservative bound on the number of iterations, all the quantities involved in that expression are easily computable and lead to a simple stopping rule for the iterative algorithm.
The situation for approximate computation of or for approximate sampling from the density proportional to is much more contrasted. While there exist almost as many algorithms for performing these tasks as for the optimisation, the convergence properties of most of them are studied only empirically and, therefore, provide little theoretically grounded guidance for the choice of different tuning parameters or of the stopping rule. Furthermore, it is not clear how the rate of convergence of these algorithms scales with the growing dimension. While it is intuitively understandable that the problem of sampling from a distribution is more difficult than that of maximising its density, this does not necessarily justifies the huge gap that exists between the precision of theoretical guarantees available for the solutions of these two problems. This gap is even more surprising in light of the numerous similarities between the optimisation and approximate sampling algorithms.
Let us describe a particular example of approximate sampling algorithm, the Langevin Monte Carlo (LMC), that will be studied throughout this work. Its definition is similar to the gradient descent algorithm for optimisation but involves an additional step of random perturbation. Starting from an initial point that may be deterministic or random, the subsequent steps of the algorithm are defined by the update rule
where is a tuning parameter, often referred to as the step-size, and
is a sequence of independent centered Gaussian vectors with covariance matrix equal to identity and independent of. It is well known that under some assumptions on , when is small and is large (so that the product is large), the distribution of is close in total variation to the distribution with density proportional to , hereafter referred to as the target distribution. The goal of the present work is to establish a nonasymptotic upper bound, involving only explicit and computable quantities, on the total variation distance between the target distribution and its approximation by the distribution of . We will also analyse a variant of the LMC, termed LMCO, which makes use of the Hessian of .
In order to give the reader a foretaste of the main contributions of the present work, we summarised in Table 1 some guarantees established and described in detail in the next sections. To keep things simple, we translated all the nonasymptotic results into asymptotic ones for large dimension and small precision level (the notation ignores the dependence on constant and logarithmic factors). The complexity of one iteration of the LMC indicated in the table corresponds to the computation of the gradient and generation of a Gaussian
-vector, whereas the complexity of one iteration of the LMCO is the cost of performing a singular values decomposition on the Hessian matrix of, which is of size .
For any we write for the -algebra of Borel sets of . The Euclidean norm of is denoted by while stands for the total variation norm of a signed measure :
. For two probability measuresand defined on a space and such that is absolutely continuous with respect to , the Kullback-Leibler and divergences between and are respectively defined by
All the probability densities on are with respect to the Lebesgue measure, unless otherwise specified. We denote by
the probability density function proportional to, by
the corresponding probability distribution and bythe expectation with respect to . For a probability density and a Markov kernel , we denote by the probability distribution . We say that the density is log-concave (resp. strongly log-concave) if the function satisfies the first inequality of (1) with (resp. ). We refer the interested reader to (Saumard and Wellner, 2014) for a comprehensive survey on log-concave densities.
2 Background on the Langevin Monte Carlo algorithm
The rationale behind the LMC algorithm (5) is simple: the Markov chain is the Euler discretisation of a continuous-time diffusion process , known as Langevin diffusion, that has as invariant density. The Langevin diffusion is defined by the stochastic differential equation
where is a -dimensional Brownian motion. When satisfies condition (1), equation (6) has a unique strong solution which is a Markov process. In what follows, the transition kernel of this process is denoted by , that is for all Borel sets and any initial condition . Furthermore, assumption (1) yields the spectral gap property of the semigroup , which in turn implies that the process is geometrically ergodic in the following sense.
Under assumption (1), for any probability density ,
The proof of this lemma, postponed to Section 8, is based on the bounds on the spectral gap established in (Chen and Wang, 1997, Remark 4.14), see also (Bakry et al., 2014, Corollary 4.8.2). In simple words, inequality (7) shows that for large values of , the distribution of approaches exponentially fast to the target distribution, and the idea behind the LMC is to approximate by for . Note that inequalities of type (7) can be obtained under conditions (such as the curvature-dimension condition, see Bakry et al. (2014, Definition 1.16.1 and Theorem 4.8.4)
) weaker than the strong log-concavity required in the present work. However, we decided to restrict ourselves to the strong log-concavity condition since it is easy to check and is commonly used in machine learning and optimisation.
The first and probably the most influential work providing probabilistic analysis of asymptotic properties of the LMC algorithm is (Roberts and Tweedie, 1996). However, one of the recommendations made by the authors of that paper is to avoid using Langevin algorithm as it is defined in (5), or to use it very cautiously, since the ergodicity of the corresponding Markov chain is very sensitive to the choice of the parameter . Even in the cases where the Langevin diffusion is geometrically ergodic, the inappropriate choice of may result in the transience of the Markov chain . These findings have very strongly influenced the subsequent studies since all the ensuing research focused essentially on the Metropolis adjusted version of the LMC, known as Metropolis adjusted Langevin algorithm (MALA), and its modifications (Roberts and Rosenthal, 1998; Stramer and Tweedie, 1999a, b; Jarner and Hansen, 2000; Roberts and Stramer, 2002; Pillai et al., 2012; Xifara et al., 2014).
In contrast to this, we show here that under the strong convexity assumption imposed on (or, equivalently, on ) coupled with the Lipschitz continuity of the gradient of , one can ensure the non-transience of the Markov chain by simply choosing . In fact, the non-explosion of this chain follows from the following proposition the proof of which is very strongly inspired by the one of Theorem 1.
Let the function be continuously differentiable on and satisfy (1) with . Then, for every , we have
Note that under the condition , the quantity is always nonnegative. Indeed, it follows (see Lemma 4 in Section 8) from the Taylor expansion and the Lipschitz continuity of the gradient that for every , which—in view of (1)—entails that and, therefore, . On the other hand, in view of the strong convexity of , inequality (8) implies that
where stands for the point of (global) minimum of . As a consequence, the sequence produced by the LMC algorithm is bounded in provided that .
A crucial step in analyzing the long-time behaviour of the LMC algorithm is the assessment of the distance between the distribution of the random variableand that of . It is intuitively clear that for a fixed this distance should tend to zero when tends to zero. However, in order to get informative bounds we need to quantify the rate of this convergence. To this end, we follow the ideas presented in (Dalalyan and Tsybakov, 2009, 2012) which consist in performing the following two steps. First, a continuous-time Markov process is introduced such that the distribution of the random vectors and coincide. Second, the distance between the distributions of the variables and is bounded from above by the distance between the distributions of the continuous-time processes and .
To be more precise, we introduce a diffusion-type continuous-time process obeying the following stochastic differential equation:
with the (nonanticipative) drift . By integrating the last equation on the interval , we check that the increments of this process satisfy , where . Since the Brownian motion is a Gaussian process with independent increments, we conclude that is a sequence of iid standard Gaussian random vectors. This readily implies the equality of the distributions of the random vectors and .
Note that the specific form of the drift used in the LMC algorithm has the advantage of meeting the following two conditions. First, is close to , the drift of the Langevin diffusion. Second, it is possible to sample from the distribution , where is the step of discretisation used in the LMC algorithm. Any nonanticipative drift function satisfying these two conditions may be used for defining a version of the LMC algorithm. Such an example, the LMC algorithm with Ozaki discretisation, is considered in Section 5.
To close this section, we state an inequality that will be repeatedly used in this work and the proof of which—based on the Girsanov formula—can be found, for instance, in (Dalalyan and Tsybakov, 2012). If for some the nonanticipative drift function satisfies the inequality for every and every
, then the Kullback-Leibler divergence betweenand , the distributions of the processes and with the initial value , is given by
It is worth emphasising that the last inequality remains valid when the initial values of the processes and are random but have the same distribution.
Note that the idea of discretising the diffusion process in order to approximately sample from its invariant density is not new. It can be traced back at least to (Lamberton and Pagès, 2002), see also the thesis (Lemaire, 2005) for an overview. The results therein are stated for more general discretisation with variable step-sizes but are of asymptotic nature. This point of view has been adopted and extended to the nonasymptotic case in the recent work (Durmus and Moulines, 2015).
3 Nonasymptotic bounds on the error of the LMC algorithm
We are now in a position to establish a nonasymptotic bound with explicit constants on the distance between the target distribution and the one produced by the LMC algorithm. As explained earlier, the bound is obtained by controlling two types of errors: the error of approximating by the distribution of the Langevin diffusion (6) and the error of approximating the Langevin diffusion by its discretised version given by (10). The first error is a decreasing function of : in order to make this error small it is necessary to choose a large . A rather precise quantitative assessment of this error is given by Lemma 1 in the previous section. The second error vanishes when the step-size goes to zero, provided that is fixed. Thus, it is in our interest to choose a small . However, our goal is not only to minimise the error, but also to reduce, as much as possible, the computational cost of the algorithm. For a fixed , if we choose a small value of then a large number of steps is necessary for getting close to the target distribution. Therefore, the computational complexity is a decreasing function of . In order to find a value of leading to a reasonable trade-off between the computational complexity and the approximation error, we need to complement Lemma 1 with a precise bound on the second approximation error. This is done in the following lemma.
Let be a function satisfying the second inequality in (1) and be a stationary point (i.e., ). For any , let and be respectively the distributions of the Langevin diffusion (6) and its approximation (10) on the space of all continuous paths on with values in , with a fixed initial value . Then, if with , it holds that
Let us set . Since it simplifies the mathematical formulae and is possible to achieve in practice in view of Theorem 1
, we assume in the sequel that the initial value of the LMC algorithm is drawn at random from the Gaussian distribution with mean, a stationary point of , and covariance matrix . Then, in view of (12) and the convexity of the Kullback-Leibler divergence, we get (for )
for every and . We can now state the main result of this section, the proof of which is postponed to Section 8.
Let be a function satisfying (1) and be its global minimum point. Assume that for some , we have and . Then, for any time horizon , the total variation distance between the target distribution and the approximation furnished by the LMC algorithm with the initial distribution satisfies
We provide here a simple consequence of the last theorem that furnishes easy-to-apply rules for choosing the time horizon and the step-size .
Let , satisfy (1) and be a target precision level. Let the time horizon and the step-size be defined by
where . Then the output of the -step LMC algorithm, with , satisfies .
Let us first remark that the claim of Corollary 1 can be simplified by taking . However, for this value of the factor equals one, whereas for the slightly more complicated choice recommended by Corollary 1, this factor is close to two. In practice, increasing by a factor results in halving the running time, which represents a non-negligible gain.
Besides providing concrete and easily applicable guidance for choosing the step of discretisation and the stopping rule for the LMC algorithm to achieve a prescribed error rate, the last corollary tells us that in order to get an error smaller than , it is enough to perform evaluations of the gradient of . To the best of our knowledge, this is the first result that establishes polynomial in guarantees for sampling from a log-concave density using the LMC algorithm. We discuss the relation of this and subsequent results to earlier work in Section 7.
4 Possible extensions
In this section, we state some extensions of the previous results that do not require any major change in the proofs, but might lead to improved computational complexity or be valid under relaxed assumptions in some particular cases.
4.1 Improved bounds for a “warm start”
The choice of the distribution of the initial value has a significant impact on the convergence of the LMC algorithm. If is close to , smaller number of iterations might be enough for making the TV-error smaller than . The goal of this section is to present quantitative bounds characterising the influence of on the convergence and, as a consequence, on the computational complexity of the LMC algorithm.
The first observation that can be readily deduced from (12) is that for any ,
Elaborating on this inequality, we get the following result.
Let be a probability density on such that the second-order moment
such that the second-order momentand the divergence are finite. Then, the LMC algorithm having as initial distribution and using the time horizon and step-size defined by
satisfies, for , the inequality .
The proof of this proposition is immediate and, therefore, is left to the reader. What we infer from this result is that the choice of the initial distribution has a strong impact on the convergence of the LMC algorithm. For instance, if for some specific we are able to sample from a density satisfying, for some , the relation as , then the time horizon for approximating the target density within is and the step-size satisfies . Thus, in such a situation, one needs to perform evaluations of the gradient of to get a sampling density within a distance of of the target, which is substantially smaller than obtained in the previous section in the general case.
As it is frequently done in optimisation, one may introduce a preconditioner in the LMC algorithm in order to accelerate its convergence. To some extent, it amounts to choosing a definite positive matrix , called preconditioner, and applying the LMC algorithm to the function . Let be the sequence obtained by the LMC algorithm applied to the function , that is the density of is close to when is large and is small. Then, the sequence is approximately sampled from the density . This follows from the fact that if then . Furthermore, it holds that
i.e., the approximation error of the LMC algorithm with a preconditioner is characterised by Corollary 1. This means that if the function satisfies condition (1) with constants , then the number of steps after which the preconditioned LMC algorithm has an error bounded by is given by . Hence, the preconditioner yielding the best guaranteed computational complexity for the LMC algorithm is the matrix minimising the ratio .
4.3 Nonstrongly log-concave densities
Theoretical guarantees developed in previous sections assume that the logarithm of the target density is strongly concave, cf. assumption (1). However, they can also be used for approximate sampling from a density which is log-concave but not necessarily strongly log-concave; we call these densities nonstrongly log-concave. The idea is then to approximate the target density by a strongly log-concave one and to apply the LMC algorithm to the latter instead of the former one.
More precisely, assume that we wish to approximately sample from a multivariate target density , where the function is twice differentiable with Lipschitz continuous gradient (i.e., satisfies the second inequality in (1)). Assume, in addition, that for every there exists such that for every . Here, is an arbitrarily fixed point in . Note that if , then this assumption implies the first inequality in (1) with . The purpose of this subsection is to deal with the case where equals 0 or is very small. Let be a tuning parameter; we introduce the approximate log-density
This function satisfies both inequalities in (1) with and . Let us denote by the density defined by and by the corresponding probability distribution on
. Heuristically, it is natural to expect that under some mild assumptions the distributionis close to the target when is large and is small. This claim is made rigorous thanks to the following result, which is stated in a broad generality in order to be applicable to approximations that are not necessarily of the form (18).
Let and be two functions such that for all and both and are integrable. Then the Kullback-Leibler divergence between the distribution defined by the density and the target distribution can be bounded as follows:
As a consequence, .
Using the formula for the Kullback-Leibler divergence, we get
Applying successively the inequalities and for every , we upper bound the second term in the right-hand side of (20) as follows:
Combining this inequality with (20), we get the first claim. The last claim of the lemma follows from the Pinsker inequality. ∎
For given by (18), we get . Choosing the parameter sufficiently small and the parameter sufficiently large to ensure that and assuming that has bounded fourth-order moment, we derive from this inequality and Corollary 1 the following convergence result for the approximate LMC algorithm.
Let be a twice differentiable function satisfying for every and for every . Let be a target precision level. Assume that for some known value we have and define , for some . Set the time horizon and the step-size as follows:
Let us comment this result in the case which concerns nonstrongly log-concave densities. Then the previous result implies that . Clearly, the dependence of both on the dimension and on the acceptable error level gets substantially deteriorated as compared to the strongly log-concave case. Some improvements are possible in specific cases. First, we can improve the dependence of on if we are able to simulate from a distribution that is not too far from in the sense of divergence. More precisely, repeating the arguments of Section 4.1 we get the following result: if the initial distribution of the LMC algorithm satisfies for some then one needs at most steps of the LMC algorithm for getting an error bounded by . Second, in some cases the dependence of on can be further improved by using a preconditioner and/or by replacing the penalty in (18) by , where is a properly chosen matrix.
This being said, our intuition is that Corollary 2 is more helpful in the case of convex functions that are strongly convex in a neighbourhood of their minimum point . In such a situation, our recommendation is to set and to choose by maximising the quantity . We showcase this approach in Section 6 on the example of logistic regression.
Note that the convergence of the MCMC methods for sampling from log-concave densities was also studied in (Brooks, 1998), where a strategy for defining the stopping rule is proposed. However, as the computational complexity of that strategy increases exponentially fast in the dimension , its scope of applicability is limited.
5 Ozaki discretisation and guarantees for smooth Hessian matrices
For convex log-densities which are not only continuously differentiable but also have a smooth Hessian matrix , it is possible to take advantage of the Ozaki discretisation (Ozaki, 1992) of the Langevin diffusion which is more accurate than the Euler discretisation analysed in the foregoing sections. It consists in considering the diffusion process defined by (10) with the drift function
where, as previously, is the step-size and is the number of iterations to attain the desired time horizon . This expression leads to a diffusion process having linear drift function on each interval . Such a diffusion admits a closed-form formula. The resulting MCMC algorithm (Stramer and Tweedie, 1999b), hereafter referred to as LMCO algorithm (for Langevin Monte Carlo with Ozaki discretisation), is defined by an initial value and the following update rule. For every , we set , which is an invertible matrix since is strongly convex, and define
where is a sequence of independent random vectors distributed according to the distribution. In what follows, for any matrix , stands for the spectral norm, that is .
Assume that , the function satisfies (1) and, in addition, the Hessian matrix of is Lipschitz continuous with some constant : , for all . Let be the global minimum point of and be the Gaussian distribution . Then, for any step-size and any time horizon , the total variation distance between the target distribution and the approximation furnished by the LMCO algorithm with drawn at random from satisfies
The proof of this theorem is deferred to Section 8. Let us state now a direct consequence of the last theorem, which provides sufficient conditions on the number of steps for the LMCO algorithm to achieve a prescribed precision level . The proof of the corollary is trivial and, therefore, is omitted.
Let satisfy (1) with a Hessian that is Lipschitz-continuous with constant . For every , if the time horizon and the step-size are chosen so that
then the distribution of the outcome of the LMCO algorithm with steps fulfils .
This corollary provides simple recommendation for the choice of the parameters and in the LMCO algorithm. It also ensures that for the recommended choice of the parameters, it is sufficient to perform number of steps of the LMCO algorithm in order to reach the desired precision level . This number is much smaller than that provided earlier by Corollary 1, which was of order . However, one should pay attention to the fact that each iteration of the LMCO requires computing the exponential of the Hessian of at the current state and, therefore, the computational complexity of each iteration is usually much larger for the LMCO as compared to the LMC ( versus ). This implies that the LMCO would most likely be preferable to the LMC only in situations where is not too large, and the required precision level is very small. For instance, the arguments of this paragraph advocate for using the LMCO instead of the LMC when .
This being said, it is worth noting that for some functions the cost of performing a singular values decomposition on the Hessian of , which is the typical way of computing the matrix exponential, might be much smaller than the aforementioned worst-case complexity . This is, in particular, the case for the first example considered in the next section. One can also approximate the matrix exponentials by matrix polynomials. For second-order polynomials, this amounts to replacing the updates (24) by
Establishing guarantees for such a modified LMCO is out of scope of the present work. We will limit ourselves to an empirical assessment of the quality of this approximation on the example of logistic regression considered in Section 6.
To close this section, let us remark that in the case a warm start is available, the number of iterations for the LMCO algorithm to reach the precision may be reduced to . Indeed, if the divergence between the initial distribution and the target is bounded by a quantity independent of , or increasing not faster than a polynomial in , then the time horizon can be chosen as and the choice of provided by Corollary 3 leads to a number of iterations satisfying .
6 Numerical experiments
To illustrate the results established in the previous sections, we carried out some experiments on synthetic data. The experiments were conducted on a HP Elitebook PC with the following configuration: Intel (R) Core (TM) i7-3687U with 2.6 GHz CPU and 16 GB of RAM. The code, written in Matlab, does not use parallelisation. We considered two examples; both satisfy all the assumptions required in previous sections. This implies that Corollaries 1 and 3 apply and guarantee that the choices of and suggested by these corollaries allow us to generate random vectors having a distribution which is within a prescribed distance , in total variation, of the target distribution.
Example 1: Gaussian mixture
The goal of this first experiment is merely to show on a simple example the validity of our theoretical findings. That is, we check below that the LMC and the LMCO algorithms with the values of time horizon and step-size recommended by Corollaries 1 and 3 produce samples distributed approximately as the target distribution within a reasonable running time. To this end, we consider the simple task of sampling from the density defined by
where is a given vector. This density represents the mixture with equal weights of two Gaussian densities and