There has been recently an increasing interest in Bayesian inference applied to high-dimensional models often motivated by machine learning applications. Rather than obtaining a point estimate, Bayesian methods attempt to sample the full posterior distribution over the parameters and possibly latent variables which provides a way to assert uncertainty in the model and prevents from overfitting[neal:1992], [welling:the:2011].
The problem can be formulated as follows. We aim at sampling a posterior distribution on , , with density w.r.t. the Lebesgue measure, where is continuously differentiable. The Langevin stochastic differential equation associated with is defined by:
where is a -dimensional Brownian motion defined on the filtered probability space , satisfying the usual conditions. Under mild technical conditions, the Langevin diffusion admits as its unique invariant distribution.
We study the sampling method based on the Euler-Maruyama discretization of (1). This scheme defines the (possibly) non-homogeneous, discrete-time Markov chain given by
where is an i.i.d. sequence of
-dimensional standard Gaussian random variables andis a sequence of step sizes, which can either be held constant or be chosen to decrease to . This algorithm has been first proposed by [ermak:1975] and [parisi:1981] for molecular dynamics applications. Then it has been popularized in machine learning by [grenander:1983], [grenander:miller:1994] and computational statistics by [neal:1992] and [roberts:tweedie:1996]. Following [roberts:tweedie:1996], in the sequel this method will be refer to as the unadjusted Langevin algorithm (ULA). When the step sizes are held constant, under appropriate conditions on , the homogeneous Markov chain has a unique stationary distribution , which in most of the cases differs from the distribution . It has been proposed in [rossky:doll:friedman:1978] and [roberts:tweedie:1996] to use a Metropolis-Hastings step at each iteration to enforce reversibility w.r.t. . This algorithm is referred to as the Metropolis adjusted Langevin algorithm (MALA).
The ULA algorithm has already been studied in depth for constant step sizes in [talay:tubaro:1991], [roberts:tweedie:1996] and [mattingly:stuart:higham:2002]. In particular, [talay:tubaro:1991, Theorem 4] gives an asymptotic expansion for the weak error between and . For sequence of step sizes such that and , weak convergence of the weighted empirical distribution of the ULA algorithm has been established in [lamberton:pages:2002], [lamberton:pages:2003] and [lemaire:2005].
Contrary to these previously reported works, we focus in this paper on non-asymptotic results. More precisely, we obtain explicit bounds between the distribution of the -th iterate of the Markov chain defined in (2) and the target distribution in Wasserstein and total variation distance for nonincreasing step sizes. When the sequence of step sizes is held constant for all , both fixed horizon (the total computational budget is fixed and the step size is chosen to minimize the upper bound on the Wasserstein or total variation distance) and fixed precision (for a fixed target precision, the number of iterations and the step size are optimized simultaneously to meet this constraint) strategies are studied. In addition, quantitative estimates between and are obtained. When and , we show that the marginal distribution of the non-homogeneous Markov chain converges to the target distribution and provide explicit convergence bounds. A special attention is paid to the dependency of the proposed upper bounds on the dimension of the state space, since we are particularly interested in the application of these methods to sampling in high-dimension.
Compared to [dalalyan:2014] and [durmus:moulines:2016], our contribution is twofold. First, we report sharper convergence bounds for constant and non-increasing step sizes. In particular, it is shown that the number of iterations required to reach the target distribution with a precision in total variation is upper bounded (up to logarithmic factors) by , where is the dimension, under appropriate regularity assumption (in [durmus:moulines:2016], the upper bound was shown to be ). We show that our result is optimal (up to logarithmic factors again) in the case of the standard
-dimensional Gaussian distribution, for which explicit computation can be done. Besides, we establish computable bound for the mean square error associated with ULA for measurable bounded functions, which is an important result in Bayesian statistics to estimate credibility regions. For that purpose, we study the convergence of the Euler-Maruyama discretization towards its stationary distribution in total variation using a discrete time version of reflection coupling introduced in[bubley:dyer:jerrum:1998].
The paper is organized as follows. In Section 2, we study the convergence in the Wasserstein distance of order of the Euler discretization for constant and decreasing step sizes. In Section 3, we give non asymptotic bounds in total variation distance between the Euler discretization and . This study is completed in Section 4 by non-asymptotic bounds of convergence of the weighted empirical measure applied to bounded and measurable functions. Our claims are supported in a Bayesian inference for a binary regression model in Section 5. The proofs are given in Section 6. Finally in Section 8
, some results of independent interest, used in the proofs, on functional autoregressive models are gathered. Some technical proofs and derivations are postponed and carried out in a supplementary paper[durmus:moulines:2015:supplement].
Notations and conventions
Denote by the Borel -field of , the set of all Borel measurable functions on and for , . For a probability measure on and a -integrable function, denote by the integral of w.r.t. . We say that is a transference plan of and if it is a probability measure on such that for all measurable set of , and . We denote by the set of transference plans of and . Furthermore, we say that a couple of -random variables is a coupling of and if there exists such that are distributed according to . For two probability measures and , we define the Wasserstein distance of order as
By [VillaniTransport, Theorem 4.1], for all probability measures on , there exists a transference plan such that for any coupling distributed according to , . This kind of transference plan (respectively coupling) will be called an optimal transference plan (respectively optimal coupling) associated with . We denote by the set of probability measures with finite
-moment: for all, . By [VillaniTransport, Theorem 6.16], equipped with the Wasserstein distance of order is a complete separable metric space.
Let be a Lipschitz function, namely there exists such that for all , . Then we denote
The Monge-Kantorovich theorem (see [VillaniTransport, Theorem 5.9]) implies that for all probabilities measure on ,
Denote by the set of all bounded Borel measurable functions on . For set . For two probability measures and on , the total variation distance distance between and is defined by . By the Monge-Kantorovich theorem the total variation distance between and can be written on the form:
where . For all and , we denote by , the ball centered at of radius . For a subset , denote by the complementary of . Let and be a -matrix, then denote by the transpose of and the operator norm associated with defined by . Define the Frobenius norm associated with by . Let and be a twice continuously differentiable function. Denote by and the Jacobian and the Hessian of respectively. Denote also by
the vector Laplacian ofdefined by: for all , is the vector of such that for all , the -th component of is equals to . In the sequel, we take the convention that and for , .
2 Non-asymptotic bounds in Wasserstein distance of order for ULA
Consider the following assumption on the potential :
The function is continuously differentiable on and gradient Lipschitz: there exists such that for all , .
Under H 1, for all by [karatzas:shreve:1991, Theorem 2.5, Theorem 2.9 Chapter 5] there exists a unique strong solution to (1) with . Denote by the semi-group associated with (1). It is well-known that is its (unique) invariant probability. To get geometric convergence of to in Wasserstein distance of order , we make the following additional assumption on the potential .
is strongly convex, i.e. there exists such that for all ,
We briefly summarize some background material on the stability and the convergence in of the overdamped Langevin diffusion under H 1 and H 2. Most of the statements in Section 2 are known and are recalled here for ease of references; see e.g. [chen:shao:1989].
The proof is given in the supplementary document [durmus:moulines:2015:supplement, LABEL:proof:theo:convergence-WZ-strongly-convex]. ∎
For , consider the Markov kernel given for all and by
The process given in (2) is an inhomogeneous Markov chain with respect to the family of Markov kernels . For , , define
with the convention that for , , is the identity operator.
We first derive a Foster-Lyapunov drift condition for , , .
The proof is postponed to [durmus:moulines:2015:supplement, LABEL:proof:theo:kind_drift]. ∎
We now proceed to establish the contraction property of the sequence in . This result implies the geometric convergence of the sequence to in for all . Note that the convergence rate is again independent of the dimension.
The proof is postponed to [durmus:moulines:2015:supplement, LABEL:sec:proof:convergence_p_Euler] ∎
We now proceed to establish explicit bounds for , with . Since is invariant for for all , it suffices to get some bounds on , with and take
. To do so, we construct a coupling between the diffusion and the linear interpolation of the Euler discretization. An obvious candidate is the synchronous couplingdefined for all and by
with is distributed according to , and is given in (5). Therefore since for all , , taking , we derive an explicit bound on the Wasserstein distance between the sequence of distributions and the stationary measure of the Langevin diffusion (1).
Let , and . Let with distributed according to and defined by (9). By definition of and since for all , is invariant for , . Section 6.1 with , Section 2-2 imply, using a straightforward induction, that for all
where is given by (10), and
We preface the proof by a technical lemma which is established in the supplementary document LABEL:sec:proof-lem_rec_2.
Let be a sequence of nonincreasing real numbers, and . Then for all , and ,
Proof of Section 2.
By Theorem 1, it suffices to show that and , defined by (10) and (11) respectively, goes to as . Using the bound for , and , we have . Since is nonincreasing, note that to show that , it suffices to prove . But since is nonincreasing, there exists such that and by Section 2 applied with the integer part of :
Since , by the Cesáro theorem, . Therefore since , , and the conclusion follows from combining in (15), this limit, , and . ∎
In the case of constant step sizes for all , we can deduce from Theorem 1, a bound between and the stationary distribution of .
We can improve the bound provided by Theorem 1 under additional regularity assumptions on the potential .
The potential is three times continuously differentiable and there exists such that for all , .
In the case of constant step sizes for all , we can deduce from Theorem 2, a sharper bound between and the stationary distribution of .
The proof follows the same line as the proof of Section 2 and is omitted. ∎
Using Section 2-2 and Section 2 or Section 2, given , we determine the smallest number of iterations and an associated step-size satisfying for all . The dependencies on the dimension and the precision of based on Theorem 1 and Theorem 2 are reported in Table 1. Under H 1 and H 2, the complexity we get is the same as the one established in [durmus:moulines:2016] for the total variation distance. However if in addition H 3 holds, we improve this complexity with respect to the precision parameter as well. Finally, in the case where (for example for non-degenerate -dimensional Gaussian distributions), then the dependency on the dimension of the number of iterations given by Theorem 2 turns out to be only of order .
Details and further discussions are included in the supplementary paper [durmus:moulines:2015:supplement, LABEL:sec:disc-crefth-LABEL:sec:disc-crefth-1-LABEL:sec:gener-crefth-1]. In particular, the dependencies of the obtained bounds with respect to the constants and which appear in H 1, H 2 are evidenced.
In a recent work [dalalyan:2017] (based on a previous version of this paper), an improvement of the proof of Theorem 1 has been proposed for constant step size. Whereas the constants are sharper, the dependence with respect to the dimension and the precision parameter is the same (first line of Table 1).
|Theorem 1 and Section 2-2|
|Theorem 2 and Section 2-2|
For simplicity, consider sequences defined for all by , for and . Then for , , and (see [durmus:moulines:2015:supplement, LABEL:sec:gener-crefth-1, LABEL:sec:disc-crefth-1 and LABEL:sec:disc-crefth] for details). Based on these upper bounds, we can obtained the convergence rates provided by Theorem 1 and Theorem 2 for this particular setting, see Table 2.
3 Quantitative bounds in total variation distance
We deal in this section with quantitative bounds in total variation distance. For Bayesian inference application, total variation bounds are of importance for computing highest posterior density (HPD) credible regions and intervals. For computing such bounds we will use the results of Section 2 combined with the regularizing property of the semigroup . Under H 2, define for all by
The first key result consists in upper-bounding the total variation distance for . To that purpose, we use the coupling by reflection; see [lindvall:rogers:1986] and [chen:shao:1989, Example 3.7]. This coupling is defined as the unique solution of the SDE:
with , , for and otherwise. Define the coupling time . By construction for . By Levy’s characterization, is a -dimensional Brownian motion, therefore and are weak solutions to (1) started at and respectively. Then by Lindvall’s inequality, for all we have .
We now bound for , . For , denoting by