High-dimensional Bayesian inference via the Unadjusted Langevin Algorithm

05/05/2016 ∙ by Alain Durmus, et al. ∙ Ecole Polytechnique 0

We consider in this paper the problem of sampling a high-dimensional probability distribution π having a density with respect to the Lebesgue measure on R^d, known up to a normalisation factor e^-U(x)/∫_R^de^-U(y)d y. Such problem naturally occurs for example in Bayesian inference and machine learning. Under the assumption that U is continuously differentiable, ∇ U is globally Lipschitz and U is strongly convex, we obtain non-asymptotic bounds for the convergence to stationarity in Wasserstein distance of order 2 and total variation distance of the sampling method based on the Euler discretization of the Langevin stochastic differential equation, for both constant and decreasing step sizes. The dependence on the dimension of the state space of the obtained bounds is studied to demonstrate the applicability of this method. The convergence of an appropriately weighted empirical measure is also investigated and bounds for the mean square error and exponential deviation inequality are reported for functions which are either Lipchitz continuous or measurable and bounded. An illustration to a Bayesian inference for binary regression is presented.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There has been recently an increasing interest in Bayesian inference applied to high-dimensional models often motivated by machine learning applications. Rather than obtaining a point estimate, Bayesian methods attempt to sample the full posterior distribution over the parameters and possibly latent variables which provides a way to assert uncertainty in the model and prevents from overfitting

[neal:1992], [welling:the:2011].

The problem can be formulated as follows. We aim at sampling a posterior distribution on , , with density w.r.t. the Lebesgue measure, where is continuously differentiable. The Langevin stochastic differential equation associated with is defined by:

(1)

where is a -dimensional Brownian motion defined on the filtered probability space , satisfying the usual conditions. Under mild technical conditions, the Langevin diffusion admits as its unique invariant distribution.

We study the sampling method based on the Euler-Maruyama discretization of (1). This scheme defines the (possibly) non-homogeneous, discrete-time Markov chain given by

(2)

where is an i.i.d. sequence of

-dimensional standard Gaussian random variables and

is a sequence of step sizes, which can either be held constant or be chosen to decrease to . This algorithm has been first proposed by [ermak:1975] and [parisi:1981] for molecular dynamics applications. Then it has been popularized in machine learning by [grenander:1983], [grenander:miller:1994] and computational statistics by [neal:1992] and [roberts:tweedie:1996]. Following [roberts:tweedie:1996], in the sequel this method will be refer to as the unadjusted Langevin algorithm (ULA). When the step sizes are held constant, under appropriate conditions on , the homogeneous Markov chain has a unique stationary distribution , which in most of the cases differs from the distribution . It has been proposed in [rossky:doll:friedman:1978] and [roberts:tweedie:1996] to use a Metropolis-Hastings step at each iteration to enforce reversibility w.r.t. . This algorithm is referred to as the Metropolis adjusted Langevin algorithm (MALA).

The ULA algorithm has already been studied in depth for constant step sizes in [talay:tubaro:1991], [roberts:tweedie:1996] and [mattingly:stuart:higham:2002]. In particular, [talay:tubaro:1991, Theorem 4] gives an asymptotic expansion for the weak error between and . For sequence of step sizes such that and , weak convergence of the weighted empirical distribution of the ULA algorithm has been established in [lamberton:pages:2002], [lamberton:pages:2003] and [lemaire:2005].

Contrary to these previously reported works, we focus in this paper on non-asymptotic results. More precisely, we obtain explicit bounds between the distribution of the -th iterate of the Markov chain defined in (2) and the target distribution in Wasserstein and total variation distance for nonincreasing step sizes. When the sequence of step sizes is held constant for all , both fixed horizon (the total computational budget is fixed and the step size is chosen to minimize the upper bound on the Wasserstein or total variation distance) and fixed precision (for a fixed target precision, the number of iterations and the step size are optimized simultaneously to meet this constraint) strategies are studied. In addition, quantitative estimates between and are obtained. When and , we show that the marginal distribution of the non-homogeneous Markov chain converges to the target distribution and provide explicit convergence bounds. A special attention is paid to the dependency of the proposed upper bounds on the dimension of the state space, since we are particularly interested in the application of these methods to sampling in high-dimension.

Compared to [dalalyan:2014] and [durmus:moulines:2016], our contribution is twofold. First, we report sharper convergence bounds for constant and non-increasing step sizes. In particular, it is shown that the number of iterations required to reach the target distribution with a precision in total variation is upper bounded (up to logarithmic factors) by , where is the dimension, under appropriate regularity assumption (in [durmus:moulines:2016], the upper bound was shown to be ). We show that our result is optimal (up to logarithmic factors again) in the case of the standard

-dimensional Gaussian distribution, for which explicit computation can be done. Besides, we establish computable bound for the mean square error associated with ULA for measurable bounded functions, which is an important result in Bayesian statistics to estimate credibility regions. For that purpose, we study the convergence of the Euler-Maruyama discretization towards its stationary distribution in total variation using a discrete time version of reflection coupling introduced in

[bubley:dyer:jerrum:1998].

The paper is organized as follows. In Section 2, we study the convergence in the Wasserstein distance of order of the Euler discretization for constant and decreasing step sizes. In Section 3, we give non asymptotic bounds in total variation distance between the Euler discretization and . This study is completed in Section 4 by non-asymptotic bounds of convergence of the weighted empirical measure applied to bounded and measurable functions. Our claims are supported in a Bayesian inference for a binary regression model in Section 5. The proofs are given in Section 6. Finally in Section 8

, some results of independent interest, used in the proofs, on functional autoregressive models are gathered. Some technical proofs and derivations are postponed and carried out in a supplementary paper

[durmus:moulines:2015:supplement].

Notations and conventions

Denote by the Borel -field of , the set of all Borel measurable functions on and for , . For a probability measure on and a -integrable function, denote by the integral of w.r.t. . We say that is a transference plan of and if it is a probability measure on such that for all measurable set of , and . We denote by the set of transference plans of and . Furthermore, we say that a couple of -random variables is a coupling of and if there exists such that are distributed according to . For two probability measures and , we define the Wasserstein distance of order as

By [VillaniTransport, Theorem 4.1], for all probability measures on , there exists a transference plan such that for any coupling distributed according to , . This kind of transference plan (respectively coupling) will be called an optimal transference plan (respectively optimal coupling) associated with . We denote by the set of probability measures with finite

-moment: for all

, . By [VillaniTransport, Theorem 6.16], equipped with the Wasserstein distance of order is a complete separable metric space.

Let be a Lipschitz function, namely there exists such that for all , . Then we denote

The Monge-Kantorovich theorem (see [VillaniTransport, Theorem 5.9]) implies that for all probabilities measure on ,

Denote by the set of all bounded Borel measurable functions on . For set . For two probability measures and on , the total variation distance distance between and is defined by . By the Monge-Kantorovich theorem the total variation distance between and can be written on the form:

where . For all and , we denote by , the ball centered at of radius . For a subset , denote by the complementary of . Let and be a -matrix, then denote by the transpose of and the operator norm associated with defined by . Define the Frobenius norm associated with by . Let and be a twice continuously differentiable function. Denote by and the Jacobian and the Hessian of respectively. Denote also by

the vector Laplacian of

defined by: for all , is the vector of such that for all , the -th component of is equals to . In the sequel, we take the convention that and for , .

2 Non-asymptotic bounds in Wasserstein distance of order for ULA

Consider the following assumption on the potential :

H 1.

The function is continuously differentiable on and gradient Lipschitz: there exists such that for all , .

Under H 1, for all by [karatzas:shreve:1991, Theorem 2.5, Theorem 2.9 Chapter 5] there exists a unique strong solution to (1) with . Denote by the semi-group associated with (1). It is well-known that is its (unique) invariant probability. To get geometric convergence of to in Wasserstein distance of order , we make the following additional assumption on the potential .

H 2.

is strongly convex, i.e. there exists such that for all ,

Under H 2, [nesterov:2004, Theorem 2.1.8] shows that has a unique minimizer . If in addition H 1 holds, then [nesterov:2004, Theorem 2.1.12, Theorem 2.1.9] show that for all :

(3)

where

(4)

We briefly summarize some background material on the stability and the convergence in of the overdamped Langevin diffusion under H 1 and H 2. Most of the statements in Section 2 are known and are recalled here for ease of references; see e.g. [chen:shao:1989].

Proposition .

Assume H 1 and H 2.

  1. [label= ()]

  2. For all and ,

  3. The stationary distribution satisfies .

  4. For any and , .

  5. For any and , .

Proof.

The proof is given in the supplementary document [durmus:moulines:2015:supplement, LABEL:proof:theo:convergence-WZ-strongly-convex]. ∎

Note that the convergence rate in Section 2-4 is independent from the dimension. Let be a sequence of positive and non-increasing step sizes and for , denote by

(5)

For , consider the Markov kernel given for all and by

(6)

The process given in (2) is an inhomogeneous Markov chain with respect to the family of Markov kernels . For , , define

(7)

with the convention that for , , is the identity operator.

We first derive a Foster-Lyapunov drift condition for , , .

Proposition .

Assume H 1 and H 2.

  1. [label=()]

  2. Let be a nonincreasing sequence with . Let be the unique minimizer of . Then for all and ,

    where is given by

    (8)

    and is defined in (4).

  3. For any , has a unique stationary distribution and

Proof.

The proof is postponed to [durmus:moulines:2015:supplement, LABEL:proof:theo:kind_drift]. ∎

We now proceed to establish the contraction property of the sequence in . This result implies the geometric convergence of the sequence to in for all . Note that the convergence rate is again independent of the dimension.

Proposition .

Assume H 1 and H 2. Then,

  1. [label=()]

  2. Let be a nonincreasing sequence with . For all and ,

  3. For any , for all and ,

Proof.

The proof is postponed to [durmus:moulines:2015:supplement, LABEL:sec:proof:convergence_p_Euler]

Corollary .

Assume H 1 and H 2. Let be a nonincreasing sequence with . Then for all Lipschitz functions and , is a Lipschitz function with .

Proof.

The proof follows from Section 2-1 using

We now proceed to establish explicit bounds for , with . Since is invariant for for all , it suffices to get some bounds on , with and take

. To do so, we construct a coupling between the diffusion and the linear interpolation of the Euler discretization. An obvious candidate is the synchronous coupling

defined for all and by

(9)

with is distributed according to , and is given in (5). Therefore since for all , , taking , we derive an explicit bound on the Wasserstein distance between the sequence of distributions and the stationary measure of the Langevin diffusion (1).

Theorem 1.

Assume H 1 and H 2. Let be a nonincreasing sequence with . Then for all and ,

where

(10)

is defined in (4) and

(11)
Proof.

Let , and . Let with distributed according to and defined by (9). By definition of and since for all , is invariant for , . Section 6.1 with , Section 2-2 imply, using a straightforward induction, that for all

(12)

where is given by (10), and

(13)

with

Since is distributed according to , Section 2-2 shows that for all ,

(14)

In addition since for all , , using Section 2-2, we get . Plugging this result, (14) and (13) in (12) completes the proof. ∎

Corollary .

Assume H 1 and H 2. Let be a nonincreasing sequence with . Assume that and . Then for all ,

We preface the proof by a technical lemma which is established in the supplementary document LABEL:sec:proof-lem_rec_2.

Lemma .

Let be a sequence of nonincreasing real numbers, and . Then for all , and ,

Proof of Section 2.

By Theorem 1, it suffices to show that and , defined by (10) and (11) respectively, goes to as . Using the bound for , and , we have . Since is nonincreasing, note that to show that , it suffices to prove . But since is nonincreasing, there exists such that and by Section 2 applied with the integer part of :

(15)

Since , by the Cesáro theorem, . Therefore since , , and the conclusion follows from combining in (15), this limit, , and . ∎

In the case of constant step sizes for all , we can deduce from Theorem 1, a bound between and the stationary distribution of .

Corollary .

Assume H 1 and H 2. Let be a constant sequence for all with . Then

Proof.

Since by Section 2, for all , converges to as in , the proof then follows from Theorem 1 and Section 2 applied with . ∎

We can improve the bound provided by Theorem 1 under additional regularity assumptions on the potential .

H 3.

The potential is three times continuously differentiable and there exists such that for all , .

Note that under H 1 and H 3, we have that for all ,

(16)
Theorem 2.

Assume H 1, H 2 and H 3. Let be a nonincreasing sequence with . Then for all and ,

where is given by (10), in (4) and

(17)
Proof.

The proof of is along the same lines than Theorem 1, using Section 6.2 in place of Section 6.1. ∎

In the case of constant step sizes for all , we can deduce from Theorem 2, a sharper bound between and the stationary distribution of .

Corollary .

Assume H 1, H 2 and H 3. Let be a constant sequence for all with . Then

Proof.

The proof follows the same line as the proof of Section 2 and is omitted. ∎

Using Section 2-2 and Section 2 or Section 2, given , we determine the smallest number of iterations and an associated step-size satisfying for all . The dependencies on the dimension and the precision of based on Theorem 1 and Theorem 2 are reported in Table 1. Under H 1 and H 2, the complexity we get is the same as the one established in [durmus:moulines:2016] for the total variation distance. However if in addition H 3 holds, we improve this complexity with respect to the precision parameter as well. Finally, in the case where (for example for non-degenerate -dimensional Gaussian distributions), then the dependency on the dimension of the number of iterations given by Theorem 2 turns out to be only of order .

Details and further discussions are included in the supplementary paper [durmus:moulines:2015:supplement, LABEL:sec:disc-crefth-LABEL:sec:disc-crefth-1-LABEL:sec:gener-crefth-1]. In particular, the dependencies of the obtained bounds with respect to the constants and which appear in H 1, H 2 are evidenced.

In a recent work [dalalyan:2017] (based on a previous version of this paper), an improvement of the proof of Theorem 1 has been proposed for constant step size. Whereas the constants are sharper, the dependence with respect to the dimension and the precision parameter is the same (first line of Table 1).

Parameter
Theorem 1 and Section 2-2
Theorem 2 and Section 2-2
Table 1: Dependencies of the number of iterations to get

For simplicity, consider sequences defined for all by , for and . Then for , , and (see [durmus:moulines:2015:supplement, LABEL:sec:gener-crefth-1, LABEL:sec:disc-crefth-1 and LABEL:sec:disc-crefth] for details). Based on these upper bounds, we can obtained the convergence rates provided by Theorem 1 and Theorem 2 for this particular setting, see Table 2.

Theorem 1 for see [durmus:moulines:2015:supplement, LABEL:sec:gener-crefth-1]
Theorem 2
Table 2: Order of convergence of for

3 Quantitative bounds in total variation distance

We deal in this section with quantitative bounds in total variation distance. For Bayesian inference application, total variation bounds are of importance for computing highest posterior density (HPD) credible regions and intervals. For computing such bounds we will use the results of Section 2 combined with the regularizing property of the semigroup . Under H 2, define for all by

The first key result consists in upper-bounding the total variation distance for . To that purpose, we use the coupling by reflection; see [lindvall:rogers:1986] and [chen:shao:1989, Example 3.7]. This coupling is defined as the unique solution of the SDE:

with , , for and otherwise. Define the coupling time . By construction for . By Levy’s characterization, is a -dimensional Brownian motion, therefore and are weak solutions to (1) started at and respectively. Then by Lindvall’s inequality, for all we have .

Theorem 3.

Assume H 1 and H 2.

  1. [label=()]

  2. For any and , it holds

    where

    is the cumulative distribution function of the standard normal distribution.

  3. For any and ,

  4. For any and ,

Proof.
  1. [label=(),wide=0pt, labelindent=]

  2. We now bound for , . For , denoting by