Convergence rates for Metropolis-Hastings algorithms in the Wasserstein distance

We develop necessary conditions for geometrically fast convergence in the Wasserstein distance for Metropolis-Hastings algorithms on ℝ^d when the metric used is a norm. This is accomplished through a lower bound which is of independent interest. We show exact convergence expressions in more general Wasserstein distances (e.g. total variation) can be achieved for a large class of distributions by centering an independent Gaussian proposal, that is, matching the optimal points of the proposal and target densities. This approach has applications for sampling posteriors of many popular Bayesian generalized linear models. In the case of Bayesian binary response regression, we show when the sample size n and the dimension d grow in such a way that the ratio d/n →γ∈ (0, +∞), the exact convergence rate can be upper bounded asymptotically.

Authors

• 1 publication
• 8 publications
01/27/2020

Exact rate of convergence of the mean Wasserstein distance between the empirical and true Gaussian distribution

We study the Wasserstein distance W_2 for Gaussian samples. We establish...
08/06/2020

Exact Convergence Rate Analysis of the Independent Metropolis-Hastings Algorithms

A well-known difficult problem regarding Metropolis-Hastings algorithms ...
10/20/2018

Wasserstein-based methods for convergence complexity analysis of MCMC with application to Albert and Chib's algorithm

Over the last 25 years, techniques based on drift and minorization (d&m)...
06/20/2020

Asymptotically Optimal Exact Minibatch Metropolis-Hastings

Metropolis-Hastings (MH) is a commonly-used MCMC algorithm, but it can b...
01/08/2022

Optimal 1-Wasserstein Distance for WGANs

The mathematical forces at work behind Generative Adversarial Networks r...
10/23/2020

The Wasserstein Impact Measure (WIM): a generally applicable, practical tool for quantifying prior impact in Bayesian statistics

The prior distribution is a crucial building block in Bayesian analysis,...
10/19/2020

Reweighting samples under covariate shift using a Wasserstein distance criterion

Considering two random variables with different laws to which we only ha...

Code Repositories

Centered-Metropolis-Hastings

Centered Metropolis-Hastings independence algorithm for Bayesian logistic regression.

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Applications in modern Bayesian statistics often require generating Monte Carlo samples from a posterior distribution defined on

, which may mean using a version of the Metropolis-Hastings algorithm [Brooks2011, Hastings1970, Metropolis1953]. Popular versions of Metropolis-Hastings include, among many others, the random walk Metropolis-Hastings, Metropolis-adjusted Langevin, and Metropolis-Hastings independence (MHI) algorithms, and Hamiltonian Monte Carlo.

Convergence analyses of Metropolis-Hastings Markov chains has traditionally focused on studying their convergence rates in total variation distances

[Jarner2000, Mengersen1996, Roberts1996geo, Tierney1994, Wang2020]

. These convergence rates have received significant attention, at least in part, because they provide a key sufficient condition for the existence of central limit theorems

[Jones2004] and validity of methods for assessing the reliability of the simulation effort [Pierre2019, Pierre2020, robe:etal:viz:2021, Vats2019]. However, there has been significant recent interest in the convergence properties of Monte Carlo Markov chains in high-dimensional settings [Durmus2019, KarlOskar2019, Hairer2014, john:etal:2019, qin2019convergence, raja:spar:2015, yang:etal:2016] and the standard approaches based on drift and minorization conditions [Meyn2009, Rosenthal1995] have shown limitations in this regime [qin2021limitations]. This has led to interest in considering the convergence rates of Monte Carlo Markov chains using alternative Wasserstein distances [Durmus2015, Hairer2014, jin:tan:2020, qin2021bounds, qin2021wasserstein] which can result in many similar benefits [Hairer2014, jin:tan:2020, Joulin2010, Komorowski2011].

We derive necessary conditions for geometrically fast convergence in the Wasserstein distance for general Metropolis-Hastings Markov chains when the metric used to define the Wasserstein distance is a norm. This requires uniform control over the rejection in the Metropolis-Hastings algorithm and essentially matches previous results with total variation distances [Mengersen1996, Roberts1996geo], suggesting a fundamental requirement for fast convergence of these algorithms. We accomplish this analysis through a lower bound in the Wasserstein distance for Metropolis-Hastings algorithms on

which is of independent interest. We then use this insight to motivate a new centered proposal for the MHI algorithm. That is, we match the maximal point of the proposal density with that of the target density. By centering the proposal, we directly imbue the Markov chain with a strong attraction to a set where the target distribution has high probability eliminating the need to develop a drift condition to study the convergence rate of the Markov chain. Using this centered proposal in the MHI algorithm, we derive exact convergence expressions which are universal across more general Wasserstein distances (e.g. total variation) and provide readily verifiable conditions when this is achievable.

We apply our theoretical work on the MHI algorithm with a centered Gaussian proposal to the problem of sampling posteriors of Bayesian generalized linear models and derive exact convergence rates in general Wasserstein distances for many such models. We then consider high-dimensional Bayesian binary response regression with Gaussian priors and derive an asymptotic upper bound for general Wasserstein distances when the sample size and the dimension grow together. To the best of our knowledge, this work is the first to successfully address the setting where both the sample size and the dimension increase in unison. For example, previous results investigated the convergence rates of certain Gibbs samplers in the case where either the dimension or the sample size increase individually [KarlOskar2019, qin2021wasserstein].

2 Metropolis-Hastings and Wasserstein distance

As they will be considered here, Metropolis-Hastings algorithms simulate a Markov chain with invariant distribution on a nonempty closed set using a proposal distribution which, to avoid trivialities, is assumed throughout to be different than . We will assume has Lebesgue density and for each , has Lebesgue density . We will further assume for each , if then for each , . Define

We will consider Metropolis-Hastings algorithms initialized at a point . Metropolis-Hastings proceeds as follows: for , given , draw and so that

 θt={θ′t,if Ut≤a(θt−1,θ′t)θt−1,otherwise.

If denotes the Dirac measure at the point and , the Metropolis-Hastings Markov kernel is defined for and by

 P(θ,B)=∫Ba(θ,θ′)q(θ,θ′)dθ′+δθ(B)(1−A(θ)).

For , define the Markov kernel at iteration time recursively by

 Pt(θ,B)=∫ΘP(θ,dθ′)Pt−1(θ′,B).

Let be the set of all probability measures on with marginals and and be a lower semicontinuous metric. The Wasserstein distance, or transportation distance with metric , which we call simply the Wasserstein distance, is

 Wρ(Pt(θ,⋅),Π)=infξ∈C(Pt(θ,⋅),Π)∫Θ×Θρ(θ,ω)dξ(θ,ω).

Our focus will be on bounding or expressing exactly . In particular, we will sometimes be concerned with settings where the Markov kernel is Wasserstein geometrically ergodic meaning there is an and for every , there is a such that

 Wρ(Pt(θ,⋅),Π)≤Mθϵt.

3 Necessary conditions for Wasserstein geometric ergodicity

Necessary conditions for the geometric ergodicity in total variation of Metropolis-Hastings algorithms rely on controlling the rejection probability [Mengersen1996, Roberts1996geo]. If this rejection probability cannot be bounded below one, that is, if , then a Metropolis-Hastings algorithm fails to be geometrically ergodic [Mengersen1996, Roberts1996geo]. It is generally however an onerous task to uniformly control this rejection probability. Using a lower bound on the Wasserstein distance, we will establish necessary conditions for Metropolis-Hastings algorithm to be Wasserstein geometrically ergodic. As we shall see, this also requires uniform control of the rejection probability. We will denote norms by and the standard -norms by .

Theorem 1.

Suppose there exists a constant such that . If such that and

 C0=C′0(1−11+d)2M1d(1+d)1d,

then

 W∥⋅∥(Pt(θ,⋅),Π)≥C0(1−A(θ))t(1+1d).
Proof.

We will first construct a suitable Lipschitz function. Fix , and fix . Define the function by . We have for every ,

 |φα,θ(ω)−φα,θ(ω′)| ≤α|∥ω−θ∥1−∥∥ω′−θ∥∥1| ≤α∥∥ω−ω′∥∥1.

Therefore, is a bounded Lipschitz function with respect to the distance and the Lipschitz constant is . By assumption there exists such that and . Then, using the fact that , we obtain

 ∫Θφα,θdΠ =∫Θφα,θ(θ′)π(θ′)dθ′ ≤M∫Θexp(−α∥∥θ′−θ∥∥1)dθ′ ≤M∫Rdexp(−α∥∥θ′−θ∥∥1)dθ′ =M2dα−d. (1)

Fix a positive integer . Then, for each and each , we obtain the lower bound

 ∫Θφα,θ(θ′)(1−A(θ′))sP(ω,dθ′) =∫Θφα,θ(θ′)(1−A(θ′))smin{π(θ′)q(θ′,ω)π(ω)q(ω,θ′),1}q(ω,θ′)dθ′+φα,θ(ω)(1−A(ω))s+1 ≥φα,θ(ω)(1−A(ω))s+1.

We now apply this lower bound multiple times for however large is using the Fubini-Tonnelli theorem [Folland2013, Theorem 2.37]:

 ∫Θφα,θ(θt)Pt(θ,dθt) =∫Θ{∫Θφα,θ(θt)P(θt−1,dθt)}Pt−1(θ,dθt−1) ≥∫Θφα,θ(θt−1)(1−A(θt−1))Pt−1(θ,dθt−1) =∫Θ{∫Θφα,θ(θt−1)(1−A(θt−1))P(θt−2,dθt−1)}Pt−2(θ,dθt−2) ≥∫Θφα,θ(θt−2)(1−A(θt−2))2Pt−2(θ,dθt−2) ⋮ ≥φα,θ(θ)(1−A(θ))t =(1−A(θ))t. (2)

The final step follows from the fact that . Combining (1) and (2), we then have the lower bound,

 ∫Θ(1αφα,θ)dPt(θ,⋅)−∫Θ(1αφα,θ)dΠ≥(1−A(θ))t−M2dα−dα. (3)

The case where is trivial so we assume that . Let be the set of bounded, measurable functions on and be the Lipschitz norm measuring the maximum Lipschitz constant for a function of interest. We then have by the Kantovorich-Rubinstein theorem [Villani2003, Theorem 1.14] and the lower bound in (3),

 W∥⋅∥1(Pt(θ,⋅),Π) =supφ∈Mb(Θ)∥φ∥Lip(∥⋅∥1)≤1[∫ΘφdPt(θ,⋅)−∫ΘφdΠ] ≥∫Θ(1αφα,θ)dPt(θ,⋅)−∫Θ(1αφα,θ)dΠ ≥(1−A(θ))t−M2dα−dα.

If , then taking the limit of , completes the proof. Suppose then that . Maximizing this lower bound with respect to yields . We then have

 W∥⋅∥1(Pt(θ,⋅),Π)=(1−A(θ))t(1−11+d)α=(1−11+d)2M1d(1+d)1d(1−A(θ))t(1+1d).

This completes the proof for the norm .

Since all norms on are equivalent, there is a positive real number such that and the proof for any norm on follows from the case we have proved for the norm . ∎

Using Theorem 1, we now establish necessary conditions for the Wasserstein geometric ergodicity of Metropolis-Hastings Markov chains.

Proposition 1.

Under the same conditions as Theorem 1 with the additional assumption that , the Metropolis-Hastings kernel is not Wasserstein geometrically ergodic for any norm .

Proof.

Assume by way of contradiction that the Metropolis-Hastings algorithm is Wasserstein geometrically ergodic for some norm . Then there is an where for every , there is a constant such that

 W∥⋅∥(Pt(θ,⋅),Π)≤Rθ(1−ϵ)t.

From the lower bound in Theorem 1, we have

 C0(1−A(θ))t(1+1d)≤W∥⋅∥(Pt(θ,⋅),Π)≤Rθ(1−ϵ)t

But this implies that

 1−A(θ)≤(1−ϵ)1/(1+1/d).

Write where . We then must have that for every . But this is a contradiction to the assumption that . ∎

There is in fact an elementary example which fails to be geometrically ergodic and also fails to be Wasserstein geometrically ergodic.

Example 1.

Consider the RWM algorithm with target distribution on with density . This algorithm has been shown to not be geometrically ergodic [Roberts1996geo, Proposition 5.2]. The conditions are also met for Proposition 1 in this example and this algorithm also fails to be Wasserstein geometrically ergodic for any norm.

For the MHI algorithm, we have the following necessary condition for Wassertein geometric ergodicity.

Proposition 2.

Assume the proposal distribution has density and is independent of the previous iteration. Under the same conditions as Theorem 1 with the additional assumption that , the MHI kernel is not Wasserstein geometrically ergodic for any norm .

Proof.

Since , choose a monotonically decreasing sequence such that

 limn→+∞q(θn)π(θn)=0.

Since is bounded by , the monotone convergence theorem [Folland2013, Theorem 2.14] ensures

 limn→+∞A(θn)=0.

We then have that

 0≤infθ∈ΘA(θ)≤infnA(θn)≤liminfn→+∞A(θn)=0.

An application of Proposition 1 completes the proof. ∎

The following example for the MHI algorithm fails to be geometrically ergodic [Mengersen1996] and also fails to be Wasserstein geometrically ergodic.

Example 2.

Consider the target distribution in one dimension with independent proposal . It has been shown previously that the ratio of the proposal and target density cannot be uniformly bounded above [Mengersen1996]. This algorithm then satisfies the conditions in Proposition 2 and cannot be Wasserstein geometrically ergodic for any norm.

4 Exact convergence rates in Wasserstein distances with centered proposals

Recently there have been attempts at using centered drift functions to improve convergence analyses of some Monte Carlo Markov chains in high-dimensional settings [KarlOskar2019, qin2019convergence, qin2021wasserstein]. Our approach is to center the proposal distribution, that is, matching the optimal points of the proposal and target densities. We first consider the following assumption which has been considered previously [Wang2020].

Assumption 1.

The proposal is independent of the previous iteration, and there exists a solution .

If , then Assumption 1 implies uniform ergodicity of the MHI algorithm by [Tierney1994, Corollary 4]. Assumption 1 also implies that can be represented as a convex combination of the target distribution and the Dirac measure at the point [Wang2020, Remark 1, Theorem 2], that is,

 Pt(θ∗,⋅) (4)

This representation gives way to an exact convergence rate in total variation if the algorithm is started at the point [Wang2020, Remark 1, Theorem 2]. We now show that it is also possible to derive exact convergence rates for more general Wasserstein distances.

Proposition 3.

Under Assumption 1,

 Wρ(Pt(θ∗,⋅),Π)=(1−ϵθ∗)t∫Θρ(θ,θ∗)dΠ(θ).
Proof.

Let be a function such that . We have the identity,

 ∫ΘψdPt(θ∗,⋅)=(1−(1−ϵθ∗)t)∫ΘψdΠ(B)+(1−ϵθ∗)tψ(θ∗). (5)

It can be shown that . Since is not exactly , then . Denoting by the set of bounded measurable functions on and denoting the Lipschitz norm with the distance , applying the Kantovorich-Rubinstein theorem [Villani2003, Theorem 1.14],

 Wρ(Pt(θ∗,⋅),Π) =supφ∈Mb(Θ)∥φ∥Lip(ρ)≤1∫Θφd(Pt(θ∗,⋅)−Π) =supφ∈Mb(Θ)∥φ∥Lip(ρ)≤1{(1−ϵθ∗)t∫Θφd(δθ∗−Π)} =(1−ϵθ∗)tsupφ∈Mb(Θ)∥φ∥Lip(ρ)≤1∫Θφd(δθ∗−Π) =(1−ϵθ∗)tWρ(δθ∗,Π) =(1−ϵθ∗)t∫Θρ(θ,θ∗)dΠ(θ).

Under Assumption 1, Proposition 3 shows the convergence rate in total variation matches the convergence rate in other Wasserstein distances. This means it cannot be improved using “weaker” Wasserstein distances alternative to the total variation distance. On the other hand, it also says the rate does not worsen when looking at “stronger” Wasserstein distances such as those controlled using drift and minorization conditions [Hairer2011].

We shall see in the next section that by centering a Gaussian proposal, we may satisfy Assumption 1 for a general class of target distributions with being the optimum of the target’s density. While we focus on Gaussian proposals, the technique of centering proposals is in fact more general.

4.1 Centered MHI with Gaussian proposals

We now provide general conditions where Proposition 3 may be applied with centered Gaussian proposals. We will assume the target distribution

is a probability distribution supported on

. With and normalizing constant , we may then define the density by . Let be the unique maximum of , , and be a symmetric, positive-definite matrix. Under suitable conditions, the point may be found, at least approximately, by means of optimization. Let the proposal distribution with density correspond to a

-dimensional Gaussian distribution,

and finally, define .

Proposition 4.

If exists and satisfies for any , then

 Wρ(Pt(θ∗,⋅),Π)=(1−ϵθ∗)t∫Rdρ(θ,θ∗)dΠ(θ).
Proof.

Since the proposal density has been centered at the point , it then satisfies . For every , we have the lower bound

 q(θ)π(θ) =(2π)−d/2αd/2det(C)−1/2ZΠexp(f(θ)−α2(θ−θ∗)TC−1(θ−θ∗)) ≥(2π)−d/2αd/2det(C)−1/2ZΠexp(f(θ∗)) =q(θ∗)π(θ∗).

Since both densities are positive and the proposal is independent of the previous iteration, we have shown that Assumption 1 is satisfied. An application of Proposition 3 with the the proposal and target distribution and as we have defined them in this section completes the proof. ∎

The point is guaranteed to exist if the function satisfies a convexity property. A function is strongly convex with parameter if there is an so that is convex [Urruty2001, Nesterov2018]. The norm in this definition is often taken to be the standard Euclidean norm, but we will use the norm induced by the matrix . We obtain the following general conditions for Proposition 4 to hold.

Proposition 5.

If the function is convex for all points on , then

 Wρ(Pt(θ∗,⋅),Π)=(1−ϵθ∗)t∫Rdρ(θ,θ∗)dΠ(θ).
Proof.

Since the function is convex for all points on , it follows that for any and any ,

 f(λθ+(1−λ)θ′)≤λf(θ)+(1−λ)f(θ′)−α2λ(1−λ)(θ′−θ)TC−1(θ′−θ).

Since is positive-definite, then is nonnegative and this implies that is a convex function. It can also be shown that and since is lower semicontinuous, then attains its minimum . The right directional derivative exists for all points [Nesterov2018, Theorem 3.1.12]. For , we have

 1(1−λ)1λ[f(θ∗+λ(θ−θ∗))−f(θ∗)]−1(1−λ)(f(θ)−f(θ∗))≤−α2(θ−θ∗)TC−1(θ−θ∗).

Taking the limit with , we have that

 f′(θ∗;θ−θ∗)−f(θ)+f(θ∗)≤−α2(θ−θ∗)TC−1(θ−θ∗).

Since is the minimum of , then the right directional derivative satisfies for all . Therefore for all ,

 f(θ)≥f(θ∗)+α2(θ−θ∗)TC−1(θ−θ∗).

This inequality implies the minimum is unique. An application of Proposition 4 completes the proof. ∎

5 Applications to Bayesian generalized linear models

We consider Bayesian Poisson and negative-binomial regression for count response regression and Bayesian logistic and probit regression for binary response regression. Suppose there are discrete-valued responses with features , and a parameter . For Poisson regression, assume the ’s are conditionally independent with

 Yi∣∣Xi,β∼Poisson(exp(βTXi)).

Similarly, for negative-binomial regression, if , assume

For binary response regression, if , assume

 Yi|Xi,β∼Bernoulli(S(βTXi)).

For logistic regression, we will consider

and for probit regression, we will consider

to be the cumulative distribution function of a standard Gaussian random variable. Suppose

where and is a symmetric, positive-definite matrix. Both and

are assumed to be known. Define the vector

and the matrix . Let denote the posterior with density . If denotes the negative log-likelihood for each model, the posterior density is characterized by

 π(β|X,Y)=Z−1Π(⋅|X,Y))exp(−ℓn(β)−α2βTC−1β).

Observe that the function is convex in all four models we consider. Let denote the unique maximum of . For the MHI algorithm, we use a proposal distribution, and Proposition 5 immediately yields the following for each posterior.

Corollary 1.

Let . Then

 Wρ(Pt(β∗,⋅),Π(⋅|X,Y))=(1−ϵβ∗)t∫Rdρ(β,β∗)dΠ(β|X,Y).

5.1 High-dimensional binary response regression

Our goal now is to obtain an upper bound on the rate of convergence established in Corollary 1 in high dimensions for binary response regression. In this context, it is more natural to treat the as stochastic; each time the sample size increases, the additional observation is randomly generated. Specifically, we will assume that are independent with and with . Similar scaling assumptions on the data are used for high-dimensional maximum-likelihood theory in logistic regression [Sur2019]. We will also assume the limit of the trace of the covariance matrix used in our prior is finite, that is, as . This assumption is natural as for the Gaussian prior to exist in any infinite-dimensional Hilbert space, it is a necessary condition that the trace of the covariance is finite. Under this model, if the dimension and sample size grow proportionally, we now provide an explicit high-dimensional asymptotic upper bound on the Wasserstein distance.

Theorem 2.

Suppose that:

1. The negative log-likelihood is a twice continuously differentiable convex function with Hessian .

2. There is a universal constant such that for every .

Let . If in such a way that , then, almost surely

 limsupd,n→∞dn→γWρ(Pt(β∗,⋅),Π(⋅|X,Y))≤(1−exp(−a0))tlimsupd,n→∞dn→γ∫Rdρ(β,β∗)dΠ(β|X,Y).
Proof.

Under our assumption, we may write the matrix where is a matrix with i.i.d Gaussian entries with mean

and variance

. Denote the largest eigenvalue of the matrix

by . Therefore, as in such a way that ,

almost surely [Geman1980, Theorem 1].

Define the function by where

is the negative log-likelihood loss function and define

. We will first lower bound the quantity . We have that for any and any ,

This implies that for any and any , the Hessian of , denoted by , satisfies

 vTHf(β)v≤vT(r0λmax(XTX)Id+αC−1)v.

Since the function is twice continuously differentiable, then is also twice continuously differentiable. Since both the gradient and Hessian are continuous and , we use a Taylor expansion to obtain the upper bound

 f(β) =f(β∗)+∫10∫t0(β−β∗)THf(β∗+s(β−β∗))(β−β∗)dsdt ≤f(β∗)+12(β−β∗)T(r0λmax(XTX)Id+αC−1)(β−β∗).

We then have a lower bound on the normalizing constant of the target posterior

 ZΠ(⋅|X,Y)=∫Rdexp(−f(β))dβ≥exp(−f(β∗))(2π)d/2det(r0λmax(XTX)Id+αC−1)1/2.

This in turn yields a lower bound on the ratio

 ZΠ(⋅|X,Y)ZQexp(f(β∗))≥det(αC−1)1/2det(r0λmax(XTX)Id+αC−1)1/2. (6)

The matrix is symmetric and positive-definite and so its eigenvalues exist and are positive. Let be the eigenvalues of . It is readily verified that the eigenvalues of the matrix exist and are . Then

 det(αC−1)det(r0λmax(XTX)Id+αC−1) =∏di=1αλi(C)∏di=1(r0λmax(XTX)+αλi(C)) =d∏i=1αλi(C)r0λmax(XTX)+αλi(C) =d∏i=11r0αλmax(XTX)λi(C)+1 =exp(−d∑i=1log(r0αλmax(XTX)λi(C)+1)). (7)

We have the basic inequality for any . Since the eigenvalues of are positive and is nonnegative, we have the upper bound

 d∑i=1log(r0αλmax(XTX)λi(C)+1)≤r0αλmax(XTX)d∑i=1λi(C). (8)

This yields a lower bound on (7). Define the doubly-indexed sequence by