Centered Metropolis-Hastings independence algorithm for Bayesian logistic regression.
We develop necessary conditions for geometrically fast convergence in the Wasserstein distance for Metropolis-Hastings algorithms on ℝ^d when the metric used is a norm. This is accomplished through a lower bound which is of independent interest. We show exact convergence expressions in more general Wasserstein distances (e.g. total variation) can be achieved for a large class of distributions by centering an independent Gaussian proposal, that is, matching the optimal points of the proposal and target densities. This approach has applications for sampling posteriors of many popular Bayesian generalized linear models. In the case of Bayesian binary response regression, we show when the sample size n and the dimension d grow in such a way that the ratio d/n →γ∈ (0, +∞), the exact convergence rate can be upper bounded asymptotically.READ FULL TEXT VIEW PDF
Centered Metropolis-Hastings independence algorithm for Bayesian logistic regression.
Applications in modern Bayesian statistics often require generating Monte Carlo samples from a posterior distribution defined on, which may mean using a version of the Metropolis-Hastings algorithm [Brooks2011, Hastings1970, Metropolis1953]. Popular versions of Metropolis-Hastings include, among many others, the random walk Metropolis-Hastings, Metropolis-adjusted Langevin, and Metropolis-Hastings independence (MHI) algorithms, and Hamiltonian Monte Carlo.
Convergence analyses of Metropolis-Hastings Markov chains has traditionally focused on studying their convergence rates in total variation distances[Jarner2000, Mengersen1996, Roberts1996geo, Tierney1994, Wang2020]
. These convergence rates have received significant attention, at least in part, because they provide a key sufficient condition for the existence of central limit theorems[Jones2004] and validity of methods for assessing the reliability of the simulation effort [Pierre2019, Pierre2020, robe:etal:viz:2021, Vats2019]. However, there has been significant recent interest in the convergence properties of Monte Carlo Markov chains in high-dimensional settings [Durmus2019, KarlOskar2019, Hairer2014, john:etal:2019, qin2019convergence, raja:spar:2015, yang:etal:2016] and the standard approaches based on drift and minorization conditions [Meyn2009, Rosenthal1995] have shown limitations in this regime [qin2021limitations]. This has led to interest in considering the convergence rates of Monte Carlo Markov chains using alternative Wasserstein distances [Durmus2015, Hairer2014, jin:tan:2020, qin2021bounds, qin2021wasserstein] which can result in many similar benefits [Hairer2014, jin:tan:2020, Joulin2010, Komorowski2011].
We derive necessary conditions for geometrically fast convergence in the Wasserstein distance for general Metropolis-Hastings Markov chains when the metric used to define the Wasserstein distance is a norm. This requires uniform control over the rejection in the Metropolis-Hastings algorithm and essentially matches previous results with total variation distances [Mengersen1996, Roberts1996geo], suggesting a fundamental requirement for fast convergence of these algorithms. We accomplish this analysis through a lower bound in the Wasserstein distance for Metropolis-Hastings algorithms on
which is of independent interest. We then use this insight to motivate a new centered proposal for the MHI algorithm. That is, we match the maximal point of the proposal density with that of the target density. By centering the proposal, we directly imbue the Markov chain with a strong attraction to a set where the target distribution has high probability eliminating the need to develop a drift condition to study the convergence rate of the Markov chain. Using this centered proposal in the MHI algorithm, we derive exact convergence expressions which are universal across more general Wasserstein distances (e.g. total variation) and provide readily verifiable conditions when this is achievable.
We apply our theoretical work on the MHI algorithm with a centered Gaussian proposal to the problem of sampling posteriors of Bayesian generalized linear models and derive exact convergence rates in general Wasserstein distances for many such models. We then consider high-dimensional Bayesian binary response regression with Gaussian priors and derive an asymptotic upper bound for general Wasserstein distances when the sample size and the dimension grow together. To the best of our knowledge, this work is the first to successfully address the setting where both the sample size and the dimension increase in unison. For example, previous results investigated the convergence rates of certain Gibbs samplers in the case where either the dimension or the sample size increase individually [KarlOskar2019, qin2021wasserstein].
As they will be considered here, Metropolis-Hastings algorithms simulate a Markov chain with invariant distribution on a nonempty closed set using a proposal distribution which, to avoid trivialities, is assumed throughout to be different than . We will assume has Lebesgue density and for each , has Lebesgue density . We will further assume for each , if then for each , . Define
We will consider Metropolis-Hastings algorithms initialized at a point . Metropolis-Hastings proceeds as follows: for , given , draw and so that
If denotes the Dirac measure at the point and , the Metropolis-Hastings Markov kernel is defined for and by
For , define the Markov kernel at iteration time recursively by
Let be the set of all probability measures on with marginals and and be a lower semicontinuous metric. The Wasserstein distance, or transportation distance with metric , which we call simply the Wasserstein distance, is
Our focus will be on bounding or expressing exactly . In particular, we will sometimes be concerned with settings where the Markov kernel is Wasserstein geometrically ergodic meaning there is an and for every , there is a such that
Necessary conditions for the geometric ergodicity in total variation of Metropolis-Hastings algorithms rely on controlling the rejection probability [Mengersen1996, Roberts1996geo]. If this rejection probability cannot be bounded below one, that is, if , then a Metropolis-Hastings algorithm fails to be geometrically ergodic [Mengersen1996, Roberts1996geo]. It is generally however an onerous task to uniformly control this rejection probability. Using a lower bound on the Wasserstein distance, we will establish necessary conditions for Metropolis-Hastings algorithm to be Wasserstein geometrically ergodic. As we shall see, this also requires uniform control of the rejection probability. We will denote norms by and the standard -norms by .
Suppose there exists a constant such that . If such that and
We will first construct a suitable Lipschitz function. Fix , and fix . Define the function by . We have for every ,
Therefore, is a bounded Lipschitz function with respect to the distance and the Lipschitz constant is . By assumption there exists such that and . Then, using the fact that , we obtain
Fix a positive integer . Then, for each and each , we obtain the lower bound
We now apply this lower bound multiple times for however large is using the Fubini-Tonnelli theorem [Folland2013, Theorem 2.37]:
The case where is trivial so we assume that . Let be the set of bounded, measurable functions on and be the Lipschitz norm measuring the maximum Lipschitz constant for a function of interest. We then have by the Kantovorich-Rubinstein theorem [Villani2003, Theorem 1.14] and the lower bound in (3),
If , then taking the limit of , completes the proof. Suppose then that . Maximizing this lower bound with respect to yields . We then have
This completes the proof for the norm .
Since all norms on are equivalent, there is a positive real number such that and the proof for any norm on follows from the case we have proved for the norm . ∎
Using Theorem 1, we now establish necessary conditions for the Wasserstein geometric ergodicity of Metropolis-Hastings Markov chains.
Under the same conditions as Theorem 1 with the additional assumption that , the Metropolis-Hastings kernel is not Wasserstein geometrically ergodic for any norm .
Assume by way of contradiction that the Metropolis-Hastings algorithm is Wasserstein geometrically ergodic for some norm . Then there is an where for every , there is a constant such that
From the lower bound in Theorem 1, we have
But this implies that
Write where . We then must have that for every . But this is a contradiction to the assumption that . ∎
There is in fact an elementary example which fails to be geometrically ergodic and also fails to be Wasserstein geometrically ergodic.
Consider the RWM algorithm with target distribution on with density . This algorithm has been shown to not be geometrically ergodic [Roberts1996geo, Proposition 5.2]. The conditions are also met for Proposition 1 in this example and this algorithm also fails to be Wasserstein geometrically ergodic for any norm.
For the MHI algorithm, we have the following necessary condition for Wassertein geometric ergodicity.
Assume the proposal distribution has density and is independent of the previous iteration. Under the same conditions as Theorem 1 with the additional assumption that , the MHI kernel is not Wasserstein geometrically ergodic for any norm .
Since , choose a monotonically decreasing sequence such that
Since is bounded by , the monotone convergence theorem [Folland2013, Theorem 2.14] ensures
We then have that
An application of Proposition 1 completes the proof. ∎
The following example for the MHI algorithm fails to be geometrically ergodic [Mengersen1996] and also fails to be Wasserstein geometrically ergodic.
Consider the target distribution in one dimension with independent proposal . It has been shown previously that the ratio of the proposal and target density cannot be uniformly bounded above [Mengersen1996]. This algorithm then satisfies the conditions in Proposition 2 and cannot be Wasserstein geometrically ergodic for any norm.
Recently there have been attempts at using centered drift functions to improve convergence analyses of some Monte Carlo Markov chains in high-dimensional settings [KarlOskar2019, qin2019convergence, qin2021wasserstein]. Our approach is to center the proposal distribution, that is, matching the optimal points of the proposal and target densities. We first consider the following assumption which has been considered previously [Wang2020].
The proposal is independent of the previous iteration, and there exists a solution .
If , then Assumption 1 implies uniform ergodicity of the MHI algorithm by [Tierney1994, Corollary 4]. Assumption 1 also implies that can be represented as a convex combination of the target distribution and the Dirac measure at the point [Wang2020, Remark 1, Theorem 2], that is,
This representation gives way to an exact convergence rate in total variation if the algorithm is started at the point [Wang2020, Remark 1, Theorem 2]. We now show that it is also possible to derive exact convergence rates for more general Wasserstein distances.
Under Assumption 1,
Let be a function such that . We have the identity,
It can be shown that . Since is not exactly , then . Denoting by the set of bounded measurable functions on and denoting the Lipschitz norm with the distance , applying the Kantovorich-Rubinstein theorem [Villani2003, Theorem 1.14],
Under Assumption 1, Proposition 3 shows the convergence rate in total variation matches the convergence rate in other Wasserstein distances. This means it cannot be improved using “weaker” Wasserstein distances alternative to the total variation distance. On the other hand, it also says the rate does not worsen when looking at “stronger” Wasserstein distances such as those controlled using drift and minorization conditions [Hairer2011].
We shall see in the next section that by centering a Gaussian proposal, we may satisfy Assumption 1 for a general class of target distributions with being the optimum of the target’s density. While we focus on Gaussian proposals, the technique of centering proposals is in fact more general.
We now provide general conditions where Proposition 3 may be applied with centered Gaussian proposals. We will assume the target distribution
is a probability distribution supported on. With and normalizing constant , we may then define the density by . Let be the unique maximum of , , and be a symmetric, positive-definite matrix. Under suitable conditions, the point may be found, at least approximately, by means of optimization. Let the proposal distribution with density correspond to a
-dimensional Gaussian distribution,and finally, define .
If exists and satisfies for any , then
Since the proposal density has been centered at the point , it then satisfies . For every , we have the lower bound
The point is guaranteed to exist if the function satisfies a convexity property. A function is strongly convex with parameter if there is an so that is convex [Urruty2001, Nesterov2018]. The norm in this definition is often taken to be the standard Euclidean norm, but we will use the norm induced by the matrix . We obtain the following general conditions for Proposition 4 to hold.
If the function is convex for all points on , then
Since the function is convex for all points on , it follows that for any and any ,
Since is positive-definite, then is nonnegative and this implies that is a convex function. It can also be shown that and since is lower semicontinuous, then attains its minimum . The right directional derivative exists for all points [Nesterov2018, Theorem 3.1.12]. For , we have
Taking the limit with , we have that
Since is the minimum of , then the right directional derivative satisfies for all . Therefore for all ,
This inequality implies the minimum is unique. An application of Proposition 4 completes the proof. ∎
We consider Bayesian Poisson and negative-binomial regression for count response regression and Bayesian logistic and probit regression for binary response regression. Suppose there are discrete-valued responses with features , and a parameter . For Poisson regression, assume the ’s are conditionally independent with
Similarly, for negative-binomial regression, if , assume
For binary response regression, if , assume
For logistic regression, we will considerand for probit regression, we will consider where and is a symmetric, positive-definite matrix. Both and
are assumed to be known. Define the vectorand the matrix . Let denote the posterior with density . If denotes the negative log-likelihood for each model, the posterior density is characterized by
Observe that the function is convex in all four models we consider. Let denote the unique maximum of . For the MHI algorithm, we use a proposal distribution, and Proposition 5 immediately yields the following for each posterior.
Let . Then
Our goal now is to obtain an upper bound on the rate of convergence established in Corollary 1 in high dimensions for binary response regression. In this context, it is more natural to treat the as stochastic; each time the sample size increases, the additional observation is randomly generated. Specifically, we will assume that are independent with and with . Similar scaling assumptions on the data are used for high-dimensional maximum-likelihood theory in logistic regression [Sur2019]. We will also assume the limit of the trace of the covariance matrix used in our prior is finite, that is, as . This assumption is natural as for the Gaussian prior to exist in any infinite-dimensional Hilbert space, it is a necessary condition that the trace of the covariance is finite. Under this model, if the dimension and sample size grow proportionally, we now provide an explicit high-dimensional asymptotic upper bound on the Wasserstein distance.
1. The negative log-likelihood is a twice continuously differentiable convex function with Hessian .
2. There is a universal constant such that for every .
Let . If in such a way that , then, almost surely
Under our assumption, we may write the matrix where is a matrix with i.i.d Gaussian entries with mean
. Denote the largest eigenvalue of the matrixby . Therefore, as in such a way that ,
almost surely [Geman1980, Theorem 1].
Define the function by where
is the negative log-likelihood loss function and define. We will first lower bound the quantity . We have that for any and any ,
This implies that for any and any , the Hessian of , denoted by , satisfies
Since the function is twice continuously differentiable, then is also twice continuously differentiable. Since both the gradient and Hessian are continuous and , we use a Taylor expansion to obtain the upper bound
We then have a lower bound on the normalizing constant of the target posterior
This in turn yields a lower bound on the ratio
The matrix is symmetric and positive-definite and so its eigenvalues exist and are positive. Let be the eigenvalues of . It is readily verified that the eigenvalues of the matrix exist and are . Then
We have the basic inequality for any . Since the eigenvalues of are positive and is nonnegative, we have the upper bound
This yields a lower bound on (7). Define the doubly-indexed sequence by
We have then shown that
By our assumption, as . That is to say that . Then as in such a way that , by continuity