1. Introduction
The Metropolis-Hastings (MH) algorithm and the Langevin diffusion are among the most popular algorithms in the area of Markov chain Monte Carlo (MCMC), see for instance the survey
Roberts and Rosenthal [6] and the references therein. Under a Gaussian proposal with vanishing variance and Gibbs target distribution, Gelfand and Mitter [5] proves that the MH process converges weakly to the Langevin diffusion, thus highlighting the asymptotic connection between this two classes of Markov processes.With the above classical result in mind, the aim of this paper is to investigate the scaling limit of an entirely different dynamics that we call the second MH process, introduced recently in Choi [2], Choi and Huang [3]. We first motivate our study of the second MH process by offering a geometric perspective: both the classical MH and the second MH minimize certain distance, extending the results by Billera and Diaconis [1]. Perhaps surprisingly, for fixed dimension, we shall prove in our main result Theorem 3.1 below that, upon scaling in time and in space, both the classical MH and the second MH converge to an universal Langevin diffusion. On a microscopic level, both the classical MH and the second MH exhibit different Markovian dynamics, yet however on a macroscopic level or on a large time-scale, both processes and their convex combinations converge to an universal rescaled Langevin diffusion. We note that in our paper the dimension is kept fixed, while for the case of optimal scaling of MCMC (see for example Roberts et al. [7]), the weak convergence result therein is obtained by taking the dimension going to infinity.
2. Preliminaries
2.1. Metropolis-Hastings generators: and
In this section, we recall the construction of continuous-time Metropolis-Hastings (MH) Markov processes on a general state space . There are two inputs to the MH algorithm, namely the target distribution and the proposal chain. We refer readers to Roberts and Rosenthal [6] and the references therein for further pointers on this subject. We denote by to be our target distribution and to be the generator of the proposal Markov jump process. We assume that both and are absolutely continuous with respect to a common sigma-finite reference measure on , and with a slight abuse of notations we still denote their densities by and respectively. Recall that is the generator of a Markov jump process in the sense of [4, Chapter Section ] if and only if
With these notations, we can now define the first MH generator as a transformation from and :
Definition 2.1 (The first MH generator).
Given a target distribution on general state space and a proposal continuous-time Markov jump process with generator , the first MH Markov process is a -reversible Markov jump process with generator given by , where for bounded
Note that
In view of the earlier work by the author Choi [2], Choi and Huang [3], we would like to study the so-called second MH generator that replaces by in Definition 2.1. More precisely, we define it as follows.
Definition 2.2 (The second MH generator).
Given a target distribution on general state space and a proposal continuous-time Markov jump process with generator , define
If
(2.1) |
then the second MH Markov process is a -reversible Markov jump process with generator given by , where for bounded
Comparing Definition 2.1 and 2.2, we see that in the former is always a generator of Markov jump process, while in the latter additional conditions on and are required so as to ensure (2.1). In our main results Section 3, we will consider the special case when
is a normal distribution with mean
and variance , and is the Gibbs distribution. Under the usual regularity conditions on as in Gelfand and Mitter [5], we will see that as defined is a valid generator of a Markov jump process, see Proposition 3.1 below.2.2. Geometric interpretation of and
In order to motivate the definition of and as natural transformations from and , in this section we offer a geometric interpretation for both and , extending the results by Billera and Diaconis [1], Choi and Huang [3]. In our result Theorem 2.1 below, we prove that both and , as well as their convex combinations, minimize certain distance between and the set of -reversible generator of Markov jump processes on . As a result, in this sense they are natural transformations that maps a given generator of Markov jump process to the set of -reversible generators of Markov jump process.
We first introduce a few notations and define a metric to quantify the distance between two generators of Markov jump processes. We write to be the set of conservative -reversible generators of Markov jump processes and to be the set of generators of Markov jump processes on . For any , similar to [1, Section ] we define a metric on to be
where is the set of diagonal in . The distance between and is then defined to be
(2.2) |
With the above notations in mind, we are now ready to state our result in this section:
Theorem 2.1.
Suppose that and are such that (2.1) is satisfied and is a generator of Markov jump process. The convex combinations for minimize the distance between and . That is,
[Proof. ]The proof is inspired by the proof of Theorem in Billera and Diaconis [1] and Theorem in Choi and Huang [3]. We first define two helpful half spaces:
We now show that for , . First, we note that
As is -reversible, setting gives . Plugging these expressions back yields
where we use the reverse triangle inequality in the second inequality. Similarly, we can show via substituting by . To see that , we have
As for convex combinations of and , we see that
3. Main results: universality of Langevin diffusion as scaling limit of random walk and
In this section, we specialize into the case of with , and we take the reference measure to be the Lebesgue measure. Let be a function satisfying the following regularity assumption:
Assumption 3.1.
is continuously differentiable, and its gradient is bounded and Lipschitz continuous.
Note that the same assumption on is imposed in Gelfand and Mitter [5] to obtain their weak convergence result that we will briefly recall later in this section. The target distribution is the Gibbs distribution at temperature with density given by
(3.1) |
Writing to be the density of one-dimensional normal distribution with mean and variance , for the proposal Markov jump process, we take to be
(3.2) |
where is the Dirac delta function. In words, we pick one of the coordinates uniformly at random, say , and propose a new state at according to a normal distribution centered at and variance while keeping other coordinates unchanged. Note that . If we write , we define and to be respectively
With the above notations, we can define and in this setting:
Proposition 3.1 ( and under Gibbs and Gaussian proposal ).
[Proof. ]We first prove the two formulae for and . As , we have
Next, we show that is a valid generator of Markov jump process. Let be an upper bound on
for all . By mean value theorem on , we have
By writing
to be a normal random variable with mean
and variance , this leads towhere
is the moment generating function of the half-normal distribution
, which is independent of .In our main result of this paper, we are primarily interested in the scaling limit of and upon scaling in time as . The scaling in space is embedded in the proposal . Let be the space of -valued functions on that are right continuous with left limit, equipped with the Skorohod topology. We denote the weak convergence of processes in the Skorohod topology by .
Theorem 3.1 (Universality of the Langevin diffusion as scaling limit of and ).
Suppose that satisfies Assumption 3.1, and we let and to be the Markov jump process with generator and respectively, both with initial distribution independent of . Let be the following time-rescaled Langevin diffusion with and stochastic differential equation given by
(3.3) |
where is the standard -dimensional Brownian motion. Then we have
weakly in as .
Remark 3.1.
Remark 3.2.
It is perhaps surprising that both and share the same scaling limit, given that they have entirely different dynamics. In fact, any Markov jump process whose generator is a convex combination of and converges to the same Langevin diffusion:
Corollary 3.1.
Suppose that satisfies Assumption 3.1, and we let be the Markov jump process with generator , and initial distribution independent of . Let be the following time-rescaled Langevin diffusion with and stochastic differential equation given by
where is the standard -dimensional Brownian motion. Then we have
weakly in as .
4. Proofs of main results
4.1. Proof of Theorem 3.1
For notational convenience, we replace by . In view of Remark 3.1, we only prove the weak convergence of . We let to be the generator of described by (3.3), where for ,
Note that since the drift is Lipschitz continuous, by [4, Chapter Theorem ], the space of infinitely differentiable functions with compact support forms a core of . Thus to prove the desired weak convergence, by [4, Chapter Theorem ] it suffices to prove the uniform convergence of the generator in , that is, for as we would like to prove that
(4.1) |
Define for ,
We now present three lemmas that will aid our proof, and their proofs are deferred to Section 4.1.1, 4.1.2 and 4.1.3 respectively. The first auxiliary lemma bounds the distance between and :
Lemma 4.1.
There exists positive constants and that only depend on and such that
Consequently, we have
Our next lemma controls the upper bound on Lemma 4.1.
Lemma 4.2.
Recall that follows a normal distribution with mean , variance and probability density function
Lemma 4.3.
For , as ,
(4.2) | ||||
(4.3) | ||||
(4.4) |
where the convergence are all uniform in .
We proceed to complete the proof of Theorem 3.1. By Taylor expansion on , there exists such that
where the fourth equality follows from Lemma 4.3 and the fact that has compact support. Note that the convergence is uniform in .
4.1.1. Proof of Lemma 4.1
First, by the Taylor expansion on and the fact that is Lipschitz continuous by Assumption 3.1, there exists constant such that
We would like to show the following inequality holds, by considering the possible signs of and :
(4.5) |
-
Case 1: ,
In this case, since , upon rearranging we obtain the leftmost inequality of (4.5). -
Case 2: ,
The leftmost inequality of (4.5) holds trivially. -
Case 3: ,
-
Case 4: ,
Similarly, we would like to show the following inequality holds, by considering the possible signs of and :
(4.6) |
-
Case 1: ,
-
Case 2: ,
The leftmost inequality of (4.6) holds trivially. -
Case 3: ,
-
Case 4: ,
As a result, collecting both (4.5) and (4.6) we have
where the inequalities in the two equations above follow from mean value theorem and the fact that is bounded by Assumption 3.1.
4.1.2. Proof of Lemma 4.2
Let and we write
to be the cumulative distribution function of standard normal. We also denote
to be the -th derivative of . By brute force differentiation and integration we note that