# Universality of the Langevin diffusion as scaling limit of a family of Metropolis-Hastings processes I: fixed dimension

Given a target distribution μ on a general state space X and a proposal Markov jump process with generator Q, the purpose of this paper is to investigate two universal properties enjoyed by two types of Metropolis-Hastings (MH) processes with generators M_1(Q,μ) and M_2(Q,μ) respectively. First, we motivate our study of M_2 by offering a geometric interpretation of M_1, M_2 and their convex combinations as L^1 minimizers between Q and the set of μ-reversible generators of Markov jump processes. Second, specializing into the case of X = R^d along with a Gaussian proposal with vanishing variance and Gibbs target distribution, we prove that, upon appropriate scaling in time, the family of Markov jump processes corresponding to M_1, M_2 or their convex combinations all converge weakly to an universal Langevin diffusion. While M_1 and M_2 are seemingly different stochastic dynamics, it is perhaps surprising that they share these two universal properties. These two results are known for M_1 in Billera and Diaconis (2001) and Gelfand and Mitter (1991), and the counterpart results for M_2 and their convex combinations are new.

## Authors

• 6 publications
07/24/2019

### Universality of the Langevin diffusion as scaling limit of a family of Metropolis-Hastings processes

Given a target distribution μ on a general state space X and a proposal ...
03/25/2016

### Markov substitute processes : a new model for linguistics and beyond

We introduce Markov substitute processes, a new model at the crossroad o...
10/30/2021

### Non-reversible processes: GENERIC, Hypocoercivity and fluctuations

We consider two approaches to study non-reversible Markov processes, nam...
02/09/2021

### Stationary Distribution Convergence of the Offered Waiting Processes in Heavy Traffic under General Patience Time Scaling

We study a sequence of single server queues with customer abandonment (G...
08/21/2020

### Exact targeting of Gibbs distributions using velocity-jump processes

This work introduces and studies a new family of velocity jump Markov pr...
03/08/2021

### On the Hurst Exponent, Markov Processes, and Fractional Brownian Motion

There is much confusion in the literature over Hurst exponent (H). The p...
10/08/2020

### Estmiation of the Spectral Measure from Convex Combinations of Regularly Varying Random Vectors

The extremal dependence structure of a regularly varying random vector X...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

The Metropolis-Hastings (MH) algorithm, the Langevin diffusion and their various variants are among the most popular algorithms in the area of Markov chain Monte Carlo (MCMC), see for instance the survey

Roberts and Rosenthal [12] and the references therein. Under a Gaussian proposal with vanishing variance and Gibbs target distribution, Gelfand and Mitter [8] proves that the MH process converges weakly to the Langevin diffusion, thus highlighting the asymptotic connection between this two classes of Markov processes. Another interesting property enjoyed by the MH algorithm, first shown in Billera and Diaconis [3], is that the MH transition kernel minimizes certain distance between the proposal chain and the set of transition kernels that are reversible with respect to the target distribution, thus offering a geometric perspective towards the study of MH algorithm.

With the above classical results in mind, the aim of this paper is to investigate how these two properties can perhaps be extended to an entirely different dynamics that we call the second MH process, introduced recently by the author in Choi [4], Choi and Huang [6]. The first universal property is stated in Theorem 2.1 below: both the classical MH and the second MH minimize certain distance, extending the results by Billera and Diaconis [3] to a continuous-time and general state space setting. In our main result Theorem 3.1 below, we state the second universal property: we show that upon the same scaling in time and in space, perhaps surprisingly both the classical MH and the second MH converge to an universal Langevin diffusion. On a microscopic level, both the classical MH and the second MH exhibit different Markovian dynamics, yet however on a macroscopic level or on a large time-scale, both processes and their convex combinations converge to an universal rescaled Langevin diffusion, thus the dynamics of this family are not that different afterall. As emphasized in the title of this paper, we note that the dimension is kept fixed in our weak convergence result. In a related line of work, known as the optimal scaling of MCMC (see for example Roberts et al. [13], Roberts and Rosenthal [11], Bédard [1], Jourdain et al. [9], Mattingly et al. [10], Bierkens and Roberts [2]), the weak convergence results therein are obtained by taking the dimension going to infinity. In the sequel of this paper Choi [5], we shall investigate the scaling limit of the second MH process in the Curie-Weiss model in the setting of optimal scaling as the dimension increases, in hope of obtaining interesting counterpart results of Bierkens and Roberts [2].

The rest of this paper is organized as follows. In Section 2, we recall the classical and the second MH process and fix our notations. The geometric interpretation of these processes are proved in Section 2.2. In Section 3, the weak convergence result is stated, which will be proved in Section 4.

## 2. Preliminaries

### 2.1. Metropolis-Hastings generators: M1 and M2

In this section, we recall the construction of continuous-time Metropolis-Hastings (MH) Markov processes on a general state space . There are two inputs to the MH algorithm, namely the target distribution and the proposal chain. We refer readers to Roberts and Rosenthal [12] and the references therein for further pointers on this subject. We denote by to be our target distribution and to be the generator of the proposal Markov jump process. We assume that both and are absolutely continuous with respect to a common sigma-finite reference measure on , and with a slight abuse of notations we still denote their densities by and respectively. Recall that is the generator of a Markov jump process in the sense of [7, Chapter Section ] if and only if

 supx∈X∫y; y≠xQ(x,y)ν(dy)<∞.

With these notations, we can now define the first MH generator as a transformation from and :

###### Definition 2.1 (The first MH generator).

Given a target distribution on general state space and a proposal continuous-time Markov jump process with generator , the first MH Markov process is a -reversible Markov jump process with generator given by , where for bounded

 M1f(x) =∫y; y≠x(f(y)−f(x))M1(x,y)ν(dy), M1(x,y) :=min{Q(x,y),μ(y)μ(x)Q(y,x)},x≠y.

Note that

 supx∈X∫y; y≠xM1(x,y)ν(dy)⩽supx∈X∫y; y≠xQ(x,y)ν(dy)<∞.

In view of the earlier work by the author Choi [4], Choi and Huang [6], we would like to study the so-called second MH generator that replaces by in Definition 2.1. More precisely, we define it as follows.

###### Definition 2.2 (The second MH generator).

Given a target distribution on general state space and a proposal continuous-time Markov jump process with generator , define

 M2(x,y):=max{Q(x,y),μ(y)μ(x)Q(y,x)},x≠y.

If

 (2.1) supx∈X∫y; y≠xM2(x,y)ν(dy)<∞,

then the second MH Markov process is a -reversible Markov jump process with generator given by , where for bounded

 M2f(x) =∫y; y≠x(f(y)−f(x))M2(x,y)ν(dy).

Comparing Definition 2.1 and 2.2, we see that in the former is always a generator of Markov jump process, while in the latter additional conditions on and are required so as to ensure (2.1). In our main results Section 3, we will consider the special case when

is a normal distribution with mean

and variance , and is the Gibbs distribution. Under the usual regularity conditions on as in Gelfand and Mitter [8], we will see that as defined is a valid generator of a Markov jump process, see Proposition 3.1 below.

### 2.2. Geometric interpretation of M1 and M2

In order to motivate the definition of and as natural transformations from and , in this section we offer a geometric interpretation for both and , extending the results by Billera and Diaconis [3], Choi and Huang [6] to a continuous-time and general state space setting. In our result Theorem 2.1 below, we prove that both and , as well as their convex combinations, minimize certain distance between and the set of -reversible generator of Markov jump processes on . As a result, in this sense they are natural transformations that maps a given generator of Markov jump process to the set of -reversible generators of Markov jump process.

We first introduce a few notations and define a metric to quantify the distance between two generators of Markov jump processes. We write to be the set of conservative -reversible generators of Markov jump processes and to be the set of generators of Markov jump processes on . For any , similar to [3, Section ] we define a metric on to be

 dμ(Q1,Q2):=∫X×X∖Δμ(x)|Q1(x,y)−Q2(x,y)|ν(dx)ν(dy),

where is the set of diagonal in . The distance between and is then defined to be

 (2.2) dμ(Q,R(μ)):=infR∈R(μ)dμ(Q,R).

With the above notations in mind, we are now ready to state our result in this section:

###### Theorem 2.1.

Suppose that and are such that (2.1) is satisfied and is a generator of Markov jump process. The convex combinations for minimize the distance between and . That is,

 dμ(Q,R(μ))=dμ(Q,αM1+(1−α)M2).

[Proof. ]The proof is inspired by the proof of Theorem in Billera and Diaconis [3] and Theorem in Choi and Huang [6]. We first define two helpful half spaces:

 H<=H<(Q,μ) :={(x,y); μ(x)Q(x,y)<μ(y)Q(y,x)}, H>=H>(Q,μ) :={(x,y); μ(x)Q(x,y)>μ(y)Q(y,x)}.

We now show that for , . First, we note that

 dμ(Q,R)⩾∫(x,y)∈H<[μ(x)|Q(x,y)−R(x,y)|+μ(y)|Q(y,x)−R(y,x)|]ν(dx)ν(dy).

As is -reversible, setting gives . Plugging these expressions back yields

 dμ(Q,N) ⩾∫(x,y)∈H<[μ(x)|ϵxy|+μ(y)∣∣∣Q(y,x)−μ(x)μ(y)(Q(x,y)+ϵxy)∣∣∣]ν(dx)ν(dy) =∫(x,y)∈H<[μ(x)|ϵxy|+∣∣μ(y)Q(y,x)−μ(x)Q(x,y)−μ(x)ϵxy∣∣]ν(dx)ν(dy) ⩾∫(x,y)∈H<|μ(y)Q(y,x)−μ(x)Q(x,y)|ν(dx)ν(dy)=dμ(Q,M2),

where we use the reverse triangle inequality in the second inequality. Similarly, we can show via substituting by . To see that , we have

 dμ(Q,M2) =∫(x,y)∈H<|μ(y)Q(y,x)−μ(x)Q(x,y)|ν(dx)ν(dy) =∫(y,x)∈H>|μ(y)Q(y,x)−μ(x)Q(x,y)|ν(dx)ν(dy)=dμ(Q,M1).

As for convex combinations of and , we see that

 dμ(Q,αM1+(1−α)M2) =(1−α)∫(x,y)∈H<|μ(y)Q(y,x)−μ(x)Q(x,y)|ν(dx)ν(dy) +α∫(x,y)∈H>|μ(y)Q(y,x)−μ(x)Q(x,y)|ν(dx)ν(dy) =(1−α)dμ(Q,M2)+αdμ(Q,M1)=dμ(Q,M1).

## 3. Main results: universality of Langevin diffusion as scaling limit of random walk M1 and M2

In this section, we specialize into the case of with , and we take the reference measure to be the Lebesgue measure. Let be a function satisfying the following regularity assumption:

###### Assumption 3.1.

is continuously differentiable, and its gradient is bounded and Lipschitz continuous.

Note that the same assumption on is imposed in Gelfand and Mitter [8] to obtain their weak convergence result that we will briefly recall later in this section. The target distribution is the Gibbs distribution at temperature with density given by

 (3.1) μ(x)=e−U(x)/T∫e−U(x)/Tdx,x∈Rd∗.

Writing to be the density of one-dimensional normal distribution with mean and variance , for the proposal Markov jump process, we take to be

 (3.2) Qϵ(x,y)=1d∗d∗∑i=1ϕϵ(yi−xi)∏j≠iδ(yj−xj),

where is the Dirac delta function. In words, we pick one of the coordinates uniformly at random, say , and propose a new state at according to a normal distribution centered at and variance while keeping other coordinates unchanged. Note that . If we write , we define and to be respectively

 sM1(x,y) :=e−(U(y)−U(x))+/T,sM1(i,x,yi):=sM1((x1,…,xd∗),(x1,…,xi−1,yi,xi+1,…,xd∗)), sM2(x,y) :=e(U(x)−U(y))+/T,sM2(i,x,yi):=sM2((x1,…,xd∗),(x1,…,xi−1,yi,xi+1,…,xd∗)).

With the above notations, we can define and in this setting:

###### Proposition 3.1 (M1 and M2 under Gibbs μ and Gaussian proposal Qϵ).

Suppose that satisfies Assumption 3.1, is the Gibbs distribution (3.1) and is the Gaussian proposal (3.2). Then both and are generators of Markov jump process. Furthermore, for ,

 Mϵ1(x,y) =1d∗d∗∑i=1sM1(i,x,yi)ϕϵ(yi−xi)∏j≠iδ(yj−xj), Mϵ2(x,y) =1d∗d∗∑i=1sM2(i,x,yi)ϕϵ(yi−xi)∏j≠iδ(yj−xj).

We write and to be the Markov jump process with generator and respectively.

[Proof. ]We first prove the two formulae for and . As , we have

 Mϵ1(x,y) =Qϵ(x,y)min{μ(y)μ(x),1}=Qϵ(x,y)sM1(x,y)=1d∗d∗∑i=1sM1(i,x,yi)ϕϵ(yi−xi)∏j≠iδ(yj−xj), Mϵ2(x,y) =Qϵ(x,y)max{μ(y)μ(x),1}=Qϵ(x,y)sM2(x,y)=1d∗d∗∑i=1sM2(i,x,yi)ϕϵ(yi−xi)∏j≠iδ(yj−xj).

Next, we show that is a valid generator of Markov jump process. Let be an upper bound on

 supx∈Rd∗|∂xiU(x)|⩽M.

for all . By mean value theorem on , we have

 sM2(i,x,yi)⩽eM|yi−xi|/T.

By writing

to be a normal random variable with mean

and variance , this leads to

 ∫y; x≠yMϵ2(x,y)dy⩽E(eM|Z|/T)<∞,

where

is the moment generating function of the half-normal distribution

, which is independent of .

In our main result of this paper, we are primarily interested in the scaling limit of and upon scaling in time as . The scaling in space is embedded in the proposal . Let be the space of -valued functions on that are right continuous with left limit, equipped with the Skorohod topology. We denote the weak convergence of processes in the Skorohod topology by .

###### Theorem 3.1 (Universality of the Langevin diffusion as scaling limit of Mϵ1 and Mϵ2).

Suppose that satisfies Assumption 3.1, and we let and to be the Markov jump process with generator and respectively, both with initial distribution independent of . Let be the following time-rescaled Langevin diffusion with and stochastic differential equation given by

 (3.3) dX(t)=−∇U(X(t))2Td∗dt+1√d∗dW(t),

where is the standard -dimensional Brownian motion. Then we have

 XMϵ1(⋅ϵ)⇒X(⋅),XMϵ2(⋅ϵ)⇒X(⋅)

weakly in as .

###### Remark 3.1.

As noted in the abstract and in Section 1, the weak convergence of to the Langevin diffusion is first proved by Gelfand and Mitter [8]. We shall only prove the case of in Section 4.1.

###### Remark 3.2.

Denote by the Langevin diffusion with initial condition and stochastic differential equation

 d˜X(t)=−∇U(˜X(t))dt+√2TdW(t).

Denote the clock process by , then has the same law as satisfying (3.3). In other words, (3.3) is the Langevin diffusion running at time scale .

It is perhaps surprising that both and share the same scaling limit, given that they have entirely different dynamics. In fact, any Markov jump process whose generator is a convex combination of and converges to the same Langevin diffusion:

###### Corollary 3.1.

Suppose that satisfies Assumption 3.1, and we let be the Markov jump process with generator , and initial distribution independent of . Let be the following time-rescaled Langevin diffusion with and stochastic differential equation given by

 dX(t)=−∇U(X(t))2Td∗dt+1√d∗dW(t),

where is the standard -dimensional Brownian motion. Then we have

 Yϵ(⋅ϵ)⇒X(⋅)

weakly in as .

In view of Theorem 2.1 and Corollary 3.1, we see that on one hand the convex combination of may have different dynamics for different , yet interestingly they all minimize the distance and converge weakly to the same Langevin diffusion.

The rest of the paper is devoted to the proof of Theorem 3.1 and Corollary 3.1 in Section 4.1 and 4.2 respectively.

## 4. Proofs of main results

### 4.1. Proof of Theorem 3.1

For notational convenience, we replace by . In view of Remark 3.1, we only prove the weak convergence of . We let to be the generator of described by (3.3), where for ,

 Gf(x)=1d∗d∗∑i=1∂xiU(x)∂xif(x)+12d∗d∗∑i=1∂2xif(x).

Note that since the drift is Lipschitz continuous, by [7, Chapter Theorem ], the space of infinitely differentiable functions with compact support forms a core of . Thus to prove the desired weak convergence, by [7, Chapter Theorem ] it suffices to prove the uniform convergence of the generator in , that is, for as we would like to prove that

 (4.1) supx∈Rd∗|(1/ϵ)Mϵ2f(x)−Gf(x)|→0.

Define for ,

 ^sM2(x,y) :=e⟨∇U(x),x−y⟩+,^sM2(i,x,yi):=^sM2((x1,…,xd∗),(x1,…,xi−1,yi,xi+1,…,xd∗)), g(x,y)

We now present three lemmas that will aid our proof, and their proofs are deferred to Section 4.1.1, 4.1.2 and 4.1.3 respectively. The first auxiliary lemma bounds the distance between and :

###### Lemma 4.1.

There exists positive constants and that only depend on and such that

 |sM2(x,y)−^sM2(x,y)|⩽c1eM∑d∗i=1|yi−xi|∥y−x∥2.

Consequently, we have

 |sM2(i,x,yi)−^sM2(i,x,yi)|⩽c1eM|yi−xi||yi−xi|2.

Our next lemma controls the upper bound on Lemma 4.1.

###### Lemma 4.2.

Recall that follows a normal distribution with mean , variance

. Then for and we have

 E(et|Z||Z|3) =O(ϵ3/2), E(et|Z||Z|4) =O(ϵ2).

With Lemma 4.1 and 4.2

, we prove the following estimates on the drift and volatility terms of

as :

###### Lemma 4.3.

For , as ,

 (4.2) (1/ϵ)∫(yi−xi)Mϵ2(x,y)dy =−∂xiU(x)/(2d∗)+O(ϵ1/2), (4.3) (1/ϵ)∫(yi−xi)2Mϵ2(x,y)dy =1/d∗+O(ϵ1/2), (4.4) (1/ϵ)∫(yi−xi)3Mϵ2(x,y)dy =O(ϵ1/2),

where the convergence are all uniform in .

We proceed to complete the proof of Theorem 3.1. By Taylor expansion on , there exists such that

 (1/ϵ)Mϵ2f(x) =(1/ϵ)∫(f(y)−f(x))Mϵ2(x,y)dy =(1/ϵ)∫(d∗∑i=1∂xif(x)(yi−xi)+12d∗∑i,j=1∂xi∂xjf(x)(yi−xi)(yj−xj))Mϵ2(x,y)dy =(1/ϵ)∫(d∗∑i=1∂xif(x)(yi−xi)+12d∗∑i=1∂2xif(x)(yi−xi)2)Mϵ2(x,y)dy +(1/ϵ)∫(16d∗∑i=1∂3xif(z)(yi−xi)3)Mϵ2(x,y)dy =1d∗d∗∑i=1∂xiU(x)∂xif(x)+12d∗d∗∑i=1∂2xif(x)+O(ϵ1/2) =Gf(x)+O(ϵ1/2),

where the fourth equality follows from Lemma 4.3 and the fact that has compact support. Note that the convergence is uniform in .

#### 4.1.1. Proof of Lemma 4.1

First, by the Taylor expansion on and the fact that is Lipschitz continuous by Assumption 3.1, there exists constant such that

 |g(x,y)|⩽c1∥y−x∥2.

We would like to show the following inequality holds, by considering the possible signs of and :

 (4.5) 1−e⟨∇U(x),x−y⟩+−(U(x)−U(y))+⩽1−e−|g(x,y)|⩽|g(x,y)|⩽c1∥y−x∥2.
• Case 1: ,
In this case, since , upon rearranging we obtain the leftmost inequality of (4.5).

• Case 2: ,
The leftmost inequality of (4.5) holds trivially.

• Case 3: ,

• Case 4: ,

 1−e⟨∇U(x),x−y⟩+−(U(x)−U(y))+=1−e⟨∇U(x),x−y⟩⩽0⩽1−e−|g(x,y)|.

Similarly, we would like to show the following inequality holds, by considering the possible signs of and :

 (4.6) 1−e−⟨∇U(x),x−y⟩++(U(x)−U(y))+⩽1−e−|g(x,y)|⩽|g(x,y)|⩽c1∥y−x∥2.
• Case 1: ,

 1−e−⟨∇U(x),x−y⟩++(U(x)−U(y))+=1−eg(x,y)⩽1−e−|g(x,y)|.
• Case 2: ,
The leftmost inequality of (4.6) holds trivially.

• Case 3: ,

• Case 4: ,

 1−e−⟨∇U(x),x−y⟩++(U(x)−U(y))+=1−e−⟨∇U(x),x−y⟩⩽1−eU(x)−U(y)−⟨∇U(x),x−y⟩=1−eg(x,y)=1−e−|g(x,y)|.

As a result, collecting both (4.5) and (4.6) we have

 sM2(x,y)−^sM2(x,y) =e(U(x)−U(y))+(1−e⟨∇U(x),x−y⟩+−(U(x)−U(y))+)⩽c1eM∑d∗i=1|yi−xi|∥y−x∥2, ^sM2(x,y)−sM2(x,y) =e⟨∇U(x),x−y⟩+(1−e−⟨∇U(x),x−y⟩++(U(x)−U(y))+)⩽c1eM∑d∗i=1|yi−xi|∥y−x∥2,

where the inequalities in the two equations above follow from mean value theorem and the fact that is bounded by Assumption 3.1.

#### 4.1.2. Proof of Lemma 4.2

Let and we write

to be the cumulative distribution function of standard normal. We also denote

to be the -th derivative of . By brute force differentiation and integration we note that

 h(t) =2eϵt22(1−Φ(−√ϵt)), ∂th(t) =E(et|Z||