 # Variance reduction for MCMC methods via martingale representations

In this paper we propose an efficient variance reduction approach for MCMC algorithms relying on a novel discrete time martingale representation for Markov chains. Our approach is fully non-asymptotic and does not require any type of ergodicity or special product structure of the underlying density. By rigorously analyzing the convergence of the proposed algorithm, we show that it's complexity is indeed significantly smaller than one of the original MCMC algorithm. The numerical performance of the new method is illustrated in the case of Gaussian mixtures and binary regression.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Monte Carlo integration typically has an error variance of the form where is a sample size and is the variance of integrand. We can make the variance smaller by using a larger value of . Alternatively, we can reduce instead of increasing the sample size . To this end, one can try to construct a new Monte Carlo experiment with the same expectation as the original one but with a lower variance . Methods to achieve this are known as variance reduction techniques. Variance reduction plays an important role in Monte Carlo and Markov Chain Monte Carlo methods. Introduction to many of the variance reduction techniques can be found in , ,  and . Recently one witnessed a revival of interest in efficient variance reduction methods for MCMC algorithms, see for example , ,  and references therein.

Suppose that we wish to compute , where

is a random vector-valued in

with a density and with

. The idea of the so-called control variates variance reduction method is to find a cheaply computable random variable

with and such that the variance of the r.v. is small. The complexity of the problem of constructing classes of control variates satisfying essentially depends on the degree of our knowledge on For example, if

is analytically known and satisfies some regularity conditions, one can apply the well-known technique of polynomial interpolation to construct control variates enjoying some optimality properties, see, for example, Section 3.2 in

. Alternatively, if an orthonormal system in is analytically available, one can build control variates as a linear combination of the corresponding basis functions, see . Furthermore, if

is known only up to a normalizing constant (which is often the case in Bayesian statistics), one can apply the recent approach of control variates depending only on the gradient

using Schrdinger-type Hamiltonian operator in  and so-called Stein operator in . In some situations is not known analytically, but can be represented as a function of simple random variables with known distribution. Such situation arises, for example, in the case of functionals of discretized diffusion processes. In this case a Wiener chaos-type decomposition can be used to construct control variates with nice theoretical properties, see . Note that in order to compare different variance reduction approaches, one has to analyze their complexity, that is, the number of numerical operations required to achieve a prescribed magnitude of the resulting variance.

The situation becomes much more difficult in the case of MCMC algorithms, where one has to work with a Markov chain whose marginal distribution converges to as time grows. One important class of the variance reduction methods in this case is based on the so-called Poisson equation for the corresponding Markov chain. It was observed in Henderson  that if a time-homogeneous Markov chain is stationary with stationary distribution then for any real-valued function defined on the state space of the Markov chain the function has zero mean with respect to . The best choice for the function corresponds to a solution of the so-called Poisson equation

. Moreover, it is also related to the minimal asymptotic variance in the corresponding central limit theorem, see

 and . Although the Poisson equation involves the quantity of interest and can not be solved explicitly in most cases, this idea still can be used to construct some approximations for the optimal zero-variance control variates. For example, Henderson  proposed to compute approximations for the solution of the Poisson equation for specific Markov chains with particular emphasis on models arising in stochastic network theory. In  and  series-type control variates are introduced and studied for reversible Markov chains. It is assumed in  that the one-step conditional expectations can be computed explicitly for a set of basis functions. The authors in  proposed another approach tailored to diffusion setting which doesn’t require the computation of integrals of basis functions and only involves applications of the underlying generator.

In this paper we focus on the Langevin type algorithms which got much attention recently, see [7, 12, 17, 22, 21] and references therein. We propose a generic variance reduction method for these and other types algorithms, which is purely non-asymptotic and does not require that the conditional expectations of the corresponding Markov chain can be computed or that the generator is known analytically. Moreover, we do not need to assume stationarity or/and sampling under the invariant distribution We rigorously analyse the convergence of the method and study its complexity. It is shown that our variance-reduced Langevin algorithm outperforms the standard Langevin algorithms in terms of complexity.

The paper is organized as follows. In Section 2 we set up the problem and introduce some notations. Section 3 contains a novel martingale representation and shows how this representation can be used for variance reduction. In Section 4 we analyze the performance of the proposed variance reduction algorithm in the case of Unadjusted Langevin Algorithm (ULA). Section 6 studies the complexity of the variance reduced ULA. Finally, numerical examples are presented in Section 7.

## 2 Setup

Let be a domain in Our aim is to numerically compute expectations of the form

 π(f)=∫Xf(x)π(dx),

where and

is a probability measure supported on

If the dimension of the space is large and can not be computed analytically, one can apply Monte Carlo methods. However, in many practical situations direct sampling from is impossible and this precludes the use of plain Monte Carlo methods in this case. One popular alternative to Monte Carlo is Markov Chain Monte Carlo (MCMC), where one is looking for a discrete time (possibly non-homogeneous) Markov chain such that is its unique invariant measure. In this paper we study a class of MCMC algorithms with satisfying the the following recurrence relation:

 Xp=Φp(Xp−1,ξp),p=1,2,…,X0=x0, (1)

for some i.i.d. random vectors with distribution and some Borel-measurable functions In fact, this is quite general class of Markov chains (see Theorem 1.3.6 in  ) and many well-known MCMC algorithms can be represented in form (1). Let us consider two popular examples.

###### Example 1 (Unadjusted Langevin Algorithm)

Fix a sequence of positive time steps Given a Borel function , consider a non-homogeneous discrete-time Markov chain defined by

 Xp+1=Xp−γp+1μ(Xp)+√γp+1Zp+1, (2)

where is an i.i.d. sequence of -dimensional standard Gaussian random vectors. If for some continuously differentiable function then Markov chain (2) can be used to approximately sample from the density

 π(x)=conste−U(x),const=1/∫Rde−U(x)dx, (3)

provided that is finite. This method is usually referred to as Unadjusted Langevin Algorithm (ULA). In fact the Markov chain (2) arises as the Euler-Maruyama discretization of the Langevin diffusion

 dYt=−μ(Yt)dt+dWt

with nonnegative time steps , and, under mild technical conditions, the latter Langevin diffusion admits of (3) as its unique invariant distribution; see  and .

###### Example 2 (Metropolis-Adjusted Langevin Algorithm)

The Metropolis-Hastings algorithm associated with a target density requires the choice of a sequence of conditional densities also called proposal or candidate kernels. The transition from the value of the Markov chain at time and its value at time proceeds via the following transition step:

Then, as shown in Metropolis et al. , this transition is reversible with respect to and therefore preserves the stationary density . If have a wide enough support to eventually reach any region of the state space with positive mass under , then this transition is irreducible and is a maximal irreducibility measure . The Metropolis-Adjusted Langevin algorithm (MALA) takes (2) as proposal, that is,

 qp(y|x)=(γp+1)−d/2φ([y−x+γp+1μ(x)]/√γp+1),

where , , denotes the density of a -dimensional standard Gaussian random vector. The MALA algorithms usually provide noticeable speed-ups in convergence for most problems. It is not difficult to see that the MALA chain can be compactly represented in the form

 Xp+1 =Xp+1(Up+1≤α(Xp,Yp))(Yp−Xp), Yp =Xp−γp+1μ(Xp)+√γp+1Zp+1,

where

is an i.i.d. sequence of uniformly distributed on

random variables independent of Thus we recover (1) with and

 Φp(x,(u,z)⊤)=x+1(u≤α(x,x−γpμ(x)+√γpz))(−γpμ(x)+√γpz).
###### Example 3

Let be the unique strong solution to a SDE of the form:

 dXt=b(Xt)dt+σ(Xt)dWt,t≥0, (4)

where is a standard -valued Brownian motion, and are locally Lipschitz continuous functions with at most linear growth. The process is a Markov process and let denote its infinitesimal generator defined by

 Lg=b⊤∇g+12Tr(σ⊤D2gσ)

for any If there exists a continuously twice differentiable Lyapunov function such that

 supx∈RdLV(x)<∞,limsup|x|→∞LV(x)<0,

then there is an invariant probability measure for that is, for all if Invariant measures are crucial in the study of the long term behaviour of stochastic differential systems (4). Under some additional assumptions, the invariant measure is ergodic and this property can be exploited to compute the integrals for by means of ergodic averages. The idea is to replace the diffusion by a (simulable) discretization scheme of the form (see e.g. )

 ¯Xn+1=¯Xn+γn+1b(¯Xn)+σ(¯Xn)(WΓn+1−WΓn),n≥0,¯X0=X0,

where and is a non-increasing sequence of time steps. Then for a function we can approximate via

 πγn(f)=1Γnn∑i=1γif(¯Xi−1).

Due to typically high correlation between variance reduction is of crucial importance here. As a matter of fact, in many cases there is no explicit formula for the invariant measure and this makes the use of gradient based variance reduction techniques (see e.g. 

) impossible in this case. On the contrary, our method can be directly used to reduce variance of the ergodic estimator

without explicit knowledge of

## 3 Martingale representation and variance reduction

In this section we give a general discrete-time martingale representation for the Markov chain (1) which is used below to construct an efficient variance reduction algorithm. Let be a complete orthonormal system in with . In particular, we have

 E[ϕi(ξ)ϕj(ξ)]=δij,i,j∈Z+

with Notice that this implies that the random variables ,

, are centered. As an example, we can take multivariate Hermite polynomials for the ULA algorithm and a tensor product of Shifted Legendre polynomials for ”uniform part” and Hermite polynomials for ”gaussian part” of the random variable

in MALA, as the Shifted Legendre polynomials are orthogonal with respect to the Lebesgue measure on

In the sequel, we denote by the filtration generated by generated by with the convention .

###### Theorem 1

Let be a Borel function such that , with a Markov chain defined in (1). Then, for , the following representation holds in

 f(Xp)=E[f(Xp)|Gj]+∞∑k=1p∑l=j+1ap,l,k(Xl−1)ϕk(ξl), (5)

where

 ap,l,k(x)=E[f(Xp)ϕk(ξl)|Xl−1=x],p≥l,k∈N. (6)

Proof The expansion obviously holds for and . Indeed, due to the orthonormality and completeness of the system , we have

 f(X1)=E[f(X1)]+∑k≥1a1,1,k(X0)ϕk(ξ1)

with

 a1,1,k(x0)=E[f(X1)ϕk(ξ1)|X0=x0],

provided Recall that and . Suppose that (5) holds for , all , and all Borel-measurable functions with . Let us prove it for . Given with , due to the orthonormality and completeness of the system , we get by conditioning on ,

 f(Xp)=E[f(Xp)|Gq]+∑k≥1αp,q+1,kϕk(ξq+1),

where

 αp,q+1,k=E[f(Xp)ϕk(ξq+1)|Gq].

By the Markov property of , we have . Furthermore, a calculation involving intermediate conditioning on and the recurrence relation verifies that

 αp,q+1,k=E[f(Xp)ϕk(ξq+1)|Xq]=ap,q+1,k(Xq)

for suitably chosen Borel-measurable functions . We thus arrive at

 f(Xp)=E[f(Xp)|Xq]+∑k≥1ap,q+1,k(Xq)ϕk(ξq+1), (7)

which is the required statement in the case . Now assume . The random variable is square integrable and has the form , hence the induction hypothesis applies, and we get

 E[f(Xp)|Xq]=E[f(Xp)|Xj]+∑k≥1q∑l=j+1ap,l,k(Xl−1)ϕk(ξl) (8)

with

 ap,l,k(Xl−1) =E[f(Xp)ϕk(ξl)|Xl−1].

Formulas (7) and (8) conclude the proof.

From numerical point of view another representation of the coefficients turns out to be more useful.

###### Proposition 2

The coefficients in (6) can be alternatively represented as

 ap,l,k(x)=E[ϕk(ξ)Qp,l(Φl(x,ξ))]

with The functions can be computed by the backward recurrence: and for

 Qp,l(x)=E[Qp,l+1(Xl+1)|Xl=x]. (9)

Next we show how the representation (5) can be used to efficiently reduce the variance of MCMC algorithms. Let be a sequence of positive and non-increasing step sizes with and, for , , we set

 Γn,l=l∑p=nγp.

Consider a weighted average estimator of the form

 πNn(f)=N+n∑p=N+1ωNp,nf(Xp),ωNp,n=γp+1Γ−1N+2,N+n+1, (10)

where is the length of the burn-in period and the number of effective samples. Given and as above, for , denote

 MNK,n(f)=K∑k=1N+n∑l=N+1¯al,k(Xl−1)ϕk(ξl), (11)

where

 ¯al,k(x)=N+n∑p=lωNp,nap,l,k(x)=E[(N+n∑p=lωNp,nf(Xp))ϕk(ξl)∣∣ ∣∣Xl−1=x]. (12)

Since is independent of and the r.v. has zero mean and can be viewed as a control variate.

## 4 Analysis of variance reduced ULA

The representation (6) suggests that the variance of the variance-reduced estimator

 πNK,n(f)=πNn(f)−MNK,n(f) (13)

should be small for large enough. In this section we provide an analysis of the variance-reduced ULA algorithm (see Example 1). We shall use the notations and . By , , we denote the normalized Hermite polynomial on , that is,

 Hk(x)=(−1)k√k!ex2/2∂k∂xke−x2/2,x∈R.

For a multi-index , denotes the normalized Hermite polynomial on , that is,

 Hk(x)=d∏i=1Hki(xi),x=(xi)∈Rd.

In what follows, we also use the notation for , and we set , , and . Given and as above, for , denote

 MNK,n(f) (14) =∑k:0<∥k∥≤KN+n∑l=N+1¯al,k(Xl−1)Hk(Zl)

with and

 ¯al,k(x) =N+n∑p=lωNp,nap,l,k(x) =E[(N+n∑p=lωNp,nf(Xp))Hk(Zl)∣∣ ∣∣Xl−1=x].

For an estimator of (see (10) and (13)), we are interested in its conditional Mean Squared Error (MSE), which can be decomposed as the sum of the squared conditional bias and the conditional variance:

 MSE[ρ(f)|GN] = E[(ρ(f)−π(f))2|GN] = (E[ρ(f)|GN]−π(f))2+Var[ρ(f)|GN].

The quantities in (4) are conditioned on , as it reflects the way the estimators are used for MCMC: the path of the Markov chain is simulated only once, and we start to use the realized values of the Markov chain to construct our estimate only after the burn-in period of length . We also notice that, due to the Markovian structure, the conditioning on in (4) is equivalent to conditioning on only (this is particularly clear in the case but requires some calculation in the remaining case ).

### 4.1 Squared conditional bias

Due to the martingale transform structure of , we have

 E[MNK,n(f)∣∣GN]=0,

Hence both estimators and have the same conditional bias. Notice that this remains true also when we substitute the coefficients in (14) with some independent approximations For a bounded Borel function , we can estimate the conditional bias similarly to the beginning of  [12, Section 4]:

 (E[πNK,n(f)|GN]−π(f))2 =(E[πNn(f)|GN]−π(f))2 (16) ≤osc(f)2N+n∑p=N+1ωNp,n∥QN+1,pγ(XN,⋅)−π(⋅)∥2TV,

where , denotes the total variation distance between probability measures and , that is,

 ∥μ−ν∥TV=supA∈B(Rd)|μ(A)−ν(A)|,

denotes the probability measure on with density of (3); for , the Markov kernel from into is defined by

 Rγ(x,A)=∫A1(4πγ)d/2 exp{−14γ∥y−x+γμ(x)∥2}dy,

while, for , , the kernel is

 Qk,lγ=Rγl⋯Rγk,

which, finally, provides the (random) measure used in the right-hand side of (16).

Different specific upper bounds for the squared bias can be deduced from (16) using results of Section 3 in  on bounds in the total variation distance.

### 4.2 Conditional variance

An upper bound for the variance of the classical estimator (10) is provided in in [12, Theorem 17]. As for the estimator (13), we introduce lemma of conditional variance decomposition. It follows from (14), Theorem 1 and independence condition for the sequence .

###### Lemma 3

For variance-reduced estimator (13), defined for the Markov Chain (1) the following representation holds

 (17)

Now we present an upper bound for the right-hand sides of (17) in case of ULA algorithm.

###### Theorem 4

Fix . Suppose that and are times continuously differentiable and it holds

 |∂kf(x)| ≤ Bf,x∈Rd, |∂kμ(x)| ≤ Bμ,x∈Rd

for all with ,

 Jμ(x)≥bμI,x∈Rd,

for some positive number where stands for Jacobian of Let be a sequence of positive and non-increasing step sizes with . We assume that and that

 ∞∑r=jγrr∏k=j[1−γkbμ]≤C,for all j∈N, (18)

with some constant . Then it holds

 Var[πNK,n(f)∣∣GN]≲1Γ2N+2,N+n+1N+n∑l=N+1∑I⊆{1,…,d},I≠∅(γl2)|I|K, (19)

where the sum in (19) is taken over all subsets of and stands for inequality up to a constant not depending on and

###### Remark 1

Note that in Theorem (4) we don’t require that the function itself is bounded. Assumption (18) is not restrictive. For instance, a straightforward calculation shows that (18) is satisfied in most interesting case with .

## 5 Nonparametric regression

The functions need to be estimated before one can apply the proposed variance reduction approach. This amounts to solve a nonparametric regression problem. We present a generic regression algorithm. Algorithm starts with estimating the functions for where

 ¯Ql(x)=N+n∑p=lωNp,nQp,l(x)=ωNl,nf(x)+E[N+n∑p=l+1ωNp,nf(Xp)∣∣ ∣∣Xl=x].

We first generate paths conditionally independent of the -algebra generated by the burn-in sequence :

 T={(X(s)N+1,…,X(s)N+n),s=1,…,T}

of the chain (the so-called “training paths”). The regression algorithm proceeds with estimating the functions by solving the least squares optimization problems

 ˆQl=\operatornamewithlimitsargminψ∈ΨT∑s=1∣∣ ∣∣N+n∑p=lωNp,nf(X(s)p)−ψ(X(s)l)∣∣ ∣∣2 (20)

for where is a class of functions on Next we estimate the coefficients using the formula

 ˆal,k(x)=E[ϕk(ξ)ˆQl(Φl(x,ξ))∣∣T], (21)

where is a random variable independent from with law is defined in (1) and the expectation is taken according to known distribution . Conditioning on underlines that estimation of derived from training trajectories. In many cases integration can be done in closed analytical form.

For example, in the case of ULA algorithm we have

is the multivariate normal distribution

and Therefore if we use polynomials to approximate then can be even computed in closed form; implementations details are provided in Section 7.

Upon estimating the coefficients , one can construct an empirical estimate of in the form

 ˆMNK,n(f)=∑1≤k≤KN+n∑l=N+1ˆal,k(Xl−1)ϕk(ξl).

Obviously and is indeed a valid control variate in that it does not introduce any bias.

## 6 Complexity analysis for ULA

The following result quantifies the error of estimating the functions via the algorithm and its proof follows from Theorem 2.2 in .

###### Theorem 5

Suppose that for any

 E⎡⎣(N+n∑p=lωNp,nf(Xp)−¯Ql(Xl))4∣∣ ∣∣Xl−1⎤⎦≤σ4l,

with probability for some finite positive numbers Furthermore, assume that where the functions are uniformly bounded and satisfy

 maxlsupg∈Ψ∖{0}∥g∥2∞/E[g2(Xl)|GN]≤B<∞.

Then for any values of and such that and

 T≳B2[BD+log(2/ε)+B2D2T]

it holds with probability at least

 E[∣∣¯Ql(Xl)−ˆQl(Xl)∣∣2∣∣∣T∨GN]≲σ2lB(BD+log(2/ε)T+B2D2T2)+infψ∈ΨE[∣∣¯Ql(Xl)−ψ(Xl)∣∣2∣∣GN], (22)

for with standing for inequality up to a universal multiplicative constant.

Now we are able to give a bound for the difference between and

###### Proposition 6

Under conditions of Theorem 5, we have with probability at least

 E[∣∣∣ˆMNK,n(f)−MNK,n(f)∣∣∣2∣∣∣T,GN] (23) EˆMNK,n≲K[N+n∑l=N+1σ2l]B(BD+log(2/ε)T+B2D2T2) E|ˆMNK,n+KN+n∑l=n+1infψ∈ΨE[∣∣¯Ql(Xl)−ψ(Xl)∣∣2∣∣GN].

Proof Using the conditional Cauchy-Schwarz inequality and orthonormality of , we derive

By the Jensen inequality and orthonormality of

 E[∣∣∣ˆMNK,n(f)−MNK,n(f)∣∣∣2∣∣∣T∨GN]≤∑1≤k≤KN+n∑l=N+1E[|ˆal,k(Xl−1)−¯al,k(Xl−1)|2∣∣T∨GN].

### 6.1 Complexity analysis of variance reduction for the ULA algorithm

In the case of ULA one can bound the quantities using -type Sobolev inequalities (see ), see Remark 3 in Section 8. In particular, we can derive that under conditions of Theorem 4 with

 σ2l≲1Γ2N+2,N+n+1,l=N+1,…,N+n+1,

where stands for the inequality up to a multiplicative constant not depending on and Using this inequality and combining (23) with Theorem 4, we get for the variance of with probability at least

 ≲ nKBΓ2N+2,N+n+1(BD+log(2/ε)T+