1 Introduction
Monte Carlo integration typically has an error variance of the form where is a sample size and is the variance of integrand. We can make the variance smaller by using a larger value of . Alternatively, we can reduce instead of increasing the sample size . To this end, one can try to construct a new Monte Carlo experiment with the same expectation as the original one but with a lower variance . Methods to achieve this are known as variance reduction techniques. Variance reduction plays an important role in Monte Carlo and Markov Chain Monte Carlo methods. Introduction to many of the variance reduction techniques can be found in [6], [23], [15] and [14]. Recently one witnessed a revival of interest in efficient variance reduction methods for MCMC algorithms, see for example [8], [20], [5] and references therein.
Suppose that we wish to compute , where
is a random vectorvalued in
with a density and with. The idea of the socalled control variates variance reduction method is to find a cheaply computable random variable
with and such that the variance of the r.v. is small. The complexity of the problem of constructing classes of control variates satisfying essentially depends on the degree of our knowledge on For example, ifis analytically known and satisfies some regularity conditions, one can apply the wellknown technique of polynomial interpolation to construct control variates enjoying some optimality properties, see, for example, Section 3.2 in
[9]. Alternatively, if an orthonormal system in is analytically available, one can build control variates as a linear combination of the corresponding basis functions, see [4]. Furthermore, ifis known only up to a normalizing constant (which is often the case in Bayesian statistics), one can apply the recent approach of control variates depending only on the gradient
using Schrdingertype Hamiltonian operator in [20] and socalled Stein operator in [5]. In some situations is not known analytically, but can be represented as a function of simple random variables with known distribution. Such situation arises, for example, in the case of functionals of discretized diffusion processes. In this case a Wiener chaostype decomposition can be used to construct control variates with nice theoretical properties, see [3]. Note that in order to compare different variance reduction approaches, one has to analyze their complexity, that is, the number of numerical operations required to achieve a prescribed magnitude of the resulting variance.The situation becomes much more difficult in the case of MCMC algorithms, where one has to work with a Markov chain whose marginal distribution converges to as time grows. One important class of the variance reduction methods in this case is based on the socalled Poisson equation for the corresponding Markov chain. It was observed in Henderson [16] that if a timehomogeneous Markov chain is stationary with stationary distribution then for any realvalued function defined on the state space of the Markov chain the function has zero mean with respect to . The best choice for the function corresponds to a solution of the socalled Poisson equation
. Moreover, it is also related to the minimal asymptotic variance in the corresponding central limit theorem, see
[11] and [20]. Although the Poisson equation involves the quantity of interest and can not be solved explicitly in most cases, this idea still can be used to construct some approximations for the optimal zerovariance control variates. For example, Henderson [16] proposed to compute approximations for the solution of the Poisson equation for specific Markov chains with particular emphasis on models arising in stochastic network theory. In [8] and [5] seriestype control variates are introduced and studied for reversible Markov chains. It is assumed in [8] that the onestep conditional expectations can be computed explicitly for a set of basis functions. The authors in [5] proposed another approach tailored to diffusion setting which doesn’t require the computation of integrals of basis functions and only involves applications of the underlying generator.In this paper we focus on the Langevin type algorithms which got much attention recently, see [7, 12, 17, 22, 21] and references therein. We propose a generic variance reduction method for these and other types algorithms, which is purely nonasymptotic and does not require that the conditional expectations of the corresponding Markov chain can be computed or that the generator is known analytically. Moreover, we do not need to assume stationarity or/and sampling under the invariant distribution We rigorously analyse the convergence of the method and study its complexity. It is shown that our variancereduced Langevin algorithm outperforms the standard Langevin algorithms in terms of complexity.
The paper is organized as follows. In Section 2 we set up the problem and introduce some notations. Section 3 contains a novel martingale representation and shows how this representation can be used for variance reduction. In Section 4 we analyze the performance of the proposed variance reduction algorithm in the case of Unadjusted Langevin Algorithm (ULA). Section 6 studies the complexity of the variance reduced ULA. Finally, numerical examples are presented in Section 7.
2 Setup
Let be a domain in Our aim is to numerically compute expectations of the form
where and
is a probability measure supported on
If the dimension of the space is large and can not be computed analytically, one can apply Monte Carlo methods. However, in many practical situations direct sampling from is impossible and this precludes the use of plain Monte Carlo methods in this case. One popular alternative to Monte Carlo is Markov Chain Monte Carlo (MCMC), where one is looking for a discrete time (possibly nonhomogeneous) Markov chain such that is its unique invariant measure. In this paper we study a class of MCMC algorithms with satisfying the the following recurrence relation:(1) 
for some i.i.d. random vectors with distribution and some Borelmeasurable functions In fact, this is quite general class of Markov chains (see Theorem 1.3.6 in [10]) and many wellknown MCMC algorithms can be represented in form (1). Let us consider two popular examples.
Example 1 (Unadjusted Langevin Algorithm)
Fix a sequence of positive time steps Given a Borel function , consider a nonhomogeneous discretetime Markov chain defined by
(2) 
where is an i.i.d. sequence of dimensional standard Gaussian random vectors. If for some continuously differentiable function then Markov chain (2) can be used to approximately sample from the density
(3) 
provided that is finite. This method is usually referred to as Unadjusted Langevin Algorithm (ULA). In fact the Markov chain (2) arises as the EulerMaruyama discretization of the Langevin diffusion
with nonnegative time steps , and, under mild technical conditions, the latter Langevin diffusion admits of (3) as its unique invariant distribution; see [7] and [12].
Example 2 (MetropolisAdjusted Langevin Algorithm)
The MetropolisHastings algorithm associated with a target density requires the choice of a sequence of conditional densities also called proposal or candidate kernels. The transition from the value of the Markov chain at time and its value at time proceeds via the following transition step:

Generate ;
Put
where
Then, as shown in Metropolis et al. [19], this transition is reversible with respect to and therefore preserves the stationary density . If have a wide enough support to eventually reach any region of the state space with positive mass under , then this transition is irreducible and is a maximal irreducibility measure [18]. The MetropolisAdjusted Langevin algorithm (MALA) takes (2) as proposal, that is,
where , , denotes the density of a dimensional standard Gaussian random vector. The MALA algorithms usually provide noticeable speedups in convergence for most problems. It is not difficult to see that the MALA chain can be compactly represented in the form
where
is an i.i.d. sequence of uniformly distributed on
random variables independent of Thus we recover (1) with andExample 3
Let be the unique strong solution to a SDE of the form:
(4) 
where is a standard valued Brownian motion, and are locally Lipschitz continuous functions with at most linear growth. The process is a Markov process and let denote its infinitesimal generator defined by
for any If there exists a continuously twice differentiable Lyapunov function such that
then there is an invariant probability measure for that is, for all if Invariant measures are crucial in the study of the long term behaviour of stochastic differential systems (4). Under some additional assumptions, the invariant measure is ergodic and this property can be exploited to compute the integrals for by means of ergodic averages. The idea is to replace the diffusion by a (simulable) discretization scheme of the form (see e.g. [22])
where and is a nonincreasing sequence of time steps. Then for a function we can approximate via
Due to typically high correlation between variance reduction is of crucial importance here. As a matter of fact, in many cases there is no explicit formula for the invariant measure and this makes the use of gradient based variance reduction techniques (see e.g. [20]
) impossible in this case. On the contrary, our method can be directly used to reduce variance of the ergodic estimator
without explicit knowledge of3 Martingale representation and variance reduction
In this section we give a general discretetime martingale representation for the Markov chain (1) which is used below to construct an efficient variance reduction algorithm. Let be a complete orthonormal system in with . In particular, we have
with Notice that this implies that the random variables ,
, are centered. As an example, we can take multivariate Hermite polynomials for the ULA algorithm and a tensor product of Shifted Legendre polynomials for ”uniform part” and Hermite polynomials for ”gaussian part” of the random variable
in MALA, as the Shifted Legendre polynomials are orthogonal with respect to the Lebesgue measure onIn the sequel, we denote by the filtration generated by generated by with the convention .
Theorem 1
Let be a Borel function such that , with a Markov chain defined in (1). Then, for , the following representation holds in
(5) 
where
(6) 
Proof The expansion obviously holds for and . Indeed, due to the orthonormality and completeness of the system , we have
with
provided Recall that and . Suppose that (5) holds for , all , and all Borelmeasurable functions with . Let us prove it for . Given with , due to the orthonormality and completeness of the system , we get by conditioning on ,
where
By the Markov property of , we have . Furthermore, a calculation involving intermediate conditioning on and the recurrence relation verifies that
for suitably chosen Borelmeasurable functions . We thus arrive at
(7) 
which is the required statement in the case . Now assume . The random variable is square integrable and has the form , hence the induction hypothesis applies, and we get
(8) 
with
From numerical point of view another representation of the coefficients turns out to be more useful.
Proposition 2
The coefficients in (6) can be alternatively represented as
with The functions can be computed by the backward recurrence: and for
(9) 
Next we show how the representation (5) can be used to efficiently reduce the variance of MCMC algorithms. Let be a sequence of positive and nonincreasing step sizes with and, for , , we set
Consider a weighted average estimator of the form
(10) 
where is the length of the burnin period and the number of effective samples. Given and as above, for , denote
(11) 
where
(12) 
Since is independent of and the r.v. has zero mean and can be viewed as a control variate.
4 Analysis of variance reduced ULA
The representation (6) suggests that the variance of the variancereduced estimator
(13) 
should be small for large enough. In this section we provide an analysis of the variancereduced ULA algorithm (see Example 1). We shall use the notations and . By , , we denote the normalized Hermite polynomial on , that is,
For a multiindex , denotes the normalized Hermite polynomial on , that is,
In what follows, we also use the notation for , and we set , , and . Given and as above, for , denote
(14)  
with and
For an estimator of (see (10) and (13)), we are interested in its conditional Mean Squared Error (MSE), which can be decomposed as the sum of the squared conditional bias and the conditional variance:
The quantities in (4) are conditioned on , as it reflects the way the estimators are used for MCMC: the path of the Markov chain is simulated only once, and we start to use the realized values of the Markov chain to construct our estimate only after the burnin period of length . We also notice that, due to the Markovian structure, the conditioning on in (4) is equivalent to conditioning on only (this is particularly clear in the case but requires some calculation in the remaining case ).
4.1 Squared conditional bias
Due to the martingale transform structure of , we have
Hence both estimators and have the same conditional bias. Notice that this remains true also when we substitute the coefficients in (14) with some independent approximations For a bounded Borel function , we can estimate the conditional bias similarly to the beginning of [12, Section 4]:
(16)  
where , denotes the total variation distance between probability measures and , that is,
denotes the probability measure on with density of (3); for , the Markov kernel from into is defined by
while, for , , the kernel is
which, finally, provides the (random) measure used in the righthand side of (16).
4.2 Conditional variance
An upper bound for the variance of the classical estimator (10) is provided in in [12, Theorem 17]. As for the estimator (13), we introduce lemma of conditional variance decomposition. It follows from (14), Theorem 1 and independence condition for the sequence .
Lemma 3
Now we present an upper bound for the righthand sides of (17) in case of ULA algorithm.
Theorem 4
Fix . Suppose that and are times continuously differentiable and it holds
for all with ,
for some positive number where stands for Jacobian of Let be a sequence of positive and nonincreasing step sizes with . We assume that and that
(18) 
with some constant . Then it holds
(19) 
where the sum in (19) is taken over all subsets of and stands for inequality up to a constant not depending on and
5 Nonparametric regression
The functions need to be estimated before one can apply the proposed variance reduction approach. This amounts to solve a nonparametric regression problem. We present a generic regression algorithm. Algorithm starts with estimating the functions for where
We first generate paths conditionally independent of the algebra generated by the burnin sequence :
of the chain (the socalled “training paths”). The regression algorithm proceeds with estimating the functions by solving the least squares optimization problems
(20) 
for where is a class of functions on Next we estimate the coefficients using the formula
(21) 
where is a random variable independent from with law is defined in (1) and the expectation is taken according to known distribution . Conditioning on underlines that estimation of derived from training trajectories. In many cases integration can be done in closed analytical form.
For example, in the case of ULA algorithm we have
is the multivariate normal distribution
and Therefore if we use polynomials to approximate then can be even computed in closed form; implementations details are provided in Section 7.Upon estimating the coefficients , one can construct an empirical estimate of in the form
Obviously and is indeed a valid control variate in that it does not introduce any bias.
6 Complexity analysis for ULA
The following result quantifies the error of estimating the functions via the algorithm and its proof follows from Theorem 2.2 in [2].
Theorem 5
Suppose that for any
with probability for some finite positive numbers Furthermore, assume that where the functions are uniformly bounded and satisfy
Then for any values of and such that and
it holds with probability at least
(22) 
for with standing for inequality up to a universal multiplicative constant.
Now we are able to give a bound for the difference between and
Proposition 6
Under conditions of Theorem 5, we have with probability at least
(23)  
Proof Using the conditional CauchySchwarz inequality and orthonormality of , we derive
By the Jensen inequality and orthonormality of
6.1 Complexity analysis of variance reduction for the ULA algorithm
In the case of ULA one can bound the quantities using type Sobolev inequalities (see [1]), see Remark 3 in Section 8. In particular, we can derive that under conditions of Theorem 4 with
where stands for the inequality up to a multiplicative constant not depending on and Using this inequality and combining (23) with Theorem 4, we get for the variance of with probability at least
Comments
There are no comments yet.