Monte Carlo integration typically has an error variance of the form where is a sample size and is the variance of integrand. We can make the variance smaller by using a larger value of . Alternatively, we can reduce instead of increasing the sample size . To this end, one can try to construct a new Monte Carlo experiment with the same expectation as the original one but with a lower variance . Methods to achieve this are known as variance reduction techniques. Variance reduction plays an important role in Monte Carlo and Markov Chain Monte Carlo methods. Introduction to many of the variance reduction techniques can be found in , ,  and . Recently one witnessed a revival of interest in efficient variance reduction methods for MCMC algorithms, see for example , ,  and references therein.
Suppose that we wish to compute , where
is a random vector-valued inwith a density and with
. The idea of the so-called control variates variance reduction method is to find a cheaply computable random variablewith and such that the variance of the r.v. is small. The complexity of the problem of constructing classes of control variates satisfying essentially depends on the degree of our knowledge on For example, if
is analytically known and satisfies some regularity conditions, one can apply the well-known technique of polynomial interpolation to construct control variates enjoying some optimality properties, see, for example, Section 3.2 in. Alternatively, if an orthonormal system in is analytically available, one can build control variates as a linear combination of the corresponding basis functions, see . Furthermore, if
is known only up to a normalizing constant (which is often the case in Bayesian statistics), one can apply the recent approach of control variates depending only on the gradientusing Schrdinger-type Hamiltonian operator in  and so-called Stein operator in . In some situations is not known analytically, but can be represented as a function of simple random variables with known distribution. Such situation arises, for example, in the case of functionals of discretized diffusion processes. In this case a Wiener chaos-type decomposition can be used to construct control variates with nice theoretical properties, see . Note that in order to compare different variance reduction approaches, one has to analyze their complexity, that is, the number of numerical operations required to achieve a prescribed magnitude of the resulting variance.
The situation becomes much more difficult in the case of MCMC algorithms, where one has to work with a Markov chain whose marginal distribution converges to as time grows. One important class of the variance reduction methods in this case is based on the so-called Poisson equation for the corresponding Markov chain. It was observed in Henderson  that if a time-homogeneous Markov chain is stationary with stationary distribution then for any real-valued function defined on the state space of the Markov chain the function has zero mean with respect to . The best choice for the function corresponds to a solution of the so-called Poisson equation
. Moreover, it is also related to the minimal asymptotic variance in the corresponding central limit theorem, see and . Although the Poisson equation involves the quantity of interest and can not be solved explicitly in most cases, this idea still can be used to construct some approximations for the optimal zero-variance control variates. For example, Henderson  proposed to compute approximations for the solution of the Poisson equation for specific Markov chains with particular emphasis on models arising in stochastic network theory. In  and  series-type control variates are introduced and studied for reversible Markov chains. It is assumed in  that the one-step conditional expectations can be computed explicitly for a set of basis functions. The authors in  proposed another approach tailored to diffusion setting which doesn’t require the computation of integrals of basis functions and only involves applications of the underlying generator.
In this paper we focus on the Langevin type algorithms which got much attention recently, see [7, 12, 17, 22, 21] and references therein. We propose a generic variance reduction method for these and other types algorithms, which is purely non-asymptotic and does not require that the conditional expectations of the corresponding Markov chain can be computed or that the generator is known analytically. Moreover, we do not need to assume stationarity or/and sampling under the invariant distribution We rigorously analyse the convergence of the method and study its complexity. It is shown that our variance-reduced Langevin algorithm outperforms the standard Langevin algorithms in terms of complexity.
The paper is organized as follows. In Section 2 we set up the problem and introduce some notations. Section 3 contains a novel martingale representation and shows how this representation can be used for variance reduction. In Section 4 we analyze the performance of the proposed variance reduction algorithm in the case of Unadjusted Langevin Algorithm (ULA). Section 6 studies the complexity of the variance reduced ULA. Finally, numerical examples are presented in Section 7.
Let be a domain in Our aim is to numerically compute expectations of the form
is a probability measure supported onIf the dimension of the space is large and can not be computed analytically, one can apply Monte Carlo methods. However, in many practical situations direct sampling from is impossible and this precludes the use of plain Monte Carlo methods in this case. One popular alternative to Monte Carlo is Markov Chain Monte Carlo (MCMC), where one is looking for a discrete time (possibly non-homogeneous) Markov chain such that is its unique invariant measure. In this paper we study a class of MCMC algorithms with satisfying the the following recurrence relation:
for some i.i.d. random vectors with distribution and some Borel-measurable functions In fact, this is quite general class of Markov chains (see Theorem 1.3.6 in ) and many well-known MCMC algorithms can be represented in form (1). Let us consider two popular examples.
Example 1 (Unadjusted Langevin Algorithm)
Fix a sequence of positive time steps Given a Borel function , consider a non-homogeneous discrete-time Markov chain defined by
where is an i.i.d. sequence of -dimensional standard Gaussian random vectors. If for some continuously differentiable function then Markov chain (2) can be used to approximately sample from the density
provided that is finite. This method is usually referred to as Unadjusted Langevin Algorithm (ULA). In fact the Markov chain (2) arises as the Euler-Maruyama discretization of the Langevin diffusion
Example 2 (Metropolis-Adjusted Langevin Algorithm)
The Metropolis-Hastings algorithm associated with a target density requires the choice of a sequence of conditional densities also called proposal or candidate kernels. The transition from the value of the Markov chain at time and its value at time proceeds via the following transition step:
Then, as shown in Metropolis et al. , this transition is reversible with respect to and therefore preserves the stationary density . If have a wide enough support to eventually reach any region of the state space with positive mass under , then this transition is irreducible and is a maximal irreducibility measure . The Metropolis-Adjusted Langevin algorithm (MALA) takes (2) as proposal, that is,
where , , denotes the density of a -dimensional standard Gaussian random vector. The MALA algorithms usually provide noticeable speed-ups in convergence for most problems. It is not difficult to see that the MALA chain can be compactly represented in the form
Let be the unique strong solution to a SDE of the form:
where is a standard -valued Brownian motion, and are locally Lipschitz continuous functions with at most linear growth. The process is a Markov process and let denote its infinitesimal generator defined by
for any If there exists a continuously twice differentiable Lyapunov function such that
then there is an invariant probability measure for that is, for all if Invariant measures are crucial in the study of the long term behaviour of stochastic differential systems (4). Under some additional assumptions, the invariant measure is ergodic and this property can be exploited to compute the integrals for by means of ergodic averages. The idea is to replace the diffusion by a (simulable) discretization scheme of the form (see e.g. )
where and is a non-increasing sequence of time steps. Then for a function we can approximate via
Due to typically high correlation between variance reduction is of crucial importance here.
As a matter of fact, in many cases there is no explicit formula for the invariant measure and this makes the use of gradient based variance reduction techniques (see e.g.  ) impossible in this case.
On the contrary, our method can be directly used to reduce variance of the ergodic estimator
) impossible in this case. On the contrary, our method can be directly used to reduce variance of the ergodic estimatorwithout explicit knowledge of
3 Martingale representation and variance reduction
In this section we give a general discrete-time martingale representation for the Markov chain (1) which is used below to construct an efficient variance reduction algorithm. Let be a complete orthonormal system in with . In particular, we have
with Notice that this implies that the random variables ,
, are centered. As an example, we can take multivariate Hermite polynomials for the ULA algorithm and a tensor product of Shifted Legendre polynomials for ”uniform part” and Hermite polynomials for ”gaussian part” of the random variablein MALA, as the Shifted Legendre polynomials are orthogonal with respect to the Lebesgue measure on
In the sequel, we denote by the filtration generated by generated by with the convention .
Let be a Borel function such that , with a Markov chain defined in (1). Then, for , the following representation holds in
Proof The expansion obviously holds for and . Indeed, due to the orthonormality and completeness of the system , we have
provided Recall that and . Suppose that (5) holds for , all , and all Borel-measurable functions with . Let us prove it for . Given with , due to the orthonormality and completeness of the system , we get by conditioning on ,
By the Markov property of , we have . Furthermore, a calculation involving intermediate conditioning on and the recurrence relation verifies that
for suitably chosen Borel-measurable functions . We thus arrive at
which is the required statement in the case . Now assume . The random variable is square integrable and has the form , hence the induction hypothesis applies, and we get
From numerical point of view another representation of the coefficients turns out to be more useful.
The coefficients in (6) can be alternatively represented as
with The functions can be computed by the backward recurrence: and for
Next we show how the representation (5) can be used to efficiently reduce the variance of MCMC algorithms. Let be a sequence of positive and non-increasing step sizes with and, for , , we set
Consider a weighted average estimator of the form
where is the length of the burn-in period and the number of effective samples. Given and as above, for , denote
Since is independent of and the r.v. has zero mean and can be viewed as a control variate.
4 Analysis of variance reduced ULA
The representation (6) suggests that the variance of the variance-reduced estimator
should be small for large enough. In this section we provide an analysis of the variance-reduced ULA algorithm (see Example 1). We shall use the notations and . By , , we denote the normalized Hermite polynomial on , that is,
For a multi-index , denotes the normalized Hermite polynomial on , that is,
In what follows, we also use the notation for , and we set , , and . Given and as above, for , denote
For an estimator of (see (10) and (13)), we are interested in its conditional Mean Squared Error (MSE), which can be decomposed as the sum of the squared conditional bias and the conditional variance:
The quantities in (4) are conditioned on , as it reflects the way the estimators are used for MCMC: the path of the Markov chain is simulated only once, and we start to use the realized values of the Markov chain to construct our estimate only after the burn-in period of length . We also notice that, due to the Markovian structure, the conditioning on in (4) is equivalent to conditioning on only (this is particularly clear in the case but requires some calculation in the remaining case ).
4.1 Squared conditional bias
Due to the martingale transform structure of , we have
Hence both estimators and have the same conditional bias. Notice that this remains true also when we substitute the coefficients in (14) with some independent approximations For a bounded Borel function , we can estimate the conditional bias similarly to the beginning of [12, Section 4]:
where , denotes the total variation distance between probability measures and , that is,
denotes the probability measure on with density of (3); for , the Markov kernel from into is defined by
while, for , , the kernel is
which, finally, provides the (random) measure used in the right-hand side of (16).
4.2 Conditional variance
An upper bound for the variance of the classical estimator (10) is provided in in [12, Theorem 17]. As for the estimator (13), we introduce lemma of conditional variance decomposition. It follows from (14), Theorem 1 and independence condition for the sequence .
Now we present an upper bound for the right-hand sides of (17) in case of ULA algorithm.
Fix . Suppose that and are times continuously differentiable and it holds
for all with ,
for some positive number where stands for Jacobian of Let be a sequence of positive and non-increasing step sizes with . We assume that and that
with some constant . Then it holds
where the sum in (19) is taken over all subsets of and stands for inequality up to a constant not depending on and
5 Nonparametric regression
The functions need to be estimated before one can apply the proposed variance reduction approach. This amounts to solve a nonparametric regression problem. We present a generic regression algorithm. Algorithm starts with estimating the functions for where
We first generate paths conditionally independent of the -algebra generated by the burn-in sequence :
of the chain (the so-called “training paths”). The regression algorithm proceeds with estimating the functions by solving the least squares optimization problems
for where is a class of functions on Next we estimate the coefficients using the formula
where is a random variable independent from with law is defined in (1) and the expectation is taken according to known distribution . Conditioning on underlines that estimation of derived from training trajectories. In many cases integration can be done in closed analytical form.
For example, in the case of ULA algorithm we have
is the multivariate normal distributionand Therefore if we use polynomials to approximate then can be even computed in closed form; implementations details are provided in Section 7.
Upon estimating the coefficients , one can construct an empirical estimate of in the form
Obviously and is indeed a valid control variate in that it does not introduce any bias.
6 Complexity analysis for ULA
The following result quantifies the error of estimating the functions via the algorithm and its proof follows from Theorem 2.2 in .
Suppose that for any
with probability for some finite positive numbers Furthermore, assume that where the functions are uniformly bounded and satisfy
Then for any values of and such that and
it holds with probability at least
for with standing for inequality up to a universal multiplicative constant.
Now we are able to give a bound for the difference between and
Under conditions of Theorem 5, we have with probability at least
Proof Using the conditional Cauchy-Schwarz inequality and orthonormality of , we derive
By the Jensen inequality and orthonormality of