Variance reduction for MCMC methods via martingale representations

03/18/2019 ∙ by D. Belomestny, et al. ∙ Universität Duisburg-Essen Higher School of Economics Ecole Polytechnique 0

In this paper we propose an efficient variance reduction approach for MCMC algorithms relying on a novel discrete time martingale representation for Markov chains. Our approach is fully non-asymptotic and does not require any type of ergodicity or special product structure of the underlying density. By rigorously analyzing the convergence of the proposed algorithm, we show that it's complexity is indeed significantly smaller than one of the original MCMC algorithm. The numerical performance of the new method is illustrated in the case of Gaussian mixtures and binary regression.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Monte Carlo integration typically has an error variance of the form where is a sample size and is the variance of integrand. We can make the variance smaller by using a larger value of . Alternatively, we can reduce instead of increasing the sample size . To this end, one can try to construct a new Monte Carlo experiment with the same expectation as the original one but with a lower variance . Methods to achieve this are known as variance reduction techniques. Variance reduction plays an important role in Monte Carlo and Markov Chain Monte Carlo methods. Introduction to many of the variance reduction techniques can be found in [6], [23], [15] and [14]. Recently one witnessed a revival of interest in efficient variance reduction methods for MCMC algorithms, see for example [8], [20], [5] and references therein.

Suppose that we wish to compute , where

is a random vector-valued in

with a density and with

. The idea of the so-called control variates variance reduction method is to find a cheaply computable random variable

with and such that the variance of the r.v. is small. The complexity of the problem of constructing classes of control variates satisfying essentially depends on the degree of our knowledge on For example, if

is analytically known and satisfies some regularity conditions, one can apply the well-known technique of polynomial interpolation to construct control variates enjoying some optimality properties, see, for example, Section 3.2 in

[9]. Alternatively, if an orthonormal system in is analytically available, one can build control variates as a linear combination of the corresponding basis functions, see [4]. Furthermore, if

is known only up to a normalizing constant (which is often the case in Bayesian statistics), one can apply the recent approach of control variates depending only on the gradient

using Schrdinger-type Hamiltonian operator in [20] and so-called Stein operator in [5]. In some situations is not known analytically, but can be represented as a function of simple random variables with known distribution. Such situation arises, for example, in the case of functionals of discretized diffusion processes. In this case a Wiener chaos-type decomposition can be used to construct control variates with nice theoretical properties, see [3]. Note that in order to compare different variance reduction approaches, one has to analyze their complexity, that is, the number of numerical operations required to achieve a prescribed magnitude of the resulting variance.

The situation becomes much more difficult in the case of MCMC algorithms, where one has to work with a Markov chain whose marginal distribution converges to as time grows. One important class of the variance reduction methods in this case is based on the so-called Poisson equation for the corresponding Markov chain. It was observed in Henderson [16] that if a time-homogeneous Markov chain is stationary with stationary distribution then for any real-valued function defined on the state space of the Markov chain the function has zero mean with respect to . The best choice for the function corresponds to a solution of the so-called Poisson equation

. Moreover, it is also related to the minimal asymptotic variance in the corresponding central limit theorem, see

[11] and [20]. Although the Poisson equation involves the quantity of interest and can not be solved explicitly in most cases, this idea still can be used to construct some approximations for the optimal zero-variance control variates. For example, Henderson [16] proposed to compute approximations for the solution of the Poisson equation for specific Markov chains with particular emphasis on models arising in stochastic network theory. In [8] and [5] series-type control variates are introduced and studied for reversible Markov chains. It is assumed in [8] that the one-step conditional expectations can be computed explicitly for a set of basis functions. The authors in [5] proposed another approach tailored to diffusion setting which doesn’t require the computation of integrals of basis functions and only involves applications of the underlying generator.

In this paper we focus on the Langevin type algorithms which got much attention recently, see [7, 12, 17, 22, 21] and references therein. We propose a generic variance reduction method for these and other types algorithms, which is purely non-asymptotic and does not require that the conditional expectations of the corresponding Markov chain can be computed or that the generator is known analytically. Moreover, we do not need to assume stationarity or/and sampling under the invariant distribution We rigorously analyse the convergence of the method and study its complexity. It is shown that our variance-reduced Langevin algorithm outperforms the standard Langevin algorithms in terms of complexity.

The paper is organized as follows. In Section 2 we set up the problem and introduce some notations. Section 3 contains a novel martingale representation and shows how this representation can be used for variance reduction. In Section 4 we analyze the performance of the proposed variance reduction algorithm in the case of Unadjusted Langevin Algorithm (ULA). Section 6 studies the complexity of the variance reduced ULA. Finally, numerical examples are presented in Section 7.

2 Setup

Let be a domain in Our aim is to numerically compute expectations of the form

where and

is a probability measure supported on

If the dimension of the space is large and can not be computed analytically, one can apply Monte Carlo methods. However, in many practical situations direct sampling from is impossible and this precludes the use of plain Monte Carlo methods in this case. One popular alternative to Monte Carlo is Markov Chain Monte Carlo (MCMC), where one is looking for a discrete time (possibly non-homogeneous) Markov chain such that is its unique invariant measure. In this paper we study a class of MCMC algorithms with satisfying the the following recurrence relation:

(1)

for some i.i.d. random vectors with distribution and some Borel-measurable functions In fact, this is quite general class of Markov chains (see Theorem 1.3.6 in  [10]) and many well-known MCMC algorithms can be represented in form (1). Let us consider two popular examples.

Example 1 (Unadjusted Langevin Algorithm)

Fix a sequence of positive time steps Given a Borel function , consider a non-homogeneous discrete-time Markov chain defined by

(2)

where is an i.i.d. sequence of -dimensional standard Gaussian random vectors. If for some continuously differentiable function then Markov chain (2) can be used to approximately sample from the density

(3)

provided that is finite. This method is usually referred to as Unadjusted Langevin Algorithm (ULA). In fact the Markov chain (2) arises as the Euler-Maruyama discretization of the Langevin diffusion

with nonnegative time steps , and, under mild technical conditions, the latter Langevin diffusion admits of (3) as its unique invariant distribution; see [7] and [12].

Example 2 (Metropolis-Adjusted Langevin Algorithm)

The Metropolis-Hastings algorithm associated with a target density requires the choice of a sequence of conditional densities also called proposal or candidate kernels. The transition from the value of the Markov chain at time and its value at time proceeds via the following transition step:

Given ;
  1. Generate ;

  • Put

    where

  • Then, as shown in Metropolis et al. [19], this transition is reversible with respect to and therefore preserves the stationary density . If have a wide enough support to eventually reach any region of the state space with positive mass under , then this transition is irreducible and is a maximal irreducibility measure [18]. The Metropolis-Adjusted Langevin algorithm (MALA) takes (2) as proposal, that is,

    where , , denotes the density of a -dimensional standard Gaussian random vector. The MALA algorithms usually provide noticeable speed-ups in convergence for most problems. It is not difficult to see that the MALA chain can be compactly represented in the form

    where

    is an i.i.d. sequence of uniformly distributed on

    random variables independent of Thus we recover (1) with and

    Example 3

    Let be the unique strong solution to a SDE of the form:

    (4)

    where is a standard -valued Brownian motion, and are locally Lipschitz continuous functions with at most linear growth. The process is a Markov process and let denote its infinitesimal generator defined by

    for any If there exists a continuously twice differentiable Lyapunov function such that

    then there is an invariant probability measure for that is, for all if Invariant measures are crucial in the study of the long term behaviour of stochastic differential systems (4). Under some additional assumptions, the invariant measure is ergodic and this property can be exploited to compute the integrals for by means of ergodic averages. The idea is to replace the diffusion by a (simulable) discretization scheme of the form (see e.g. [22])

    where and is a non-increasing sequence of time steps. Then for a function we can approximate via

    Due to typically high correlation between variance reduction is of crucial importance here. As a matter of fact, in many cases there is no explicit formula for the invariant measure and this makes the use of gradient based variance reduction techniques (see e.g. [20]

    ) impossible in this case. On the contrary, our method can be directly used to reduce variance of the ergodic estimator

    without explicit knowledge of

    3 Martingale representation and variance reduction

    In this section we give a general discrete-time martingale representation for the Markov chain (1) which is used below to construct an efficient variance reduction algorithm. Let be a complete orthonormal system in with . In particular, we have

    with Notice that this implies that the random variables ,

    , are centered. As an example, we can take multivariate Hermite polynomials for the ULA algorithm and a tensor product of Shifted Legendre polynomials for ”uniform part” and Hermite polynomials for ”gaussian part” of the random variable

    in MALA, as the Shifted Legendre polynomials are orthogonal with respect to the Lebesgue measure on

    In the sequel, we denote by the filtration generated by generated by with the convention .

    Theorem 1

    Let be a Borel function such that , with a Markov chain defined in (1). Then, for , the following representation holds in

    (5)

    where

    (6)

    Proof The expansion obviously holds for and . Indeed, due to the orthonormality and completeness of the system , we have

    with

    provided Recall that and . Suppose that (5) holds for , all , and all Borel-measurable functions with . Let us prove it for . Given with , due to the orthonormality and completeness of the system , we get by conditioning on ,

    where

    By the Markov property of , we have . Furthermore, a calculation involving intermediate conditioning on and the recurrence relation verifies that

    for suitably chosen Borel-measurable functions . We thus arrive at

    (7)

    which is the required statement in the case . Now assume . The random variable is square integrable and has the form , hence the induction hypothesis applies, and we get

    (8)

    with

    Formulas (7) and (8) conclude the proof.


    From numerical point of view another representation of the coefficients turns out to be more useful.

    Proposition 2

    The coefficients in (6) can be alternatively represented as

    with The functions can be computed by the backward recurrence: and for

    (9)

    Next we show how the representation (5) can be used to efficiently reduce the variance of MCMC algorithms. Let be a sequence of positive and non-increasing step sizes with and, for , , we set

    Consider a weighted average estimator of the form

    (10)

    where is the length of the burn-in period and the number of effective samples. Given and as above, for , denote

    (11)

    where

    (12)

    Since is independent of and the r.v. has zero mean and can be viewed as a control variate.

    4 Analysis of variance reduced ULA

    The representation (6) suggests that the variance of the variance-reduced estimator

    (13)

    should be small for large enough. In this section we provide an analysis of the variance-reduced ULA algorithm (see Example 1). We shall use the notations and . By , , we denote the normalized Hermite polynomial on , that is,

    For a multi-index , denotes the normalized Hermite polynomial on , that is,

    In what follows, we also use the notation for , and we set , , and . Given and as above, for , denote

    (14)

    with and

    For an estimator of (see (10) and (13)), we are interested in its conditional Mean Squared Error (MSE), which can be decomposed as the sum of the squared conditional bias and the conditional variance:

    The quantities in (4) are conditioned on , as it reflects the way the estimators are used for MCMC: the path of the Markov chain is simulated only once, and we start to use the realized values of the Markov chain to construct our estimate only after the burn-in period of length . We also notice that, due to the Markovian structure, the conditioning on in (4) is equivalent to conditioning on only (this is particularly clear in the case but requires some calculation in the remaining case ).

    4.1 Squared conditional bias

    Due to the martingale transform structure of , we have

    Hence both estimators and have the same conditional bias. Notice that this remains true also when we substitute the coefficients in (14) with some independent approximations For a bounded Borel function , we can estimate the conditional bias similarly to the beginning of  [12, Section 4]:

    (16)

    where , denotes the total variation distance between probability measures and , that is,

    denotes the probability measure on with density of (3); for , the Markov kernel from into is defined by

    while, for , , the kernel is

    which, finally, provides the (random) measure used in the right-hand side of (16).

    Different specific upper bounds for the squared bias can be deduced from (16) using results of Section 3 in [12] on bounds in the total variation distance.

    4.2 Conditional variance

    An upper bound for the variance of the classical estimator (10) is provided in in [12, Theorem 17]. As for the estimator (13), we introduce lemma of conditional variance decomposition. It follows from (14), Theorem 1 and independence condition for the sequence .

    Lemma 3

    For variance-reduced estimator (13), defined for the Markov Chain (1) the following representation holds

    (17)

    Now we present an upper bound for the right-hand sides of (17) in case of ULA algorithm.

    Theorem 4

    Fix . Suppose that and are times continuously differentiable and it holds

    for all with ,

    for some positive number where stands for Jacobian of Let be a sequence of positive and non-increasing step sizes with . We assume that and that

    (18)

    with some constant . Then it holds

    (19)

    where the sum in (19) is taken over all subsets of and stands for inequality up to a constant not depending on and

    Remark 1

    Note that in Theorem (4) we don’t require that the function itself is bounded. Assumption (18) is not restrictive. For instance, a straightforward calculation shows that (18) is satisfied in most interesting case with .

    5 Nonparametric regression

    The functions need to be estimated before one can apply the proposed variance reduction approach. This amounts to solve a nonparametric regression problem. We present a generic regression algorithm. Algorithm starts with estimating the functions for where

    We first generate paths conditionally independent of the -algebra generated by the burn-in sequence :

    of the chain (the so-called “training paths”). The regression algorithm proceeds with estimating the functions by solving the least squares optimization problems

    (20)

    for where is a class of functions on Next we estimate the coefficients using the formula

    (21)

    where is a random variable independent from with law is defined in (1) and the expectation is taken according to known distribution . Conditioning on underlines that estimation of derived from training trajectories. In many cases integration can be done in closed analytical form.

    For example, in the case of ULA algorithm we have

    is the multivariate normal distribution

    and Therefore if we use polynomials to approximate then can be even computed in closed form; implementations details are provided in Section 7.

    Upon estimating the coefficients , one can construct an empirical estimate of in the form

    Obviously and is indeed a valid control variate in that it does not introduce any bias.

    6 Complexity analysis for ULA

    The following result quantifies the error of estimating the functions via the algorithm and its proof follows from Theorem 2.2 in [2].

    Theorem 5

    Suppose that for any

    with probability for some finite positive numbers Furthermore, assume that where the functions are uniformly bounded and satisfy

    Then for any values of and such that and

    it holds with probability at least

    (22)

    for with standing for inequality up to a universal multiplicative constant.

    Now we are able to give a bound for the difference between and

    Proposition 6

    Under conditions of Theorem 5, we have with probability at least

    (23)

    Proof Using the conditional Cauchy-Schwarz inequality and orthonormality of , we derive

    By the Jensen inequality and orthonormality of


    6.1 Complexity analysis of variance reduction for the ULA algorithm

    In the case of ULA one can bound the quantities using -type Sobolev inequalities (see [1]), see Remark 3 in Section 8. In particular, we can derive that under conditions of Theorem 4 with

    where stands for the inequality up to a multiplicative constant not depending on and Using this inequality and combining (23) with Theorem 4, we get for the variance of with probability at least