Diffusion approximations and control variates for MCMC

08/05/2018 ∙ by Nicolas Brosse, et al. ∙ University of Florida Ecole Polytechnique 0

A new methodology is presented for the construction of control variates to reduce the variance of additive functionals of Markov Chain Monte Carlo (MCMC) samplers. Our control variates are defined as linear combinations of functions whose coefficients are obtained by minimizing a proxy for the asymptotic variance. The construction is theoretically justified by two new results. We first show that the asymptotic variances of some well-known MCMC algorithms, including the Random Walk Metropolis and the (Metropolis) Unadjusted/Adjusted Langevin Algorithm, are close to the asymptotic variance of the Langevin diffusion. Second, we provide an explicit representation of the optimal coefficients minimizing the asymptotic variance of the Langevin diffusion. Several examples of Bayesian inference problems demonstrate that the corresponding reduction in the variance is significant, and that in some cases it can be dramatic.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 15

page 40

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Let be a measurable function on such that

. This function is associated to a probability measure

on defined for all by

We are interested in approximating , where is a

-integrable function. The classical Monte Carlo solution to this problem is to simulate i.i.d. random variables

with distribution 

, and then to estimate

by the sample mean

(1)

In most applications, sampling from is not an option. Markov Chain Monte Carlo (MCMC) methods provide samples from a Markov chain with unique invariant probability . Under mild conditions [MT09, Chapter 17], the estimator defined by (1

) satisfies for any initial distribution a Central Limit Theorem (CLT)

(2)

where and is referred to as the asymptotic variance associated to and

denotes a Gaussian distribution with mean

and variance .

The aim of the present paper is to propose a new methodology to reduce the asymptotic variance of a family of MCMC algorithms. This method consists in constructing suitable control variates, i.e. we consider a family of -integrable functions and then choose such that . Reducing the variance of Monte Carlo estimators is a very active research domain: see e.g. [RC04, Chapter 4], [Liu08, Section 2.3], and [RK17, Chapter 5] for an overview of the main methods - see also Section 2.2.

Analysis and motivation are based on the Langevin diffusion defined by

(3)

where is a -dimensional Brownian motion. In the sequel, we assume that the Stochastic Differential Equation (SDE) (3) has a unique strong solution for every initial condition . Under appropriate conditions (see [Bha82, CCG12]), is invariant for the Markov process and the following CLT holds:

(4)

The main contribution of this paper is the introduction of a new method to compute control variates based on the expression of the asymptotic variance given in (4). For any twice continuously differentiable function , the differential generator acting on is denoted by

(5)

Under appropriate conditions on and , it may be shown that . This property suggests to consider the class of control functionals for the Langevin diffusion, where is a family of “smooth” functions, and minimize over , the criterion

(6)

The use of control functionals has already been proposed in [AC99] with applications to quantum Monte Carlo calculations; improved schemes have been later considered in [MSI13, PMG14] with applications to computational Bayesian inference. Although is a class of control functionals for the Langevin diffusion, the choice of controls variates minimizing the criterion (6) for some MCMC algorithms is motivated by the fact the asymptotic variance , defined in (2) and associated to the Markov chains associated with these methods, is (up to a scaling factor) a good approximation of the asymptotic variance of the Langevin diffusion defined in (4).

The remainder of the paper is organized as follows. In Section 2, we present our methodology to minimize (6) and the construction of control variates for some MCMC algorithms. In Section 3, we state our main result which guarantees that the asymptotic variance defined in (2) and associated with a given MCMC method is close (up to a scaling factor) to the asymptotic variance of the Langevin diffusion defined in (4). We show that under appropriate conditions on , the Metropolis Adjusted/Unadjusted Langevin Algorithm (MALA and ULA) and the Random Walk Metropolis (RWM) algorithm fit the framework of our methodology. In Section 4, Monte Carlo experiments illustrating the performance of our method are presented. The proofs are postponed to Sections 6 and 5 and to the Appendix.

Notation

Let denote the Borel -field of . Moreover, let be the set of -integrable functions for a probability measure on . Further, for an . Given a Markov kernel on , for all and integrable under , denote by . Let be a measurable function. The -total variation distance between two probability measures and on is defined as . If , then is the total variation denoted by . For a measurable function , define .

For , define the scalar product and the Euclidian norm . Denote by . For , denote by , and . For , and denote respectively the floor and ceil functions evaluated in . We take the convention that for , then , and . Define for , and . In addition, stands for the -dimensional standard Gaussian density, i.e.  for .

For , and two open sets of respectively, denote by , the set of -times continuously differentiable functions. For , denote by the gradient of and by the Laplacian of . For and , denote by the -th order differential of for . For and , define , . For and , define the semi-norm

Define and for any , we consider the semi-norm

Finally, define .

2 Langevin-based control variates for MCMC methods

2.1 Method

We introduce in the following our methodology based on control variates for the Langevin diffusion. In order not to obscure the main ideas of this method, we present it informally. Results which justify rigorously the related derivations are postponed to Section 3.

We consider a family of control functionals . There is a great flexibility in the choice of the family . We illustrate our methodology through a simple example

(7)

with , but the method developed in this paper is by no means restricted to a linear parameterized family.

A key property of the Langevin diffusion which is the basis of our methodology is the following “carré du champ” property (see for example [BGL14, Section 1.6.2, formula 1.6.3]): for all ,

(8)

which reflects in particular that is a self-adjoint operator on a dense subspace of , the Hilbert space of square integrable function w.r.t. . A straightforward consequence of (8) (setting ) is that for any function . This observation implies that and have the same expectation with respect to for any and . Therefore, as emphasized in the introduction, if the CLT (4) holds, a relevant choice of control variate for the Langevin diffusion to estimate , is , where is a minimizer of

(9)

In the following, we explain how this optimization problem can be practically solved.

It is shown in [Bha82] (see also [GM96] and [CCG12]) that under appropriate conditions on and , the solution of the Langevin diffusion (3) satisfies the CLT (4) where the asymptotic variance is given by

(10)

and satisfies Poisson’s equation:

(11)

Another expression for is, using (8) and (11):

(12)

Based on (8), (10) and (12), we see now how the minimization of (9) can be computed in practice. First, by definition (11), for all , is a solution to the Poisson equation

Therefore, we get for all , using and (10)

In addition, by (8) and (11), we get that , and we obtain using (12) that

(13)

Minimizing the map (9) is equivalent to minimization of . It means that we might actually minimize the function without computing the solution of the Poisson equation, which is in general a computational bottleneck.

When , then (13) may be rewritten as:

where and are given for any by

Note that is by definition a symmetric semi-positive definite matrix. If are linearly independent in , then is full rank and the minimizer of is given by

(14)

In conclusion, in addition to its theoretical interest, the Langevin diffusion (3) is an attractive model because optimization of the asymptotic variance is greatly simplified. However, we are not advocating simulation of this diffusion in MCMC applications. The main contribution of this paper is to show that the optimal control variate for the diffusion remains nearly optimal for many standard MCMC algorithms.

One example is the Unadjusted Langevin Algorithm (ULA), the Euler discretization scheme associated to the Langevin SDE (3):

where is the step size and is an i.i.d. sequence of standard Gaussian

-dimensional random vectors. The idea of using the Markov chain

to sample approximately from has been first introduced in the physics literature by [Par81] and popularized in the computational statistics community by [Gre83] and [GM94]. As shown below, other examples are the Metropolis Adjusted Langevin Algorithm (MALA) algorithm (for which an additional Metropolis-Hastings correction step is added) but also for MCMC algorithms which do not seem to be “directly” related to the Langevin diffusion, like the Random Walk Metropolis algorithm (RWM).

To deal with these different algorithms within the same theoretical framework, we consider a family of Markov kernels , parameterized by a scalar parameter where . For the ULA and MALA algorithm, is the stepsize in the Euler discretization of the diffusion; for the RWM this is the variance of the random walk proposal. For any initial probability on and , denote by and the probability and the expectation respectively on the canonical space of the Markov chain with initial probability and of transition kernel . By convention, we set for all . We denote by the canonical process. It is assumed below that , and satisfy the following assumptions. Roughly speaking, these conditions impose that for any and , the discrete CLT (2) holds for the function , and that the associated asymptotic variance is sufficiently close to given by the continuous CLT (3), as , so that control functionals for the Markov chain can be derived using the methodology we developed above for the Langevin diffusion.

  1. [label=()]

  2. For each ,

    has an invariant probability distribution

    satisfying for any .

  3. For any and ,

    (15)

    where is the sample mean (see (1)), and is the asymptotic variance (see (2)) relatively to .

  4. For any , as ,

    (16)
    (17)

    where is defined in (10).

The verification that these assumptions are satisfied for the ULA, RWM and MALA algorithms (under appropriate technical conditions), in the case and , is postponed to Section 3. The standard conditions 12 are in particular satisfied if, for any , is -uniformly geometrically ergodic for some measurable function , i.e. it admits an invariant probability measure such that and there exist and such that for any probability measure on and ,

(see e.g. [MT09] or [DMPS18]). Condition 3 requires a specific form of the dependence of and on .

Based on 13 and (14), the estimator of we suggest is given for by

(18)

where is the length of the burn-in period and is a minimizer of the structural risk associated with (13)

(19)

where . Here can be an independent copy of (or be identical to) the Markov chain and is the length of the sequence used to estimate the control variate. In this article, we do not study to what extent minimizing the empirical asymptotic variance (19) leads to the minimization of the asymptotic variance of (18) as ; such a problem has been tackled by [BIZ18] in the i.i.d. case. To control the complexity of the class of functions , a penalty term may be added in (19). The use of a penalty term to control the excess risk in the estimation of the control variate has been proposed and discussed in [SMD18]. Concerning the choice of , the simplest case is defined by (7), corresponding to the parametric case, and it is by far the most popular approach. It is possible to go one step further and adopt fully non-parametric approaches like kernel regression methods [OGC16]

or neural networks

[ZWZ18].

If the control function is a linear combination of functions, where , then the empirical risk (19) may be expressed as

where for ,

In this simple case, an optimizer is obtained in closed form

(20)

where is the Moore-Penrose pseudoinverse of .

2.2 Comparison with other control variate methods for Monte Carlo simulation

The construction of control variates for MCMC and the related problem of approximating solutions of Poisson equations are very active fields of research. It is impossible to give credit for all the contributions undertaken in this area: see [DK12], [PMG14], [OGC16] and references therein for further background. We survey in this section only the methods which are closely connected to our approach. [Hen97] and [Mey08, Section 11.5] proposed control variates of the form where and is the Markov kernel associated to a Markov chain and are known -integrable functions. The parameter is obtained by minimizing the asymptotic variance

(21)

where is solution of the discrete Poisson equation . The method suggested in [Mey08, Section 11.5] to minimize (21) requires estimates of the solution of the Poisson equation. Temporal Difference learning is a possible candidate, but this method is complex to implement and suffers from high variance.

[DK12] noticed that if is reversible w.r.t. , it is possible to optimize the limiting variance (21) without computing explicitly the Poisson solution . This approach is of course closely related with our proposed method: the reversibility of the Markov kernel is replaced here by the self-adjointness of the generator of the Langevin diffusion which implies the reversibility of the semi-group.

Each of the algorithms in the aforementioned literature requires computation of for each , which is in general difficult except in very specific examples. In [Hen97, Mey08] this is addressed by restricting to kernels with finite support for each . In [DK12] the authors consider mainly Gibbs samplers in their numerical examples.

Our methodology is also related to the Zero Variance method proposed by [MSI13, PMG14, OGC16, SMD18], which uses as a control variate and chooses by minimizing . A drawback of this method stems from the fact that the optimization criterion is theoretically justified if is i.i.d. and might significantly differ from the asymptotic variance defined in (15). We compare the two approaches in Section 4.

3 Asymptotic expansion for the asymptotic variance of MCMC algorithms

In this Section, we provide conditions upon which the approximations (16)-(17) are satisfied for and . We first assume that the gradient of the potential is Lipschitz:

H 1.

and is Lipschitz, i.e. there exists such that for all ,

Denote by the semigroup associated to the SDE (3) defined by where is bounded measurable and is a solution of (3) started at . By construction, the target distribution is invariant for .

The conditions we consider require that is a family of Markov kernels such that for any , approximates in a sense specified below. Let be a measurable function.

H 2.
  1. [label=()]

  2. For any , has a unique invariant distribution .

  3. There exists such that , and .

  4. There exist and such that for all ,

    (22)
    (23)

These conditions imply that the kernels are -uniformly geometrically ergodic “uniformly” with respect to the parameter with a mixing time going to infinity as the inverse of the stepsize when . Note that the mixing time of is also inversely proportional to when .

Under H 1 and H 2, by [Kop15, Lemma 2.6], there exists a solution to Poisson’s equation (11) for any which is given for any by

(24)

Moreover, [CCG12, Theorem 3.1] shows that, for any , where is the solution of the Langevin SDE, converges weakly to where is given by (10).

Note that the assumption H 2 implies that for any ,

(25)

We now introduce an assumption guaranteeing that the limit as is equal to the infinitesimal generator of the Langevin diffusion defined, for a bounded measurable function and , as , if the limit exists. This is a natural assumption if the semigroup of the Langevin diffusion evaluated at time , , and are close as .

H 3.

There exist and a family of operators with , such that for all and ,

In addition, there exists , such that for all there exist and (depending only on ) such that for any ,

We show below that these conditions are satisfied for the Metropolis Adjusted / Unadjusted Langevin Algorithm (MALA and ULA) algorithms (in which case is the stepsize in the Euler discretization of the Langevin diffusion) and also by the Random Walk Metropolis algorithm (RWM) (in which case is the variance of the increment distribution). We next give an upper bound on the difference between and which implies that (17) holds. The proofs are postponed to Section 5.

Proposition .

Assume H 1, H 2 and H 3 and let . Then there exists such that for all and ,

Proof.

The proof is postponed to Section 5.1. ∎

The next result which is the main theorem of this Section precisely formalizes (16).

Theorem 1.

Assume H 1, H 2 and H 3. Then, there exists such that for all , , , and

where is defined in (10).

Proof.

The proof is postponed to Section 5.2. ∎

We now consider the ULA algorithm. The Markov kernel associated to the ULA algorithm is given for , and by

(26)

where is the -dimensional standard Gaussian density . Consider the following additional assumption.

H 4.

There exist and such that for any , and , . Moreover, there exists such that for any , .

Proposition .

Assume H 1 and H 4. There exist and such that H 2 is satisfied for the family of Markov kernels .

Proof.

The proof follows from [DBD19, Theorem 14, Proposition 24]. However, for completeness and since all the tools needed for the proof of this result are used in the study of MALA, the proof is given in Section 6.1. ∎

Remark .

Note that in fact H 2 holds for ULA under milder conditions on using the results obtained in [Ebe15, EM18, DBD19]. For example, if H 1 holds and there exist and such that

(27)

[DBD19, Theorem 14, Proposition 24] imply that H 2 holds with .

We now establish H 3. Let , , and . Denote by where is a standard -dimensional Gaussian vector, the first step of ULA. A Taylor expansion of at and integration show that where

(28)

A simple application of the Lebesgue dominated convergence theorem implies then the following result.

Lemma .

Assume H 1. Then for any , satisfies H 3 with , and .

We now examine the MALA algorithm. The Markov kernel of the MALA algorithm, see [RT96], is given for , , and by

(29)