DeepAI
Log In Sign Up

Variance Reduction for Metropolis-Hastings Samplers

We introduce a general framework that constructs estimators with reduced variance for random walk Metropolis and Metropolis-adjusted Langevin algorithms. The resulting estimators require negligible computational cost and are derived in a post-process manner utilising all proposal values of the Metropolis algorithms. Variance reduction is achieved by producing control variates through the approximate solution of the Poisson equation associated with the target density of the Markov chain. The proposed method is based on approximating the target density with a Gaussian and then utilising accurate solutions of the Poisson equation for the Gaussian case. This leads to an estimator that uses two key elements: (i) a control variate from the Poisson equation that contains an intractable expectation under the proposal distribution, (ii) a second control variate to reduce the variance of a Monte Carlo estimate of this latter intractable expectation. Simulated data examples are used to illustrate the impressive variance reduction achieved in the Gaussian target case and the corresponding effect when target Gaussianity assumption is violated. Real data examples on Bayesian logistic regression and stochastic volatility models verify that considerable variance reduction is achieved with negligible extra computational cost.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/12/2022

Solving the Poisson equation using coupled Markov chains

This article draws connections between unbiased estimators constructed f...
08/05/2018

Diffusion approximations and control variates for MCMC

A new methodology is presented for the construction of control variates ...
11/13/2018

Regularised Zero-Variance Control Variates for High-Dimensional Variance Reduction

Zero-variance control variates (ZV-CV) are a post-processing method to r...
03/09/2020

Amortized variance reduction for doubly stochastic objectives

Approximate inference in complex probabilistic models such as deep Gauss...
12/04/2020

On the Optimization of Approximate Control Variates with Parametrically Defined Estimators

Multi-model Monte Carlo methods, such as multi-level Monte Carlo (MLMC) ...
11/21/2021

Stochastic viscosity approximations of Hamilton-Jacobi equations and variance reduction

We consider the computation of free energy-like quantities for diffusion...
11/13/2018

Regularised Zero-Variance Control Variates

Zero-variance control variates (ZV-CV) is a post-processing method to re...

1 Introduction

Statistical methods for reducing the bias and the variance of estimators have played a prominent role in Monte Carlo based numerical algorithms. Variance reduction via control variates has a long and well studied history introduced as early as the work of kahn1953methods, whereas an early non-parametric estimate of bias, subsequently renamed jackknife and broadly used for bias reduction, was first presented by quenouille1956notes. However, the corresponding theoretical developments in the more complicated, but extremely popular and practically important, estimators based on MCMC algorithms has been rather limited. The major impediment is the fact that the MCMC estimators are based on ergodic averages of dependent samples produced by simulating a Markov chain.

We provide a general methodology to construct control variates for any discrete time random walk Metropolis (RWM) and Metropolis-adjusted Langevin algorithm (MALA) Markov chains that can achieve, in a post-processing manner and with a negligible additional computational cost, impressive variance reduction when compared to the standard MCMC ergodic averages. Our proposed estimators are based on an approximate, but accurate, solution of the Poisson equation for a multivariate Gaussian target density of any dimension.

Suppose that we have a sample of size from an ergodic Markov chain with continuous state space , transition kernel and invariant measure . A standard estimator of the mean of a real-valued function defined on under is the ergodic mean

which satisfies, for any initial distribution of

, a central limit theorem of the form

with the asymptotic variance given by

Interesting attempts on variance reduction methods for Markov chain samplers include the use of antithetic variables (barone1990improving; green1992metropolis; craiu2005multiprocess), Rao-Blackwellization (gelfand1990sampling), Riemann sums (philippe2001riemann) or autocorrelation reduction (mira2000non; van2001art; yu2011center).

Control variates have played an outstanding role in the MCMC variance reduction quiver. A strand of research is based on assaraf1999zero who noticed that a Hamiltonian operator together with a trial function are sufficient to construct an estimator with zero asymptotic variance. They considered a Hamiltonian operator of Schrödinger-type that led to a series of zero-variance estimators studied by valle2010new, mira2013zero and papamarkou2014zero. The estimation of the optimal parameters of the trial function is conducted by ignoring the Markov chain sample dependency, an issue that was dealt with by belomestny2020variance by utilizing spectral methods. The main barrier for the wide applicability of zero-variance estimators is that their computational complexity increases with , see south2018regularised. Another approach to construct control variates is a non-parametric version of the methods presented by mira2013zero and papamarkou2014zero which lead to the construction of control functionals (oates2017control; barp2018riemannian; south2020semi). Although their computational cost with respect to is low, their general applicability is prohibited due to the cubic computational cost with respect to (south2018regularised; oates2019convergence)

and the possibility to suffer from the curse of dimensionality that is often met in non-parametric methods

(wasserman2006all). Finally, hammer2008control proposed constructing control variates by expanding the state space of the Metropolis-Hastings algorithm.

An approach which is closely related to our proposed methodology attempts to minimise the asymptotic variance . This seems a hard problem since a closed form expression of

is not available and therefore a loss function to be minimised is not readily available; see, for example,

flegal2010batch. However, there has been a recent research activity based on the following observation by andradottir1993variance. If a solution to the Poisson equation for was available, that is if for every

(1)

where

then one could construct a function equal to which is constant and equal to . It is then immediate that a zero-variance and zero-bias estimator for is given by

which can be viewed as an enrichment of the estimator with the (optimal) control variate . Of course, solving (1) is extremely hard for continuous state space Markov chains, even if we assume that is known, because it involves solving a non-standard integral equation. Interestingly, a solution of this equation (also called the fundamental equation) produces zero-variance estimators suggested by assaraf1999zero for a specific choice of Hamiltonian operator. One of the rare examples that (1) has been solved exactly for discrete time Markov chains is the random scan Gibbs sampler where the target density is a multivariate Gaussian density, see dellaportas2012control, dellaportas2009notes. They advocated that this solution provides a good approximation to (1

) for posterior densities often met in Bayesian statistics that are close to multivariate Gaussian densities. Indeed, since direct solution of (

1) is not available, approximating has been also suggested by andradottir1993variance, atchade2005improving, henderson1997variance, meyn2008control.

tsourti2012variance attempted to extend the work by dellaportas2012control to RWM samplers. The resulting algorithms produced estimators with lower variance but the computational cost required for the post-processing construction of these estimators counterbalance the variance reduction gains. We build on the work by tsourti2012variance here but we differ in that (i) we build new, appropriately chosen to facilitate analytic computations, non-linear -dimensional approximations to rather than linear combinations of -dimensional functions and (ii) we produce efficient Monte Carlo approximations of the -dimensional integral so that no extra computation is required for its evaluation. Finally, mijatovic2018poisson approximate numerically the solution of (1) for -dimensional RWM samplers and mijatovic2019asymptotic construct control variates for large by employing the solution of (1) that is associated with the Langevin diffusion in which the Markov chain converges as (roberts1997geometric); this requires very expensive Monte Carlo estimation methods so it is prohibited for realistic statistical applications.

We follow this route and add to this literature by extending the work of dellaportas2012control and tsourti2012variance to RWM and MALA algorithms by producing estimators for the posterior means of each co-ordinate of a -dimensional target density with reduced asymptotic variance and negligible extra computational cost. Our Monte Carlo estimator to compute the expectation makes use of three components:

  1. [label=()]

  2. An approximation to the solution of the Poisson equation associated with the target , transition kernel and function .

  3. A construction of based on firstly approximating with a Gaussian density , and then specifying by an accurate approximation to the solution of the Poisson equation for the approximate target .

  4. An additional control variate, referred to as static control variate, that is based on the same Gaussian approximation and allows to reduce the variance of a Monte Carlo estimator for the intractable expectation .

In Section 2 we provide full details of the above steps. We start by discussing, in Section 2.1, how all the above ingredients are put together to eventually arrive at the general form of our proposed estimator in equation (7). In Section 3 we present extensive simulation studies that verify that our methodology performs very well with multi-dimensional Gaussian targets and it stops reducing the asymptotic variance when we deal with a multimodal

-dimensional target density with distinct, remote modes. Moreover, we apply our methodology to real data examples consisting of a series of logistic regression examples with parameter vectors up to

dimensions and two stochastic volatility examples with and parameters. In all cases we have produced estimators with considerable variance reduction with negligible extra computational cost.

1.1 Some notation

In the remainder of the paper we use a simplified notation where both

-dimensional random variables and their values are denoted by lower case letters, such as

and where is the th dimension or coordinate, ; the subscript refers to the th sample drawn by using an MCMC algorithm, that is is the th sample for the th coordinate of ; the density of the

-variate Gaussian distribution with mean

and covariance matrix is denoted by ; for a function we set ; is the identity matrix and the superscript in a vector or matrix denotes its transpose; denotes the Euclidean norm; all the vectors are understood as column vectors.

2 Metrolopis-Hastings estimators with control variates from the Poisson equation

2.1 The general form of estimators for arbitrary targets

Consider an arbitrary intractable target from which we have obtained a set of correlated samples by simulating a Markov chain with transition kernel obtained by a Metropolis-Hastings kernel invariant to . To start with, assume a function . By following the observation of henderson1997variance the function has zero expectation with respect to because the kernel is invariant to . Therefore, given correlated samples from the target, i.e.  with , the following estimator is unbiased

(2)

For general Metropolis-Hastings algorithms the kernel is such that the expectation takes the form

(3)

where

(4)

and is the proposal distribution. By substituting (2.1) back into estimator (2) we obtain

(5)

To use this estimator we need to overcome two obstacles: (i) we need to specify the function and (ii) we need to deal with the intractable integral associated with the control variate.

Regarding (i) there is a theoretical best choice which is to set to the function that solves the Poisson equation,

(6)

where we have substituted in the general form of the Poisson equation from (1) the Metropolis-Hastings kernel. For such optimal choice for the estimator in (5) has zero variance, i.e. it equals to the exact expectation . Nevertheless, getting for general high-dimensional intractable targets is not feasible, and hence we need to compromise with an inferior choice for that can only approximate . To get such , we make use of a Gaussian approximation to the intractable target, as indicated by the assumption below.

Assumption 1.

The target is approximated by a multivariate Gaussian and the covariance matrix of the proposal is proportional to .

The main purpose of the above assumption is to establish the ability to construct an efficient RWM or MALA sampler. Indeed, it is well-known that efficient implementation of these Metropolis-Hastings samplers when requires that the covariance matrix of should resemble as much as possible the shape of . In adaptive MCMC (roberts2009examples), such a shape matching is achieved during the adaptive phase where is estimated. If is a smooth differentiable function, could be alternatively estimated by a gradient-based optimisation procedure and it is then customary to choose a proposal covariance matrix of the form for a tuned scalar .

We then solve the Poisson equation for the Gaussian approximation by finding the function that satisfies,

It is useful to emphasize the difference between this new Poisson equation and the original Poisson equation in (6). This new equation involves the approximate Gaussian target and the corresponding “approximate” Metropolis-Hastings transition kernel , which now has been modified so that the ratio is obtained by replacing the exact target with the approximate target while the proposal is also modified if needed.111For the standard RWM algorithm remains exactly the same, while for MALA it needs to be modified by replacing the gradient with . Clearly, this modification makes invariant to . When is a good approximation to , we expect also to closely approximate the ideal function . Therefore, in our method we propose to set to (actually to an analytic approximation of ) and then use it in the estimator (5).

Having chosen , we now discuss the second challenge (ii), i.e. dealing with the intractable expectation . Given that for any drawn sample of the Markov chain there is also a corresponding proposed sample that is generated from the proposal, we can unbiasedly approximate the integral with a single-sample Monte Carlo estimate,

Although is a unbiased stochastic estimate of the Poisson-type control variate, it can have high variance that needs to be reduced. We introduce a second control variate based on some function , that correlates well with , and it has analytic expectation . We refer to this control variate as static since it involves a standard Monte Carlo problem with exact samples from the tractable proposal density . To construct we rely again on the Gaussian approximation as we describe in Section 2.3.

With and specified, we can finally write down the general form of the proposed estimator that can be efficiently computed only from the MCMC output samples and the corresponding proposed samples :

(7)

In practice we use a slightly modified version of this estimator by adding a set of adaptive regression coefficients to further reduce the variance following dellaportas2012control; see Section 2.4.

2.2 Approximation of the Poisson equation for Gaussian targets

2.2.1 Standard Gaussian case

In this section we construct an analytical approximation to the exact solution of the Poisson equation for the standard Gaussian -variate target and for the function where . We use the function in the remainder of the paper which corresponds to approximating the mean value , while other choices of are left for future work. We denote the exact unknown solution by and the analytical approximation by . Given this target and some choice for we express the expectation in (2.1) as

where

(8)
(9)

The calculation of reduces thus to the calculation of the integrals and . In both integrals is just a constant since the integration is with respect to . Moreover, the MCMC algorithm we consider is either RWM or MALA with proposal

(10)

where corresponds to RWM and to MALA while is the step-size. Both and are expectations under the proposal distribution .

One key observation is that for any dimension , is just an univariate random variable with law induced by . Then, together with can induce an overall tractable univariate random variable so that the computation of in (8) can be performed analytically. The computation of is more involved since it depends on the form of . Therefore, we propose an approximate by first introducing a parametrised family that leads to tractable and efficient closed form computation of . In particular, we consider the following weighted sum of exponential functions

(11)

where and are scalars whereas and are -dimensional vectors. It turns out that using the form in (11) for we can analytically compute the expectation as stated in Proposition 1. The proof of this proposition and the proofs of all remaining propositions and remarks presented throughout Section 2 are given in the Appendix.

Proposition 1.

Let and given by (8) and (9) respectively and in to have the form in (11). Then,

where in the case of RWM and in the case of MALA and

follows the non-central chi-squared distribution with

degrees of freedom and non-central parameter , and

where follows the non-central chi-squared distribution with degrees of freedom and non-central parameter and and .

Proposition 1 states that the calculation of and is based on the cdf of the non-central chi-squared distribution and allows, for -variate standard normal targets, the exact computation of the modified estimator given by (2).

Having a family of functions for which we can calculate analytically the expectation we turn to the problem of specifying a particular member of this family to serve as an accurate approximation to the solution of the Poisson equation for the standard Gaussian distribution. We first provide the following proposition which states that satisfies certain symmetry properties.

Proposition 2.

Given , the exact solution is: (i) (holds for

) Odd function in the dimension

. (ii) (holds for ) Even function over any remaining dimension . (iii) (holds for ) Permutation invariant over the remaining dimensions.

To construct an approximation model family that incorporates the symmetry properties of Proposition 2 we make the following assumptions for the parameters in (11). We set and we assume that and for each whereas we set , , and . Moreover, for the -dimensional vectors and we assume that , and ; we set the vectors and to be filled everywhere with zeros except from their th element which is equal to and respectively. We specify thus the function as

(12)

To identify optimal parameters for the function in (12) such that we first simulate a Markov chain with large sample size from the -variate standard Gaussian distribution by employing the RWM algorithm and the MALA. Then, for each algorithm we minimize the loss function

(13)

with respect to the parameters , , , , and by employing the Broyden–Fletcher–Goldfarb–Shanno method. Figure 1 provides an illustration of the achieved approximation to in the univariate case where and the model in (12) simplifies as

For such case, we can visualize our optimised and compare it against the numerical solution from mijatovic2018poisson. Figure 1 shows this comparison which provides clear evidence that for our approximation is very accurate.

Figure 1: Numerical solution of the Poisson equation (black solid lines) and its approximation (red dashed lines) in the case of univariate standard Gaussian target simulated by using the random walk Metropolis (RWM) algorithm and the Metropolis-adjusted Langevin algorithm (MALA).

2.2.2 General Gaussian case

Given the general -variate Gaussian target we denote by the exact solution of the Poisson equation and by the approximation that we wish to construct. To approximate we apply a change of variables transformation from the standard normal, as motivated by the following proposition and remark.

Proposition 3.

Suppose the standard normal target , the function and the associated solution of the Poisson equation for either RWM with proposal or MALA with proposal . Then, the solution for the general Gaussian target and Metropolis-Hastings proposal

(14)

is where is a lower triangular Cholesky matrix such that and is its first diagonal element.

Remark 1.

To apply Proposition 3 for , , the vector needs to be permuted such that becomes its first element; the corresponding permutation has also to be applied to the mean and covariance matrix .

Proposition 3 implies that we can obtain the exact solution of the Poisson equation for any -variate Gaussian target by applying a change of variables transformation to the solution of the standard normal -variate target. Therefore, based on this theoretical result we propose to obtain an approximation of the Poisson equation in the general Gaussian case by simply transforming the approximation in (12) from the standard normal case so as

(15)

The constant is omitted since it can be absorbed by the regression coefficient ; see Section 2.4.

2.3 Construction of the static control variate

Suppose we have constructed a Gaussian approximation , where , to the intractable target and also have obtained the function from (15) needed for the proposed, general, estimator in (7). What remains is to specify the function , labelled as static control variate in (7), which should correlate well with

The intractable term in this function is the Metropolis-Hastings probability

in (4) where the Metropolis-Hastings ratio contains the intractable target . This suggests to choose as

(16)

where is the acceptance ratio in a M-H algorithm that targets the Gaussian approximation , that is

(17)

and is the proposal distribution that we would use for the Gaussian target as defined by equation (14). Importantly, by assuming that serves as an accurate approximation to , the ratio approximates accurately the exact M-H ratio and can be calculated analytically. In particular, using (15) we have that

This integral can be computed efficiently as follows. We reparametrize the integral according to the new variable and also use the shortcut where is an MCMC sample. After this reparametrization, the above expectation becomes under the distribution

(18)

where we condition on with a slightly abuse of notation since the term is the exact pre-computed gradient for the sample of the intractable target. Thus, the calculation of reduces to the evaluation of the following integral

(19)

Note also that inside the Metropolis-Hastings ratio with as in (10). In the case of RWM and by noting that the density in (18) coincides with the density in (10) we have that the calculation of the integral in (19) reduces to the calculation of the integrals in (8) and (9) and, thus, can be conducted by utilizing Proposition 1. The calculation of the integral in (19) for the MALA is slightly different as highlighted by the following remark.

Remark 2.

In the case of MALA the mean of the density in (18) is different from the mean of due to the presence of the term and the formulas in Proposition (1) are modified accordingly.

Finally, we note that except from the tractability in the calculations which offered by the particular choice of , there is also the following intuition for its effectiveness. If the Gaussian approximation is exact, then the overall control variate, defined in equation (7) as the sum of a stochastic and a static control variate, becomes the exact “Poisson control variate” that we would compute if the initial target was actually Gaussian. Thus, we expect that the function , as a static control variate in a non-Gaussian target, enables effective variance reduction under the assumption that the target is well-approximated by a Gaussian distribution.

2.4 The modified estimator with regression coefficients

As pointed out by dellaportas2012control the fact that the proposed estimator is based on an approximation of the true solution of the Poisson equation implies that we need to modify as

(20)

where estimates the optimal coefficient that further minimizes the variance of the overall estimator. dellaportas2012control show that for reversible MCMC samplers, the optimal estimator of the true coefficient can be constructed solely from the MCMC output. By re-writing the estimator in (20) as

where the term

(21)

approximates , we can estimate as

(22)

The resulting estimator in (20) is evaluated by using solely the output of the MCMC algorithm and under some regularity conditions converges to a.s. as , see tsourti2012variance.

2.5 Algorithmic summary

In summary, the proposed variance reduction approach can be applied a posteriori to the MCMC output samples obtained from either RWM or MALA with proposal density given by (14). The extra computations needed involve the evaluation of given by (21). This is efficient since it relies on quantities that are readily available such as the values and , where is the value generated from the proposal during the main MCMC algorithm, as well as on the acceptance probability which has been also computed and stored at each MCMC iteration. The evaluation of requires also the construction of the static control variate defined by (16). This depends on the ratio given by (17) and on the expectation . The calculation of the latter expectation is tractable since is the acceptance ratio of Metropolis-Hastings algorithm that targets the Gaussian target , where and are estimators of the mean and covariance matrix respectively of the target ; see Assumption 1. Finally, we compute using (22) and evaluate the proposed estimator from (20). Algorithm 1 summarizes the steps of the variance reduction procedure.

Inputs: The samples , , simulated by using RWM or MALA with proposal distribution given by (14); the proposed samples generated from the proposal during the MCMC; the M-H probabilities calculated during the MCMC; estimators and of the mean and covariance matrix respectively of the target.
Returns: An estimate for the mean of the th coordinate of the target.

1:Set .
2:Calculate given by (16).
3:Calculate by utilising Propositions 1 and 3.
4:Calculate given by (21) for each .
5:Calculate given by (22).
6:Return given by (20).
Algorithm 1 Variance reduction for Metropolis-Hasting samplers

3 Application on real and simulated data

We present results from the application of the proposed methodology on real and simulated data examples. First we consider multivariate Gaussian targets for which we have shown that the function in (12) allows the explicit calculation of the expectation defined by (2.1). Section 3.1 presents variance reduction factors in the case of -variate standard Gaussian densities, simulated by employing the RWM and MALA, up to dimensions. In Sections 3.2, 3.3 and 3.4 and we examine the efficiency of our proposed methodology in targets that depart from the Gaussian distribution and the expectation is not analytically available.

To conduct all the experiments we set the parameters and of the function in (12) in the values given by Table 1 which were estimated by minimizing the loss function in (13) for . In practice we observe that such values lead to good performance across all real data experiments, including those with .

To estimate the variance of in each experiment we obtained different estimates , , for based on independent MCMC runs. Then, the variance of has been estimated by

where is the average of . We estimated similarly the variance of the proposed estimator .

RWM 8.7078 0.2916 0.0001 -3.5619 0.1131 3.9162
MALA 7.6639 0.0613 0.0096 -14.8086 0.3431 -0.0647
Table 1: Optimal values for the parameters of the function in (12).

3.1 Simulated data: Gaussian targets

The target distribution is a -variate standard Gaussian distribution and we are interested in estimating the expected value of the first coordinate of the target by setting . Samples of size were drawn from target densities by utilising the proposal distribution in (10) with for the RWM case and by tuning during the burn-in period to achieve acceptance rate between and in the MALA case. Table 2 presents factors by which the variance of is greater than the variance of in the case of the RWM and MALA. Variance reduction is considerable even for . Figure 2 shows typical realizations of the sequences of estimates obtained by the standard estimators and the proposed for different dimensions of the standard Gaussian target and Figure 3 provides a visualization of the distribution of the estimators and .

RWM MALA
d=2 d=10 d=30 d=100 d=2 d=10 d=30 d=100
n=1,000 93 26 10 5 1,345 64 57 97
n= 10,000 278 173 112 27 3,572 81 88 316
n= 50,000 541 445 177 94 4,628 92 103 274
n= 500,000 531 820 370 263 4,997 83 157 286
Table 2: Estimated factors by which the variance of is larger than the variance of for standard Gaussian -variate target. We collect samples after the first iterations of the RWM and the MALA.
Figure 2: Sequence of the standard ergodic averages (black solid lines) and the proposed estimates (blue dashed lines). The red lines indicate the mean of the -variate standard Gaussian target. The values are based on samples drawn by employing either the RWM (top row) or the MALA (bottom row) with iterations discarded as burn-in period.
Figure 3: Each pair of boxplots is consisted of values for the estimators (left boxplot) and (right boxplot) for the -variate standard Gaussian target. The estimators have been calculated by using samples drawn by employing either the RWM (top row) or the MALA (bottom row) and discarded the first samples as burn-in period.

3.2 Simulated data: mixtures of Gaussian distributions

It is important to investigate how our proposed methodology performs when the target density departs from normality. We used as a mixture of -variate Gaussian distributions with density

(23)

where, following mijatovic2019asymptotic, we set to be the -dimensional vector and is

covariance matrix randomly drawn from an inverse Wishart distribution by requiring its largest eigenvalue to be equal to

.

We drew samples from the target distribution by using the Metropolis-Hastings algorithm with proposal distribution where by setting we achieve an acceptance ratio between and . When the MCMC algorithm struggles to converge. Table 3 presents the factors by which the variance of is greater than the variance of the modified estimator for dimensions