Statistical methods for reducing the bias and the variance of estimators have played a prominent role in Monte Carlo based numerical algorithms. Variance reduction via control variates has a long and well studied history introduced as early as the work of kahn1953methods, whereas an early non-parametric estimate of bias, subsequently renamed jackknife and broadly used for bias reduction, was first presented by quenouille1956notes. However, the corresponding theoretical developments in the more complicated, but extremely popular and practically important, estimators based on MCMC algorithms has been rather limited. The major impediment is the fact that the MCMC estimators are based on ergodic averages of dependent samples produced by simulating a Markov chain.
We provide a general methodology to construct control variates for any discrete time random walk Metropolis (RWM) and Metropolis-adjusted Langevin algorithm (MALA) Markov chains that can achieve, in a post-processing manner and with a negligible additional computational cost, impressive variance reduction when compared to the standard MCMC ergodic averages. Our proposed estimators are based on an approximate, but accurate, solution of the Poisson equation for a multivariate Gaussian target density of any dimension.
Suppose that we have a sample of size from an ergodic Markov chain with continuous state space , transition kernel and invariant measure . A standard estimator of the mean of a real-valued function defined on under is the ergodic mean
which satisfies, for any initial distribution of , a central limit theorem of the form
, a central limit theorem of the form
with the asymptotic variance given by
Interesting attempts on variance reduction methods for Markov chain samplers include the use of antithetic variables (barone1990improving; green1992metropolis; craiu2005multiprocess), Rao-Blackwellization (gelfand1990sampling), Riemann sums (philippe2001riemann) or autocorrelation reduction (mira2000non; van2001art; yu2011center).
Control variates have played an outstanding role in the MCMC variance reduction quiver. A strand of research is based on assaraf1999zero who noticed that a Hamiltonian operator together with a trial function are sufficient to construct an estimator with zero asymptotic variance. They considered a Hamiltonian operator of Schrödinger-type that led to a series of zero-variance estimators studied by valle2010new, mira2013zero and papamarkou2014zero. The estimation of the optimal parameters of the trial function is conducted by ignoring the Markov chain sample dependency, an issue that was dealt with by belomestny2020variance by utilizing spectral methods. The main barrier for the wide applicability of zero-variance estimators is that their computational complexity increases with , see south2018regularised. Another approach to construct control variates is a non-parametric version of the methods presented by mira2013zero and papamarkou2014zero which lead to the construction of control functionals (oates2017control; barp2018riemannian; south2020semi).
Although their computational cost with respect to is low, their general applicability is prohibited due to the cubic computational cost with respect to (south2018regularised; oates2019convergence) and the possibility to suffer from the curse of dimensionality that is often met in non-parametric methods
and the possibility to suffer from the curse of dimensionality that is often met in non-parametric methods(wasserman2006all). Finally, hammer2008control proposed constructing control variates by expanding the state space of the Metropolis-Hastings algorithm.
An approach which is closely related to our proposed methodology attempts to minimise the asymptotic variance . This seems a hard problem since a closed form expression of is not available and therefore a loss function to be minimised is not readily available; see, for example,
is not available and therefore a loss function to be minimised is not readily available; see, for example,flegal2010batch. However, there has been a recent research activity based on the following observation by andradottir1993variance. If a solution to the Poisson equation for was available, that is if for every
then one could construct a function equal to which is constant and equal to . It is then immediate that a zero-variance and zero-bias estimator for is given by
which can be viewed as an enrichment of the estimator with the (optimal) control variate .
Of course, solving (1) is extremely hard for continuous state space Markov chains, even if we assume that
is known, because it involves solving a non-standard integral equation.
Interestingly, a solution of this equation (also called the fundamental equation) produces zero-variance estimators suggested by assaraf1999zero for a specific choice of Hamiltonian operator.
One of the rare examples that (1) has been solved exactly
for discrete time Markov chains
is the random scan Gibbs sampler where the target density is a multivariate Gaussian density, see dellaportas2012control, dellaportas2009notes. They advocated that this solution provides a good approximation to (1 ) for posterior densities often met in Bayesian statistics that are close to multivariate Gaussian densities. Indeed, since direct solution of (
) for posterior densities often met in Bayesian statistics that are close to multivariate Gaussian densities. Indeed, since direct solution of (1) is not available, approximating has been also suggested by andradottir1993variance, atchade2005improving, henderson1997variance, meyn2008control.
tsourti2012variance attempted to extend the work by dellaportas2012control to RWM samplers. The resulting algorithms produced estimators with lower variance but the computational cost required for the post-processing construction of these estimators counterbalance the variance reduction gains. We build on the work by tsourti2012variance here but we differ in that (i) we build new, appropriately chosen to facilitate analytic computations, non-linear -dimensional approximations to rather than linear combinations of -dimensional functions and (ii) we produce efficient Monte Carlo approximations of the -dimensional integral so that no extra computation is required for its evaluation. Finally, mijatovic2018poisson approximate numerically the solution of (1) for -dimensional RWM samplers and mijatovic2019asymptotic construct control variates for large by employing the solution of (1) that is associated with the Langevin diffusion in which the Markov chain converges as (roberts1997geometric); this requires very expensive Monte Carlo estimation methods so it is prohibited for realistic statistical applications.
We follow this route and add to this literature by extending the work of dellaportas2012control and tsourti2012variance to RWM and MALA algorithms by producing estimators for the posterior means of each co-ordinate of a -dimensional target density with reduced asymptotic variance and negligible extra computational cost. Our Monte Carlo estimator to compute the expectation makes use of three components:
An approximation to the solution of the Poisson equation associated with the target , transition kernel and function .
A construction of based on firstly approximating with a Gaussian density , and then specifying by an accurate approximation to the solution of the Poisson equation for the approximate target .
An additional control variate, referred to as static control variate, that is based on the same Gaussian approximation and allows to reduce the variance of a Monte Carlo estimator for the intractable expectation .
In Section 2 we provide full details of the above steps. We start by discussing, in Section 2.1, how all the above ingredients are put together to eventually arrive at the general form of our proposed estimator in equation (7).
In Section 3 we present extensive simulation studies that verify that our methodology performs very well with multi-dimensional Gaussian targets and it stops reducing the asymptotic variance when we deal with a multimodal -dimensional target density with distinct, remote modes. Moreover, we apply our methodology to real data examples consisting of a series of logistic regression examples with parameter vectors up to
-dimensional target density with distinct, remote modes. Moreover, we apply our methodology to real data examples consisting of a series of logistic regression examples with parameter vectors up todimensions and two stochastic volatility examples with and parameters. In all cases we have produced estimators with considerable variance reduction with negligible extra computational cost.
1.1 Some notation
In the remainder of the paper we use a simplified notation where both -dimensional random variables and their values are denoted by lower case letters, such as -variate Gaussian distribution with mean
-dimensional random variables and their values are denoted by lower case letters, such asand where is the th dimension or coordinate, ; the subscript refers to the th sample drawn by using an MCMC algorithm, that is is the th sample for the th coordinate of ; the density of the
-variate Gaussian distribution with meanand covariance matrix is denoted by ; for a function we set ; is the identity matrix and the superscript in a vector or matrix denotes its transpose; denotes the Euclidean norm; all the vectors are understood as column vectors.
2 Metrolopis-Hastings estimators with control variates from the Poisson equation
2.1 The general form of estimators for arbitrary targets
Consider an arbitrary intractable target from which we have obtained a set of correlated samples by simulating a Markov chain with transition kernel obtained by a Metropolis-Hastings kernel invariant to . To start with, assume a function . By following the observation of henderson1997variance the function has zero expectation with respect to because the kernel is invariant to . Therefore, given correlated samples from the target, i.e. with , the following estimator is unbiased
For general Metropolis-Hastings algorithms the kernel is such that the expectation takes the form
To use this estimator we need to overcome two obstacles: (i) we need to specify the function and (ii) we need to deal with the intractable integral associated with the control variate.
Regarding (i) there is a theoretical best choice which is to set to the function that solves the Poisson equation,
where we have substituted in the general form of the Poisson equation from (1) the Metropolis-Hastings kernel. For such optimal choice for the estimator in (5) has zero variance, i.e. it equals to the exact expectation . Nevertheless, getting for general high-dimensional intractable targets is not feasible, and hence we need to compromise with an inferior choice for that can only approximate . To get such , we make use of a Gaussian approximation to the intractable target, as indicated by the assumption below.
The target is approximated by a multivariate Gaussian and the covariance matrix of the proposal is proportional to .
The main purpose of the above assumption is to establish the ability to construct an efficient RWM or MALA sampler. Indeed, it is well-known that efficient implementation of these Metropolis-Hastings samplers when requires that the covariance matrix of should resemble as much as possible the shape of . In adaptive MCMC (roberts2009examples), such a shape matching is achieved during the adaptive phase where is estimated. If is a smooth differentiable function, could be alternatively estimated by a gradient-based optimisation procedure and it is then customary to choose a proposal covariance matrix of the form for a tuned scalar .
We then solve the Poisson equation for the Gaussian approximation by finding the function that satisfies,
It is useful to emphasize the difference between this new Poisson equation and the original Poisson equation in (6). This new equation involves the approximate Gaussian target and the corresponding “approximate” Metropolis-Hastings transition kernel , which now has been modified so that the ratio is obtained by replacing the exact target with the approximate target while the proposal is also modified if needed.111For the standard RWM algorithm remains exactly the same, while for MALA it needs to be modified by replacing the gradient with . Clearly, this modification makes invariant to . When is a good approximation to , we expect also to closely approximate the ideal function . Therefore, in our method we propose to set to (actually to an analytic approximation of ) and then use it in the estimator (5).
Having chosen , we now discuss the second challenge (ii), i.e. dealing with the intractable expectation . Given that for any drawn sample of the Markov chain there is also a corresponding proposed sample that is generated from the proposal, we can unbiasedly approximate the integral with a single-sample Monte Carlo estimate,
Although is a unbiased stochastic estimate of the Poisson-type control variate, it can have high variance that needs to be reduced. We introduce a second control variate based on some function , that correlates well with , and it has analytic expectation . We refer to this control variate as static since it involves a standard Monte Carlo problem with exact samples from the tractable proposal density . To construct we rely again on the Gaussian approximation as we describe in Section 2.3.
With and specified, we can finally write down the general form of the proposed estimator that can be efficiently computed only from the MCMC output samples and the corresponding proposed samples :
In practice we use a slightly modified version of this estimator by adding a set of adaptive regression coefficients to further reduce the variance following dellaportas2012control; see Section 2.4.
2.2 Approximation of the Poisson equation for Gaussian targets
2.2.1 Standard Gaussian case
In this section we construct an analytical approximation to the exact solution of the Poisson equation for the standard Gaussian -variate target and for the function where . We use the function in the remainder of the paper which corresponds to approximating the mean value , while other choices of are left for future work. We denote the exact unknown solution by and the analytical approximation by . Given this target and some choice for we express the expectation in (2.1) as
The calculation of reduces thus to the calculation of the integrals and . In both integrals is just a constant since the integration is with respect to . Moreover, the MCMC algorithm we consider is either RWM or MALA with proposal
where corresponds to RWM and to MALA while is the step-size. Both and are expectations under the proposal distribution .
One key observation is that for any dimension , is just an univariate random variable with law induced by . Then, together with can induce an overall tractable univariate random variable so that the computation of in (8) can be performed analytically. The computation of is more involved since it depends on the form of . Therefore, we propose an approximate by first introducing a parametrised family that leads to tractable and efficient closed form computation of . In particular, we consider the following weighted sum of exponential functions
where and are scalars whereas and are -dimensional vectors. It turns out that using the form in (11) for we can analytically compute the expectation as stated in Proposition 1. The proof of this proposition and the proofs of all remaining propositions and remarks presented throughout Section 2 are given in the Appendix.
where follows the non-central chi-squared distribution with degrees of freedom and non-central parameter and and .
Proposition 1 states that the calculation of and is based on the cdf of the non-central chi-squared distribution and allows, for -variate standard normal targets, the exact computation of the modified estimator given by (2).
Having a family of functions for which we can calculate analytically the expectation we turn to the problem of specifying a particular member of this family to serve as an accurate approximation to the solution of the Poisson equation for the standard Gaussian distribution. We first provide the following proposition which states that satisfies certain symmetry properties.
Given , the exact solution is: (i) (holds for ) Odd function in the dimension
) Odd function in the dimension. (ii) (holds for ) Even function over any remaining dimension . (iii) (holds for ) Permutation invariant over the remaining dimensions.
To construct an approximation model family that incorporates the symmetry properties of Proposition 2 we make the following assumptions for the parameters in (11). We set and we assume that and for each whereas we set , , and . Moreover, for the -dimensional vectors and we assume that , and ; we set the vectors and to be filled everywhere with zeros except from their th element which is equal to and respectively. We specify thus the function as
To identify optimal parameters for the function in (12) such that we first simulate a Markov chain with large sample size from the -variate standard Gaussian distribution by employing the RWM algorithm and the MALA. Then, for each algorithm we minimize the loss function
with respect to the parameters , , , , and by employing the Broyden–Fletcher–Goldfarb–Shanno method. Figure 1 provides an illustration of the achieved approximation to in the univariate case where and the model in (12) simplifies as
For such case, we can visualize our optimised and compare it against the numerical solution from mijatovic2018poisson. Figure 1 shows this comparison which provides clear evidence that for our approximation is very accurate.
2.2.2 General Gaussian case
Given the general -variate Gaussian target we denote by the exact solution of the Poisson equation and by the approximation that we wish to construct. To approximate we apply a change of variables transformation from the standard normal, as motivated by the following proposition and remark.
Suppose the standard normal target , the function and the associated solution of the Poisson equation for either RWM with proposal or MALA with proposal . Then, the solution for the general Gaussian target and Metropolis-Hastings proposal
is where is a lower triangular Cholesky matrix such that and is its first diagonal element.
To apply Proposition 3 for , , the vector needs to be permuted such that becomes its first element; the corresponding permutation has also to be applied to the mean and covariance matrix .
Proposition 3 implies that we can obtain the exact solution of the Poisson equation for any -variate Gaussian target by applying a change of variables transformation to the solution of the standard normal -variate target. Therefore, based on this theoretical result we propose to obtain an approximation of the Poisson equation in the general Gaussian case by simply transforming the approximation in (12) from the standard normal case so as
The constant is omitted since it can be absorbed by the regression coefficient ; see Section 2.4.
2.3 Construction of the static control variate
Suppose we have constructed
a Gaussian approximation
, where ,
to the intractable
target and also have obtained the function
from (15) needed for the proposed, general, estimator in (7).
What remains is to specify the function
, labelled as static control variate in (7), which should correlate well
The intractable term in this function is the Metropolis-Hastings probability
The intractable term in this function is the Metropolis-Hastings probabilityin (4) where the Metropolis-Hastings ratio contains the intractable target . This suggests to choose as
where is the acceptance ratio in a M-H algorithm that targets the Gaussian approximation , that is
and is the proposal distribution that we would use for the Gaussian target as defined by equation (14). Importantly, by assuming that serves as an accurate approximation to , the ratio approximates accurately the exact M-H ratio and can be calculated analytically. In particular, using (15) we have that
This integral can be computed efficiently as follows. We reparametrize the integral according to the new variable and also use the shortcut where is an MCMC sample. After this reparametrization, the above expectation becomes under the distribution
where we condition on with a slightly abuse of notation since the term is the exact pre-computed gradient for the sample of the intractable target. Thus, the calculation of reduces to the evaluation of the following integral
Note also that inside the Metropolis-Hastings ratio with as in (10). In the case of RWM and by noting that the density in (18) coincides with the density in (10) we have that the calculation of the integral in (19) reduces to the calculation of the integrals in (8) and (9) and, thus, can be conducted by utilizing Proposition 1. The calculation of the integral in (19) for the MALA is slightly different as highlighted by the following remark.
Finally, we note that except from the tractability in the calculations which offered by the particular choice of , there is also the following intuition for its effectiveness. If the Gaussian approximation is exact, then the overall control variate, defined in equation (7) as the sum of a stochastic and a static control variate, becomes the exact “Poisson control variate” that we would compute if the initial target was actually Gaussian. Thus, we expect that the function , as a static control variate in a non-Gaussian target, enables effective variance reduction under the assumption that the target is well-approximated by a Gaussian distribution.
2.4 The modified estimator with regression coefficients
As pointed out by dellaportas2012control the fact that the proposed estimator is based on an approximation of the true solution of the Poisson equation implies that we need to modify as
where estimates the optimal coefficient that further minimizes the variance of the overall estimator. dellaportas2012control show that for reversible MCMC samplers, the optimal estimator of the true coefficient can be constructed solely from the MCMC output. By re-writing the estimator in (20) as
where the term
approximates , we can estimate as
The resulting estimator in (20) is evaluated by using solely the output of the MCMC algorithm and under some regularity conditions converges to a.s. as , see tsourti2012variance.
2.5 Algorithmic summary
In summary, the proposed variance reduction approach can be applied a posteriori to the MCMC output samples obtained from either RWM or MALA with proposal density given by (14). The extra computations needed involve the evaluation of given by (21). This is efficient since it relies on quantities that are readily available such as the values and , where is the value generated from the proposal during the main MCMC algorithm, as well as on the acceptance probability which has been also computed and stored at each MCMC iteration. The evaluation of requires also the construction of the static control variate defined by (16). This depends on the ratio given by (17) and on the expectation . The calculation of the latter expectation is tractable since is the acceptance ratio of Metropolis-Hastings algorithm that targets the Gaussian target , where and are estimators of the mean and covariance matrix respectively of the target ; see Assumption 1. Finally, we compute using (22) and evaluate the proposed estimator from (20). Algorithm 1 summarizes the steps of the variance reduction procedure.
3 Application on real and simulated data
We present results from the application of the proposed methodology on real and simulated data examples. First we consider multivariate Gaussian targets for which we have shown that the function in (12) allows the explicit calculation of the expectation defined by (2.1). Section 3.1 presents variance reduction factors in the case of -variate standard Gaussian densities, simulated by employing the RWM and MALA, up to dimensions. In Sections 3.2, 3.3 and 3.4 and we examine the efficiency of our proposed methodology in targets that depart from the Gaussian distribution and the expectation is not analytically available.
To conduct all the experiments we set the parameters and of the function in (12) in the values given by Table 1 which were estimated by minimizing the loss function in (13) for . In practice we observe that such values lead to good performance across all real data experiments, including those with .
To estimate the variance of in each experiment we obtained different estimates , , for based on independent MCMC runs. Then, the variance of has been estimated by
where is the average of . We estimated similarly the variance of the proposed estimator .
3.1 Simulated data: Gaussian targets
The target distribution is a -variate standard Gaussian distribution and we are interested in estimating the expected value of the first coordinate of the target by setting . Samples of size were drawn from target densities by utilising the proposal distribution in (10) with for the RWM case and by tuning during the burn-in period to achieve acceptance rate between and in the MALA case. Table 2 presents factors by which the variance of is greater than the variance of in the case of the RWM and MALA. Variance reduction is considerable even for . Figure 2 shows typical realizations of the sequences of estimates obtained by the standard estimators and the proposed for different dimensions of the standard Gaussian target and Figure 3 provides a visualization of the distribution of the estimators and .
3.2 Simulated data: mixtures of Gaussian distributions
It is important to investigate how our proposed methodology performs when the target density departs from normality. We used as a mixture of -variate Gaussian distributions with density
where, following mijatovic2019asymptotic, we set to be the -dimensional vector and is covariance matrix randomly drawn from an inverse Wishart distribution by requiring its largest eigenvalue to be equal to
covariance matrix randomly drawn from an inverse Wishart distribution by requiring its largest eigenvalue to be equal to.
We drew samples from the target distribution by using the Metropolis-Hastings algorithm with proposal distribution where by setting we achieve an acceptance ratio between and . When the MCMC algorithm struggles to converge. Table 3 presents the factors by which the variance of is greater than the variance of the modified estimator for dimensions