Subsampling MCMC - A review for the survey statistician

07/23/2018 ∙ by Matias Quiroz, et al. ∙ 0

The rapid development of computing power and efficient Markov Chain Monte Carlo (MCMC) simulation algorithms have revolutionized Bayesian statistics, making it a highly practical inference method in applied work. However, MCMC algorithms tend to be computationally demanding, and are particularly slow for large datasets. Data subsampling has recently been suggested as a way to make MCMC methods scalable on massively large data, utilizing efficient sampling schemes and estimators from the survey sampling literature. These developments tend to be unknown by many survey statisticians who traditionally work with non-Bayesian methods, and rarely use MCMC. Our article reviews Subsampling MCMC, a so called pseudo-marginal MCMC approach to speeding up MCMC through data subsampling. The review is written for a survey statistician without previous knowledge of MCMC methods since our aim is to motivate survey sampling experts to contribute to the growing Subsampling MCMC literature.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The key drivers behind the widespread adoption of Bayesian inference in the last three decades have been the rapid improvements in computing power and the availability of powerful user-friendly simulation algorithms. The family of Markov Chain Monte Carlo (MCMC) sampling methods

(Brooks et al., 2011), and in particular the Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970), quickly became the method of choice for practitioners for simulating from complex posterior distributions. MCMC opened up the possibility of routine analysis of highly complex models with limited algorithmic tuning. MCMC sampling was also fast enough for most problems, and at first it seemed that the problem of computational intractability that had hindered early Bayesians had been solved once and for all.

Meanwhile, it became apparent that MCMC was too slow in certain specialized areas where particular problems still had practioners waiting for days or even weeks for MCMC to deliver the results. For example, MCMC is too slow for many high-dimensional spatial problems where the INLA approximations (Rue et al., 2009) quickly gained popularity. Massive datasets in technology led to fast Variational Bayes (VB) approximations (Jordan et al., 1999; Blei et al., 2017) and Expectation Propagation (EP) (Minka, 2001)

in the machine learning field. The tension with MCMC for big data problems in the machine learning community is now present in many other scientific discplines in the natural and social sciences, and, with increasing text digitalization, also in the humanities. In the current big data era, MCMC is often too slow and is, as a result, increasingly being replaced by other approximate methods. This is unfortunate since, unlike other methods, MCMC samples are guaranteed to converge to the posterior distribution if the MCMC sampler performs adequately. Although there is exciting new work with flexible simulation based VB methods (see

Blei et al., 2017 for a recent review), it is fair to say that VB is still less accurate than MCMC and does not come with practical error bounds. Moreover, it is often very time consuming to obtain good VB approximations for new complex models.

To deal with the challenges of massive datasets, there has been a recent push to develop scalable MCMC samplers. This work has followed two main paths: i) Distributed MCMC and ii) Subsampling MCMC. Distributed MCMC is inspired by the MapReduce scheme (Dean and Ghemawat, 2008) where the data is partitioned and distributed to different machines. MCMC is then run separately on each machine to obtain a subposterior for each partition in a parallel and distributed manner. The key question is then how to combine these subposteriors into a single posterior for all the data; see Scott et al. (2016), Neiswanger et al. (2013), Minsker et al. (2014), Wang and Dunson (2014) and Nemeth and Sherlock (2018) for some attempts. Subsampling MCMC instead focuses on taking random subsamples of the data in each MCMC iteration. The FireFly Monte Carlo algorithm in Maclaurin and Adams (2014) introduces an auxiliary variable for each observation which determines if it should be included in the evaluation of the posterior; Gibbs sampling (Geman and Geman, 1984) is then used to switch between updates of the parameters and the auxiliary variables. Korattikara et al. (2014) and Bardenet et al. (2014, 2017) use increasingly larger subsets of the data until the accept-reject decision in MCMC can be taken with sufficiently high confidence. We refer to Bardenet et al. (2017) for an excellent review of these and other subsampling approaches. After the publication of Bardenet et al. (2017), there has been interesting new progress on non-reversible MCMC for subsampling applications using continuous time piecewise deterministic Markov processes, see Bierkens et al. (2018) and Bouchard-Côté et al. (2018). Moreover, a different approach using Noisy MCMC (Alquier et al., 2016) and data subsampling is explored in Maire et al. (2018).

We will here focus on so called pseudo-marginal MCMC (PMCMC) methods where the likelihood evaluation is replaced by an unbiased estimate from a data subsample in each MCMC iteration

(Andrieu and Roberts, 2009). Using a small subset to estimate the otherwise computationally costly likelihood in a big data setting can give dramatic speed-ups. As explained here, PMCMC has been shown to give samples from the correct posterior distribution even if the likelihood estimator is very noisy. However, as we demonstrate in this review, controlling the variability of the log of the likelihood estimator is absolutely crucial for the performance of Subsampling MCMC based on pseudo-marginal methods. This makes it important to introduce subsampling MCMC to survey sampling experts. The specific approach presented here has been developed in a series of papers (Quiroz et al., 2018a, b, c; Dang et al., 2017) and should be of particular interest to survey statisticians since the estimation problem in our approach focuses on estimating the log-likelihood. The log-likelihood is usually a sum, and is therefore akin to a population total, the fundamental quantity in survey sampling. We also present a subsampling approach that directly estimates the likelihood unbiasedly (Quiroz et al., 2018c), which is usually a product; this is a less standard problem in survey sampling that may open up new challenges for survey statisticians. Finally, we note that estimating the log-likelihood based on data subsampling has also been explored in subsampling Sequential Monte Carlo (SMC) for static Bayesian models (Gunawan et al., 2018). SMC (Doucet et al., 2001) is a powerful alternative to MCMC which produces an estimate of the marginal likelihood, useful for model selection, as a byproduct. However, for brevity, this review focuses on MCMC.

The paper is organized as follows. The next section introduces the Metropolis-Hastings algorithm, and its extension to pseudo-marginal Metropolis-Hastings which can be used when the likelihood is replaced by an unbiased estimator. Section 3 gives details on estimators for the likelihood and their properties, and discusses several recently proposed variance reduction strategies such as using control variates and dependent subsamples. Section 3 also presents a promising approach for subsampling for Hamiltonian Monte Carlo (HMC) sampling which has recently been at the forefront in high-dimensional problems. The final section concludes. Appendix A summarizes the main algorithms and Appendix B gives some implementation details for our running illustrative example in the text.

2. The Pseudo-Marginal Metropolis-Hastings (PMMH) Algorithm

2.1. The Metropolis-Hastings algorithm

Markov Chain Monte Carlo (MCMC) is a family of algorithms for random variate generation from potentially complicated multivariate distributions. MCMC simulates from a distribution , here taken as a Bayesian posterior distribution, by constructing a Markov Chain on the parameter space of such that its invariant distribution is . Realizations from this Markov chain will therefore converge in distribution to from any starting point of the Markov chain, such that after a burn-in period the path of the Markov chain is a dependent sample from . The celebrated Metropolis-Hastings (MH) algorithm (Metropolis et al., 1953; Hastings, 1970) in Algorithm 1 in Appendix A, is the most widely used MCMC algorithm.

While the MH algorithm is valid for any proposal density , where is the current value of the parameter and is its proposed value, the specific proposal used is crucial for the efficiency of the algorithm. The two most commonly used proposals are the Random Malk Metropolis (RWM) and the Independence sampler (IMH); see Brooks et al. (2011) for an introduction. The most common implementation of RWM uses a random walk proposal , where captures the shape of posterior in an efficient implementation (often

is minus the inverse posterior Hessian or simply the identity matrix) and

is a tuning parameter. A small

is often needed to keep the acceptance probability reasonably large, and the algorithm therefore tends to traverse the parameter space very slowly. This is especially pronounced in high dimensions as the optimal

, where is the number of parameters (Roberts et al., 1997). The IMH sampler generates proposals independent of the current position: . Here it is crucial that is a fairly accurate approximation to the true posterior and that it has heavier tails, otherwise the sampler will generate long sequences of rejected draws, i.e. the sampler gets stuck for long spells. When the IMH proposal is a good approximation of the posterior, the sampler traverses the parameter space very swiftly.

2.2. Estimating a computationally costly likelihood

The MH algorithm in Algorithm 1 is extremely convenient for Bayesian computations since it does not require knowledge of the normalizing constant of the posterior, , which is often intractable. Even so, there are many problems where the required evaluations of the likelihood are also very costly, for example with large datasets or when the underlying probability model is a complex dynamical system, causing MH to be very slow. Moreover, for some models the likelihood can be intractable, e.g. in random effects models. Such situations are increasingly common in many of important applications and the slow execution of MH has prompted users to develop faster posterior approximation methods, for example variational Bayes (Blei et al., 2017) and expectation propagation (Gelman et al., 2017). While such methods are computationally attractive and steadily improving, they usually provide substantially less accurate approximations than MCMC.

A natural way to circumvent the problem of evaluating a costly likelihood is to replace the likelihood by a computationally cheap estimate, . We will here illustrate this idea in two very different settings.

Big data

Consider first the big data case when we run the Metropolis-Hastings algorithm on a dataset with independent observations, with very large. Evaluating the likelihood is generally an operation and can be very costly. A natural solution is to estimate the likelihood from a subsample of size obtained by simple random sampling. We first focus on estimating the log-likelihood instead of the likelihood; the reason for estimating on the log-scale is that the log-likelihood is usually a sum and therefore equivalent to estimating a population total, a long studied problem in survey sampling (Särndal et al., 2003). The log-likelihood for independent observations is

(2.1)

where is the log-likelihood contribution of the th observation. Let

be binary variables such that

if observation is selected in the subsample, and zero otherwise. Assuming simple random sampling (SRS) without replacement, the usual unbiased estimator is of the simple form

(2.2)

While it is convenient from a survey sampling point of view to estimate the log-likelihood, we will see in Section 2.3 that Subsampling MCMC actually requires an unbiased estimate of the likelihood on the original scale. This entails estimating a product, which is a much less studied problem in survey sampling. In order to remain in the realm of survey sampling we can use the unbiased estimator for the log-likelihood in (2.2) with a bias-correction to obtain an estimator for the likelihood of the form (Ceperley and Dewing, 1999; Nicholls et al., 2012)

(2.3)

where . This bias-correction is exact if i)

is normally distributed and ii)

is known. In practice, is replaced by the usual sample estimate. We return to this issue in more detail in Section 3.4.

Note that the log-likelihood can often be written as a sum even when the observations are not fully independent. The most straightforward example is longitudinal data where the time series of observations within a subject are typically dependent temporally, but the different subjects are independent. In this case the log-likelihood is a sum over subjects and we can estimate it from a subsample of subjects, rather than individual observations. Data with a direct Markovian structure can be handled similarly by subsampling an observation jointly with its relevant history, as is done in the block bootstrap for time series.

Random effects and importance sampling

Another common setting where the likelihood is intractable, but can be estimated unbiasedly, are random effects models. As an example, consider a logistic regression with both fixed and random effects

where are observations for the th subject, are random effects of the covariates in , and are covariates with fixed effects. The likelihood for a sample of observations with the random effects integrated out is then

(2.4)

where

The integrals in (2.4) are often intractable, but can be estimated unbiasedly by Monte Carlo integration, or importance sampling. Let denote the number of samples in the importance sampling estimate of each term, and the total number of random numbers used to estimate the likelihood in (2.4). Here importance sampling can be used to construct an unbiased estimate of the likelihood in random effects models. Similarly, for state space models, the particle filter gives an unbiased estimator of the likelihood using random particles, see Del Moral (2004, Proposition 7.4.1) for the original result and Pitt et al. (2012) for an alternative proof.

It is important to highlight the randomness of the estimator so we write , where are the random numbers used to form the estimate. In the large data setting,

is the vector of sample selection indicators discussed above and

is given by the simple random sampling design. More specifically, is given by (2.3) with the log-likelihood estimate in (2.2) showing the explicit dependence of the estimator on the random numbers . In random effects models the would instead be the random numbers used to approximate the intractable random effects integrals by Monte Carlo integration.

2.3. The Pseudo-Marginal Metropolis-Hastings algorithm

Andrieu and Roberts (2009)
prove the remarkable result that replacing the likelihood in the MH algorithm with a noisy estimate still gives a sample from the posterior if the likelihood estimator is positive and unbiased. This is done by defining an augmented target density that includes both and such that its marginal for with integrated out is the posterior of . The MH algorithm is run on this augmented target distribution and the draws are not used for inference. It turns out that this so called pseudo-marginal algorithm is exactly of the same form as the original MH algorithm, with the likelihood evaluation in the acceptance probability in each iteration replaced by its current estimate; see the Pseudo-Marginal Metropolis-Hastings (PMMH) in Algorithm 2 for details. The idea of substituting the likelihood in MH with a noisy estimate appeared initially in physics (Lin et al., 2000) and in genetics (Beaumont, 2003).

Even though samples from PMMH with any unbiased positive likelihood estimator will converge to the posterior distribution, it turns out that having a low estimator variance is absolutely crucial for the efficiency of the standard PMMH sampler, see for example Flury and Shephard (2011) and Section 3.3. An estimator with a large variance can easily lead to an accepted parameter draw with a large over-estimate of the likelihood; subsequent draws will be rejected until they also happen to be associated with another gross over-estimate. This causes the sampler to be stuck for long spells, making the MCMC algorithm very inefficient.

The variance of the likelihood estimate is controlled by , the number of subsamples in the subsampling setting, or the number of draws in importance sampling estimators. An that is too small inflates the variance of the likelihood estimator and gives an inefficient sampler. An that is too large gives an unnecessarily precise estimator at an excessive computational cost. The optimal finds the right balance between MCMC efficiency and computational cost, and is usually derived under the assumption that the cost of a single MCMC iteration is proportional to , see e.g. Pitt et al. (2012) for details. This cost must be balanced against the efficiency of the MCMC (which can be shown to increase with , as we will illustrate later). The usual measure of MCMC sampling inefficiency for a given parameter is given by the Integrated AutoCorrelation Time (IACT)

(2.5)

where is the th autocorrelation of the MCMC chain for . In practice, the IACT is estimated using the spectral density evaluated at zero, see for example Plummer et al. (2006). We define the Computational Time (CT) for producing a sample equivalent to an iid draw from the posterior distribution as

(2.6)

where . We note that the IACT in (2.5) becomes a function of the variance of the log of the likelihood estimator when implementing pseudo-marginal MCMC. Here we follow Pitt et al. (2012) and Doucet et al. (2015) in assuming that the cost of a single iteration is proportional to , which in turn is inversely proportional to . Depending on the assumptions made, and the choice of proposal distribution for , the optimal subsample size which minimizes CT is obtained by targeting a between and (Pitt et al., 2012; Doucet et al., 2015; Sherlock et al., 2015). It is also known that CT is relatively flat over the interval , but increases sharply outside this interval, in particular when is too large. We will illustrate some properties of the CT later in the text.

The definition of CT in (2.6) is the one traditionally used in pseudo-marginal MCMC. In some of the Subsampling MCMC methods the focus is on estimating the log-likelihood, which is subsequently converted into an estimator of the likelihood by bias-correction, see (2.3). The relevant Computational Time is then

(2.7)

where . The two definitions of CT are identical if in (2.3) is known, and typically differ very little when in (2.3) is replaced by a sample estimate . To keep things simple, we will therefore use the same rule to set the subsample size to target when using subsampling based on estimating the log-likelihood.

3. Subsampling for likelihood estimation

The previous section described how an estimated likelihood can be used in a pseudo-marginal algorithm to sample from a posterior distribution. As long as the estimator is unbiased and nonnegative, and some non-onerous regularity conditions apply, the samples will converge in distribution to the target posterior based on the true likelihood function. This section discusses the importance of variance reduction and proposes alternative estimators from the survey literature and adapts them to the Subsampling MCMC context.

Figure 3.1. Optimal subsampling fractions () for SRS without replacement for (left) and (right), where is the population variance in (3.1). The optimal subsample size () is set to target .

3.1. Simple Random Sampling is by itself not useful for Subsampling MCMC

We have already discussed that the optimal subsample size should target a variance of the log-likelihood estimator in the interval . It turns out, however, that it is almost impossible in the subsampling setting to achieve a in that interval with Simple Random Sampling (SRS) without ending up with a sampling fraction very close to unity. To see this, note that the variance of the estimator in (2.2) under the SRS design without replacement is (Särndal et al., 2003)

where is the population variance. Now, in order to target a given variance , the subsample size must be

(3.1)

Figure 3.1 illustrates the optimal sampling fraction as a function of for two different values of when the target is . Note that this is the largest value among the recommended ones in the literature to keep the sampling fraction conservatively low here. The sampling fraction nevertheless quickly approaches unity, showing that SRS with the population total estimator in (2.2) is not useful for Subsampling MCMC. An even more dramatic way of illustrating this is to consider sampling with replacement. SRS with replacement gives and the optimal grows as , which is clearly unacceptable.

The variance of the estimator when sampling without replacement is lower by the factor compared to the with-replacement case. This is a negligible improvement whenever , which is the situation of interest here since otherwise subsampling would not be worthwhile. Since sampling with replacement is simpler to implement, and the implied independence makes the theory much easier to develop, this has been the preferred sampling method in the Subsampling MCMC literature. We will therefore use sampling with replacement throughout the paper. The sampling indicators are now random observation indices such that for and the estimator in (2.2) becomes

(3.2)

3.2. Efficient and scalable Subsampling MCMC using control variates

The difference estimator

Part of the problem with SRS is that the log-likelihood contributions can vary quite dramatically over the observations, hence inflating the variance of the estimator. There are at least three main ways to deal with the heterogeneity of population elements.

The first approach is stratified sampling with a higher sampling inclusion probability in the strata with largest units. This would ensure that most or all of the large enter the sample. However, it turns out that stratified sampling tends to produce a variance that is too large for efficient Subsampling MCMC.

The second approach, proposed for Subsampling MCMC in the first version of Quiroz et al. (2018a) (see Quiroz et al., 2014 for the first version), is to use probability-proportional-to-size (PPS) sampling that assigns higher inclusion probabilities to larger units (Särndal et al., 2003). To implement PPS (or in the case of sampling without replacement) we need to approximate the size of for all observations. In order to gain in computational speed from subsampling, those size measures must clearly be cheaper to compute than the , and such size measures are proposed in (Quiroz et al., 2014). Nevertheless, the computational complexity of the subsampling algorithm remains .

The third approach is proposed in Quiroz et al. (2018a) and amounts to subtracting an approximation from each such for each that the new population is more homogeneous in size than . Formally, we use simple random sampling and the difference estimator (Särndal et al., 2003)

(3.3)

with is a potentially crude approximation to for . It is easy to show that is unbiased for any . The approximation plays the same normalizing role as control variates in importance sampling (Hammersley and Handscomb, 1964) and we will use this term here.

Parameter-expanded control variates

A natural way of constructing control variates is by a Taylor expansion of , , around some central value (Bardenet et al., 2017)

(3.4)

where and are the gradient and Hessian with respect to evaluated at , respectively. As argued in Bardenet et al. (2017), these parameter-expanded

control variates work very well when the posterior is tightly concentrated; asymptotic posterior concentration is guaranteed by the Bernstein von Mises theorem

(Van der Vaart, 1998) and will be practically relevant in big data problems with many observations, but not too many parameters, i.e. so called tall data. As discussed later, it also has good scaling properties with respect to .

A crucial property of parameter-expanded covariates is that the sum in the difference estimator in (3.3) can be reduced from an operation to an operation since both and are evaluated at , and can therefore be pre-computed before starting the MCMC iterations.

Data-expanded control variates

Let be a vector with all observed data for the th item. For example, in a regression setting,

would contain both the response variable

and the covariates . Further, let denote log-likelihood contribution for the th observation. The idea with the data-expanded control variates proposed in Quiroz et al. (2018a) is that the tend to vary slowly across data space, and can therefore be approximated by , where is the nearest centroid in a pre-clustering of the data. Similarly to the parameter-expanded control variates, we can improve on this by using a Taylor expansion of , but this time in data space around the centroid . The data-expanded control variates are of the form

(3.5)

where and are the gradient and Hessian with respect to , both evaluated at .

Quiroz et al. (2018a) show that the complexity of is for data-expanded control variates, where is the number of clusters and typically . Hence, data-expanded control variates also give scalable algorithms since the number of clusters tends to grow very slowly with .

Figure 3.2. The accuracy of the parameter expanded control variates for the Poisson regression model in Eq. (3.6). Each subgraph plots the true against the control variate for that observation. The three columns correspond to , and terms in the Taylor expansion. The three rows correspond to different that are increasingly distant from the Taylor expansion point : i) (top row), ii) (middle row) and iii) (bottom row), where is the Euclidean norm. As a point of comparison, ’s on the % posterior ellipsoid have values for ranging between and . The header of each subgraph displays the optimal subsample () that gives the target variance . The quality of the parameter expanded control variates depends on being small.

Comparing control variates from parameter-expansion and data-expansion

It is crucial to realize that our sampling problem is dynamic, in the sense that we will need estimates of at every iteration of the pseudo-marginal MH algorithm, and typically changes in every iteration. This means that we have sequence of survey sampling problems where the measurements on the population units, , change over time (MH iterations). Such situations also occur in real-world surveys (Steel and McLaren, 2009), but Subsampling MCMC has not as yet used any of the methods proposed in the repeated surveys literature. We return to this dynamic survey sampling perspective when we discuss dependent subsampling in Section 3.6. The fact that changes over the iterations can cause problems for the parameter-expanded control variates, but does not significantly affect the data-expanded control variates. This is illustrated in Figures 3.2 and 3.3 where we plot the true against the two control variates for different number of terms in the Taylor expansions. For illustration purposes the underlying sample of observations comes from a simple Poisson regression

(3.6)

where , but the point we make holds generally. Figure 3.2 clearly shows that parameter-expansion around a static is problematic when the current is far from . Figure 3.3 shows that the data-expanded control variates remains relatively unaffected by movements in .

Figure 3.3. The accuracy of the data expanded control variates with 75 centroids for the Poisson regression model in Eq. (3.6). Each subgraph plots the true against the control variate for that observation. The three columns correspond to , and terms in the Taylor expansion. The three rows correspond to different that are increasingly distant from the Taylor expansion point : i) (top row), ii) (middle row) and iii) (bottom row), where is the Euclidean norm. As a point of comparison, ’s on the % posterior ellipsoid have values for ranging between and . The header of each subgraph displays the optimal subsample () that gives the target variance . The quality of the data expanded control variates is not sensitive to .
Figure 3.4. The data points in (x,y)-space and the centroids.
Figure 3.5. The accuracy of the data expanded control variates with different number of centroids for the Poisson regression model in Eq. (3.6) when is such that , where is the Euclidean distance. The three columns correspond to , and terms in the Taylor expansion. The header of each subgraph displays the optimal subsample () that gives the target variance and the order of the Taylor expansion (Taylor). The quality of the data expanded control variates deteriorates when the number of centroids is small, especially when using a lower order Taylor expansion.

However, data-expanded control variates only give accurate approximations if enough centroids are used in the clustering; see Figures 3.4 and 3.5

for an illustration. The curse of dimensionality makes this a limitation in higher dimensional data spaces since many observations will be quite far from their nearest centroid even when using a larger number of centroids.

In summary, data-expanded control variates perform well for any , but do not scale well with the dimension of the data space. Parameter-expanded control variates scale well with dimension, but perform poorly when is far from the expansion point . Quiroz et al. (2018a) therefore propose the strategy of starting the posterior sampling with data-expanded control variates and then switching over to parameter-expanded control variates when the sampler has reached a more central point in the posterior which can be used as .

Asymptotic behavior with control variates

We have shown that the optimal subsample size needs to grow as when using simple random sampling with replacement in order to keep around unity; control variates can improve on this asymptotic rate. With control variates, the variance of the difference estimator in (3.3) is given by

(3.7)

where is the variance of the finite population of differences. Note that we have made explicit that the accuracy of the control variates depends on . As explained above, to obtain the optimal we need to ensure that is , which requires understanding the behaviour of as . Lemma 2 in Quiroz et al. (2018a) shows that

(3.8)

where

The asymptotic behaviour of depends on the type of control variate, and also on choices within a given control variate such as how the number of centroids grows with in the case of data-expanded control variates. We will focus here on the asymptotic properties of parameter-expanded control variates and refer to Quiroz et al. (2018a) for results on the data-expanded case.

Since the parameter-expanded control variate is based on a Taylor expansion around , the rate at which its accuracy improves with is determined by the rate at which contracts, where we have made explicit that the expansion point typically depends on . Quiroz et al. (2018a) prove the following lemma.

Lemma 3.1.

For the parameter-expanded control variates of second order we have

From the Bernstein-von Mises theorem (Chen, 1985), if is the posterior mode based on all data, we have as . This implies that will be close to unity for large enough . We therefore have that for all . Hence, for such , the optimal subsample size that targets is, by (3.8), , suggesting that Subsampling MCMC with parameter-expanded control variates scales extremely well to large datasets. There are at least three objections to this analysis, however. First, the conditions under which this optimality is derived requires that is large enough for to be approximately normally distributed, so the optimal is not attainable. Second, having control variates that expand around the posterior mode of based on all data is not practical in large data settings. Third, as discussed in Quiroz et al. (2018a), setting gives a PMMH algorithm that samples from a target distribution that deviates from the true posterior by an factor, which is clearly not acceptable. A more practical approach with control variates based on the posterior mode from a small subset of the data is analyzed in Quiroz et al. (2018a) and presented in Section 3.4 below.

Other control variates

We have emphasized parameter- and data-expanded control variates as general and scalable solutions for variance reduction in Subsampling MCMC. However, many other control variates can be used in particular applications. For example, in many models the evaluation of the log-likelihood contributions is very time-consuming because some aspect of the model needs to be solved numerically. The likelihood can then be costly also for smaller . For example, an intractable integral may be approximated by Gaussian quadrature, a differential equation can be solved by the Runge-Kutta method, an optimum found by Newton’s method. Any numerical method depends on tuning parameters which control the accuracy of the solution. A natural control variate can then be obtained from tuning parameters that give cruder, but much faster, evaluations of (a coarse grid in numerical integration and in solving differential equations, a small number of Newton steps for optimization). The log-likelihood contributions for the sampled subset of observations are computed based on tuning parameters that give very accurate evaluations. Note however that for such control variates we need in general to evaluate the control variate for all observations (but may be small), so the algorithm will still run in time, but with a much smaller cost for each MCMC iteration.

Figure 3.6. Sampling the posterior distribution of in the Poisson regression model in (3.6). The figure shows, for different values of on the four rows, the sampling chains (left), the estimates of the log-likelihood (middle) and estimates of the chain’s autocorrelation (right).

3.3. Control variates are crucial for the Integrated AutoCorrelation Time (IACT)

We have argued that control variates provide significant variance reduction for the log-likelihood estimator and that the MCMC sampling efficiency (as measured by the IACT) is a function of the variance. Figure 3.6 illustrates that when targeting a variance of one for (second row) our subsampling MCMC essentially behaves as the full data MCMC (first row). However, Subsampling MCMC with a large estimator variance ( in the third row and in the fourth row) does not efficiently explore the posterior distribution of our Poisson regression example, and has a much greater tendency of getting stuck. This stickiness is also clearly borne out in the autocorrelation function of the MCMC draws in the right panel of Figure 3.6.

3.4. An approximate approach using bias-corrected log-likelihood estimators

We have so far shown the importance of variance reduction of the log-likelihood estimator using control variates. The reason for focusing on estimators of the log-likelihood, rather than the likelihood, is that the log-likelihood is a sum, which is the usual aim in survey sampling, allowing us to exploit century-old experience in that area.

However, pseudo-marginal MCMC will only generate a sample from the correct posterior if the likelihood is estimated by a positive unbiased estimator (Andrieu and Roberts, 2009). The difference estimator in (3.3) is unbiased for the log-likelihood, but biased for the likelihood. As discussed in Section 2.2, we can bias-correct the biased estimator , where is any unbiased estimator of the log-likelihood. In particular, using the difference estimator the bias-corrected estimator is of the form

(3.9)

where is the difference estimator in (3.3) and . The estimator in (3.9) is unbiased if i) is normally distributed and ii)

is known. The assumption of normality can often be defended by a central limit theorem in the large

setting (assuming also that grows). Even when is very small we have observed that is very close to normal since the control variates homogenize the population so that , are usually distributed much more symmetrically and have lighter tails than the population of . Assuming that is known is harder to defend since knowing requires computing for all observations. The approach in Quiroz et al. (2018a) replaces in (3.9) by

(3.10)

where , giving the estimator

(3.11)

Substituting an estimate makes the estimator in (3.11) only approximately unbiased, and raises the question: what do samples from a PMMH algorithm using the estimator converge to, if anything? Quiroz et al. (2018a) note that this PMMH is still a valid MCMC on the joint () space, but targets the density

(3.12)

The marginal density of is

(3.13)

Note that in general because of the (slight) bias in the likelihood estimator . This shows that PMMH based on is still a valid MCMC scheme, but the draws target the perturbed posterior instead of the actual posterior .

Our next result from Quiroz et al. (2018a) gives the rate at which the perturbed target approaches the true target posterior . Note that depends on and depends on both and . We make this dependence explicit by using the relevant subscripts in our asymptotic results.

Theorem 3.1.

Suppose that a PMMH algorithm is implemented with the estimator in (3.11) using the second order parameter expanded control variates where the expansion point is the posterior mode, and assume that the regularity conditions in Assumption 2 in Quiroz et al. (2018a) are satisfied. Then,

  1. Suppose that is a function such that . Then

Theorem 3.1 shows that the perturbation error vanishes rapidly with the subsample size at rate for fixed . The theorem also shows that when for example , the perturburation error is .

To analyze the scalability of the algorithm for practical work, Quiroz et al. (2018a) make the more realistic assumption that control variates are expanded around , the posterior mode based on a small subset of observations, rather the costly posterior mode based on all observations. The following corollary is proved in Quiroz et al. (2018a).

Corollary 3.1.

Suppose that a PMMH algorithm is implemented with the estimator in (3.11) using the second order parameter expanded control variates with expansion point based on a subset of observations. Assume that , and that the regularity conditions in Assumption 2 in Quiroz et al. (2018a) are satisfied. Then,

  1. Suppose that is a function such that . Then

If for some , then achieves the optimal variance of , and the perturbation errors in Corollary 3.1 decreases with if and only if . For example, if we take , then and the posterior perturbation error is .

The asymptotics in Theorem 3.1 and Corollary 3.1 are reassuring for the method, but does not provide a practically useful way to quantify the discrepancy between and . Quiroz et al. (2018a) derive an accurate approximation to the point-wise fractional error in the perturbed posterior distribution

(3.14)

and show that the increases with for large . It is important to note however that it is only the part of that depends on that affects the perturbation error; an additive constant to will give rise to a multiplicative constant to in (3.11) that also appears in and will therefore cancel in (3.13). Hence, a large only implies a large perturbation error if varies with . This can be an advantage for data-expanded control variates since their errors are by construction relatively insensitive to , as demonstrated in Figure 3.3.

The next subsection presents an alternative approach which produces an unbiased estimator of the likelihood. Although exact, this method has two drawbacks compared to the approximate method presented in this subsection. First, the relative computational time of the algorithm is higher than the approximate method above, see Figure S8 in the supplementary material of Quiroz et al. (2018c). Second, the exact approach can only estimate expectations of functions of the parameters, rather than the whole posterior distribution.

3.5. Signed PMMH with the Block-Poisson estimator

The approach in the previous subsection used an unbiased estimator of the log-likelihood, which was subsequently approximately bias-corrected to estimate the likelihood

(3.15)

We now review how to estimate this product unbiasedly using the Block-Poisson estimator proposed in Quiroz et al. (2018c).

The Block-Poisson estimator is defined as

(3.16)

where , with being the control variates in (3.3). The Block-Poisson estimator is essentially a product of Poisson estimators, (Wagner, 1988; Papaspiliopoulos, 2009). Each Poisson estimator in the product is based on a random number of unbiased estimates of , i.e. the second term in the difference estimator (3.3), but from a mini-batch of observations. The scalar is a lower bound of the to ensure that for all .

Quiroz et al. (2018c) show that the Block-Poisson estimator is unbiased for the likelihood for all . The product construction in the estimator is not used for variance reduction, but to induce dependency in the subsamples over the MCMC iterations, see Section 3.6 below; in fact, Quiroz et al. (2018c) prove that the variance of is finite and exactly the same as the variance of the usual Poisson estimator in Papaspiliopoulos (2009).

To ensure that in (3.16) is positive with probability , which is necessary for PMMH, needs to be a lower bound of . Obtaining a lower bound is problematic for two reasons. First, a lower bound requires evaluating for all data points. Second, can be prohibitively large as the most extreme outcome of needs to be covered. This is problematic because Quiroz et al. (2018c) show that is minimized for for any given . Hence, must typically be very large in order for to be a lower bound, and a large means many mini-batches and a high computational cost.

Quiroz et al. (2018c) instead advocate the use of a soft lower bound, which is a lower bound resulting in less than one, but close to it. Since the estimator might not be positive, the target cannot be defined as in (3.12). However, further augmenting the density with the variable , we obtain (cf. Section 3.4)

(3.17)

where is a normalization constant. Note that if , and hence samples from the true posterior are obtained, instead of an approximation as in Section 3.4. We argued above that is too expensive and therefore Quiroz et al. (2018c) follow Lyne et al. (2015), who cleverly note that

(3.18)

We can therefore obtain samples from in (3.17) and estimate (3.18) by

(3.19)

that satisfies as . The approach in Lyne et al. (2015) of running PMMH on the absolute posterior followed by a sign-correction by importance sampling to consistently estimate expectations of functionals is termed Signed PMMH by Quiroz et al. (2018c).

Under the optimal variance condition it remains to choose values for the tuning parameters and . The natural approach is to choose and to minimize a computational time similar to (2.5). Quiroz et al. (2018c) show that the computational time of Signed PMMH with the Block-Poisson estimator is

(3.20)

where is the variance of the log of the absolute value of the Block-Poisson estimator. To minimize we need to compute i) IACT, ii) and iii) . All three quantities are derived in closed form in Quiroz et al. (2018c) where practical strategies for optimally tuning of and to minimize CT are also proposed. The derivations are made under idealized assumptions, but the tuning is demonstrated to be near optimal. Furthermore, the guidelines for selecting and are shown to be conservative in the sense of not giving too low values for and , which is known to be crucial in pseudo-marginal methods.

We end this subsection with a discussion of the possibility of using the Block-Poisson estimator in survey sampling, outside of a Subsampling MCMC context. We are not aware of survey sampling applications where the interest is in estimating a population product. However, the Poisson estimator is a special case of so called debiasing estimators (Rhee and Glynn, 2015). Such estimators are useful for unbiased estimation of a quantity (e.g. the likelihood) which is a non-linear function of a quantity that can easily be estimated unbiasedly (the log-likelihood). The debiasing approach resolves this issue for general functions. It is for example possible to apply this idea to debias calibration estimators (Deville and Särndal, 1992) such as the ratio estimator in survey sampling.

3.6. Dependent subsampling

We have argued that controlling the variance of the log of the likelihood estimator is crucial for the efficiency of PMMH. A closer inspection of Algorithm 2 shows that it is more correct to say it is the variance of the difference in the log likelihood estimates at and that matters for PMMH. Using independent proposals for makes it easy to get a gross over-estimate of the likelihood at some iteration and get stuck, as illustrated in Figure 3.6. Refreshing only parts of the subsample in each iteration reduces the variance of the difference in the log of the estimates of the likelihood between the proposed and current point. This is achieved by making (last accepted draw) and (proposed draw) dependent. We now present two approaches from the Subsampling MCMC literature for generating dependence in over the MCMC iterations, which were developed independently of the literature on repeated survey sampling for estimating changing populations over time (Steel and McLaren, 2009) in the survey sampling field. Much of this literature is focused on problems unrelated to Subsampling MCMC, for example how to avoid responders fatigue in repeated surveys, but this is certainly an area where the knowledge of survey statisticians can advance Subsampling MCMC.

The correlated pseudo-marginal

Deligiannidis et al. (2018) present a general Correlated Pseudo-Marginal (CPM) approach to dependent particles in PMMH. Their focus is on random effects models and particle filters in state-space models where the