Optimal proposals for Approximate Bayesian Computation

08/18/2018
by   Justin Alsing, et al.
Simons Foundation
0

We derive the optimal proposal density for Approximate Bayesian Computation (ABC) using Sequential Monte Carlo (SMC) (or Population Monte Carlo, PMC). The criterion for optimality is that the SMC/PMC-ABC sampler maximise the effective number of samples per parameter proposal. The optimal proposal density represents the optimal trade-off between favoring high acceptance rate and reducing the variance of the importance weights of accepted samples. We discuss two convenient approximations of this proposal and show that the optimal proposal density gives a significant boost in the expected sampling efficiency compared to standard kernels that are in common use in the ABC literature, especially as the number of parameters increases.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

page 11

page 12

10/03/2017

Improving approximate Bayesian computation via quasi Monte Carlo

ABC (approximate Bayesian computation) is a general approach for dealing...
12/26/2020

Population Quasi-Monte Carlo

Monte Carlo methods are widely used for approximating complicated, multi...
02/09/2015

Nested Sequential Monte Carlo Methods

We propose nested sequential Monte Carlo (NSMC), a methodology to sample...
01/01/2021

A Two Stage Adaptive Metropolis Algorithm

We propose a new sampling algorithm combining two quite powerful ideas i...
07/18/2019

Amortized Monte Carlo Integration

Current approaches to amortizing Bayesian inference focus solely on appr...
09/02/2021

Bayesian Detectability of Induced Polarisation in Airborne Electromagnetic Data using Reversible Jump Sequential Monte Carlo

Detection of induced polarisation (IP) effects in airborne electromagnet...
12/06/2019

HMC: avoiding rejections by not using leapfrog and an analysis of the acceptance rate

We give numerical evidence that the standard leapfrog algorithm may not ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The naïve approach to Approximate Bayesian Computation (ABC) generates (compressed) data simulations for parameters that are drawn from the prior . If the resulting simulated data is within of the true data , i.e. under a distance metric , then is accepted as a sample from the approximate posterior density, .

If simulation is costly, it is advantageous to attempt to increase the fraction of accepted by proposing new candidate s from a proposal density that is large in parameter ranges that are preferred by the data, and small in less interesting regions of the prior volume. It is natural to base the choice of on the current accepted samples from the approximate posterior . This is the approach taken in SMC-ABC algorithms (see eg., Marin et al., 2012 for a review); a proposal density

is constructed at each population iteration, which is typically a kernel density estimator (KDE) based on the accepted points from the previous population (ie, a kernel that adapts as the algorithm steps through successive population iterations

Beaumont et al., 2009; McKinley et al., 2009; Toni and Stumpf, 2009; Barnes et al., 2011; Didelot et al., 2011; Jasra et al., 2012; Filippi et al., 2013; Bonassi et al., 2015).

The price to pay for the increased fraction of accepted ABC samples of is the necessity to importance weight the accepted samples by . The variance in these importance weights will reduce the effective number of samples. The more concentrated is relative to , on parameters

that have a high probability of being accepted, the larger the variance in the importance weights

. In practice, a poor choice of proposal density can lead to a proposed sample whose proposal density was low, but is subsequently accepted, leading to a large importance weight that can overwhelm the rest of the (weighted) samples leading to a very small effective sample size. The natural question, then, is how to choose the proposal density that represents the optimal trade-off between a high acceptance rate and a low importance-weight variance.

Most SMC-ABC implementations propose new parameters for forward simulation via an importance weighted KDE based on the previous population’s accepted samples, with uniform (Toni et al., 2009; Toni and Stumpf, 2009; McKinley et al., 2009), student-t (Didelot et al., 2011) and Gaussian kernels (Sisson et al., 2007; Beaumont et al., 2009; Filippi et al., 2013; Bonassi et al., 2015) in common use. A choice has to be made for the kernel bandwidth and there are various choices in the literature: for example, for Gaussian kernels, Sisson et al. (2007) use the importance-weighted variance of the previous population samples, Beaumont et al. (2009) use twice the importance-weighted variance of the previous population samples, whilst Bonassi et al. (2015) use standard recommendations from West (1993) and Scott and Sain (2005)111ie., taking the (component-wise) importance-weighted variance divided by , for samples.

. Previous studies have sought kernels that are optimal in the following sense: they minimize the sum of the Kullback-Leibler divergence between the proposal KDE and the target density, and the negative log acceptance ratio, providing some trade-off between closeness of the target and proposal and the acceptance ratio

(Beaumont et al., 2009; Filippi et al., 2013). Beaumont et al. (2009) showed that for a global Gaussian kernel, this optimality criterion leads to a bandwidth equal to twice the importance-weighted variance of the previous population’s accepted samples (with some further refinement and generalization by Filippi et al., 2013). Filippi et al. (2013) also considered the same optimality criterion applied to local rather than global kernels, deriving a locally-optimal kernel-covariance. Whilst this optimality criterion has proved powerful, the relative importance and utility of the KL divergence and acceptance ratio terms in this approach is ambiguous.

In this paper we take a slightly different approach and derive the proposal density for SMC-ABC sampling that maximizes the effective number of samples per parameter proposal (and hence forward simulation). This provides the optimal trade-off between high acceptance rate and low variance in the importance weights under a straightforward and pragmatic definition of optimality. Rather than restricting to a given class of perturbation kernels (eg., Gaussian kernels), we derive the optimal proposal density (in the asymptotic limit) assuming only that some density estimator for the ABC posterior is available at each population iteration. The result provides a well-motivated guide for adaptive proposal density choice for SMC-ABC sampling.

2 Optimal SMC-ABC proposal densities

We define the sampling efficiency as the functional that measures the effective number of samples per parameter proposal. This is composed of two components: 1) , the fraction of proposed points that will be accepted, and 2) , the effective number of points after application of the importance weights.

The expected fraction of accepted points is given by

(2.1)

where is the probability density of simulated data when the parameters are proposed from , is the volume of the space for accepted samples, and ”” becomes accurate in the limit of small , assuming is continuous.

The expected effective number of points after application of the importance weights to accepted samples is given by

(2.2)
(2.3)
(2.4)
(2.5)

We therefore find for the sampling efficiency (dropping -independent constant pre-factors),

(2.6)

where the second equality defines the functionals and , and the last proportionality defines the sampling efficiency as the number of effective (accepted) samples per parameter proposal (and hence forward simulation). For naïve ABC, proposing parameters from the prior, . Maximizing with respect to , under the constraint of being normalized, gives the optimal proposal

(2.7)

This is the main result of the paper. This implicit equation for can be solved iteratively; alternatively one can just consider and as parameters and

(2.8)

Further, there are two simple and fast approximations for that can be obtained without iteration or that can inform good starting points for the iteration; these are discussed in §2.1.

Note that we only need to know to be able to sample from the optimal proposal (2.7), since just sets the normalization. Assuming a density estimator for is available, and

is easy to evaluate, then Markov Chain Monte Carlo (MCMC) methods can efficiently generate draws from

to serve as parameter proposals for the next SMC iteration.

The optimal proposal Eq. (2.7

) can be seen as the geometric mean of posterior and prior (ie., the numerator), with a relative boost where the posterior is larger than the prior. The denominator boosts

in the region where the posterior peaks, but the peaks in are narrower than the corresponding ones in in their immediate neighborhoods to compensate for the heavier tails away from the peaks. Illustrative examples are shown in Figures 13 and discussed in §2.2.

2.1 Fast approximations to the optimal proposal density

2.1.1 Geometric mean approximation

Expanding gives the convergent series

(2.9)

Convergence follows from the properties of the binomial coefficient and the lower bound on in Eq. (2.12) below. In numerical experiments we find the terms give sub-dominant contributions to for a wide variety of choices for , and . Taking the normalized leading term of the series gives the geometric mean approximation to the optimal proposal222 From Jensen’s inequality we have

(2.10)
It follows that and for any pdf and . While we cannot conclude in general that , Figure 4 shows that in the Gaussian example outperforms the posterior in all practically relevant cases. ,

(2.11)

Note that for simple cases where the posterior is Gaussian under a uniform prior, the geometric mean approximation leads to a Gaussian proposal density centered on the posterior mean and with twice the posterior covariance. This is similar to taking a Gaussian KDE proposal with bandwidth equal to the estimated posterior variance; this kernel scheme is sometimes adopted in the ABC literature (eg., Sisson et al., 2007; Ishida et al., 2015).

2.1.2 Bounded approximation

Given the value of , we could sample the optimal proposal density Eq. (2.7). Whilst is not available a priori, we can bound from above and below: using Hölder’s inequality to obtain the upper bound, and the non-negativity of probability densities for the lower bound, we find that

(2.12)

Choosing a value of between these bounds avoids the need for iteration for the cost of a mild reduction in optimality. Fixing to be the average of the upper and the lower bound leads to the bounded approximation for the optimal proposal,

(2.13)

with

(2.14)

In numerical experiments, we find that the bounded approximation gives close to optimal sampling efficiencies (see §2.2 and Table 1). If further optimality is desired, this can be used as a starting point for iterating Eq. (2.7) towards the optimal .

All that is required to propose samples from the optimal proposal or an approximation of it is a density estimator for the posterior. In practice, this could be a KDE or mixture model fit to the accepted samples in the previous SMC population. With a posterior density estimator in hand, the optimal proposal can be found iteratively using Eq. (2.7), or via one of the convenient approximations Eq. (2.11) or (2.13), and then sampled using MCMC or otherwise to generate parameter proposals for the next SMC iteration.

Figure 1: ABC proposals for Gaussian posterior (grey), with a Gaussian prior (not shown). From bottom to top at the peak: commonly used KDE proposal with bandwidth of twice the (estimated) posterior variance (blue), geometric mean approximation of the optimal proposal (red-dotted), bounded approximation of the optimal proposal with (red-dashed), optimal proposal density (red).
Figure 2: ABC proposals for a bimodal posterior (grey), with a Gaussian prior (not shown). From bottom to top at the peak: commonly used KDE proposal with bandwidth of twice the (estimated) posterior variance (blue), geometric mean approximation of the optimal proposal (red-dotted), bounded approximation of the optimal proposal with (red-dashed), optimal proposal density (red).
Figure 3: ABC proposals for posterior (grey), with a uniform prior (not shown). From bottom to top at the peak: commonly used KDE proposal with bandwidth of twice the (estimated) posterior variance (blue), geometric mean approximation of the optimal proposal (red-dotted), bounded approximation of the optimal proposal with (red-dashed), optimal proposal density (red).
Proposal,
Case I: ,
3.57 1.0 3.57
2.54 0.41 6.16
2.96 0.38 7.71
3.26 0.40 8.20
3.34 0.41 8.22
Case II: ,
3.68 1.0 3.68
2.55 0.40 6.36
3.22 0.34 9.47
3.48 0.35 9.93
3.52 0.35 9.94
Case III: ,
4.77 1.0 4.77
3.80 0.57 6.68
3.56 0.35 10.23
4.05 0.36 11.23
4.34 0.38 11.40
Table 1: Comparison of the performance of different SMC proposal schemes for the three cases considered. Functionals and are proportional to the expected acceptance rate and (inverse) importance-weight variance respectively, and defines the sampling efficiency; see Eq. (2.6) for definitions.

2.2 Examples

2.2.1 Numerical examples of optimal ABC proposals and their approximations

Figures 13 and Table 1 illustrate the optimal ABC proposal density and its approximations in three scenarios: a Gaussian posterior (Figure 1), a bimodal double-Gaussian posterior (Figure 2), and a posterior (Figure 3). We compare the optimal proposals to the commonly used ABC proposal scheme recommended in Beaumont et al. (2009) for reference: a Gaussian KDE with bandwidth set to double the (estimated) posterior variance. For illustration, the proposals are compared in the converged limit where the approximate posterior (in practice, a density estimator for the accepted samples) is close to the true posterior. The KDE proposal (Beaumont et al., 2009) is shown as the convolution of the true posterior with a Gaussian with twice the posterior variance.

The three examples shown in Figures 13 and Table 1 display the same essential characteristics. The optimal proposals are boosted in regions of high posterior density (around the peak) to give a high acceptance rate, whilst having slightly broader tails compared to the posterior to ensure the importance-weight variance is kept under control (hence giving an improved effective sample size). In contrast, the KDE proposals are typically much broader than the optimal proposals around the posterior peak, which leads to lower expected acceptance rates. Meanwhile, using a posterior density estimate for the proposal gives a poor expected importance-weight variance, owing to the narrower tails compared to the other proposal schemes. The optimal proposal represents the trade-off between high proposal density in regions of high posterior density, and fatter tails to maintain a lower importance-weight variance.

In all three examples shown, the bounded approximation for the optimal proposal Eq. (2.13) performs nearly as well as the optimal proposal. This is especially clear from Table 1, where the sampling efficiencies for the optimal proposal versus the bounded approximation are very similar. The geometric mean approximation also provides a reasonable first approximation to the optimal proposal for the examples shown in Figures 13, with improvements in sampling efficiencies compared to the KDE or posterior-approximation proposals (see Table 1).

These three one-dimensional examples have the virtue of being easy to visualise and to show the features of the optimal proposal for Gaussian, skewed and multi-modal posteriors. In the following section we demonstrate that a lower bound on the expected relative improvement of the optimal proposal improves exponentially on the performance of other kernels.

2.2.2 Expected improvement as a function of data informativeness and parameter dimensionality

We can obtain a lower bound on the improvement in sampling efficiency enabled by the optimal proposal by applying the geometric mean approximation to a simple toy problem. In this model, we take the prior and posterior to be -dimensional multivariate Gaussians, with means in each dimension of and , and diagonal covariances with variances and , respectively. We then exploit the fact that can be obtained analytically in this setting to rapidly evaluate the improvement in sampling efficiency over other choices of the proposal as a function of the dimensionality of the parameter space and the informativeness of the data (as measured by both the reduction in volume and the shift in the mean in going from prior to posterior). As the optimal proposal outperforms the geometric mean approximation , the results in this section can be considered lower bounds on the improvement in performance realized by using the optimal proposal.

Figure 4 compares the geometric mean approximation proposal to using an estimator for the posterior density as the proposal, in the three-dimensional setting. It is clear that proposing samples from the posterior is highly suboptimal, whilst using even the geometric mean approximation to the optimal proposal instead gives improvements that are large, especially when the data are highly informative or surprising compared to prior expectations. Figure 5 compares the geometric mean approximation proposal to the Gaussian KDE proposal scheme recommended in Beaumont et al. (2009) (with bandwidth set to double the estimated posterior variance), again for the case where the posterior and prior are three-dimensional Gaussians. Again, gains in sampling efficiency are expected using the geometric mean approximation as opposed to the KDE scheme. Figure 6 shows the same case but for a 10-parameter rather than three-parameter set-up, showing that the relative improvement from using the optimal kernel quickly becomes larger in higher dimensions.

Inspecting the analytical result for the Gaussian case confirms that the relative improvement in sampling efficiency of the geometric mean approximation scales exponentially with the number of parameters when compared to the other cases we study here. This also implies exponentially scaling improvement with number of parameters for the optimal proposal.

Figure 4: Improvement in the sampling efficiency when sampling from the geometric mean approximation to the optimal proposal rather than the posterior for the case where both prior and posterior are Gaussian. This plot shows the case where (3-parameters). Even for this modest number of parameters the improvement is large when the data are informative. While the exact result guarantees , the geometric mean approximation gives (shown in blue) for the atypical case when the posterior and prior have nearly equal width but are located far from each other.
Figure 5: Improvement in the sampling efficiency when drawing proposals from the geometric mean approximation to the optimal proposal rather than the KDE scheme recommended in Beaumont et al. (2009) (with bandwidth equal to twice the posterior variance) for the case where both prior and posterior are Gaussian. This plot shows the case where . While the exact result guarantees , the approximation gives (shown in blue) for the atypical case when the posterior and prior have nearly equal width but are located far from each other.
Figure 6: Improvement in the sampling efficiency when drawing proposals from the geometric mean approximation to the optimal proposal rather than the KDE scheme recommended in Beaumont et al. (2009) (with bandwidth equal to twice the posterior variance) for the case where both prior and posterior are Gaussian. This plot shows the case where . While the exact result guarantees , the approximation gives (shown in blue) for the atypical case when the posterior and prior have nearly equal width but are located far from each other.

3 Conclusions

We have derived an optimal proposal scheme for SMC-ABC algorithms by maximizing the sampling efficiency, defined as the effective number of samples per parameter proposal. This represents the optimal trade-off between obtaining a high acceptance rate, whilst reducing the variance of the importance-weights of the accepted samples, hence increasing the effective sample size for the accepted samples. We derived an implicit form for the optimal proposal that can be solved iteratively, and also two convenient and simple approximations that can be evaluated quickly, provided a density estimator for the posterior is available (based on the accepted samples from the previous SMC population).

We have shown that the optimal proposal scheme gives a substantial boost in the expected sampling efficiency for a range of statistical models, and the expected gain in sampling efficiency increases with the dimensionality of the problem. The derived results hence provide a guide for choosing optimal proposal densities for SMC-ABC applications, for the small cost of constructing a posterior density estimator at each SMC population iteration.

References

  • Barnes et al. (2011) Barnes, C. P., Silk, D., Sheng, X., and Stumpf, M. P. (2011). “Bayesian design of synthetic biological systems.” Proceedings of the National Academy of Sciences.
  • Beaumont et al. (2009) Beaumont, M. A., Cornuet, J.-M., Marin, J.-M., and Robert, C. P. (2009). “Adaptive approximate Bayesian computation.” Biometrika, 96(4): 983–990.
  • Bonassi et al. (2015) Bonassi, F. V., West, M., et al. (2015). “Sequential Monte Carlo with adaptive weights for approximate Bayesian computation.” Bayesian Analysis, 10(1): 171–187.
  • Didelot et al. (2011) Didelot, X., Everitt, R. G., Johansen, A. M., Lawson, D. J., et al. (2011). “Likelihood-free estimation of model evidence.” Bayesian analysis, 6(1): 49–76.
  • Filippi et al. (2013) Filippi, S., Barnes, C. P., Cornebise, J., and Stumpf, M. P. (2013). “On optimality of kernels for approximate Bayesian computation using sequential Monte Carlo.” Statistical applications in genetics and molecular biology, 12(1): 87–107.
  • Ishida et al. (2015) Ishida, E., Vitenti, S., Penna-Lima, M., Cisewski, J., de Souza, R., Trindade, A., Cameron, E., Busti, V., collaboration, C., et al. (2015). “cosmoabc: likelihood-free inference via population Monte Carlo approximate Bayesian computation.” Astronomy and Computing, 13: 1–11.
  • Jasra et al. (2012) Jasra, A., Singh, S. S., Martin, J. S., and McCoy, E. (2012). “Filtering via approximate Bayesian computation.” Statistics and Computing, 22(6): 1223–1237.
  • Marin et al. (2012) Marin, J., Pudlo, P., Robert, C., et al. (2012). “Approximate Bayesian computational methods.” Statistics and Computing, 22(6): 1167–1180.
  • McKinley et al. (2009) McKinley, T., Cook, A. R., and Deardon, R. (2009). “Inference in epidemic models without likelihoods.” The International Journal of Biostatistics, 5(1).
  • Scott and Sain (2005) Scott, D. W. and Sain, S. R. (2005). “Multidimensional density estimation.” Handbook of statistics, 24: 229–261.
  • Sisson et al. (2007) Sisson, S. A., Fan, Y., and Tanaka, M. M. (2007). “Sequential monte carlo without likelihoods.” Proceedings of the National Academy of Sciences, 104(6): 1760–1765.
  • Toni and Stumpf (2009) Toni, T. and Stumpf, M. P. (2009). “Simulation-based model selection for dynamical systems in systems and population biology.” Bioinformatics, 26(1): 104–110.
  • Toni et al. (2009) Toni, T., Welch, D., Strelkowa, N., Ipsen, A., and Stumpf, M. P. (2009). “Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems.” Journal of the Royal Society Interface, 6(31): 187–202.
  • West (1993) West, M. (1993). “Mixture models, Monte Carlo, Bayesian updating, and dynamic models.” Computing Science and Statistics, 325–325.