1 Introduction
Approximate Bayesian computation (Beaumont et al., 2002; Marin et al., 2012)
has become an essential tool for Bayesian inference when the likelihood function of a statistical model of interest is intractable, i.e. when the analytical form of the likelihood is either unavailable or too costly to evaluate, but simulating the model is feasible. The main idea of the ABC rejection sampler
(Pritchard et al., 1999) is to draw a parameter from the prior, use it to simulate one pseudodata set and finally accept the parameter as a draw from an approximate posterior if the discrepancy between the simulated and observed data sets is small enough. While the computational efficiency of this basic ABC algorithm has been improved in several ways, many models e.g. in genomics and epidemiology (Numminen et al., 2013; Marttinen et al., 2015), astronomy (Rogers et al., 2019) and climate science (Holden et al., 2018)are very costly to simulate making samplingbased ABC inference algorithms infeasible. To increase sampleefficiency of ABC, various methods using surrogate models such as neural networks
(Papamakarios and Murray, 2016; Papamakarios et al., 2019; Greenberg et al., 2019) and Gaussian processes (Meeds and Welling, 2014; Wilkinson, 2014; Gutmann and Corander, 2016; Järvenpää et al., 2018, 2019a) have been proposed.In one promising surrogatebased ABC framework the discrepancy between the observed and simulated data, a key quantity in ABC, is modelled with a GP (Gutmann and Corander, 2016; Järvenpää et al., 2018, 2019a). The GP model is then used to form an estimator for the (approximate) posterior and to adaptively select new evaluation locations. Sequential Bayesian experimental design (also known as active learning) methods to select the simulation locations so as to maximise the sampleefficiency were proposed by Järvenpää et al. (2019a). However, their methods allow to run only one simulation at a time although in practice one often has access to multiple cores to run some of the simulations in parallel. In this work, we resolve this limitation by developing batch simulation methods which are then shown to considerably decrease the walltime needed for ABC inference. Our approach (Section 4) is based on a Bayesian decision theoretic framework recently developed by Järvenpää et al. (2019b) who, however, studied only the case where the expensive and potentially noisy likelihood evaluations are available (e.g. using synthetic likelihood method by Wood (2010)). In this work we instead focus on ABC scenario where only less than a thousand simulations can be obtained.
In practice the posterior distribution is often summarised for further decision making using e.g. the expectation and variance. When the computational resources for ABC inference are limited, it would be important to assess the accuracy of such summaries, but this has not been done in earlier work. As the second main contribution of this paper, we devise an approximate numerical method to propagate the uncertainty of the discrepancy, represented by the GP model, to the resulting ABC posterior summaries (Section
5). Such uncertainty estimates are useful for assessing the accuracy of the inference and can also guide the termination of the inference algorithm. We call the resulting improved framework as Bayesian ABC^{1}^{1}1We use this name in analogy with the related problems of Bayesian quadrature and Bayesian optimisation..We also provide some new insight on the underlying GP modelling assumptions (Appendix A.2) and on the connections between Bayesian ABC, BQ and BO to improve understanding of these conceptually similar GPsurrogate methods (Section 6). Finally, experiments with several toy models and three realworld simulation models are used to demonstrate the ABC posterior uncertainty quantification and to show that Bayesian ABC framework is wellsuited for parallel simulations (Section 7).
2 Brief Background on Abc
Suppose our initial knowledge about the (continuous) parameters of a statistical model of interest is coded into a prior density . If the likelihood function is available, the posterior distribution describing our knowledge of given the observed data
, can be computed using Bayes’ theorem
(1) 
If the likelihood function is intractable, evaluating (1) even uptonormalisation becomes infeasible. Standard ABC algorithms such as the ABC rejection sampler instead target the approximate posterior
(2) 
where . Other choices of kernel are also possible (Wilkinson, 2013). Above, is a discrepancy function used to compare the similarity of the data sets and is a threshold parameter. Small produces good approximations but renders samplingbased ABC methods inefficient. A wellconstructed discrepancy function is an important ingredient of accurate ABC inference (Marin et al., 2012). In this paper we assume a suitable discrepancy function is already available (e.g. constructed based on expert opinion, earlier analyses on other similar models, pilot runs or distance measures between raw data sets (Park et al., 2016; Bernton et al., 2019)) and focus on approximating any given ABC posterior in (2) as well as possible given only a limited budget of simulations.
3 Bayesian Abc Framework
We describe our Bayesian ABC framework here. While this is similar to Järvenpää et al. (2019a), we use a hierarchical GP model and, most importantly, explicitly quantify the uncertainty of the ABC posterior instead of merely using point estimation. The main idea is to explicitly use another layer of Bayesian inference to estimate the ABC posterior in (2). The previously simulated discrepancyparameterpairs are treated as data to learn a surrogate model, which will predict the discrepancy for a given parameter value. The surrogate model is further used to form an estimator for the ABC posterior in (2) and to adaptively acquire new data using Bayesian experimental design.
We assume each discrepancy evaluation denoted by at corresponding parameter is generated as
(3) 
where is the variance of the discrepancy assumed here constant over .^{2}^{2}2While this modelling assumption may seem strong, it has been used successfully before. We now give a justification for this modelling choice in Appendix A.2. To encode the assumptions of smoothness and e.g. potential quadratic shape of the discrepancy , in this work its unknown mean function is given a hierarchical GP prior
(4) 
where
is a covariance function with hyperparameters
and are basis functions (both assumed continuous). We marginalise in (4), as in O’Hagan and Kingman (1978), and Riihimäki and Vehtari (2014), to obtain the GP prior(5) 
where
is a column vector consisting of the basis functions
evaluated at . For now, we assume the GP hyperparameters are fixed and omit from our notation for brevity.Given training data , we obtain ,
(6)  
(7) 
where , , and
(8)  
(9) 
Above is the generalised leastsquares estimate, is the matrix whose columns consist of basis function values evaluated at , is a matrix, and is the corresponding vector of test point . We also define and . For further details of GP regression, see e.g. Rasmussen and Williams (2006).
If the true discrepancy mean function and the variance of the discrepancy were known, the ABC posterior would be
(10) 
where is the Gaussian cdf. This fact follows from (2) and the Gaussian modelling assumption (3). In practice is unknown but our knowledge about is represented by the posterior . Since the ABC posterior in (10) depends on , it is also a random quantity and its posterior can be obtained by propagating the uncertainty in through the mapping .
Computing the distribution of is difficult due to its nonlinear dependence on and because
is infinitedimensional. However, the pointwise mean, variance and quantiles of the unnormalised ABC posterior
(11) 
i.e. the numerator of (10), can be computed analytically as shown by Järvenpää et al. (2019a) in the case of a zero mean GP prior. It is easy to see that their formulas also hold for our more general GP model in (4). For example,
(12)  
(13)  
(14) 
where med is the marginal (i.e. elementwise) median. While these formulas are useful, they do not allow one to assess the uncertainty of e.g. posterior mean . We resolve this limitation in Section 5.
4 Parallel Simulations
We aim to find the most informative simulation locations for obtaining the best possible estimate of the ABC posterior given the postulated GP model. Principled sequential designs, where one simulation is run at a time, were developed by Järvenpää et al. (2019a). In practice, to decrease the walltime needed for the inference task, one could run some of the simulations in parallel. In the following, we apply Bayesian experimental design theory (Chaloner and Verdinelli, 1995; Järvenpää et al., 2019b) for the (synchronous) batch setting where simulations are simultaneously selected to be computed in parallel.
4.1 DecisionTheoretic Approach
First, consider a loss function
so that quantifies the penalty of reporting as our ABC posterior when the true one is . Given , the onebatchahead Bayesoptimal selection of the next batch of evaluations is then obtained as(15)  
(16) 
In (16), we calculate an expectation over future discrepancy evaluations at locations , assuming they follow our current GP model. The expectation is taken of the Bayes risk resulting from the nested decision problem of choosing the estimator , assuming are known and merged with current data via . While the main quantity of interest in the Bayesian ABC framework is the ABC posterior in (10), in practice it is desirable to use a loss function based on the unnormalised distribution . Such a simplification, also used by Kandasamy et al. (2017), Sinsbeck and Nowak (2017), and Järvenpää et al. (2019a, b), allows efficient computations. Furthermore, evaluations that are optimal for estimating will be informative about the related quantity .
Consider loss function between the unnormalised ABC posterior and its estimator (both supposed to be squareintegrable in i.e. ). Then the optimal estimator is the mean in (12) (Sinsbeck and Nowak, 2017; Järvenpää et al., 2019b). If we instead consider loss (supposing ), then the marginal median in (14) is the optimal estimator. Corresponding Bayes risks, denoted and , respectively, can be computed as follows:
Lemma 4.1.
We call as an acquisition function. Expected integrated variance (EIV) and expected integrated MAD^{3}^{3}3Mean absolute deviation (around median). (EIMAD) acquisition functions, denoted and , respectively, can be computed as follows:
Proposition 4.2.
4.2 Details on Computation
Finding the onebatchahead optimal design requires global optimisation over for both EIV and EIMAD. As this becomes infeasible with large batch size and/or dimension of the parameter space , we use greedy optimisation as common also in batch BO (see, e.g., Ginsbourger et al., 2010; Snoek et al., 2012; Wilson et al., 2018). That is, the th point in the batch () is iteratively chosen by optimising the acquisition function evaluated with with respect to when the other points are kept fixed to their already determined values. This simplifies the dimensional optimisation problem to a sequence of easier dimensional problems. Theory of submodular optimisation has been used to study greedy batch designs (Bach, 2013; Wilson et al., 2018; Järvenpää et al., 2019b). Unfortunately, such analysis hardly extends to our case because the acquisition functions in Proposition 4.2 depend on in a rather complex way. Using the facts that is nondecreasing for and cannot decrease as more points are included to , we nevertheless see that both EIV and EIMAD are nonincreasing as set functions of . For this reason, we can expect the greedy optimisation to be useful in practice as is seen empirically in Section 7.
Another potential computational difficulty is the integration over in (19) and (20). Many stateoftheart BO methods, such as Hennig and Schuler (2012), HernándezLobato et al. (2014) and Wu and Frazier (2016), also require similar computations. We approximate the integral using numerical integration for and selfnormalised importance sampling (IS), where the current loss function interpreted as an unnormalised density is the instrumental distribution, for . For details, see Appendix B.
4.3 Heuristic Baseline Batch Methods
We consider also heuristic acquisition functions which evaluate where the pointwise uncertainty of
is highest. Such intuitive strategy is sometimes called as uncertainty sampling and used, e.g., by Gunter et al. (2014), Järvenpää et al. (2019a) and Chai and Garnett (2019). When variance is used as the measure of uncertainty of , we call the method as MAXV and when MAD is used, we obtain an alternative strategy called analogously MAXMAD. The resulting acquisition functions can be computed using the integrands of (17) and (18).Finally, we propose a heuristic approach, also used for BO (Snoek et al., 2012), to parallellise MAXV and MAXMAD strategies: The first point in the batch is chosen as in the sequential case. The further points are iteratively selected as the locations where the expected variance (or MAD), taken with respect to the discrepancy values of the pending points, that is, points that have been already chosen to the current batch, is highest. The resulting acquisition functions are immediately obtained as the integrands of (19) and (20).
5 Uncertainty Quantification of the Abc Posterior
Pointwise marginal uncertainty of the unnormalised ABC posterior was used in previous section for selecting the simulation locations. However, knowing the value of and its marginal uncertainty in some individual values is not very helpful for summarising and understanding the accuracy of the final estimate of the ABC posterior. Computing the distribution of the moments and marginals of the normalised ABC posterior in (10) is clearly more intuitive. See Fig. 1 for a 1D demonstration of this approach.
To access the posterior of , it would be possible to fix a sample path , then use it to fix a realisation of the ABC posterior using (10) and finally use e.g. MCMC to sample from . This would be repeated times and the resulting set of samples (where is the length of the MCMC chain for each posterior realisation ) approximately describes the posterior of given (see Fig. 1d). The uncertainty of GP hyperparameters could also be taken into account by drawing as the very first step but we here consider as known for simplicity although this can cause underestimation of the uncertainty of . The outlined approach involves a major computational challenge as evaluating the sample paths at different sets of test points scales^{4}^{4}4Approximations such as random Fourier features (RFF) (Rahimi and Recht, 2008) and those by Pleiss et al. (2018) can be used to reduce this cost, e.g. HernándezLobato et al. (2014) and Wang and Jegelka (2017) used RFF to approximately optimise GP sample paths. However, this produces tradeoff between exact GP but small v.s. inexact GP but large which we do not analyse in this paper. as .
We propose the following computationally cheaper yet asymptotically exact approach: In small dimensions, when , we evaluate each sample path at fixed grid points and compute the required integrations numerically. This approach scales as . If , then selfnormalised importance sampling is used. We draw samples from instrumental density, the quantile of interpreted as a pdf and evaluated using (A.23) of Appendix, with . The samples are thinned and the resulting representative samples are used to compute the normalised importance weights for each sampled posterior . The output is a set of weighted sample sets from which moments and marginal densities can be computed using standard Monte Carlo estimators for each . This approach requires only one MCMC sampling from the instrumental density which scales as , i.e. only linearly with respect to , so that can be large. Total cost is .
This approach has nevertheless some limitations: The computations itself are only approximate because and are finite. Also, if the uncertainty of is very high, choosing a good instrumental density can be difficult because some of the sampled posteriors are then necessarily quite different from any single instrumental density producing possibly poor approximation. In our experiments this however happened only with early iterations and can be detected e.g. by monitoring the distribution of effective sample sizes for . In Section 7 we demonstrate that the uncertainty quantification is still feasible and beneficial for lowdimensional cases. The proposed approach also works with other GPmodelling situations such as Järvenpää et al. (2019b).
6 On Related GpBased Methods
In this section we briefly discuss the relation between Bayesian ABC, BQ and BO to facilitate better understanding of these conceptually similar inference methods.
6.1 Relation to Bayesian Quadrature
In Bayesian quadrature one aims to compute integral , where is an expensive blackbox function and is a known density, e.g. Gaussian. If a GP prior is placed on , given some evaluations where , the posterior of , describing one’s knowledge of the value of this integral, is Gaussian whose mean and variance can be computed analytically for some choices of and (for details, see O’Hagan (1991); Briol et al. (2019)). Also, BQ methods for computing integrals of the form with some known (nonnegative) function , such as marginal likelihoods, have been developed by Osborne et al. (2012); Gunter et al. (2014); Chai and Garnett (2019).
Our approach in Section 5 is instead developed for quantifying the uncertainty in either the whole function , which we here write as
(22) 
or some corresponding moments such as the expectation . To our knowledge, computatation of these quantities probabilistically has not been considered before. In particular, we used the “01 kernel” in (2) corresponding to in (22). Osborne et al. (2012), Gutmann and Corander (2016), Acerbi (2018) and Järvenpää et al. (2019b) instead modelled the loglikelihood with GP to reckon the nonnegativity of the likelihood and the high dynamic range of the loglikelihood. This would correspond to in (22).
Osborne et al. (2012); Gunter et al. (2014); Chai and Garnett (2019) used linearisation approximations in their algorithms for estimating integrals of the form . Similarly, if both terms in our case in (10) were linear for , then the numerator and denominator in (10) would have joint Gaussian density leading to tractable computations. However, we observed that the resulting densities can be highly nonGaussian so that any linearisation approach can result poor quality approximations. For this reason we considered simulationbased approach in Section 5.
6.2 Relation to Bayesian Optimisation
Suppose now is an expensive, blackbox function to be minimised. In BO, a GP prior is placed on and the future locations for obtaining (possibly noisy) evaluations of are chosen adaptively by optimising an acquisition function that, in some sense, measures the potential improvement in the knowledge of the minimum point or the corresponding function value brought by the extra evaluation. For example, entropy search and predictive entropy search (Hennig and Schuler, 2012; HernándezLobato et al., 2014) use an acquisition function that measures the expected reduction in the differential entropy of the posterior of . Wang and Jegelka (2017) similarly considered the posterior of . The important difference between these methods (or BO in general) and Bayesian ABC is that the quantity of interest of Bayesian ABC is not the minimiser of but the full ABC posterior density (or ). Also, BO is rarely introduced this way in literature, simple acquisition functions such as the expected improvement and lower confidence bound (LCB) are often used and the posterior of or is rarely explicitly computed.
In the BOLFI framework (Gutmann and Corander, 2016), the function was however taken to be the ABC discrepancy , and LCB acquisition function (Srinivas et al., 2010) was used for illustrating their approach of learning the ABC posterior. This is reasonable because to learn the ABC posterior one needs to evaluate in the regions with small discrepancy. We have the following new result that relates LCB to the Bayesian ABC framework:
Proposition 6.1.
If the prior is uniform over (and may be improper), i.e. if , then the point chosen by the LCB acquisition function with parameter is the same as the point maximising the quantile of the unnormalised ABC posterior for any .
This result gives an interpretation for the LCB tradeoff parameter in the ABC setting. However, instead of using LCB for Bayesian ABC, it is clearly more reasonable to evaluate where the variance (or some other measure of uncertainty) is large as already discussed e.g. by Kandasamy et al. (2017) and Järvenpää et al. (2019a). Järvenpää et al. (2019a) showed empirically that EIV consistently works better than LCB in their sequential scenario when the goal is to learn the ABC posterior. For this reason, we do not use (batch) BO methods in this article.
7 Experiments
We first consider four 2D toy problems to gain understanding on how the proposed method performs with a wellspecified GP model. We then focus on more typical scenarios where the GP modelling assumptions do not hold exactly using three realworld simulation models. We compare the performance of the sequential and synchronous batch versions of the acquisition methods outlined in Section 4. As a simple baseline, we consider random points drawn from the prior (abbreviated as RAND). We also briefly demonstrate the uncertainty quantification of the ABC posterior. We do not consider synthetic likelihood method (as e.g. in Järvenpää et al. (2019b)) because it requires hundreds of evaluations for each proposed parameter and is thus not applicable here.
Locations for fitting the initial GP model are sampled from the uniform prior in all cases. We take initial points for 2D and for 3D and 4D cases. We use , and include basis functions of the form . The discrepancy is assumed smooth and we use the squared exponential covariance function . GP hyperparameters are given weakly informative priors and their values are obtained using MAP estimation at each iteration. Owen’s T function values are computed using a Cimplementation of the algorithm by Patefield and Tandy (2000).
ABCMCMC (Marjoram et al., 2003) with extensive simulations is used to compute the ground truth ABC posterior for the realworld models. For simplicity and to ensure meaningful comparisons to groundtruth, we fix to certain small predefined values although, in practice, its value is set adaptively (Järvenpää et al., 2019a) or based on pilot runs. We compute the estimate of the unnormalised ABC posterior using (12) for MAXV, EIV, RAND and (14) for MAXMAD, EIMAD. Adaptive MCMC is used to sample from the resulting ABC posterior estimates and from instrumental densities needed for IS approximations. TV denotes the median total variation distance between the estimated ABC posterior and the true one (2D) or the average TV between their marginal TV values (3D, 4D) computed numerically over repeated runs. Iteration (i.e. number of batches chosen) serves as a proxy to walltime. The number of simulations i.e. the maximum value of is fixed in all experiments and the batch methods thus finish earlier.
7.1 Toy Simulation Models
Fig. 2 shows the results with sequential methods () and corresponding batch methods with for four synthetically constructed toy models. These were taken from Järvenpää et al. (2019a) and are illustrated in the Appendix B. In Fig. 3 the effect of batch size is studied for the two best performing methods.
7.2 RealWorld Simulation Models
Lorenz model. This modified version of the wellknown Lorenz weather prediction model describes the dynamics of slow weather variables and their dependence on unobserved fast weather variables over a certain period of time. The model is represented by a coupled stochastic differential equation which can only be solved numerically resulting in an intractable likelihood function. The model has two parameters which we estimate from timeseries data generated using . See Thomas et al. (2018) for full details of the model and the experimental setup that we also use here, with the exception that we use wider uniform prior . The discrepancy is formed as a Mahalanobis distance from six summary statistics proposed by Hakkarainen et al. (2012). The results are shown in Fig. 4(a). Furthermore, Fig. 4(bc) demonstrates the uncertainty quantification of the expectation of the modelbased ABC posterior. See Appendix B.1 for the details of the numerical computations used. The effect of batch size is shown in Fig. 5(c).
Bacterial infections model. This model describes transmission dynamics of bacterial infections in day care centers and features intractable likelihood function. The model has been developed by Numminen et al. (2013) and used previously by Gutmann and Corander (2016) and Järvenpää et al. (2019a) as an ABC benchmark problem. We estimate the internal, external and coinfection parameters and , respectively, using true data (Numminen et al., 2013) and uniform priors. The discrepancy is formed as in Gutmann and Corander (2016), see Appendix B.3 for details. The results with all methods are shown in Fig. 5(a) and Fig. 5(b) shows the effect of batch size for the two best performing methods.
7.3 Discussion on the Results
In general, we obtain reasonable posterior approximations considering the very limited budget of simulations. EIV and EIMAD tend to produce more stable, accurate but also more conservative estimates than MAXV and MAXMAD. Difference in approximation quality between EIV and EIMAD, both based on the same Bayesian decision theoretic framework but different loss functions, was small. While RAND worked well in 2D cases and is fully parallellisable, it unsurprisingly produced poor posterior approximations in higher dimensions. In all cases, our batch strategies produced similar evaluation locations as the corresponding sequential methods leading to substantial improvements in walltime when the simulations are costly. Unlike in the related problem of BO, batch points need not always be diverse because the simulations are stochastic and simulating multiple times at nearby points can be useful. On the other hand, already a single simulation can be enough to effectively rule out large tail regions. The proposed methods automatically balance between these two situations.
Fig. 4(bc) show the evolution of the uncertainty in the ABC posterior expectation of the Lorenz model over iterations in the case of sequential EIV. The convergence is approximately towards the true ABC posterior expectation due to a slight GP misspecification. Similarly, the ABC posterior marginals of the bacterial infection model in Appendix C contain some uncertainty after iterations which our approach allows to rigorously quantify. Due to the approximations involved and because this approach is not designed to account for the error due to approximating the intractable groundtruth posterior with the ABC posterior in the first place, we however suggest to interpret the uncertainty estimates with care. Developing more effective (analytical) methods for computing these uncertainty estimates is an interesting avenue for future work. The connection to BQ methods outlined in Section 6.1 can be helpful for achieving this goal.
8 Conclusions
We considered ABC inference with a limited number of simulations (). We outlined a GP surrogate modelling framework called Bayesian ABC where the uncertainty of the ABC posterior distribution due to the limited computational resources is approximately quantified. We also developed batchsequential Bayesian experimental design strategies to efficiently parallellise the expensive simulations. Experiments suggest that substantial gains in walltime over previous related work can be obtained.
Acknowledgements
This work was funded by the Academy of Finland (grants no. 286607 and 294015 to PM). We acknowledge the computational resources provided by Aalto ScienceIT project.
References
 Acerbi (2018) L. Acerbi. Variational Bayesian Monte Carlo. In Advances in Neural Information Processing Systems 31, pages 8223–8233. 2018.
 Bach (2013) F. Bach. Learning with Submodular Functions: A Convex Optimization Perspective. 2013.
 Beaumont et al. (2002) M. A. Beaumont, W. Zhang, and D. J. Balding. Approximate Bayesian computation in population genetics. Genetics, 162(4):2025–2035, 2002.
 Bernton et al. (2019) E. Bernton, P. E. Jacob, M. Gerber, and C. P. Robert. Approximate Bayesian computation with the Wasserstein distance. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(2):235–269, 2019.
 Briol et al. (2019) F.X. Briol, C. J. Oates, M. Girolami, M. A. Osborne, and D. Sejdinovic. Probabilistic integration: A role in statistical computation? Statistical Science, 34(1):1–22, 2019.

Chai and Garnett (2019)
H. R. Chai and R. Garnett.
Improving quadrature for constrained integrands.
In
The 22nd International Conference on Artificial Intelligence and Statistics
, pages 2751–2759, 2019.  Chaloner and Verdinelli (1995) K. Chaloner and I. Verdinelli. Bayesian Experimental Design: A Review. Statistical Science, 10(3):273–304, 1995.
 Ginsbourger et al. (2010) D. Ginsbourger, R. Le Riche, and L. Carraro. Kriging Is WellSuited to Parallelize Optimization, pages 131–162. Springer Berlin Heidelberg, 2010.

Greenberg et al. (2019)
D. Greenberg, M. Nonnenmacher, and J. Macke.
Automatic posterior transformation for likelihoodfree inference.
In
Proceedings of the 36th International Conference on Machine Learning
, 2019.  Gunter et al. (2014) T. Gunter, M. A. Osborne, R. Garnett, P. Hennig, and S. J. Roberts. Sampling for inference in probabilistic models with fast Bayesian quadrature. In Advances in Neural Information Processing Systems 27. 2014.
 Gutmann and Corander (2016) M. U. Gutmann and J. Corander. Bayesian optimization for likelihoodfree inference of simulatorbased statistical models. Journal of Machine Learning Research, 17(125):1–47, 2016.
 Haario et al. (2006) H. Haario, M. Laine, A. Mira, and E. Saksman. DRAM: Efficient adaptive MCMC. Statistics and Computing, 16(4):339–354, 2006.
 Hakkarainen et al. (2012) J. Hakkarainen, A. Ilin, A. Solonen, M. Laine, H. Haario, J. Tamminen, E. Oja, and H. Järvinen. On closure parameter estimation in chaotic systems. Nonlinear Processes in Geophysis, 19:127–143, 2012.
 Hennig and Schuler (2012) P. Hennig and C. J. Schuler. Entropy Search for InformationEfficient Global Optimization. Journal of Machine Learning Research, 13(1999):1809–1837, 2012.
 HernándezLobato et al. (2014) J. M. HernándezLobato, M. W. Hoffman, and Z. Ghahramani. Predictive Entropy Search for Efficient Global Optimization of Blackbox Functions. Advances in Neural Information Processing Systems 28, pages 1–9, 2014.
 Holden et al. (2018) P. Holden, N. Edwards, and R. Wilkinson. ABC for climate: dealing with expensive simulators, pages 569–95. 2018. In The Handbook of ABC.
 Järvenpää et al. (2018) M. Järvenpää, M. U. Gutmann, A. Vehtari, and P. Marttinen. Gaussian process modelling in approximate Bayesian computation to estimate horizontal gene transfer in bacteria. The Annals of Applied Statistics, 12(4):2228–2251, 2018.
 Järvenpää et al. (2019a) M. Järvenpää, M. U. Gutmann, A. Pleska, A. Vehtari, and P. Marttinen. Efficient acquisition rules for modelbased approximate Bayesian computation. Bayesian Analysis, 14(2):595–622, 2019a.
 Järvenpää et al. (2019b) M. Järvenpää, M. U. Gutmann, A. Vehtari, and P. Marttinen. Parallel Gaussian process surrogate Bayesian inference with noisy likelihood evaluations. arXiv1905.01252, 2019b.
 Kandasamy et al. (2017) K. Kandasamy, J. Schneider, and B. Póczos. Query efficient posterior estimation in scientific experiments via Bayesian active learning. Artificial Intelligence, 243:45–56, 2017.
 Marin et al. (2012) J. M. Marin, P. Pudlo, C. P. Robert, and R. J. Ryder. Approximate Bayesian computational methods. Statistics and Computing, 22(6):1167–1180, 2012.
 Marjoram et al. (2003) P. Marjoram, J. Molitor, V. Plagnol, and S. Tavare. Markov chain Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences of the United States of America, 100(26):15324–8, 2003.
 Marttinen et al. (2015) P. Marttinen, M. U. Gutmann, N. J. Croucher, W. P. Hanage, and J. Corander. Recombination produces coherent bacterial species clusters in both core and accessory genomes. Microbial Genomics, 1(5), 2015.
 Meeds and Welling (2014) E. Meeds and M. Welling. GPSABC: Gaussian Process Surrogate Approximate Bayesian Computation. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, 2014.
 Numminen et al. (2013) E. Numminen, L. Cheng, M. Gyllenberg, and J. Corander. Estimating the transmission dynamics of streptococcus pneumoniae from strain prevalence data. Biometrics, 69(3):748–757, 2013.
 O’Hagan (1991) A. O’Hagan. BayesHermite quadrature. Journal of Statistical Planning and Inference, 1991.
 O’Hagan and Kingman (1978) A. O’Hagan and J. F. C. Kingman. Curve fitting and optimal design for prediction. Journal of the Royal Statistical Society. Series B (Methodological), 40(1):1–42, 1978.
 Osborne et al. (2012) M. A. Osborne, D. Duvenaud, R. Garnett, C. E. Rasmussen, S. J. Roberts, and Z. Ghahramani. Active Learning of Model Evidence Using Bayesian Quadrature. Advances in Neural Information Processing Systems 26, pages 1–9, 2012.

Owen (1956)
D. B. Owen.
Tables for computing bivariate normal probabilities.
The Annals of Mathematical Statistics, 27(4):1075–1090, 1956.  Papamakarios and Murray (2016) G. Papamakarios and I. Murray. Fast efree inference of simulation models with Bayesian conditional density estimation. In Advances in Neural Information Processing Systems 29, 2016.
 Papamakarios et al. (2019) G. Papamakarios, D. Sterratt, and I. Murray. Sequential neural likelihood: Fast likelihoodfree inference with autoregressive flows. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, pages 837–848, 2019.
 Park et al. (2016) M. Park, W. Jitkrittum, and D. Sejdinovic. K2ABC: Approximate Bayesian computation with kernel embeddings. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 398–407, 2016.
 Patefield and Tandy (2000) M. Patefield and D. Tandy. Fast and accurate calculation of Owen’s T function. Journal of Statistical Software, 5(5):1–25, 2000.
 Pleiss et al. (2018) G. Pleiss, J. Gardner, K. Weinberger, and A. G. Wilson. Constanttime predictive distributions for Gaussian processes. In Proceedings of the 35th International Conference on Machine Learning, pages 4114–4123, 2018.
 Prangle (2017) D. Prangle. Adapting the ABC distance function. Bayesian Analysis, 12(1):289–309, 2017.
 Price et al. (2018) L. F. Price, C. C. Drovandi, A. Lee, and D. J. Nott. Bayesian synthetic likelihood. Journal of Computational and Graphical Statistics, 27(1):1–11, 2018.
 Pritchard et al. (1999) J. K. Pritchard, M. T. Seielstad, A. PerezLezaun, and M. W. Feldman. Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Molecular Biology and Evolution, 16(12):1791–1798, 1999.
 Rahimi and Recht (2008) A. Rahimi and B. Recht. Random features for largescale kernel machines. In Advances in Neural Information Processing Systems 20, pages 1177–1184. 2008.
 Rasmussen and Williams (2006) C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.
 Riihimäki and Vehtari (2014) J. Riihimäki and A. Vehtari. Laplace approximation for logistic Gaussian process density estimation and regression. Bayesian Analysis, 9(2):425–448, 2014.
 Rogers et al. (2019) K. K. Rogers, H. V. Peiris, A. Pontzen, S. Bird, L. Verde, and A. FontRibera. Bayesian emulator optimisation for cosmology: application to the Lymanalpha forest. Journal of Cosmology and Astroparticle Physics, 2019.
 Sinsbeck and Nowak (2017) M. Sinsbeck and W. Nowak. Sequential design of computer experiments for the solution of Bayesian inverse problems. SIAM/ASA Journal on Uncertainty Quantification, 5(1):640–664, 2017.
 Snoek et al. (2012) J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, pages 1–9, 2012.
 Srinivas et al. (2010) N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, pages 1015–1022, 2010.
 Thomas et al. (2018) O. Thomas, R. Dutta, J. Corander, S. Kaski, and M. U. Gutmann. Likelihoodfree inference by ratio estimation. arXiv1611.10242, 2018.
 Wang and Jegelka (2017) Z. Wang and S. Jegelka. Maxvalue entropy search for efficient Bayesian optimization. In Proceedings of the 34th International Conference on Machine Learning, pages 3627–3635, 2017.
 Wilkinson (2013) R. D. Wilkinson. Approximate Bayesian computation (ABC) gives exact results under the assumption of model error. Statistical Applications in Genetics and Molecular Biology, 12(2):129–141, 2013.
 Wilkinson (2014) R. D. Wilkinson. Accelerating ABC methods using Gaussian processes. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, 2014.
 Wilson et al. (2018) J. Wilson, F. Hutter, and M. Deisenroth. Maximizing acquisition functions for Bayesian optimization. In Advances in Neural Information Processing Systems 31, pages 9906–9917. 2018.
 Wood (2010) S. N. Wood. Statistical inference for noisy nonlinear ecological dynamic systems. Nature, 466:1102–1104, 2010.
 Wu and Frazier (2016) J. Wu and P. Frazier. The parallel knowledge gradient method for batch Bayesian optimization. In Advances in Neural Information Processing Systems 29, pages 3126–3134. 2016.
Appendix A Proofs and additional analysis
a.1 Proofs
Proof of Lemma 4.1.
We consider first the case of integrated variance. A result corresponding to (17) but with zero mean GP prior is shown as Lemma 3.1 in the article by Järvenpää et al. (2019a). However, its proof works as such also for our GP model in Section 3 and (17) follows immediately.
Let us now consider integrated MAD in (18). To simplify notation, we use for , for and for . We then see that
(A.1)  
(A.2) 
For the inner integral with fixed we obtain
(A.3)  
(A.4)  
(A.5)  
(A.6) 
where on the last line we have used the fact
(A.7) 
shown by Järvenpää et al. (2019a). We further see that
(A.8)  
(A.9)  
(A.10)  
(A.11)  
(A.12)  
(A.13)  
(A.14) 
where denotes the bivariate Normal cdf and denotes the zeromean bivariate Normal cdf with unit variances and correlation coefficient evaluated at . Finally, using a connection between bivariate Gaussian cdf and Owen’s T function (Owen, 1956), we obtain
(A.15) 
When we combine the equations, we see that the terms cancel out and we obtain (20). ∎
Proof of Proposition 4.2.
The formula for the EIV can be derived in a straightforward manner by using the GP lookahead formulas given by Lemma 5.1 by Järvenpää et al. (2019b) in the proof of Proposition 3.2 by Järvenpää et al. (2019a).
The case of EIMAD requires some extra work. First, using an equation from the proof of Lemma 4.1, we obtain
(A.16)  
(A.17)  
(A.18) 
Note that in and is used to emphasise that these quantities depend on and/or . Since , i.e. the reduction of the GP variance function at due to extra evaluations is deterministic and depends only on (and not on ), we obtain for each that
(A.19)  
(A.20)  
(A.21) 
where we have used Lemma 5.1 by Järvenpää et al. (2019b) and (A.7). Similarly, we see that
Comments
There are no comments yet.