# Adaptive MCMC for synthetic likelihoods and correlated synthetic likelihoods

Approximate Bayesian computation (ABC) and synthetic likelihood (SL) are strategies for parameter inference when the likelihood function is analytically or computationally intractable. In SL, the likelihood function of the data is replaced by a multivariate Gaussian density for summary statistics compressing the observed data. While SL is conceptually simpler to implement compared with ABC, it requires simulation of many replicate datasets at every parameter value considered by a sampling algorithm, such as MCMC, making the method very computationally-intensive. We propose two strategies to alleviate the computational burden imposed by SL algorithms. We first introduce a novel adaptive MCMC algorithm for SL where the proposal distribution is sequentially tuned. Second, we exploit existing strategies from the correlated particle filters literature, to improve the MCMC mixing in a SL framework. Additionally, we show how to use Bayesian optimization to rapidly generate promising starting values for SL inference. Our combined goal is to provide ways to make the best out of each expensive MCMC iteration when using synthetic likelihoods algorithms, which will broaden the scope of these methods for complex modeling problems with costly simulators. To illustrate the advantages stemming from our framework we consider three benchmarking examples, including estimation of parameters for a cosmological model.

## Authors

• 6 publications
• 5 publications
• 33 publications
07/03/2020

### Transformations in Semi-Parametric Bayesian Synthetic Likelihood

Bayesian synthetic likelihood (BSL) is a popular method for performing a...
05/16/2019

### Finding our Way in the Dark: Approximate MCMC for Approximate Bayesian Methods

With larger data at their disposal, scientists are emboldened to tackle ...
10/11/2017

### Efficient MCMC for Gibbs Random Fields using pre-computation

Bayesian inference of Gibbs random fields (GRFs) is often referred to as...
05/20/2019

### Stratified sampling and resampling for approximate Bayesian computation

Approximate Bayesian computation (ABC) is computationally intensive for ...
06/17/2021

### Hierarchical surrogate-based Approximate Bayesian Computation for an electric motor test bench

Inferring parameter distributions of complex industrial systems from noi...
10/23/2017

### An MCMC Algorithm for Estimating the Reduced RUM

The RRUM is a model that is frequently seen in language assessment studi...
02/07/2018

### An MCMC Algorithm for Estimating the Q-matrix in a Bayesian Framework

The purpose of this research is to develop an MCMC algorithm for estimat...

## Code Repositories

### ASL

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Synthetic likelihood (SL) is a methodology for parameter inference in stochastic models that do not admit a computationally tractable likelihood function. That is, similarly to approximate Bayesian computation (ABC, sisson2018handbook

), SL only requires the ability to generate synthetic datasets from a model simulator, and statistically relevant summary statistics from the data that capture parameter-dependent variation in an adequate manner. The price to pay for its flexibility is that SL can be computationally very intensive, since it is typically embedded into a Markov chain Monte Carlo (MCMC) framework, requiring the simulation of multiple (often hundreds or thousands) synthetic datasets at each proposed parameter. The goal of our work is to propose strategies for reducing the computational cost to perform Bayesian inference via SL. While each iteration of MCMC using SL can have a non-negligible cost on the overall computational budget, we construct an adaptive proposal distribution specific for SL, and tweak methods that have been recently proposed in the correlated particle filters literature to improve chain mixing in MCMC. In addition, we show that for challenging problems where it is difficult to locate appropriate starting parameters, Bayesian optimization

(gutmann2016bayesian)

can be efficiently used for kickstarting SL-based posterior sampling, which is facilitated by the open-source

ELFI software (lintusaari2018elfi).

SL is described in detail in Section 2, but here we first review some its features with relevant references to the literature. SL was first proposed in wood2010statistical to produce inference for parameters of simulator-based models with intractable likelihood. It replaces the analytically intractable data likelihood for observed data with the joint density of a set of summary statistics of the data . Here is a function of the data that has to be specified by the analyst and that can be evaluated for input , or simulated data . The main assumption characterizing the SL approach is that

has a multivariate normal distribution

with unknown mean and covariance matrix

. The unknown moments can be estimated via Monte Carlo simulations of size

to obtain estimators , .

The resulting multivariate Gaussian likelihood can then be numerically maximised with respect to , to return an approximate maximimum likelihood estimator or, as recommended in wood2010statistical, can be plugged into a Metropolis-Hastings algorithm to sample from the posterior . The introduction of data summaries in the inference has been shown to cope well with chaotic models, where the likelihood would otherwise be difficult to optimize and the corresponding posterior surface may be difficult to explore. More generally, SL is a tool for likelihood-free inference, just like the ABC framework (see reviews sisson2011likelihood; karabatsos2018approximate), where the latter can be seen as a nonparametric methodology, while SL uses a parametric distributional assumption on . SL has found applications in e.g. ecology (wood2010statistical), epidemiology (engblom2019bayesian; dehideniya2019synthetic), mixed-effects modeling of tumor growth (picchini2019bayesian)

. For a recent generalization of the SL family of inference methods using statistical classifiers to directly target estimation of the posterior density, see

thomas2016likelihood and kokko2019pylfire.

While ABC is more general than SL, it is also more difficult to tune and it typically suffers from the “curse of dimensionality” when the size of

increases, due to its nonparametric nature. Specifically, it was shown in fearnhead2012constructing and li2018asymptotic that in ABC is optimal. On the other hand, the Gaussianity assumption concerning the summary statistics is the main limitation of SL. At the same time, due to its parametric nature SL has been shown to perform satisfactorily on problems where is large relative to the dimension of (ong2018likelihood). price2018bayesian framed SL within a pseudo-marginal algorithm for Bayesian inference (andrieu2009pseudo) and named the method Bayesian SL (BSL). They showed that when is truly Gaussian, BSL produces MCMC samples from not depending on , meaning that draws from the posterior obtained via BSL do not depend on the specific choice of . However, in practice, the inference algorithm does depend on the specific choice of , since this value affects the mixing of the MCMC.

As mentioned above, the main downside of SL is that it is computationally intensive, since for each considered value of , a large number of synthetic datasets must be produced and the corresponding summary statistic calculated. Unless the underlying computer model is trivial, producing the datasets for each represents a serious computational bottleneck. In this work we design a strategy that exploits the Gaussian assumption for the summary statistics in SL and builds sequentially an ad hoc proposal density

for possible parameter moves. This strategy can be used with both standard SL and BSL. Our idea is inspired by the “sequential neuronal likelihood” approach in

papamakarios2018sequential. We find that our adaptive proposal for SL (named ASL) is easy to construct and adds essentially no overhead, since it exploits quantities that are anyway computed in SL. Secondly, we correlate log synthetic likelihoods using a “blockwise” strategy, borrowed from the particle filter literature. This is shown to considerably improve mixing of the chains generated via SL, while not introducing correlation can lead to unsatisfactory simulations when using starting parameter values residing relatively far from the representative ones. Finally, we show how to deal with the problem of initializing the SL simulations when a good starting parameter is not known, which corresponds to the typical situation in applications. In fact, when the starting parameter value is far in the tails of the posterior, this can lead to (i) non-computable synthetic likelihoods due to non-positive definite covariance matrices, and/or (ii) not well-mixing chains, when the Gaussianity assumption on the summaries breaks apart for the unlikely parameter values (even though it may hold for parameters representing the bulk of the posterior). To solve this problem we use the BOLFI method (gutmann2016bayesian) available in ELFI (lintusaari2018elfi), the engine for likelihood-free inference. We show that the Gaussian process surrogate models employed by BOLFI can efficiently learn a good starting parameter value for ASL.

Our paper is structured as follows: in Section 2 we introduce the synthetic likelihood approach. In Section 3 we construct the adaptive proposal distribution via ASL and in section 4 we construct correlated synthetic likelihoods. In Section 5 we discuss using BOLFI and ELFI to accelerate convergence in ASL. In Section 6 we discuss three benchmarking simulation studies: a simple g-and-k model, then a cosmological model and finally the recruitment, boom and bust model which has markedly non-Gaussian summary statistics. Further results are given in Supplementary Material. Code can be found at https://github.com/umbertopicchini/ASL.

## 2 Synthetic likelihood

We briefly summarize the synthetic likelihood (SL) method as proposed in wood2010statistical. The main goal is to produce Bayesian inference for , by sampling from (an approximation to) the posterior using MCMC, where is the density underlying the true (unknown) distribution of . wood2010statistical proposes a parametric approximation to , placing the rather strong assumption that . The reason for this assumption is that estimators for the unknown mean and covariance of the summaries, and respectively, can be obtained straightforwardly via simulation, as described below. As obvious from the notation used, and

depend on the unknown finite-dimensional vector parameter

, and these are found by simulating independently datasets from the assumed data-generating model, conditionally on . We denote the synthetic datasets simulated from the assumed model run at a given with . These are such that (), with denoting observed data, and therefore . For each dataset it is possible to construct the corresponding (possibly vector valued) summary , with . These simulated summaries are used to construct the following estimators:

 ^μM,θ∗=1MM∑m=1s∗m,^ΣM,θ∗=1M−1M∑m=1(s∗m−^μθ∗)(s∗m−^μθ∗)′, (1)

with denoting transposition. By defining , the SL procedure in wood2010statistical samples from the posterior , see algorithm 1. A slight modification of the original approach in wood2010statistical leads to the “Bayesian synthetic likelihood” (BSL) algorithm of price2018bayesian, which samples from when is truly Gaussian, by introducing an unbiased approximation to a Gaussian likelihood. Besides this, the BSL is the same as algorithm 1. See the Supplementary Material for details about BSL. All our numerical experiments use the BSL formulation of the inference problem.

When the simulator generating the synthetic datasets is computationally demanding, algorithm 1 is computer intensive, as it generally needs to be run for a number of iterations in the order of thousands. The problem is exacerbated by the possibly poor mixing of the resulting chain. As well known in the literature on pseudo-marginal methods (e.g. doucet2015efficient, pitt2012some), when a likelihood is approximated using

Monte Carlo simulations, an occasional acceptance of an overestimated likelihood may occur, causing further proposals to be rejected for many iterations. This produces a “sticky chain”. The most obvious way to alleviate the problem is to reduce the variance of the estimated likelihoods, by increasing

, but of course this makes the algorithm computationally more intensive. A further problem occurs when the initial lies far away in the tails of the posterior. This may cause numerical problems when the initial is ill-conditioned, possibly requiring a very large to get the MCMC started, and hence it is desirable to have the chains approach the bulk of the posterior as rapidly as possible.

In the following we propose two strategies aiming at keeping the mixing rate of a MCMC produced either by SL or BSL at acceptable levels and also to ease convergence of the chains to the regions of high posterior density. The first strategy results in designing a specific proposal distribution for use in MCMC via synthetic likelihood: this is denoted “adaptive proposal for synthetic likelihoods” (shorty ASL) and is described in section 3. The second strategy reduces the variability in the Metropolis-Hastings ratio by correlating successive pairs of synthetic likelihoods: this results in “correlated synthetic likelihoods” (CSL) described in section 4.

## 3 Adaptive proposals for synthetic likelihoods

In section 3.1 we illustrate the main ideas of our ASL method. Later in section 3.2 we specialize ASL so that we instead obtain a sequence of proposal distributions . What we now introduce in section 3.1 will also initialize the ASL method, i.e. provide an initial .

### 3.1 Main idea and initialization

Suppose we have at disposal pairs , where the are posterior draws generated by some SL procedure (i.e. the standard method from wood2010statistical or the BSL one from price2018bayesian), e.g. , while is one of the summaries that have been produced to compute the synthetic likelihood corresponding to . That is, needs not be simulated anew conditionally to , as it is already available. We set and , then is a vector having length . Assume for a moment that the joint vector is a -dimensional Gaussian, with . We stress that this assumption is made merely to construct a proposal sampler, and does not extend to the actual distribution of . We set a -dimensional mean vector and the covariance matrix

 S≡[SθSθsSsθSs],

where is , is , is and of course is . We estimate and using the available draws. That is, define then, same as in (1), we have

 ^m=1NN∑n=1xn,^S=1N−1N∑n=1(xn−^m)(xn−^m)′. (2)

Once and are obtained, it is possible to extract the corresponding entries and , , ,

. We can now use well known formulae for conditionals of a multivariate Gaussian distribution, to obtain a proposal distribution (with a slight abuse of notation)

, with

 ^mθ|s =^mθ+^Sθs(^Ss)−1(s−^ms) (3) ^Sθ|s =^Sθ−^Sθs(^Ss)−1^Ssθ. (4)

Hence a new proposal can be generated as , thus exploiting the information provided by the observed summaries , and then be updated as new posterior draws become available, as further described below. Therefore, can be employed in place of into algorithm 1, but we recommend to use it only for a few iterations, as explained later. Clearly the proposal function is independent of the last accepted value of , hence it is an “independence sampler” (robert2004monte), except that its mean and covariance matrix are not constant but change with . If our approach is used as just introduced, it might produce an “overconfident” chain, with a very high acceptance probability (e.g. an acceptance rate of more than 0.50 or even more than 0.80). This implies that the proposed moves are too local, and we recommend proposing instead from , where is an “expansion factor” which may be selected in the real interval . However, any real scalar may be considered. This way the proposed parameters will explore larger regions (at the price of a reduced acceptance rate), rather than being too constrained by the range of the already accepted draws. Tuning is however not really necessary, as we plan to use ASL only for a small number of iterations. Moreover, next section also illustrate a sampler based on the multivariate Student’s distribution.

### 3.2 Sequential approach

The construction outlined above contains the key ideas underlying our adaptive MCMC for synthetic likelihoods (ASL) methodology, however it can be further detailed to ease the actual implementation in a sequential way. In fact, the above is based on an available batch of draws, however we may want to update our sampler sequentially, and we define a sequence of “rounds” over which kernels are sequentially constructed. In the first round (), we construct using the output of MCMC iterations, obtained using e.g. a Gaussian random walk. We may consider as burnin iterations. Once (2)–(3)–(4) are computed using the output of the burnin iterations, we obtain the first adaptive distribution denoted as already illustrated in section 3.1. We store the draws as and then employ as a proposal density in further MCMC iterations, after which we perform the following steps: (i) we collect the newly obtained batch of pairs (where, again, and is one of the already accepted simulated summaries generated conditionally to ) and add it to the previously obtained ones as . Then (ii) similarly to (2) compute the sample mean and corresponding covariance , except that here and use the pairs in . (iii) Update (3)–(4) to and , and obtain . (iv) Use for further MCMC moves, stack the new draws into , and using the pairs in proceed as before to obtain , and so on until the last batch of iterations generated using is obtained.

From the procedure we have just illustrated, the sequence of Gaussian kernels has , with and the conditional mean and covariance matrix given by

 ^m0:tθ|s =^m0:tθ+^S0:tθs(^S0:ts)−1(s−^m0:ts) (5) ^S0:tθ|s =^S0:tθ−^S0:tθs(^S0:ts)−1^S0:tsθ. (6)

The proposal function uses all available present and past information, as these are obtained using the most recent version of , which contains information from the previous rounds in addition to the latest batch of draws. Compared to a standard Metropolis random walk, the additional computational effort to implement our method is negligible, as it uses trivial matrix algebra applied on quantities obtained as a by-product of the SL procedure, namely the several pairs . Regarding our specification that is one of the summaries produced conditionally to , in our experiments we choose according to the strategy detailed in section 3.3. An alternative to Gaussian proposals is to use multivariate Student’s proposals. We build on the result found in ding2016conditional allowing us to write , and here is a multivariate Student’s distribution with degrees of freedom, and in this case can be simulated using

 θ∗n=^m0:tθ|s+(√ν+δnν+ds(^S0:tθ|s)1/2)(Zn/√χ2ν+dsν+ds) (7)

with

an independent draw from a Chi-squared distribution with

degrees of freedom, and a -dimensional standard multivariate Gaussian vector independent of . For simplicity, in the following we do not make distinction between the Gaussian and the Student’s proposals, and the user can choose any of the two, as they are anyway obtained from the same building-blocks (2)–(6).

A pseudo-algorithm embedding the construction of the sequence into a SL procedure, is given in algorithm 2 and constitutes our ASL approach. Since from each we draw proposals (when ), the total MCMC effort consists in iterations ( iterations are used as burnin).

An advantage of ASL is that it is self-adapting. A disadvantage is that, since the adaptation results into an independence sampler, it does not explore a neighbourhood of the last accepted draw, and newly accepted draws obtained at stage might not necessarily produce a rapid change into mean and covariance for the proposal function (should a rapid change actually be required for optimal exploration of the parameter space). That is the sampler could react slowly to local changes in the surface, as this only happens once mean and covariance change substantially. This is why in our applications we obtained the best results when setting . That is, the proposal distribution is updated at each iteration by immediately incorporating information provided by the last accepted draw.

Our adaptive strategy is inspired by the sequential neuronal likelihood approach found in papamakarios2018sequential. There the MCMC draws obtained in each of stages, sequentially approximate the likelihood function for models having an intractable likelihood, whose approximation at stage is obtained by training a neuronal network (NN) on the MCMC output obtained at stage

. Their approach is more general (and it is aimed at approximating the likelihood, not the MCMC proposal), but has the disadvantage of requiring the construction of a NN, and then the NN hyperparameters must be tuned at every stage

, which of course requires domain knowledge and computational resources. Our approach is framed specifically for inference via synthetic likelihoods, which is a limitation per-se, but it is completely self-tuning, with the possible exception of the burnin iterations where an initial covariance matrix must be provided by the user, though this is a minor issue when the number of parameters is limited.

However, we provide no claim on the ergodicity of the generated chain. That is, while ASL is of help in “pushing” the chain to regions of high posterior density, we cannot ensure that resulting draws are such that (or such that ). In fact, generally the chain is able to explore the mode of the posterior, not necessarily the tails and this is why in practice we run ASL for a relatively small number of iterations, which we use to initialize an adaptive MCMC algorithm with proved ergodicity properties. In our studies, once the iterations of ASL are completed (with relatively small) we produce further iterations with the algorithm of haario2001adaptive, hereafter denoted “Haario”. Basically we initialize Haario at with an initial covariance matrix obtained from ASL (e.g. or one obtained from (7)). In practice, the final inference results use draws produced with iterations running the Haario method.

### 3.3 Using KNN to select a plausible summary statistic

In section 3.2 we mentioned that we collect batches of pairs where is one of the simulated summaries corresponding to the accepted . However, in practice it would be naive to randomly pick any of the summaries, as even for an accepted parameter some of the corresponding simulated summaries could lie very far away from the observed , which is especially true during the burnin iterations. What we do instead is to employ the

-nearest-neighbour (KNN) procedure of

Friedman77analgorithm, which we use to select the “best” summaries () among the simulated conditionally to , where “best” here means the closest to the observed under some distance. From the summaries we randomly pick one, which becomes the . Within KNN we use a Mahalanobis distance employing the covariance matrix , and our experiments used , but a different could be used. This procedure has a very little computational cost for our algorithm, therefore we use it as a default (in Matlab it is implemented as knnsearch).

### 3.4 On the explicit conditioning on the summaries

A legitimate question that may arise is why using (5)-(6) at all, that is why conditioning on , given that the unconditional and are the mean and covariance of draws from the posterior , hence these are by definition already conditioned on . However not using (5)-(6), i.e. proposing from a Gaussian having mean and covariance matrix , would be detrimental in the first MCMC iterations immediately after burnin. In fact, in such case the proposal distribution would again be an independence sampler for a chain that could possibly be very far from stationarity, and hence would be self-calibrated on accepted values far from the target. This would cause proposals to be produced by taking too large “jumps”, resulting in many rejections. Instead we show in the Supplementary Material that applying an explicit conditioning via (5)-(6) (in addition to the implicit conditioning due to using moments obtained from posterior draws) will ease the chain mixing. Notice in fact that (5)-(6) reduce to and respectively as soon as . The latter condition means that the chain is close to the bulk of the posterior and accepted parameters simulate summaries distributed around the observed . Therefore, when the chain is far from its target, the additional terms in (5)-(6) can help guide the proposals thanks to an explicit conditioning to data.

## 4 Correlated synthetic likelihood

Following the success of the pseudo-marginal method (PM) returning exact Bayesian inference whenever an unbiased estimate of some intractable likelihood is available (

beaumont2003estimation, andrieu2009pseudo, andrieu2010particle), studies aiming at increasing the efficiency of particle filters (or sequential Monte Carlo) for Bayesian inference in state-space models have been studied extensively, see schon2018probabilistic for an approachable review. A recent important addition to PM methodology, improving the acceptance rate in Metropolis-Hastings algorithms when particle filters are used to unbiasedly approximate the likelihood function, considers inducing some correlation between the likelihoods appearing in the Metropolis-Hastings ratio. The idea underlying correlated pseudo-marginal methods (CPM), as initially proposed in dahlin2015accelerating and deligiannidis2015correlated, is that having correlated likelihoods will reduce the stochastic variability in the estimated acceptance ratio. This reduces the stickiness in the MCMC chain, which is typically due to excessively varying likelihood approximations, when these are obtained using a too-small number of particles. In fact, while the variability of these estimates can be mitigated by increasing the number of particles, this has of course negative consequences on the computational budget. Instead CPM strategies allow for considerably smaller number of particles when trying to alleviate the stickiness problem, see for example golightly2018correlated for applications to stochastic kinetic models. For example, pitt2012some show that to obtain a good tradeoff between computational complexity and MCMC mixing in PM algorithms, the number of particles used in the particle filter should be such that the variance of the log of the estimated likelihood is around one, hence the number of required particles is , for data of size . deligiannidis2015correlated show that the number of particles required by CPM in each MCMC iteration is . The interesting fact is that implementing CPM approaches is trivial. deligiannidis2015correlated and dahlin2015accelerating correlate the estimated likelihoods at the proposed and current values of the model parameters by correlating the underlying standard normal random numbers used to construct the estimates of the likelihood, via a Crank-Nicolson proposal. We found particular benefit with the “blocked” PM approach (BPM) of tran2016block (see also choppala2016bayesian for inference in state-space models), which we now describe in full generality, i.e. regardless of our synthetic likelihoods approach.

Denote with the vector of random variates (typically standard Gaussian or uniform) necessary to produce a non-negative unbiased likelihood approximation at a given parameter for data . In tran2016block the set is divided into blocks, and one of these blocks is updated jointly with in each MCMC iteration. Let be the estimated unbiased likelihood obtained using the th block of random variates , . Define the joint posterior of and as

 π(θ,U|y) ∝^p(y|θ,U)π(θ)G∏i=1pU(U(i)) (8) where θ and U are a-priori independent and ^p(y|θ,U) :=1GG∑i=1^p(y|θ,U(i)) (9)

is the average of the unbiased likelihood estimates and hence also unbiased. We then update the parameters jointly with a randomly-selected block in each MCMC iteration, with for any . Using this scheme, the acceptance probability for a joint move from the current set to a proposed set generated using some proposal function , is

 α=min⎧⎪ ⎪⎨⎪ ⎪⎩1,^p(y|θp,Uc(1),...,Uc(k−1),Up(k),Uc(k+1),...,Uc(G))π(θp)^p(y|θc,Uc(1),...,Uc(k−1),Uc(k),Uc(k+1),...,Uc(G))π(θc)g(θc|θp)g(θp|θc)⎫⎪ ⎪⎬⎪ ⎪⎭. (10)

Hence in case of proposal acceptance we update the joint vector and move to the next iteration, where . The resulting chain targets (8) (tran2016block). Notice the random variates used to compute the likelihood at the numerator of (10) are the same ones for the likelihood at the denominator except for the -th block, hence blocks from the current set are reused at the numerator. This induces beneficial correlation between subsequent pairs of likelihood estimates. Also, we considered hence the simplified expression (10). The correlation between and is approximately (choppala2016bayesian), so the larger the number of simulations involved when computing , the more the number of groups that can be formed and the higher the correlation. Also, note that the approximations can be run in parallel on multiple processors when these likelihoods are approximated using particle filters. However, in our synthetic likelihood approach we do not make use of (9) and take instead without decomposing this into a sum of contributions. We do not in fact compute separately the , since we found that in order for each to behave in a numerically stable way, a not too small number of simulations should be devoted for each sub-likelihood term, or otherwise the corresponding covariance results singular, this causing instability. Therefore, in practice, we just compute the joint , and (10) becomes

 α=min⎧⎪ ⎪⎨⎪ ⎪⎩1,p(s|θp,Uc(1),...,Uc(k−1),Up(k),Uc(k+1),...,Uc(G))π(θp)p(s|θc,Uc(1),...,Uc(k−1),Uc(k),Uc(k+1),...,Uc(G))π(θc)g(θc|θp)g(θp|θc)⎫⎪ ⎪⎬⎪ ⎪⎭, (11)

which we therefore call “correlated synthetic likelihood” (CSL) approach. From the analytic point of view our correlated likelihood is the same unbiased approximation given in price2018bayesian (also in Supplementary Material), hence CSL uses the BSL approach, the only difference being the recycling of blocks from the set of pseudo-random draws , as described above.

In our experiments we show that using the acceptance criterion (11) into algorithm 1 (regardless of the use of our ASL proposal kernel) is of great benefit to ease convergence, and comes with no computational overhead compared to not using correlated synthetic likelihoods.

## 5 Algorithmic initialization using BOLFI and ELFI

We consider the problem of initializing an MCMC algorithm using the synthetic likelihoods (SL) approach, for experiments where obtaining a reasonable starting value for by trial-and-error is not feasible, due to the computational cost of evaluating the SL density at many candidates for . At minimum, we need to find a value such that the corresponding SL density (the biased or the unbiased one in the sense of price2018bayesian) has a positive definite covariance matrix . This is not ensured when the summaries are simulated from highly non-representative values of , which would result in an MCMC algorithm that halts. The issue is critical, as testing many values can be prohibitively expensive, both because the dimension of can be large and because the model itself might be slow to simulate from. This is exacerbated by the very nature of the SL procedure, which is intrinsically expensive. An alternative would be to use a different type of inference method for the initialization, e.g. some version of ABC such as ABC-MCMC (marjoram2003markov; sisson2011likelihood), in order to locate an approximate posterior mode and set to this value. However, ABC algorithms are notoriously not easy to calibrate, and their application would be counter-intuitive in the context of SL inference in the first place, as a SL is easier to construct, at least when approximately Gaussian summaries are available.

An approach developed in gutmann2016bayesian uses Bayesian optimization to locate “optimal” values of , when the likelihood function is intractable but realizations from a stochastic model simulator are available, which is exactly the framework that applies to ABC and SL. The resulting method, named BOLFI (Bayesian optimization for likelihood-free inference), locates a that either minimizes the expected value of , where is some discrepancy between simulated and observed summary statistics, say for some distance , or alternatively can be used to minimize the negative SL expression. For example, could be the Euclidean distance , or a Mahalanobis distance for some square matrix weighting the individual contributions of the entries in and (see prangle2017adapting). The appeal of BOLFI is that (i) it is able to rapidly focus the exploration in those regions of the parameter space where either is smaller, or the SL is larger, and (ii) it is implemented in ELFI (lintusaari2018elfi), the Python-based open-source engine for likelihood-free inference.

Hence, in the case with expensive simulators, BOLFI is ideally positioned to minimize the number of attempts needed to obtain a reasonable value , to be used to initialize the synthetic likelihoods approach. BOLFI replaces the expensive realizations from the model simulator with a “surrogate simulator” defined by a Gaussian process (GP, rasmussen2004gaussian). Using simulations from the actual (expensive) simulator to form a collection of pairs such as , the GP is trained on the generated pairs and the actual optimization in BOLFI only uses the computationally cheap GP simulator. This means that the optimum returned by BOLFI does not necessarily reflect the best generating the observed . It is possible to use the BOLFI optimum to initialize some other procedure within ELFI, such as Hamiltonian Monte Carlo MCMC via the NUTS algorithm of hoffman2014no. However, the ELFI version of NUTS uses, again, the GP surrogate of the likelihood function. In our current work, once the BOLFI optimum is obtained, we revert instead to the SL MCMC which still uses simulations from the true model, and these may be expensive, but at least are initialised at a which should be “good enough” to avoid a long and expensive burnin. We exemplify this approach in Section 6.1.2.

## 6 Simulation studies

### 6.1 g-and-k distribution

The g-and-k distribution is a standard toy model for case studies having intractable likelihoods (e.g. allingham2009bayesian; fearnhead2012constructing; picchini2018likelihood

), in that its simulation is straightforward, but it does not have a closed-form probability density function (pdf). Therefore the likelihood is analytically intractable. However, it has been noted in

rayner2002numerical

that one can still numerically compute the pdf, by 1) numerically inverting the quantile function to get the cumulative distribution function (cdf), and 2) numerically differentiating the cdf, using finite differences, for instance. Therefore “exact” Bayesian inference (exact up to numerical discretization) is possible. This approach is implemented in the

gk R package (gk).

The -and- distributions is a flexibly shaped distribution that is used to model non-standard data through a small number of parameters. It is defined by its quantile function, see gk for an overview. Essentially, it is possible to generate a draw from the distribution using the following scheme

 Q=A+B[1+c1−exp(−g⋅u)1+exp(−g⋅u)](1+u2)k⋅u (12)

where , and are location and scale parameters and and

are related to skewness and kurtosis. Parameters restrictions are

and . We assume as parameter of interest, by noting that it is customary to keep fixed to (drovandi2011likelihood; rayner2002numerical). We use the summaries suggested in drovandi2011likelihood, where can be observed and simulated data and respectively:

 sA,w =P50,w sB,w =P75,w−P25,w, sg,w =(P75,w+P25,w−2sA,w)/sB,w sk,w =(P87.5,w−P62.5,w+P37.5,w−P12.5,w)/sB,w

where is the th empirical percentile of . That is and

are the median and the inter-quartile range of

respectively. We use the simulation strategy outlined above to generate data , consisting of independent samples from the g-and-k distribution using parameters . We place uniform priors on the parameters: , , , .

We now proceed at running algorithm 2, starting at parameter values set relatively far from the ground truth. We consider three sets of parameters starting values given by:

• set 1: ;

• set 2: ;

• set 3:

where set 1 should be considered as a more difficult starting scenario, while set 3 is the easiest of the three. For the moment we verify the ability of the adaptive proposal ASL to create a chain rapidly converging to the bulk of the posterior, rather than performing accurate posterior inference. For this reason, for each set of starting parameter values, we run a total of only 500 MCMC iterations of the BSL algorithm, where the first iterations constitute the burnin. Five hundred iterations is enough to show that indeed ASL is effective for this case study. We call this strategy BSL-ASL. We use model simulations at each iteration and during the burnin we advance the chain by proposing parameters using a Gaussian random walk acting on log-scale, i.e. on

, with a fixed diagonal covariance matrix having standard deviations (on log-scale) given by

for respectively. Given the short burnin, in the first iterations we implement a Markov-chain-within-Metropolis strategy (MCWM, andrieu2009pseudo) to increase the mixing of the algorithm before our adaptive strategy starts. That is MCWM is not used after the burnin. At iteration , our ASL algorithm 2 is ready to start, and the proposal kernel is updated at each iteration, i.e. we use throughout. We used to “inflate” the covariance of the independence sampler and induce larger steps. The three MCMC attempts at different starting parameters are in Figure 1. All attempts manage to approach the ground-truth parameter values. However, a most interesting detail is given by the traces corresponding to set 1, the ones starting the furthest away from the ground truth. Especially for and we notice that during the burnin the chains are still quite far from the ground truth, not surprisingly so given that we deliberately chose small standard deviations for the random walk proposal. However, as soon as ASL kicks on (iteration 201), we notice a large jump towards the true parameters. This happens more slowly for and , however all four chains reach the buk of the posterior within a fairly small number of iterations.

The above is not enough to show whether the chains are correctly exploring the target. Therefore we now report the results of a longer simulation, where we append the chains produced by a standard adaptive MCMC strategy to the previous results. That is, after iteration 500, we employ the adaptive algorithm of haario2001adaptive for further 1500 iterations, initialized at the last draw produced by BSL-ASL and using a Gaussian random walk with covariance matrix initialized at the covariance matrix returned by BSL-ASL at iteration 500 (i.e. ). The covariance matrix is then updated every 30 iterations following haario2001adaptive. We call “BSL-Haario” the execution of BSL when using haario2001adaptive (end excluding the initial tuning provided by ASL). In Figure 2 we plot the traces corresponding to set 1 only, and removing the burnin iterations for ease of display. As we can see, in this case the exploration of the posterior when using BSL-ASL seems compatible with BSL-Haario. As a further experiment, we now consider the effective sample size (ESS) for both BSL-ASL and BSL-Haario when using starting values from set 1. For both methods we report the minimal ESS, which is the ESS corresponding to the “worst chain” among the four ESS for the parameters to infer. This means that the two algorithms are only as good as their worst mixing chain. To obtain accurate results, this time we run both BSL-ASL and BSL-Haario for 5,200 iterations. We report the runtimes for the 5,200 iterations in Table 1, however posterior inference and ESS are computed on the last 4,000 draws (i.e. we consider a burnin of 1,200 for both methods, for fairness of comparison). Table 1 shows that the two methods report a very similar inference. However BSL-ASL tends to underestimate the posterior variance for and . This is probably due to the use of the independence sampler that characterizes ASL, implying not large enough moves. Also, recall that the parameter is not tuned. However, rather surprisingly the ESS for BSL-ASL is double the ESS for BSL-Haario. The runtime for BSL-ASL is slightly larger than for BSL-Haario, however we did not put much effort in optimizing calculations. For example we did not use on-line formulae for updating (5)-(6). Hence, while at the moment the relative ESS is essentially the same between the two methods (last column in the table), it is possible to produce improvements.

#### 6.1.1 Using correlated synthetic likelihood without ASL

Here we consider the correlated synthetic likelihood (CSL) approach outlined in section 4, without the use of our ASL approach for proposing parameters, to better appreciate the individual effect of using correlated likelihoods. In our experiments, CSL is essentially BSL with embedded blocking strategy. Notice (12) immediately suggests how to implement CSL, since the appearing in (12) can be thought as a scalar realization of the variate in section 4. We rerun experiments with g-and-k data and compare CSL with the standard BSL of price2018bayesian. For CSL we always consider blocks, which should imply a theoretical correlation of between estimated synthetic loglikelihoods. For both CSL and BSL we always propose parameters using the adaptive MCMC from haario2001adaptive, where the covariance matrix for the proposal function is updated every 30 iterations. Notice, the results produced here via BSL will be different than those previously discussed and reported in Table 1 as “BSL-Haario”, as in that case BSL was initiated at the already “tuned” chain provided by a preliminary run of BSL-ASL. In this section instead BSL is not “pre-tuned”. For our comparisons, we consider the previously introduced starting parameter values from set 1 and 2. For all experiments we run MCMC iterations.

A first experiment initializes BSL at the parameters in set 1 using . BSL is essentially unable to move away from the starting values (results not reported). With the same setup we then run CSL, and results in Figure 3 clearly show the benefits of the correlated approach, since the chain moves towards the high density region. The main conclusion is that, for relatively remote starting parameter values (such as set 1), BSL requires a larger value of in order to safely approach convergence: in fact we have verified (results not reported) that with BSL the chains still gets stuck, while finally produces convergent chains (albeit with a low acceptance rate of %).

We now move to set 2, again using . Results are found in the Supplementary Material as Figure 13 and Figure 14 for BSL and CSL respectively, showing that BSL finally converges. In fact, BSL and CSL in this case are essentially equivalent: for BSL the minimum ESS on the last 1,000 iterations is minESS=13 and for CLS we have minESS = 16. Runtime for the entire 2,000 iterations was 152 seconds for BSL and 133 seconds for CSL (these timings were averaged over multiple attempts), showing that reciclying pseudo-random draws can produce some time savings, even though this is not a goal of the CSL methodology. This implies that minESS/runtime = 0.086 (BSL) and minESS/runtime = 0.12 (CSL).

We now study the influence of using different values of on the algorithmic efficiency, as measured in terms of ESS. We consider , 5, 10, 20, 50 and 100 when CSL is initialized at the ground-truth parameters and haario2001adaptive is used to propose parameters. We run each experiment for 1,500 iterations, see Table 2

. What we deduce is that, provided the chain is initialized in regions of high posterior probability, using correlated synthetic likelihood does not seem to bring any benefit (nor disadvantage), as the performance seems unaffected by the specific value of

(we could even take ). However, using has been shown to be important when the starting is far from the bulk of the posterior mass. Therefore, it seems wise to always adopt a correlated synthetic likelihood approach.

#### 6.1.2 Initialization using ELFI and BOLFI

Here we show results from the BOLFI optimizer discussed in section 5, obtained using the ELFI framework. BOLFI uses a Gaussian Process (GP) to learn the possibly complex and nonlinear relationship between discrepancies (or log-discrepancies) and corresponding parameters . In order to obtain training pairs BOLFI generates parameters , independently simulated as , and then corresponding summaries are generated from the model simulator. Notice, here is not a synthetic likelihood, it is instead the unknown density underlying the true distribution of the summaries. That is here an artificial dataset which is first generated from the model simulator, and then corresponding summaries are obtained.

We found that for this specific example, where we set very wide and vague priors, we could not infer the parameters using BOLFI regardless the value set for . This is because while in previous inference attempts we used MCMC methods to explore the posterior and having very vague priors was still feasible, here having initial samples provided by very uninformative priors is not manageable. In this section we use , , , . These priors are narrower than in previous attempts but are still wide and uninformative enough to make this experiment interesting and challenging.

Once the training samples are obtained, BOLFI starts optimizing parameters by iteratively fitting a GP and proposing points such that each attempts at reducing , . We first consider and then , see Table 3. However notice that BOLFI is a stochastic algorithm, hence different runs will return slightly different results. The clouds of points in Figure 4 represent all values of log-discrepancies (for and ) and corresponding parameter values. It is evident that the smallest values of cluster around the ground-truth parameters which we recall are , , , . The values of the optimized discrepancies are in Table 3. Even with a very small the obtained results appear very promising. Also, even though the estimates for seem to be bounded by the lower limit we set for its prior, we can clearly notice a trend, in that smaller values for return smaller discrepancies. As mentioned in section 5, we are not considering using BOLFI to report final inference results, but it can be an effective tool to intialize our synthetic likelihood MCMC. The time required to obtain the optimum when and was 255 seconds using an Intel Core i7-7700 CPU with 3.60 GHz and 32 GB RAM. For comparison, the corresponding time when and was 407 seconds. These times show that BOLFI is best suited for expensive simulators, rather than the simple g-and-k case study.

Table 3 arranges results according to whether was computed by giving the same weight to all summaries entering and (weighted=no) or if instead summaries received different weights (weighted=yes). The way weights are obtained is described in the Supplementary Material. The only times we happened to obtain a positive estimate for was in two instances using weighted summaries, see Table 3. The weighting of summaries statistics is only performed when using BOLFI, not when using the SL approach (in SL, summaries are naturally weighted via the matrix ).

A characteristic of Bayesian optimization, that we can clearly notice in Figure 4, is the tendency to over-explore the boundaries of the parameters. This is a problem that has been recently addressed in siivola2017correcting, but a solution has not been implemented in ELFI yet.

### 6.2 Supernova cosmological parameters estimation

We present an astronomical example taken from jennings2017astroabc. There, the “adaptive ABC” algorithm by BeaumontEtAl2009 was used for likelihood-free inference. The algorithm in BeaumontEtAl2009 is a sequential Monte Carlo (SMC) sampler, hereafter denoted ABC-SMC, which propagates many parameter values (“particles”) through a sequence of approximations of the posterior distribution of the parameters. Our goal is to show how synthetic likelihoods may be as well used in order to tackle the inferential problems and a comparison with jennings2017astroabc is presented. In jennings2017astroabc the analysis relied on the SNANA light curve analysis package (kessler2009snana) and its corresponding implementation of the SALT–II light curve fitter presented in guy2010supernova. A sample of supernovae with redshift range are simulated and then binned into redshift bins. However, for this example, we did not use SNANA and data is instead simulated following the procedure in Section 6.2.1. The model that describes the distance modulus as a function of redshift , known in the astronomical literature as Friedmann–Robertson–Model (condon2018lambdacdm), is:

 μi(zi;Ωm,ΩΛ,Ωk,w0,h0)∝5log10(c(1+zi)h0)∫zi0dz′E(z′), (13)

where .

The cosmological parameters involved in (13) are five. The first three parameters are the matter density of the universe, , the dark energy density of the universe, and the radiation and relic neutrinos, . A constrain is involved when dealing with these three parameters, which is (genovese2009inference; tripathi2017dark; usmani2008dark). The final two parameters are, respectively, the present value of the dark energy equation, , and the Hubble constant, . A common assumption involves a flat universe, leading to , as shown in tripathi2017dark; usmani2008dark. As a result, (13) simplifies and in particular can be written as , where we note that . Same as in jennings2017astroabc, we work under the flat universe assumption. Concerning the Dark Energy Equation of State (EoS), , we use the parametrization proposed in chevallier2001accelerating and in linder2003exploring:

 w(z)=w0+wa(1−a)=w0+waz1+z. (14)

According to (14), is assumed linear in the scale parameter. Another common assumption relies on being constant; in this case . We note that several parametrizations have been proposed for the EoS (see for example huterer2001probing, wetterich2004phenomenological and usmani2008dark). For the present example, ground-truth parameters are set as follows: , ,