We consider the problem of Bayesian inference from cosmological data, in the common scenario where we can generate synthetic data through forward simulations, but where the exact likelihood function is intractable. The generative process can be extremely general: it may be a noisy non-linear dynamical system involving an unrestricted number of latent variables. Likelihood-free inference methods, also known as approximate Bayesian computation (ABC, see Marin et al., 2012; Lintusaari et al., 2017a, for reviews) replace likelihood calculations with data model evaluations. In recent years, they have emerged as a viable alternative to likelihood-based techniques, when the simulator is sufficiently cheap. Applications in cosmology include measuring cosmological parameters from type Ia supernovae (Weyant, Schafer & Wood-Vasey, 2013) and weak lensing peak counts (Lin & Kilbinger, 2015), analysing the galaxy halo connection (Hahn et al., 2017), inferring the photometric and size evolution of galaxies (Carassou et al., 2017), measuring cosmological redshift distributions (Kacprzak et al., 2018)
, estimating the ionising background from the Lyman-and Lyman- forests (Davies et al., 2018).
In its simplest form, ABC takes the form of likelihood-free rejection sampling and involves forward simulating data from parameters drawn from the prior, then accepting parameters when the discrepancy (by some measure) between simulated data and observed data is smaller than a user-specified threshold . Such an approach tends to be extremely expensive since many simulated data sets get rejected, due to the lack of knowledge about the relation between the model parameters and the corresponding discrepancy. Variants of likelihood-free rejection sampling such as Population (or Sequential) Monte Carlo ABC (pmc-abc or smc-abc, see Akeret et al., 2015; Ishida et al., 2015; Jennings & Madigan, 2017, for implementations aimed at astrophysical applications) improve upon this scheme by making the proposal adaptive; however, they do not use a probabilistic model for the relation between parameters and discrepancies (also known as a surrogate surface), so that their practical use usually necessitates evaluations of the simulator.
In this paper, we address the challenging problem where the number of simulations is extremely limited, e.g. to a few thousand, rendering the use of sampling-based ABC methods impossible. To this end, we use Bayesian optimisation for likelihood-free inference (bolfi, Gutmann & Corander, 2016), an algorithm which combines probabilistic modelling of the discrepancy with optimisation to facilitate likelihood-free inference. Since it was introduced, bolfi has been applied to various statistical problems in science, including inference of the Ricker model (Gutmann & Corander, 2016), the Lotka-Volterra predator-prey model and population genetic models (Järvenpää et al., 2018), pathogen spread models (Lintusaari et al., 2017a), atomistic structure models in materials (Todorović et al., 2017), and cognitive models in human-computer interaction (Kangasrääsiö et al., 2017). This work aims at introducing bolfi in cosmological data analysis and at presenting its first cosmological application. We focus on computable parametric approximations to the true likelihood (also known as synthetic likelihoods), rendering the approach completely -free. Recently, Järvenpää et al. (2017)
introduced an acquisition function for Bayesian optimisation (the expected integrated variance), specifically tailored to perform efficient and accurate ABC. We extend their work by deriving the expression of the expected integrated variance in the parametric approach. This acquisition function measures the expected uncertainty in the estimate of thebolfi posterior density, which is due to the limited number of simulations, over the future evaluation of the simulation model. The next simulation location is proposed so that this expected uncertainty is minimised. As a result, high-fidelity posterior inferences can be obtained with orders of magnitude fewer simulations than with likelihood-free rejection sampling. As examples, we demonstrate the use of bolfi on the problems of summarising Gaussian signals and inferring cosmological parameters from the Joint Lightcurve Analysis (JLA) supernovae data set (Betoule et al., 2014).
The structure of this paper is as follows. In section II, we provide a review of the formalism for the inference of simulator-based statistical models. In section III, we describe bolfi and discuss the regression and optimisation strategies. In particular, we provide the optimal acquisition rule for ABC in the parametric approach to likelihood approximation. Applications are given in section IV. The developed method is discussed in section V in the context of cosmological data analysis. Section VI concludes the paper. Mathematical details and descriptions of the case studies are presented in the appendices.
Ii Inference of simulator-based statistical models
ii.1 Simulator-based statistical models
Simulator-based statistical models (also known as generative models) can be written in a hierarchical form (figure 1), where are the parameters of interest, and d the simulated data.
is the prior probability distribution ofand is the sampling distribution of d given .
The simplest case (figure 1
, left) is when the simulator is a deterministic function of its input and does not use any random variable, i.e.
where is a Dirac delta distribution and a deterministic function of .
In a more generic scenario (figure 1, right), the simulator is stochastic, in the sense that the data are drawn from an overall (but often unknown analytically) probability distribution function (pdf) . Equation (1) does not hold in this case. The scatter between different realisations of d given the same can have various origins. In the simplest case, it only reflects the intrinsic uncertainty, which is of interest. More generically, additional nuisance parameters can be at play to produce the data d and will contribute to the uncertainty. This “latent space” can often be hundred-to-multi-million dimensional. Simulator-based cosmological models are typically of this kind: although the physical and observational processes simulated are repeatable features about which inferences can be made, the particular realisation of Fourier phases of the data is entirely noise-driven. Ideally, phase-dependent quantities should not contribute to any measure of match or mismatch between model and data.
ii.2 The exact Bayesian problem
The inference problem is to evaluate the probability of given d,
for the observed data , i.e.
where the exact likelihood for the problem is defined as
It is generally of unknown analytical form. The normalisation constant is , where is the marginal distribution of d.
ii.3 Approximate Bayesian computation
Inference of simulator-based statistical models is usually based on a finite set of simulated data , generated with parameter value , and on a measurement of the discrepancy between simulated data and observed data . This discrepancy is used to define an approximation to the exact likelihood . The approximation happens on multiple levels.
On a physical and statistical level, the approximation consists of compressing the full data to a set of summary statistics before performing inference. Similarly, simulated data are compressed to simulated summary statistics . This can be seen as adding a layer to the Bayesian hierarchical model (figure 2). The purpose of this operation is to filter out the information in d that is not deemed relevant to the inference of , so as to reduce the dimensionality of the problem. Ideally, should be sufficient for parameters , i.e. formally or equivalently , which happens when the compression is lossless. However, sufficient summary statistics are generally unknown or even impossible to design; therefore the compression from d to will usually be lossy. The approximate inference problem to be solved is now for the observed summary statistics , i.e.
In other words, is replaced by
and by . Inference of model 2 gives
with, after marginalisation over d,
Therefore, the approximate likelihood must satisfy
In many cases, the compression from d to is deterministic, i.e.
On a practical level, is still of unknown analytical form (which is a property of inherited from in model 2). Therefore, it has to be approximated using the simulator. We denote by an estimate of computed using realisations of the simulator. The limiting approximation, in the case where infinite computer resources were available, is denoted by , such that
Note that can be different from , depending on the assumptions made to construct . These are discussed in section II.4.
ii.4 Computable approximations of the likelihood
ii.4.1 Deterministic simulators
The simplest possible case is when the simulator does not use any random variable, i.e. is an entirely deterministic function of (see figure 1, left). Equivalently, all the conditional probabilities appearing in equation (7) reduce to Dirac delta distributions given by equations (1) and (10). In this case, one can directly use the approximate likelihood given by equation (6), complemented by an assumption on the functional shape of .
ii.4.2 Parametric approximations and the synthetic likelihood
When the simulator is not deterministic, the pdf is unknown analytically. Nonetheless, in some situations, it may be reasonably assumed to follow specific parametric forms.
For example, if is obtained through averaging a sufficient number of independent and identically distributed variables contained in dwith
where the mean and covariance matrix,
can depend on . This is an approximation of , unless the summary statistics are indeed Gaussian-distributed. and are generally unknown, but can be estimated using the simulator: given a set of simulations , drawn independently from , one can define
where stands for the empirical average over the set of simulations. A computable approximation of the likelihood is therefore , where
Due to the approximation of the expectation with an empirical average , both and become random objects. The approximation of the likelihood is therefore a random function with some intrinsic uncertainty itself, and its computation is a stochastic process. This is further discussed using a simple example in section IV.1.
The approximation given in equation (15), known as the synthetic likelihood (Wood, 2010; Price et al., 2017), has already been applied successfully to perform approximate inference in several scientific fields. However, as pointed out by Sellentin & Heavens (2016), for inference from Gaussian-distributed summaries with an estimated covariance matrix , a different parametric form, namely a multivariate -distribution, should rather be used. The investigation of a synthetic -likelihood is left to future investigations.
ii.4.3 Non-parametric approximations and likelihood-free rejection sampling
An alternative to assuming a parametric form for
is to replace it by a kernel density estimate of the distribution of a discrepancy between simulated and observed summary statistics, i.e.
where is a non-negative function of and (usually of ) which can also possibly depend on and any variable used internally by the simulator, and the kernel is a non-negative, univariate function independent of (usually with a maximum at zero). A computable approximation of the likelihood is then given by
For likelihood-free inference, is often chosen as the uniform kernel on the interval , i.e. , where is called the threshold and the indicator function equals one if and zero otherwise. This yields
where is the empirical probability that the discrepancy is below the threshold. can be straightforwardly evaluated by running simulations, computing and using as a criterion for acceptance or rejection of proposed samples. Such an approach is often simply (or mistakenly) referred to as approximate Bayesian computation (ABC) in the astrophysics literature, although the more appropriate and explicit denomination is likelihood-free rejection sampling (see e.g. Marin et al., 2012).
It is interesting to note that the parametric approximate likelihood approach of section II.4.2 can be embedded into the non-parametric approach. Indeed, can be defined as
for some positive semidefinite matrix . The second term is the square of the Mahalanobis distance, which includes the Euclidean distance as a special case, when
is the identity matrix. Using an exponential kerneland gives and with
are respectively the expectation and the empirical average of the discrepancy , for .
Iii Regression and Optimisation for likelihood-free inference
iii.1 Computational difficulties with likelihood-free rejection sampling
We have seen in section II.4 that computable approximations of the likelihood are stochastic processes, due to the use of simulations to approximate intractable expectations. In the most popular ABC approach, i.e. likelihood-free rejection sampling (see section II.4.3), the expectations are approximated by empirical probabilities that the discrepancy is below the threshold . While this approach allows inference of simulator-based statistical models with minimal assumptions, it suffers from several limitations that can make its use impossible in practice.
It rejects most of the proposed samples when is small, leading to a computationally inefficient algorithm.
It does not make assumptions about the shape or smoothness of the target function , hence accepted samples cannot “share” information in parameter space.
It uses a fixed proposal distribution (typically the prior ) and does not make use of already accepted samples to update the proposal of new points.
It aims at equal accuracy for all regions in parameter space, regardless of the values of the likelihood.
To overcome these issues, the proposed approach follows closely Gutmann & Corander (2016), who combine regression of the discrepancy (addressing issues 1 and 2) with Bayesian optimisation (addressing issues 3 and 4) in order to improve the computational efficiency of inference of simulator-based models. In this work, we focus on parametric approximations of the likelihood; we refer to Gutmann & Corander (2016) for a treatment of the non-parametric approach.
iii.2 Regression of the discrepancy
The standard approach to obtain a computable approximate likelihood relies on empirical averages (equations (14) and (24)). However, such sample averages are not the only way to approximate intractable expectations. Equations (21) and (23) show that, up to constants and the sign, can be interpreted as a regression function with the model parameters (the “predictors”) as the independent input variables and the discrepancy
as the response variable. Therefore, in the present approach, we consider an approximation of the intractable expectation definingin equation (23
) based on a regression analysis of, instead of sample averages. Explicitly, we consider
where the superscript stands for “training” and the expectation is taken under the probabilistic model defined in the following.
Inferring via regression requires a training data set where the discrepancies are computed from the simulated summary statistics . Building this training set requires to run simulations, but does not involve an accept/reject criterion as does likelihood-free rejection sampling (thus addressing issue 1, see section III.1). A regression-based approach also allows incorporating a smoothness assumption about . In this way, samples of the training set can “share” the information of the computed in the neighbourhood of (thus addressing issue 2). This suggests that fewer simulated data are needed to reach a certain level of accuracy when learning the target function .
In this work, we rely on Gaussian process (GP) regression in order to construct a prediction for . There are several reasons why this choice is advantageous for likelihood-free inference. First, GPs are a general-purpose regressor, able to deal with a large variety of functional shapes for , including potentially complex non-linear, or multi-modal features. Second, GPs provide not only a prediction (the mean of the regressed function), but also the uncertainty of the regression. This is useful for actively constructing the training data via Bayesian optimisation, as we show in section III.5. Finally, GPs allow extrapolating the prediction into regions of the parameter space where no training points are available. These three properties are shown in figure 3 for a multi-modal test function subject to observation noise.
We now briefly review Gaussian process regression. Suppose that we have a set of training points, , of the function that we want to regress. We assume that is a Gaussian process with prior mean function and covariance function also known as the kernel (see Rasmussen & Williams, 2006). The joint probability distribution of the training set is therefore , where the exponent is
The mean function and the kernel
define the functional shape and smoothness allowed for the prediction. Standard choices are respectively a constant and a squared exponential (the radial basis function, RBF), subject to additive Gaussian observation noise with variance. Explicitly, and
The and are the components of and , respectively. In the last term, is one if and only if
and zero otherwise. The hyperparameters are, the (the length scales controlling the amount of correlation between points, and hence the allowed wiggliness of ), (the signal variance, i.e. the marginal variance of at a point if the observation noise was zero), and (the observation noise). For the results of this paper, GP hyperparameters were learned from the training set using L-BFGS (Byrd et al., 1995), a popular optimiser for machine learning, and updated every time the training set was augmented with ten samples.
The predicted value at a new point can be obtained from the fact that form jointly a random realisation of the Gaussian process . Thus, the target pdf can be obtained from conditioning the joint pdf to the values of the training set f. The result is (see Rasmussen & Williams, 2006, section 2.7)
where we use the definitions
iii.3 Bayesian optimisation
The second major ingredient of the proposed approach is Bayesian optimisation, which allows the inference of the regression function while avoiding unnecessary computations. It allows active construction of the training data set , updating the proposal of new points using the regressed (thus addressing issue 3 with likelihood-free rejection sampling, see section III.1). Further, since we are mostly interested in the regions of the parameter space where the variance of the approximate posterior is large (due to its stochasticity), the acquisition rules can prioritise these regions, so as to obtain a better approximation of there (thus addressing issue 4).
Bayesian optimisation is a decision-making framework under uncertainty, for the automatic learning of unknown functions. It aims at gathering training data in such a manner as to evaluate the regression model the least number of times while revealing as much information as possible about the target function and, in particular, the location of the optimum or optima. The method proceeds by iteratively picking predictors to be probed (i.e. simulations to be run) in a manner that trades off exploration (parameters for which the outcome is most uncertain) and exploitation (parameters which are expected to have a good outcome for the targeted application). In many contexts, Bayesian optimisation has been shown to obtain better results with fewer simulations than grid search or random search, due to its ability to reason about the interest of simulations before they are run (see Brochu, Cora & de Freitas, 2010, for a review). Figure 4 illustrates Bayesian optimisation in combination with Gaussian process regression, applied to finding the minimum of the test function of figure 3.
In the following, we give a brief overview of the elements of Bayesian optimisation used in this paper. In order to add a new point to the training data set , Bayesian optimisation uses an acquisition function that estimates how useful the evaluation of the simulator at
will be in order to learn the target function. The acquisition function is constructed from the posterior predictive distribution ofgiven the training set , i.e. from the mean prediction and the uncertainty of the regression analysis (equations (29) and (30)). The optimum of the acquisition function in parameter space determines the next point to be evaluated by the simulator ( or depending on how the acquisition function is defined), so that the training set can be augmented with . The acquisition function is a scalar function whose evaluation should be reasonably expensive, so that its optimum can be found by simple search methods such as gradient descent.
The algorithm needs to be initialised with an initial training set. In numerical experiments, we found that building this initial set by drawing from the prior (as would typically be done in likelihood-free rejection sampling) can result in difficulties with the first iterations of Gaussian process regression. Uniformly-distributed points within the boundaries of the GP are also a poor choice, as they will result in an uneven initial sampling of the parameter space. To circumvent this issue, we build the initial training set using a low-discrepancy quasi-random Sobol sequence(Sobol, 1967), which covers the parameter space more evenly.
iii.4 Expressions for the approximate posterior
In the parametric approach to likelihood approximation, this is equivalent to an approximation of (see equation (21)). The expectation of the (unnormalised) approximate posterior is therefore directly given as (see equation (5))
The estimate of the variance of can also be propagated to the approximate posterior, giving
Details of the computations can be found in appendix A.1.
Expressions for the bolfi posterior in the non-parametric approach with the uniform kernel can also be derived (Järvenpää et al., 2017, lemma 3.1). As this paper focuses on the parametric approach, we refer to the literature for the former case.
iii.5 Acquisition rules
iii.5.1 Expected improvement
Standard Bayesian optimisation uses acquisition functions that estimate how useful the next evaluation of the simulator will be in order to find the minimum or minima of the target function. While several other choices are possible (see e.g. Brochu, Cora & de Freitas, 2010), in this work we discuss the acquisition function known as expected improvement (EI). The improvement is defined by , and the expected improvement is , where the expectation is taken with respect to the random observation assuming decision . For a Gaussian process regressor, this evaluates to (see Brochu, Cora & de Freitas, 2010, section 2.3)
or if , where and
denote respectively the pdf and the cumulative distribution function (cdf) of the unit-variance zero-mean Gaussian. The decision rule is to select the locationthat maximises .
The EI criterion can be interpreted as follows: since the goal is to find the minimum of , a reward equal to the improvement is received if is smaller than all the values observed so far, otherwise no reward is received. The first term appearing in equation (38) is maximised when evaluating at points with high uncertainty (exploration); and, at fixed variance, the second term is maximised by evaluating at points with low mean (exploitation). The expected improvement therefore automatically captures the exploration-exploitation trade-off as a result of the Bayesian decision-theoretic treatment.
iii.5.2 Expected integrated variance
As pointed out by Järvenpää et al. (2017), in Bayesian optimisation for approximate Bayesian computation, the goal should not be to find the minimum of , but rather to minimise the expected uncertainty in the estimate of the approximate posterior over the future evaluation of the simulator at . Consequently, they propose an acquisition function, known as the expected integrated variance (ExpIntVar or EIV in the following) that selects the next evaluation location to minimise the expected variance of the future posterior density over the parameter space. The framework used is Bayesian decision theory. Formally, the loss due to our uncertain knowledge of the approximate posterior density can be defined as
and the acquisition rule is to select the location that minimises
Järvenpää et al. (2017, proposition 3.2) derive the expressions for the expected integrated variance for a GP model in the non-parametric approach. In appendix A, we extend this work and derive the ExpIntVar acquisition function and its gradient in the parametric approach. The result is the following: under the GP model, the expected integrated variance after running the simulation model with parameter is given by
where is the GP posterior predicted covariance between the evaluation point in the integral and the candidate location for the next evaluation . Note that in addition to the notations given by equations (31)–(34
), we have introduced the vector
It is of interest to examine when the integrand in equation (41) is small. As for the EI (equation (38)), optimal values are found when the mean of the discrepancy is small or the variance is large. This effect is what yields the trade-off between exploitation and exploration for the ExpIntVar acquisition rule. However, unlike in standard Bayesian optimisation strategies such as the EI, the trade-off is a non-local process (due to the integration over the parameter space), and also depends on the prior, so as to minimise the uncertainty in the posterior (and not likelihood) approximation.
Computing the expected integrated variance requires integration over the parameter space. In this work, the integration is performed on a regular grid of points per dimension within the GP boundaries. In high dimension, the integral can become prohibitively expensive to compute on a grid. As discussed by Järvenpää et al. (2017), it can then be evaluated with Monte Carlo or quasi-Monte Carlo methods such as importance sampling.
In numerical experiments, we have found that the ExpIntVar criterion (as any acquisition function for Bayesian optimisation) has some sensitivity to the initial training set. In particular, the initial set (built from a Sobol sequence or otherwise) shall sample sufficiently well the GP domain, which shall encompass the prior. This ensures that the prior volume is never wider than the training data. Under this condition, as Järvenpää et al. (2017), we have found that ExpIntVar is stable, in the sense that it produces consistent bolfi posteriors over different realisations of the initial training data set and simulator outputs.
iii.5.3 Stochastic versus deterministic acquisition rules
The above rules do not guarantee that the selected is different from a previously acquired . Gutmann & Corander (2016, see in particular appendix C) found that this can result in a poor exploration of the parameter space, and propose to add a stochastic element to the decision rule in order to avoid getting stuck at one point. In some experiments, we followed this prescription by adding an “acquisition noise” of strength to each component of the optimiser of the acquisition function. More precisely, is sampled from the Gaussian distribution , where and D is the diagonal covariance matrix of components . The are chosen to be of order .
For a more extensive discussion and comparison of various stochastic and deterministic acquisition rules, the reader is referred to Järvenpää et al. (2017).
In this section, we show the application of bolfi to several application studies. In particular, we discuss the simulator and the computable approximation of the likelihood to be used, and compare bolfi to likelihood-free rejection sampling in terms of computational efficiency. In all cases, we show that bolfi reduces the amount of required simulations by several orders of magnitude.
In section IV.1, we discuss the toy problem of summarising Gaussian signals (i.e. inferring the unknown mean and/or variance of Gaussian-distributed data). In section IV.2, we show the first application of bolfi to a real cosmological problem using actual observational data: the inference of cosmological parameters from supernovae data. For each test case, we refer to the corresponding section in the appendices for the details of the data model and inference assumptions.
iv.1 Summarising Gaussian signals
A simple toy model can be constructed from the general problem of summarising Gaussian signals with unknown mean, or with unknown mean and variance. This example allows for the comparison of bolfi and likelihood-free rejection sampling to the true posterior conditional on the full data, which is known analytically. All the details of this model are given in appendix B.
iv.1.1 Unknown mean, known variance
We first consider the problem, already discussed by Gutmann & Corander (2016), where the data d are a vector of components drawn from a Gaussian with unknown mean and known variance . The empirical mean is a sufficient summary statistic for the problem of inferring . The distribution of simulated takes a simple form, . Using here the true variance, the discrepancy and synthetic likelihood are
where is an average of realisations of . In figure 5 (lower panel), the black dots show simulations of for different values of . We have , therefore the stochastic process defining the discrepancy can be written
where . Each realisation of gives a different mapping . In figure 5, we show one such realisation in the lower panel, and the corresponding approximate posterior in the upper panel. Using the percent point function (inverse of the cdf) of the Gaussian , we also show in red the mean and credible interval of the true stochastic process.
The GP regression using the simulations shown as the training set is represented in blue in the lower panel of figure 5. The corresponding bolfi posterior and its variance, defined by equations (36) and (37), are shown in purple in the upper panel. The uncertainty in the estimate of the posterior (shaded purple region) is due to the limited number of available simulations (and not to the noisiness of individual training points). It is the expectation of this uncertainty under the next evaluation of the simulator which is minimised in parameter space by the ExpIntVar acquisition rule.
iv.1.2 Unknown mean and variance
We now consider the problem where the full data set d is a vector of components drawn from a Gaussian with unknown mean and unknown variance . The aim is the two-dimensional inference of . Evidently, the true likelihood for this problem is the Gaussian characterised by
. The Gaussian-inverse-Gamma distribution is the conjugate prior for this likelihood. It is described by four parameters. Adopting a Gaussian-inverse-Gamma prior characterised byyields a Gaussian-inverse-Gamma posterior characterised by given by equations (69)–(72). This is the analytic solution to which we compare our approximate results.
For the numerical approach, we forward model the problem using a simulator that draws from the prior, simulates realisations of the Gaussian signal, and compresses them to two summary statistics, the empirical mean and variance, respectively and . The graphical probabilistic model is given in figure B.1. It is a noise-free simulator without latent variables (of the type given by figure 1, right) completed by a deterministic compression of the full data. Note that the vector is a sufficient statistic for the inference of . To perform likelihood-free inference, we also need a computable approximation of the true likelihood. We derive such an approximation in section B.3 using a parametric approach, under the assumptions (exactly verified in this example) that is Gaussian-distributed and is Gamma-distributed. We name it the Gaussian-Gamma synthetic likelihood.
The posterior obtained from likelihood-free rejection sampling is shown in green in figure 6 (left) in comparison to the prior (in blue) and the analytic posterior (in orange). It was obtained from accepted samples using a threshold of on . The entire run required forward simulations in total, the vast majority of which have been rejected. The rejection-sampling posterior is a fair approximation to the true posterior, unbiased but broader, as expected from a rejection-sampling method.
For comparison, the posterior obtained via bolfi is shown in red in figure 6 (right). bolfi was initialised using a Sobol sequence of members to compute the original surrogate surface, and Bayesian optimisation with the ExpIntVar acquisition function and acquisition noise was run to acquire more samples. As can be observed, bolfi allows very precise likelihood-free inference; in particular, the , and contours (the latter corresponding to the least likely events) of the analytic posterior are reconstructed almost perfectly. The overall cost to get these results is only simulations with bolfi versus with rejection sampling (for a poorer approximation of the analytic posterior), which corresponds to a reduction by orders of magnitude.
iv.2 Supernova cosmology
In this section, we present the first application of bolfi to a cosmological inference problem. Specifically, we perform an analysis of the Joint Lightcurve Analysis (JLA) data set, consisting of the B-band peak apparent magnitudes of type Ia supernovae (SN Ia) with redshift between and (Betoule et al., 2014): for . The details of the data model and inference assumptions are given in appendix C. For the purpose of validating bolfi, we assume a Gaussian synthetic likelihood (see section C.4), allowing us to demonstrate the fidelity of the bolfi
posterior against the exact likelihood-based solution obtained via Markov Chain Monte Carlo (MCMC). This analysis can also be compared to the proof of concept for another likelihood-free method,delfi (Density Estimation for Likelihood-Free Inference, Papamakarios & Murray, 2016; Alsing, Wandelt & Feeney, 2018), as the assumptions are very similar.
As described in appendix C, the full problem is six dimensional; however, in this work, we focus on the inference of the two physically relevant quantities, namely (the matter density of the Universe) and (the equation of state of dark energy, assumed constant), and marginalise over the other four (nuisance) parameters (, , , ). We assume a Gaussian prior,
which is roughly aligned with the direction of the well-known degeneracy. We generated samples (out of data model evaluations) of the posterior for the exact six-dimensional Bayesian problem via MCMC (performed using the emcee code, Foreman-Mackey et al., 2013), ensuring sufficient convergence to characterise the contours of the distribution.111The final Gelman-Rubin statistic (Gelman & Rubin, 1992) was for each of the six parameters. The prior and the exact posterior are shown in blue and orange, respectively, in figure 7.
For likelihood-free inference, the simulator takes as input and and simulates realisations of the magnitudes of the 740 supernovae at their redshifts. Consistently with the Gaussian likelihood used in the MCMC analysis, we assume a Gaussian synthetic likelihood with a fixed covariance matrix C. The observed data and the covariance matrix C are shown in figure C.1.
The approximate posterior obtained from likelihood-free rejection sampling is shown in green in figure 7. It was obtained from accepted samples using a (conservative) threshold of on , chosen so that the acceptance ratio was not below . The entire run required simulations in total. The approximate posterior obtained via bolfi is shown in red in figure 7. bolfi was initialised with a Sobol sequence of samples, and acquisitions were performed according to the ExpIntVar criterion, without acquisition noise. The bolfi posterior is a much finer approximation to the true posterior than the one obtained from likelihood-free rejection sampling. It is remarkable that only acquisitions are enough to learn the non-trivial banana shape of the posterior. Only the contour (which is usually not shown in cosmology papers, e.g. Betoule et al., 2014) notably deviates from the MCMC posterior. This is due to the fact that we used one realisation of the stochastic process defining and only realisations per ; the marginalisation over the four nuisance parameters is therefore partial, yielding slightly smaller credible contours. However, a better approximation could be obtained straightfowardly, if desired, by investing more computational resources (increasing ), without requiring more acquisitions.
As we used , the total cost for bolfi is simulations. This is a reduction by orders of magnitude with respect to likelihood-free rejection sampling ( simulations) and orders of magnitude with respect to MCMC sampling of the exact posterior ( simulations). It is also interesting to note that our bolfi analysis required a factor of fewer simulations than the recently introduced delfi procedure (Alsing, Wandelt & Feeney, 2018), which used simulations drawn from the prior for the analysis of the JLA.222A notable difference is that delfi allowed the authors to perform the joint inference of the six parameters of the problem, whereas we only get the distribution of and . However, since these are the only two physically interesting parameters, inference of the nuisance parameters is not deemed crucial for this example.
v.1 Benefits and limitations of the proposed approach for cosmological inferences
As noted in the introduction, likelihood-free rejection sampling, when at all viable, is extremely costly in terms of the number of required simulations. In contrast, the bolfi approach relies on a GP probabilistic model for the discrepancy, and therefore allows the incorporation of a smoothness assumption about the approximate likelihood . The smoothness assumption allows simulations in the training set to “share” information about their value of in the neighbourhood of , which suggests that fewer simulations are needed to reach a certain level of accuracy. Indeed, the number of simulations required is typically reduced by to orders of magnitude, for a better final approximation of the posterior, as demonstrated by our tests in section IV and in the statistical literature (see Gutmann & Corander, 2016).
A second benefit of bolfi is that it actively acquires training data through Bayesian optimisation. The trade-off between computational cost and statistical performance is still present, but in a modified form: the trade-off parameter is the size of the training set used in the regression. Within the training set, the user is free to choose which areas of the parameter space should be prioritised, so as to approximate the regression function more accurately there. In contrast, in ABC strategies that rely on drawing from a fixed proposal distribution (often the prior), or variants such as pmc-abc, a fixed computational cost needs to be paid per value of regardless of the value of .
Finally, by focusing on parametric approximations to the exact likelihood, the approach proposed in this work is totally “-free”, meaning that no threshold (which is often regarded as an unappealing ad hock element) is required. As likelihood-based techniques, the parametric version of bolfi has the drawback that assuming a wrong form for the synthetic likelihood or miscalculating values of its parameters (such as the covariance matrix) can potentially bias the approximate posterior and/or lead to an underestimation of credible regions. Nevertheless, massive data compression procedures can make the assumptions going into the choice of a Gaussian synthetic likelihood (almost) true by construction (see section V.2.4).
Of course, regressing the discrepancy and optimising the acquisition function are not free of computational cost. However, the run-time for realistic cosmological simulation models can be hours or days. In comparison, the computational overhead introduced by bolfi is negligible.
Likelihood-free inference should also be compared to existing likelihood-based techniques for cosmology such as Gibbs sampling or Hamiltonian Monte Carlo (e.g. Wandelt, Larson & Lakshminarayanan, 2004; Eriksen et al., 2004 for the cosmic microwave background; Jasche et al., 2010; Jasche & Lavaux, 2015; Jasche, Leclercq & Wandelt, 2015 for galaxy clustering; Alsing et al., 2016 for weak lensing). The principal difference between these techniques and bolfi lies in its likelihood-free nature. Likelihood-free inference has particular appeal for cosmological data analysis, since encoding complex physical phenomena and realistic observational effects into forward simulations is much easier than designing an approximate likelihood which incorporates these effects and solving the inverse problem. While the numerical complexity of likelihood-based techniques typically requires to approximate complex data models in order to access required products (conditionals or gradients of the pdfs) and to allow for sufficiently fast execution speeds, bolfi performs inference from full-scale black-box data models. In the future, such an approach is expected to allow previously infeasible analyses, relying on a much more precise modelling of cosmological data, including in particular the complicated systematics they experience. However, while the physics and instruments will be more accurately modelled, the statistical approximation introduced with respect to likelihood-based techniques should be kept in mind.
Other key aspects of bolfi for cosmological data analysis are the arbitrary choice of the statistical summaries and the easy joint treatment of different data sets. Indeed, as the data compression from d to is included in the simulator (see section II.3), summary statistics do not need to be quantities that can be physically modelled (such as the power spectrum) and can be chosen robustly to model misspecification. For example, for the microwave sky, the summaries could be the cross-spectra between different frequency maps; and for imaging surveys, the cross-correlation between different bands. Furthermore, joint analyses of correlated data sets, which is usually challenging in likelihood-based approaches (as they require a good model for the joint likelihood) can be performed straightforwardly in a likelihood-free approach.
Importantly, as a general inference technique, bolfi can be embedded into larger probabilistic schemes such as Gibbs or Hamiltonian-within-Gibbs samplers. Indeed, as posterior predictive distributions for conditionals and gradients of GPs are analytically tractable, it is easy to obtain samples of the bolfi approximate posterior for use in larger models. bolfi can therefore allow parts of a larger Bayesian hierarchical model to be treated as black boxes, without compromising the tractability of the entire model.
v.2 Possible extensions
v.2.1 High-dimensional inference
In this proof-of-concept paper, we focused on two-dimensional problems. Likelihood-free inference is in general very difficult when the dimensionality of the parameter space is large, due to the curse of dimensionality, which makes the volume exponentially larger with. In bolfi, this difficulty manifests itself in the form of a hard regression problem which needs to be solved. The areas in the parameter space where the discrepancy is small tend to be narrow in high dimension, therefore discovering these areas becomes more challenging as the dimension increases. The optimisation of GP kernel parameters, which control the shapes of allowed features, also becomes more difficult. Furthermore, finding the global optimum of the acquisition function becomes more demanding (especially with the ones designed for ABC such as ExpIntVar, which have a high degree of structure – see figure C.3, bottom right panel).
Nevertheless, Järvenpää et al. (2017) showed on a toy simulation model (a Gaussian) that up to ten-dimensional inference is possible with bolfi. As usual cosmological models do not include more than ten free physical parameters, we do not expect this limitation to be a hindrance. Any additional nuisance parameter or latent variable used internally by the simulator (such as , , , in supernova cosmology, see section IV.2) can be automatically marginalised over, by using realisations per . Recent advances in high-dimensional implementation of the synthetic likelihood (Ong et al., 2017) and high-dimensional Bayesian optimisation (e.g. Wang et al., 2013; Kandasamy, Schneider & Póczos, 2015) could also be exploited. In future work, we will address the problem of high-dimensional likelihood-free inference in a cosmological context.
v.2.2 Scalability with the number of acquisitions and probabilistic model for the discrepancy
In addition to the fundamental issues with high-dimensional likelihood-free inference described in the previous section, practical difficulties can be met.
Gaussian process regression requires the inversion of a matrix K of size , where is the size of the training set. The complexity is , which limits the size of the training set to a few thousand. Improving GPs with respect to this inversion is still subject to research (see Rasmussen & Williams, 2006, chapter 8). For example, “sparse” Gaussian process regression reduces the complexity by introducing auxiliary “inducing variables”. Techniques inspired by the solution to the Wiener filtering problem in cosmology, such as preconditioned conjugate gradient or messenger field algorithms could also be used (Elsner & Wandelt, 2013; Kodi Ramanah, Lavaux & Wandelt, 2017; Papez, Grigori & Stompor, 2018). Another strategy would be to divide the regression problem spatially into several patches with a lower number of training points (Park & Apley, 2017). Such approaches are possible extensions of the presented method.
In the GP probabilistic model employed to model the discrepancy, the variance depends only on the training locations, not on the obtained values (see equation (30
)). Furthermore, a stationary kernel is assumed. However, depending on the simulator, the discrepancy can show heteroscedasticity (i.e. its variance can depend on– see e.g. figure 5, bottom panel). Such cases could be handled by non-stationary GP kernels or different probabilistic models for the discrepancy, allowing a heteroscedastic regression.
v.2.3 Acquisition rules
As shown in our examples, attention should be given to the selection of an efficient acquisition rule. Although standard Bayesian optimisation strategies such as the EI are reasonably effective, they are usually too greedy, focusing nearly all the sampling effort near the estimated minimum of the discrepancy and gathering too little information about other regions in the domain (see figure C.3, bottom left panel). This implies that, unless the acquisition noise is high, the tails of the posterior will not be as well approximated as the modal areas. In contrast, the ExpIntVar acquisition rule, derived in this work for the parametric approach, addresses the inefficient use of resources in likelihood-free rejection sampling by directly targeting the regions of the parameter space where improvement in the estimation accuracy of the approximate posterior is needed most. In our experiments, ExpIntVar seems to correct – at least partially – for the well-known effect in Bayesian optimisation of overexploration of the domain boundaries, which becomes more problematic in high dimension.
Acquisition strategies examined so far in the literature (see Järvenpää et al., 2017, for a comparative study) have focused on single acquisitions and are all “myopic”, in the sense that they reason only about the expected utility of the next acquisition, and the number of simulations left in a limited budget is not taken into account. Improvement of acquisition rules enabling batch acquisitions and non-myopic reasoning are left to future extensions of bolfi.
v.2.4 Data compression
In addition to the problem of the curse of dimensionality in parameter space, discussed in section V.2.1, likelihood-free inference usually suffers from difficulties in the measuring the (mis)match between simulations and observations if the data space also has high dimension. As discussed in section II.3, simulator-based models include a data compression step. The comparison in data space can be made more easily if is reduced. In future work, we will therefore aim at combining bolfi with massive and (close to) optimal data compression strategies. These include moped (Heavens, Jimenez & Lahav, 2000), the score function (Alsing & Wandelt, 2018)
, or information-maximising neural networks(Charnock, Lavaux & Wandelt, 2018). Using such efficient data compression techniques, the number of simulations required for inference with bolfi will be reduced even more, and the number of parameters treated could be increased.
Parametric approximations to the exact likelihood depend on quantities that have to be estimated using the simulator (typically for the Gaussian synthetic likelihood, the inverse covariance matrix of the summaries). Unlike supernova cosmology where the covariance matrix is easily obtained, in many cases it is prohibitively expensive to run enough simulations to estimate the required quantities, especially when they vary with the model parameters. In this context, massive data compression offers a way forward, reducing enormously the number of required simulations and making the analysis feasible when otherwise it might be essentially impossible (Heavens et al., 2017; Gualdi et al., 2018).
An additional advantage of several data compression strategies is that they support the choice of a Gaussian synthetic likelihood. Indeed, the central limit theorem (for moped
) or the form of the network’s reward function (for information-maximising neural networks) assist in giving the compressed data a near-Gaussian distribution. Furthermore, testing the Gaussian assumption for the synthetic likelihood will be far easier in a smaller number of dimensions than in the original high-dimensional data space.
v.3 Parallelisation and computational efficiency
While MCMC sampling has to be done sequentially, bolfi lends itself to more parallelisation. In an efficient strategy, a master process performs the regression and decides on acquisition locations, then dispatches simulations to be run by different workers. In this way, many simulations can be run simultaneously in parallel, or even on different machines. This allows fast application of the method and makes it particularly suitable for grid computing. Extensions of the probabilistic model and of the acquisition rules, discussed in section V.2.2 and V.2.3, would open the possibility of doing asynchronous acquisitions. Different workers would then work completely independently and decide on their acquisitions locally, while just sharing a pool of simulations to update their beliefs given all the evidence available.
While the construction of the training set depends on the observed data (through the acquisition function), simulations can nevertheless be reused as long as summaries are saved. This means that if one acquires new data , the existing (or a subset of them) can be used to compute the new discrepancy . Building an initial training set in this fashion can massively speed up the inference of , whereas likelihood-based techniques would require a new MCMC.
v.4 Comparison to previous work
As discussed in the introduction, likelihood-free rejection sampling is not a viable strategy for various problems that bolfi can tackle. In recent work, an other algorithm for scalable likelihood-free inference in cosmology (delfi, Papamakarios & Murray, 2016; Alsing, Wandelt & Feeney, 2018) was introduced. The approach relies on estimating the joint probability via density estimation. This idea also relates to the work of Hahn et al. (2018), who fit the sampling distribution of summaries
using Gaussian mixture density estimation or independent component analysis, before using it for parameter estimation. This section discusses the principal similarities and differences.
The main difference between bolfi and delfi is the data acquisition. Training data are actively acquired in bolfi, contrary to delfi which, in the simplest scheme, draws from the prior. The reduction in the number of simulations for the inference of cosmological parameters (see section IV.2) can be interpreted as the effect of the Bayesian optimisation procedure in combination with the ExpIntVar acquisition function. Using a purposefully constructed surrogate surface instead of a fixed proposal distribution, bolfi focuses the simulation effort to reveal as much information as possible about the target posterior. In particular, its ability to reason about the quality of simulations before they are run is an essential element. Acquisition via Bayesian optimisation almost certainly remains more efficient than even the pmc version of delfi, which learns a better proposal distribution but still chooses parameters randomly. In future cosmological applications with simulators that are expensive and/or have a large latent space, an active data acquisition procedure could be crucial in order to provide a good model for the noisy approximate likelihood in the interesting regions of parameter space, and to reduce the computational cost. This comes at the expense of a reduction of the parallelisation potential: with a fixed proposal distribution (like in delfi and unlike in bolfi), the entire set of simulations can be run at the same time.
The second comment is related to the dimensionality of problems which can be addressed. Like delfi, bolfi relies on a probabilistic model to make ABC more efficient. However, the quantities employed differ, since in delfi the relation between the parameters and the summary statistics is modelled (via density estimation), while bolfi focuses on the relation between the parameters and the discrepancy (via regression). Summary statistics are multi-dimensional while the discrepancy is a univariate scalar quantity. Thus, delfi requires to solve a density estimation problem in (which equals if the compression from Alsing & Wandelt, 2018 is used), while bolfi requires to solve a regression problem in . Both tasks are expected to become more difficult as increases (a symptom of the curse of dimensionality, see section V.2.1), but the upper limits on for practical applications may differ. Further investigations are required to compare the respective maximal dimensions of problems that can be addressed by bolfi and delfi.
Finally, as argued by Alsing, Wandelt & Feeney (2018), delfi readily provides an estimate of the approximate evidence. In contrast, as in likelihood-based techniques, integration over parameter space is required with bolfi to get
However, due to the GP model, the integral can be more easily computed, using the same strategies as for the integral appearing in ExpIntVar (see section III.5.2): only the GP predicted values are required at discrete locations on a grid (in low dimension) or at the positions of importance samples. A potential caveat is that delfi has only been demonstrated to work in combination with the score function (Alsing & Wandelt, 2018), which is necessary to reduce the dimensionality of before estimating the density.333In contrast, section IV.2 showed, for the same supernovae problem, that bolfi can still operate if the comparison is done in the full -dimensional data space. The score function produces summaries that are only sufficient up to linear order in the log-likelihood. However, in ABC, care is required to perform model selection if the summary statistics are insufficient. Indeed, Robert et al. (2011, equation 1)
show that, in such a case, the approximate Bayes factor can be arbitrarily biased and that the approximation error is unrelated to the computational effort invested in running the ABC algorithm. Moreover, sufficiency for modelsand alone, or even for both of them – even if approximately realised via Alsing & Wandelt’s procedure – does not guarantee sufficiency to compare the two different models and (Didelot et al., 2011). As the assumptions behind bolfi do not necessarily necessitate to reduce ( is always a univariate scalar quantity, see above), these difficulties could be alleviated with bolfi by carefully designing sufficient summary statistics for model comparison within the black-box simulator, if they exist.
Likelihood-free inference methods allow Bayesian inference of the parameters of simulator-based statistical models with no reference to the likelihood function. This is of particular interest for data analysis in cosmology, where complex physical and observational processes can usually be simulated forward but not handled in the inverse problem.
In this paper, we considered the demanding problem of performing Bayesian inference when simulating data from the model is extremely costly. We have seen that likelihood-free rejection sampling suffers from a vanishingly small acceptance rate when the threshold goes to zero, leading to the need for a prohibitively large number of simulations. This high cost is largely due to the lack of knowledge about the functional relation between the model parameters and the discrepancy. As a response, we have described a new approach to likelihood-free inference, bolfi, that uses regression to infer this relation, and optimisation to actively build the training data set. A crucial ingredient is the acquisition function derived in this work, with which training data are acquired such that the expected uncertainty in the final estimate of the posterior is minimised.
In case studies, we have shown that bolfi is able to precisely recover the true posterior, even far in its tails, with as few as simulations, in contrast to likelihood-free rejection sampling or likelihood-based MCMC techniques which require orders of magnitude more simulations. The reduction in the number of required simulations accelerated the inference massively.
This study opens up a wide range of possible extensions, discussed in section V.2. It also allows for novel analyses of cosmological data from fully non-linear simulator-based models, as required e.g. for the cosmic web (see the discussions in Leclercq, Jasche & Wandelt, 2015; Leclercq et al., 2016, 2017). Other applications may include the cosmic microwave background, weak gravitational lensing or intensity mapping experiments. We therefore anticipate that bolfi will be a major ingredient in principled, simulator-based inference for the coming era of massive cosmological data.
Appendix A Derivations of the mathematical results
a.1 Expressions for the approximate posterior
If we knew the target function , the bolfi posterior would be given as
However, due to the limited computational resources we only have a finite training set , which implies that there is uncertainty in the values of , and therefore that the approximate posterior is itself a stochastic process. To get its expectation under the model, the log-likelihood is replaced by its expectation under the model, i.e. (up to constants, see equations (21) and (35)), giving equation (36).
Similarly, if the function was known, the variance of the approximate posterior could be computed by standard propagation of uncertainties,
The argument of the exponential is ; it should be replaced by its expectation under the model, . The variance of under the model is, by definition, . The result for