1 Simulation-based inference
Statistical inference is performed within the context of a statistical model, and in simulation-based inference the simulator itself defines the statistical model. For the purpose of this paper, a simulator is a computer program that takes as input a vector of parameters, samples a series of internal states or latent variables , and finally produces a data vector as output. Programs that involve random samplings and are interpreted as statistical models are known as probabilistic programs, and simulators are an example. Within this general formulation, real-life simulators can vary substantially:
The parameters describe the underlying mechanistic model and thus affect the transition probabilities . Typically the mechanistic model is interpretable by a scientist and has relatively few components and a fixed dimensionality. Examples include coefficients found in the Hamiltonian of a physical system, the virulence and incubation rate of a pathogen, or fundamental constants of Nature.
The latent variables that appear in the data-generating process may directly or indirectly correspond to a physically meaningful state of a system, but typically this state is unobservable in practice. The structure of the latent space varies substantially between simulators. The latent variables may be continuous or discrete and the dimensionality of the latent space may be fixed or may vary depending on the control flow of the simulator. The simulation can freely combine deterministic and stochastic steps. The deterministic components of the simulator may be differentiable or may involve discontinuous control flow elements. In practice, some simulators may provide convenient access to the latent variables, while others are effectively black boxes. Any given simulator may combine these different aspects in almost any way.
Finally, the output data correspond to the observations. They can range from a few unstructured numbers to high-dimensional and highly structured data, such as images or geospatial information.
Consider for instance the systems shown in Fig. 1. Particle physics processes often only depend on a small number of parameters of interest such as particle masses or coupling strengths. The latent process combines a high-energy interaction, rigorously described by a quantum field theory, with the passage of the resulting particles through an incredibly complex detector, most accurately modeled with stochastic simulations with billions of latent variables; this second part often does not depend on the parameters of interest. The output data consist, in their raw form, of millions of sensor read-outs, though there is an established pipeline that compresses this raw data to tens to hundreds of observables. Epidemiological simulations can be based on a network structure with geospatial properties, and the latent process consists of many repeated structurally identical stochastic time steps. In contrast, cosmological simulations of the evolution of the Universe may consist of a highly structured stochastic initial state followed by a smooth, deterministic time evolution.
These differences mean that there is no one-size-fits-all inference method. In this review we aim to clarify the considerations needed to choose the most appropriate approach for a given problem.
Scientific inference tasks differ by what is being inferred: given observed data , is the goal to infer the input parameters , or the latent variables , or both? Sometimes only a subset of the parameters (or latent variables) are of interest, while the rest are nuisance parameters (i. e. parameters that we are not directly interested in but must account for because they influence the distributions of the data). We will focus on the common problem of inferring and comment on when methods also allow inference on .
Inference may be performed either in a frequentist or a Bayesian approach and may be limited to point estimates
or extended to include a probabilistic notion of uncertainty. In the frequentist case, confidence sets are formed from inverting hypothesis tests, often based on the likelihood ratio as test statistic. In Bayesian inference, the goal is typically to calculate the posterior
for observed data and a given prior . In both cases the likelihood function is a key ingredient.
The fundamental challenge for simulation-based inference problems is that the likelihood function implicitly defined by the simulator is typically not tractable, as it corresponds to an integral over all possible trajectories through the latent space, i. e. all possible execution traces of the simulator. That is,
where is the joint probability density of data and latent variables . For a simple sequential data generation procedure, the joint likelihood can be written as . For real-life simulators with large latent spaces, it is clearly impossible to compute this integral explicitly. Since the likelihood function is the central ingredient to both frequentist and Bayesian inference, this is a major challenge for inference in many fields. This paper reviews simulation-based or likelihood-free inference techniques that enable frequentist or Bayesian inference despite this intractability.
There is a second, more widely appreciated source of intractability. In the case of Bayesian inference, the evidence—the denominator of (1)—involves an integral over the parameters
. In problems with high-dimensional parameters this becomes intractable, independently of the intractability of the likelihood function. This challenge is commonly addressed with Markov Chain Monte Carlo (MCMC) methods(9, 10) or variational inference (VI) (11).
In practice, an important distinction is that between inference based on a single observation, and that based on multiple independent and identically distributed (i.i.d.) observations. In the second case, the likelihood factorizes into individual likelihood terms for each i.i.d. observation, as . For example, time-series data is typically non-i.i.d. and must be treated as a single high-dimensional observation, whereas the analysis of collision data in the search for the Higgs boson constitutes a data set with many i.i.d. measurements. This distinction is important when it comes to the computational cost of an inference technique, as inference in the i.i.d. case will necessitate many repeated evaluations of the individual likelihood .
1.3 Traditional methods
The problem of inference without tractable likelihoods is not a new one, and two major approaches have been developed to address it. Arguably the most well-known is Approximate Bayesian Computation (ABC) (2, 3). Until recently, it was so established that the terms “likelihood-free inference” and “ABC” were often used interchangeably. In the simplest form of rejection ABC, the parameters are drawn from the prior, the simulator is run with those values to sample , and is retained as posterior sample if the simulated data is sufficiently close to the observed data. In essence, the likelihood is approximated by the probability that the condition is satisfied, where is some distance measure and is a tolerance. The accepted samples then follow an approximate version of the posterior. We show a schematic workflow of this algorithm in Fig. 3a (for a more elaborate Markov Chain Monte Carlo algorithm with a proposal function).
In the limit , inference with ABC becomes exact, but for continuous data the acceptance probability vanishes. In practice, small values of require unfeasibly many simulations. For large , sample efficiency is increased at the expense of inference quality. Similarly, the sample efficiency of ABC
scales poorly to high-dimensional data. Since the data immediately affect the rejection process (and in more advanced ABC algorithms the proposal distribution), inference for new observations requires repeating the entire inference algorithm. ABC is thus best-suited for the case of a single observation or at most a few i.i.d. data points.
The second classical approach to simulation-based inference is based on creating a model for the likelihood by estimating the distribution of simulated data with histograms or kernel density estimation(1). Frequentist and Bayesian inference then proceeds as if the likelihood were tractable. We sketch this algorithm in Fig. 3e (replacing the green learning step with a classical density estimation method). This approach has enough similarities to ABC to be dubbed “Approximate Frequentist Computation” by the authors of Ref. (15). One advantage over ABC is that it is amortized: after the simulation and density estimation stage, new data points can be evaluated efficiently. In Fig. 3e this manifests itself as the blue “data” box only entering at the inference stage and not affecting the expensive simulation step. This property makes density estimation–based inference particularly well-suited for problems with many i.i.d. observations, a key reason for its wide-spread use in particle physics measurements.
Both of the traditional approaches suffer from the curse of dimensionality: the required number of simulations increases exponentially with the dimension of the data. Therefore both approaches rely on low-dimensional summary statistics and the quality of inference is tied to how well those summaries retain information about the parameters . Traditionally, the development of powerful summary statistics has been the task of a domain expert, and the summary statistics have been prescribed prior to inference.
2 Frontiers of simulation-based inference
These traditional simulation-based inference techniques have played a key role in several fields for years. However, they suffer from shortcomings in three crucial aspects:
Sample efficiency: Both ABC and classical density estimation techniques suffer from the curse of dimensionality. The poor scaling means that the number of simulated samples needed to provide a good estimate of the likelihood or posterior can be prohibitively expensive.
Quality of inference: The reduction of the data to low-dimensional summary statistics invariably discards some of the information in the data about , which results in a loss in statistical power. Large values of the parameter in ABC or bandwidth parameter for kernel density estimation lead to poor approximations of the true likelihood. Both reduce the overall quality of inference.
Amortization: Performing inference with ABC for a new set of observed data requires repeating most steps of the inference chain, in particular if the proposal distribution depends on the observed data. The method scales poorly when applied to large numbers of observations. On the other hand, inference based on density estimation is amortized: the computationally expensive steps do not have to repeated for new observations. This is particularly desirable for the case with i.i.d. observations.
In recent years, new capabilities have become available that let us improve all three of these aspects. We loosely group them into three main directions of progress:
Active learning methods can systematically improve sample efficiency, letting us tackle more computationally expensive simulators.
The deep integration of automatic differentiation and probabilistic programming into the simulation code, as well as the augmentation of training data with additional information that can be extracted from the simulator, are changing the way the simulator is treated in inference: it is no longer a black box, but exposed to the inference workflow.
We sketch these trends in Fig. 2, broadly categorizing the landscape of inference tasks in a two-dimensional plane of the dimensionality of the data (vertical axis) and the complexity of the simulator (horizontal axis).
2.1 A revolution in machine learning
Over the last decade, machine learning techniques, in particular deep neural networks, have turned into versatile, powerful, and popular tools for a variety of problems (16)
. Neural networks initially demonstrated breakthroughs in supervised learning tasks such as classification and regression. They can easily be composed to solve higher-level tasks, lending themselves to problems with a hierarchical or compositional structure. Architectures tailored to various data structures have been developed, including dense or fully connected networks aimed at unstructured data, convolutional neural networks that leverage spatial structure for instance in image data, recurrent neural networks for variable-length sequences, and graph neural networks for graph-structured data. Choosing an architecture well suited for a specific data structure is an example ofinductive bias, which more generally refers to the assumptions inherent in a learning algorithm independent of the data. Inductive bias is one of the key ingredients behind most successful applications of deep learning, though it is difficult to characterize its role precisely.
One area where neural networks are being actively developed is density estimation in high dimensions: given a set of points , the goal is to estimate the probability density
. As there are no explicit labels, this is usually considered an unsupervised learning task. We have already discussed that classical methods based for instance on histograms or kernel density estimation do not scale well to high-dimensional data. In this regime, density estimation techniques based on neural networks are becoming more and more popular. One class of these neural density estimation techniques arenormalizing flows (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32), in which variables described by a simple base distribution such as a multivariate Gaussian are transformed through a parameterized invertible transformation that has a tractable Jacobian. The target density is then given by the change-of-variables formula as a product of the base density and the determinant of the transformation’s Jacobian. Several such steps can be stacked, with the probability density “flowing” through the successive variable transformations. The parameters of the transformations are trained by maximizing the likelihood of the observed data under the model , resulting in a model density that approximates the true, unknown density . In addition to having a tractable density, it is possible to generate data from the model by drawing the hidden variables from the base distribution and applying the flow transformations. Neural density estimators have been generalized to model the dependency on additional inputs, i. e. to model a conditional density such as the likelihood or posterior .
Another class of approaches use autoregressive models, in which the probability distribution of a high-dimensional variable is factorized into successive conditional densities of the individual components(20, 21, 22, 23, 24, 25, 26, 27, 28). These models are expressive, have a tractable (conditional) density, and can be used to generate synthetic data. While autoregressive models are somewhat disfavored in industrial applications because generating samples from them can be slow, the sequential nature is more closely aligned with the way simulators are written and offers an opportunity to align the neural networks latent variables with the semantically meaningful latent variables of simulators.
Generative Adversarial Networks (GANs) are an alternative type of generative model based on neural networks. Unlike in normalizing flows and autoregressive models, the transformation implemented by the generator is not restricted to be invertible. While this allows for more expressiveness, the density defined by the generator is intractable. Since maximum likelihood is not a possible training objective, the generator is pitted against an adversary, whose role is to distinguish the generated data from the target distribution. We will later discuss how the same idea can be used for simulation-based inference, using an idea known as the “likelihood ratio trick”.
2.2 Active learning
A simple, but very impactful idea is to run the simulator at parameter points that are expected to increase our knowledge the most. This can be done iteratively such that after each simulation the knowledge resulting from all previous runs is used to guide which parameter point should be used next. There are multiple technical realizations of this idea of active learning. It is commonly applied in a Bayesian setting, where the posterior can be continuously updated and used to steer the proposal distribution of simulator parameters (33, 34, 35, 36, 37, 38, 39). But it applies equally well to a frequentist case, for instance optimizing confidence sets (40, 41, 42). Even simple implementations can lead to a substantial improvement in sample efficiency.
Similar ideas are discussed in the context of decision making, experimental design, and reinforcement learning, and we expect further improvements in inference algorithms from the cross-pollination between these fields. For instance, a question that is occasionally discussed in the context of reinforcement learning(43) or Bayesian optimization (44), but has not yet been applied to the likelihood-free setting, is how to make use of multi-fidelity simulators offering multiple levels of precision or approximations.
2.3 Integration and augmentation
Both machine learning and active learning can substantially improve quality of inference and sample efficiency compared to classical methods. However, overall they do not change the basic approach to simulation-based inference dramatically: they still treat the simulator as a generative black box that takes parameters as input and provides data as output, with a clear separation between the simulator and the inference engine. A third direction of research is changing this perspective, opening this black box to access more information and integrating inference and simulation more tightly.
One example of this shift is the probabilistic programming paradigm. Gordon et al. (45) describe probabilistic programs as the usual functional or imperative programs with two added constructs: (1) the ability to draw values at random from distributions, and (2) the ability to condition values of variables in a program via observations. We have already described simulators as probabilistic programs focusing on the first construct, which does not require opening the black box. However, conditioning on the observations requires a deeper integration as it involves controlling the randomness in the generative process. Previously this required writing the program in a special purpose language; however, recent work allows these capabilities to be added to existing simulators with minimal changes to their codebase (46). Ultimately, probabilistic programming aims at providing the tools to infer the incredibly complex space of all execution traces of the simulator conditioned on the observation.
A complementary development is the observation that additional information that characterizes the latent data-generating process can be extracted from the simulator and used to augment the data used to train surrogates. This augmented training data can be exploited in supervised learning objectives, and can dramatically increase the sample efficiency for surrogate training. Those developing inference algorithms and those familiar with the details of the simulator should consider whether, in addition to the sole ability to sample , the following properties in which the corresponding quantities in the simulator are well-defined and tractable.
: the probability density of the output data given the latent variables.
: the joint score is the gradient of the joint log probability density of output data and latent variables with respect to the parameters.
: the gradient of the same quantity, but with respect to the latent variables.
: the ratio of the joint probability density of output data and latent variables for two parameter points and .
: the derivative of output data and latent variables with respect to the parameters.
: the gradient of the output data with respect to the latent variables.
Many of the quantities above involve derivatives, which can now be efficiently calculated using automatic differentiation (often referred to simply as autodiff) (50)
. Autodiff is a family of techniques similar to but more general than the backpropagation algorithm that is ubiquitous in deep learning. Automatic differentiation, like probabilistic programming, involves non-standard interpretations of the simulation code and has been developed by a small but established field of computer science. In recent years several have advocated that deep learning would be better described asdifferential programming. With this view, integrating autodiff into existing simulation codes is a more direct way to exploit the advances in deep learning than trying to incorporate domain knowledge into an entirely foreign substrate such as a deep neural network.
Extracting the necessary information from the simulator again requires integration deep in the code. While technologies to incorporate probabilistic programming paradigm into existing code bases are just emerging, the development of tools to enable autodiff in the most commonly used scientific programming languages is well advanced. We highlight that two of the quantities above (II and III) involve both autodiff and probabilistic programming. The integration of inference and simulation as well as the idea of augmenting the training data with additional quantities have the potential to change the way we think about simulation-based inference. In particular, this perspective can influence the way simulation codes are developed in order to provide these new capabilities.
3 Workflows for simulation-based Inference
This wide array of capabilities can be combined in different inference workflows. Some of these are structurally identical to the traditional ABC and density estimation–based methods, while others are fundamentally different. As a guideline through this array of different workflows, let us first discuss common building blocks, and the different approaches that can be taken in each of these components. In Fig. 3 and the following sections we will then piece these blocks together into different inference algorithms.
An integral part of all inference methods is running the simulator, visualized as a yellow pentagon in Fig. 3. The parameters at which the simulator is run are drawn from some proposal distribution, which may or may not depend on the prior in a Bayesian setting, and can either be chosen statically, or iteratively with an active learning method. Next, the potentially high-dimensional output from the simulator may be used directly as input to the inference method, or reduced to low-dimensional summary statistics, which may be prescribed or learned from data.
The inference techniques can be broadly separated into those which, like ABC, use the simulator itself during inference, and methods which construct a surrogate model and use that for inference. In the first case, the output of the simulator is directly compared to data, see the top panels of Fig. 3. In the latter case, the output of the simulator is used as training data for an estimation or machine learning stage, shown as the green boxes in the bottom panels of Fig. 3. The resulting surrogate models, shown as red hexagons, are then used for inference.
The algorithms address the intractability of the true likelihood in different ways: some methods construct a tractable surrogate for the likelihood function, others for the likelihood ratio function, both of which make frequentist inference straightforward. In other methods, the likelihood function never appears explicitly, for instance when it is implicitly replaced by rejection probability (an approach that does not lend itself to frequentist inference).
The final target for Bayesian inference is the posterior. Methods differ in whether they provide access to samples of parameter points sampled from the posterior, for instance from MCMC or ABC, or a tractable function that approximates the posterior function. Similarly, some methods require specifying which quantities are to be inferred early on in the workflow, while others allow this decision to be postponed.
3.1 Using the simulator directly during inference
Let us now discuss how these blocks and computational capabilities can be combined into inference techniques, beginning with those which, like ABC, use the simulator directly during inference. We sketch some of these algorithms in the top panels of Fig. 3.
One of the major shortcomings of ABC
is its reliance on low-dimensional summary statistics. ClassifierABC (51) removes the requirement of compressing the data into summary statistics by instead training a classifier to estimate the discrepancy between observed and simulated data.
A reason for the poor sample efficiency of the original rejection ABC algorithm is that the simulator is run at parameter points drawn from the prior, which may have a large mass in regions that are in strong disagreement with the observed data. Different algorithms have been proposed that instead run the simulator at parameter points that are expected to improve the knowledge on the posterior the most (33, 34, 35, 36, 37). Compared to vanilla ABC, these techniques improve sample efficiency, though they still require the choice of summary statistics, distance measure , and tolerance .
In the case where the final stage of the simulator is tractable or the simulator is differentiable (respectively, properties I and VI from the list in Sec. 2.2.3), asymptotically exact Bayesian inference is possible (47) without relying on a distance tolerance or summary statistics, removing ABC’s main limitations in terms of quality of inference.
The probabilistic programming paradigm presents a more fundamental change to how inference is performed. First, it requires the simulator to be written in a probabilistic programming language, though recent work allows these capabilities to be added to existing simulators with minimal changes to their codebase (46). In addition, probabilistic programming either requires a tractable likelihood for the final step (quantity I) or the introduction of an ABC-like comparison. When these criteria are satisfied, several inference algorithms exist that can draw samples from the posterior of the input parameters and the latent variables given some observed data . These techniques are either based on MCMC, see Fig. 3c, or on training a neural network to provide proposal distributions (52) as shown in Fig. 3d. The key difference to ABC is that the inference engine controls all steps in the program execution and can bias each draw of random latent variables to make the simulation more likely to match the observed data, improving sample efficiency.
A strength of these algorithms is that they allow to infer not only the input parameters into the simulator, but the entire latent process leading to a particular observation. This allows us to answer entirely different questions about scientific processes, adding a particular kind of physical interpretability that methods based on surrogates do not possess. While standard ABC algorithms in principle allow for inference on , probabilistic programming solves this task more efficiently.
3.2 Surrogate models
A key disadvantage of using the simulator directly during inference is the lack of amortization. When new observed data becomes available, the whole inference chain has to be repeated. By training a tractable surrogate or emulator for the simulator, inference is amortized: after a (computationally expensive) upfront simulation and training phase, new data can be evaluated very efficiently. This approach scales particularly well to data consisting of many i.i.d. observations. As discussed in Sec. 11.3, this is not a new idea, and well-established methods use classical density estimation techniques to create a surrogate model for the likelihood function. But the new computational capabilities discussed in Sec. 2.1 have given new momentum to this class of inference techniques. They can be incorporated in various ways, providing approximations of the true parameters given data, defining suitable summary statistics, learning the likelihood, the likelihood ratio, or the posterior; we will briefly go through these different options to organize the inference. Selected algorithms are visualized in the bottom panels of Fig. 3.
Perhaps the most obvious of these approaches is to directly invert the parameter-to-data process implemented by the simulator and to train a model to estimate the true parameters as a function of observed data (53, 54, 55, 56). However, point estimates are not always useful and a probabilistic interpretation in terms of the likelihood or posterior of these methods is not obvious, so will not focus on this approach.
A powerful probabilistic approach is to train neural conditional density estimators such as normalizing flows as a surrogate for the simulator. The conditional density can be defined in two directions: the network can either learn the posterior (57, 58, 38, 59, 60, 61) or the likelihood (39, 62, 63, 64). We show these three techniques in Figs. 3e and 3f; note that the likelihood surrogate algorithm is structurally identical to the classical density estimation–based approach, but uses more powerful density estimation techniques.
Relatedly, neural networks can be trained to learn the likelihood ratio function or , where in the latter case the denominator is given by a marginal model integrated over a proposal or the prior (65, 66, 67, 68, 69, 51, 70, 71). We sketch this approach in Fig. 3g. The key idea is closely related to the discriminator network in GANs mentioned above: a classifier is trained using supervised learning to discriminate two sets of data, though in this case both sets come from the simulator and are generated for different parameter points and . The classifier output function can be converted into an approximation of the likelihood ratio between and ! This relation is often called the likelihood ratio trick.
These three surrogate-based approaches are all amortized: after an upfront simulation and training phase, the surrogates can be evaluated efficiently for arbitrary data and parameter points. They require an upfront specification of the parameters of interest, the network then implicitly marginalizes over all other (latent) variables in the simulator. All three classes of algorithms can employ active learning elements such as an iteratively updated proposal distribution to guide the simulator parameters towards relevant parameter region, improving sample efficiency. Using neural networks eliminates the requirement of low-dimensional summary statistics, leaving it to the employed model to learn the structures in high-dimensional data and potentially improving quality of inference.
Despite these fundamental similarities, there are some differences between emulating the likelihood, the likelihood ratio, and the posterior. Learning the posterior directly provides the main target quantity in Bayesian inference, but induces a prior dependence at every stage of the inference method. Learning the likelihood or the likelihood ratio enables frequentist inference or model comparisons, though for Bayesian inference an additional MCMC or VI step is necessary to generate samples from the posterior. The prior independence of likelihood or likelihood ratio estimators also leads to extra flexibility to change the prior during inference. An advantage of training a generative model to approximate the likelihood or posterior over learning the likelihood ratio function is the added functionality of being able to sample from the surrogate model. On the other hand, learning the likelihood or posterior is an unsupervised learning problem, whereas estimating the likelihood ratio through a classifier is an example of supervised learning and often a simpler task. Since for the higher-level inference goal the likelihood and the likelihood ratio can be used interchangeably, learning a surrogate for the likelihood ratio function may often be more efficient.
Another strategy that allows us to leverage supervised learning is based on extracting additional quantities from the simulator that characterize the likelihood of the latent process (e. g. II and IV from the list in Sec. 2.2.3). This additional information can be used to augment the training data for surrogate models. The resulting supervised learning task can often be solved more efficiently, ultimately improving the sample efficiency in the inference task (48, 72, 15, 49).
Surrogate-based approaches benefit from imposing suitable inductive bias for a given problem. It is widely acknowledged that the network architecture of a neural surrogate should be chosen according to the data structure (e. g. images, sequences, or graphs). Another, potentially more consequential, way of imposing inductive bias is to have the surrogate model reflect the causal structure of the simulator. Manually identifying the relevant structures and designing appropriate surrogate architectures is very domain-specific, though has been shown to improve the performance on some problems (73, 74, 75). Recently attempts are being made to automate the process of creating surrogates that mimic the simulation (76)
. Looking further ahead, one would like to learn surrogates that reflect the causal structure of a coarse grained system. If this is possible, it would allow the surrogate to model only the relevant degrees of freedom for the phenomena that emerge from the underlying mechanistic model.
3.3 Preprocessing and postprocessing
There are a number of additional steps that can surround these core inference algorithms, either in the form of preprocessing steps that precede the main inference stage, or as an afterburner following the main inference step.
One preprocessing step is to learn powerful summaries . Because of the curse of dimensionality, both ABC and classical density estimation–based inference methods require a compression of the data into low-dimensional summary statistics. They are usually prescribed, i. e. hand-chosen by domain scientists based on their intuition and knowledge of the problem at hand, but the resulting summaries will generally lose some information compared to the original data. A minimally invasive extension of these algorithms is to first learn summary statistics that have certain optimality properties, before running a standard inference algorithm such as ABC. We sketch this approach in Fig. 3b for ABC, but it applies equally to inference based on density estimation.
The score , the gradient of the log (marginal) likelihood with respect to the parameters of interest, defines such a vector of optimal summary statistics: in a neighborhood of , the score components are sufficient statistics, and they can be used for inference without loss of information. Just like the likelihood, the score itself is generally intractable, but it can be estimated based on quantity V and an exponential family approximation (77, 78). If quantity II is available, augmented data extracted from the simulator can instead be used to train a neural network to estimate the score (48) without requiring such an approximation. Learned summaries can also be made robust with respect to nuisance parameters (79, 80).
Inference compilation (52) is a preprocessing step for probabilistic programming algorithms, shown in Fig. 3d. Initial runs of the simulator are used to train a neural network to propose values of both the parameters and the latent variables.
After the completion of the core inference workflow, an important question is whether the results are reliable: can the outcome be trusted in the presence of imperfections such as limited sample size, insufficient network capacity, or inefficient optimization?
One solution is to calibrate the inference results. Using the ability of the simulator to generate data for any parameter point, we can use a parametric bootstrap approach to calculate the distribution of any quantity involved in the inference workflow. These distributions can be used to calibrate the inference procedure to provide confidence sets and posteriors with proper coverage and credibility (67, 15). While possible in principle, such procedures may require a large number of simulations.
Other diagnostic tools that can be applied at the end of the inference stage involve training classifiers to distinguish data from the surrogate model and the true simulator (67), checking known expectation values of estimators of the likelihood, likelihood ratio, or score (15); varying reference parameters that should leave the inference result invariant (67); ensemble methods; and comparing distributions of network output against known asymptotic properties (81, 82, 83). Passing these sanity checks does not guarantee that an estimator is correct, but failing them is an indication of problems.
None of these diagnostics address the issues encountered if the model is misspecified and the simulator is not an accurate description of the system being studied. Model misspecification is a problem that plagues inference with both prescribed and implicit models equally. Usually this is addressed by expanding the model to have more flexibility and introducing additional nuisance parameters.
The considerations needed to choose which of the approaches described above is best for a given problem will include the inference goals, the dimensionality of model parameters, the latent variables, and the data; whether good summary statistics are available; the internal structure of the simulator; the computational cost of the simulator; the level of control over how the simulator is run; and on whether the simulator is a black box or whether any of the quantities discussed in Sec. 2.2.3 can be extracted from it. Nevertheless, we believe that the existing body of research lets us provide a few general guidelines.
First, if any of the quantities discussed in Sec. 2 are available, they should be leveraged. Powerful algorithms are available for the case with differentiable simulators (47), for simulators for which the joint likelihood of data and latent variables is accessible (48), and for simulators explicitly written as a probabilistic model in a probabilistic programming framework (84). Probabilistic programming is also the most versatile approach when the goal is not only inference on the parameters , but also the latent variables .
If powerful low-dimensional summary statistics are established, traditional techniques can still offer a reasonable performance. In most cases, however, we recommend trying methods based on training a neural network surrogate for the likelihood (39, 63) or the likelihood ratio (67, 51, 71). If generating synthetic data from the surrogate is not important, learning the likelihood ratio rather than the likelihood allows us to leverage powerful supervised learning methods.
Finally, active learning techniques can improve the sample efficiency for all inference techniques. There is a tradeoff between active learning, which tailors the efficiency to a particular observed data set, and amortization, which benefits from surrogates that are agnostic about the observed data. A good compromise here will depend on the number of observations and the sharpness of the posterior compared to the prior.
Until recently, scientists confronted with inverse problems and a complex simulator as a forward model had little recourse than to choose ABC
or methods based on classical density estimation techniques. While these approaches have served some domains of science quite well, they have relied heavily on experts providing powerful summary statistics. As a result, the techniques are labor intensive and do not lend themselves well to high-dimensional data where powerful summary statistics are not obvious. While not explicit, there is a frontier where these traditional methods are no longer useful, beyond which scientists must resort to other heuristics not framed in a statistical statement tied to the underlying mechanistic model.
The term likelihood-free inference has served as a point of convergence for what were previously disparate communities, and a new lingua franca has emerged. This has catalyzed significant cross-pollination and led to a renaissance in simulation-based inference. The advent of powerful machine learning methods is enabling practitioners to work directly with high dimensional data and to reduce the reliance on expert-crafted summary statistics. New programming paradigms such as probabilistic programming and differentiable programming provide new capabilities that enable entirely new approaches to simulation-based inference. Finally, taking a more systems-level view of simulation-based inference that brings together the statistical and computational considerations has taken root. Here active learning is leading the way, but we expect more advances like this as simulation-based inference matures.
The rapidly advancing frontier means that several domains of science should expect either a significant improvement in inference quality or the transition from heuristic approaches to those grounded in statistical terms tied to the underlying mechanistic model. It is not unreasonable to expect that this transition may have a profound impact on science.
KC and JB are supported by the National Science Foundation under the awards ACI-1450310, OAC-1836650, and OAC-1841471 and by the Moore-Sloan data science environment at NYU.
- (1) Diggle PJ, Gratton RJ (1984) Monte Carlo Methods of Inference for Implicit Statistical Models.
- (2) Rubin DB (1984) Bayesianly Justifiable and Relevant Frequency Calculations for the Applied Statistician. The Annals of Statistics 12(4):1151–1172.
- (3) Beaumont MA, Zhang W, Balding DJ (2002) Approximate Bayesian computation in population genetics. Genetics 162(4):2025–2035.
- (4) CMS collaboration (2015) CMS collisions at 13 TeV (http://cds.cern.ch/record/2021490).
- (5) Lee WCA, et al. (2005) Dynamic remodeling of dendritic arbors in gabaergic interneurons of adult visual cortex. PLOS Biology 4(2).
- (6) Research I (2019) Public Health Research - The Spatiotemporal Epidemiological Modeler (STEM) (https://researcher.watson.ibm.com/researcher/view_group.php?id=883).
- (7) NASA/ESA (2011) Based on observations made with the nasa/esa hubble space telescope (https://apod.nasa.gov/apod/image/1112/lensshoe_hubble_3235.jpg).
- (8) ESA, the Planck Collaboration (2018) Planck image gallery (https://www.cosmos.esa.int/web/planck/picture-gallery).
- (9) Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (1953) Equation of state calculations by fast computing machines. The journal of chemical physics 21(6):1087–1092.
- (10) Hastings WK (1970) Monte carlo sampling methods using Markov chains and their applications. Biometrika 57(1):97–109.
- (11) Anderson JR, Peterson C (1987) A mean field theory learning algorithm for neural networks. Complex Systems 1:995–1019.
- (12) Sisson SA, Fan Y, Beaumont M (2018) Handbook of approximate Bayesian computation. (Chapman and Hall/CRC).
- (13) Marjoram P, Molitor J, Plagnol V, Tavaré S (2003) Markov chain Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences of the United States of America 100(26):15324–15328.
- (14) Sisson SA, Fan Y, Tanaka MM (2007) Sequential Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences of the United States of America 104(6):1760–1765.
- (15) Brehmer J, Cranmer K, Louppe G, Pavez J (2018) A Guide to Constraining Effective Field Theories with Machine Learning. Physical Review D 98(5):052004.
- (16) LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444.
- (17) Dinh L, Krueger D, Bengio Y (2014) NICE: Non-linear Independent Components Estimation. arXiv:1410.8516.
- (18) Dinh L, Sohl-Dickstein J, Bengio S (2017) Density estimation using real NVP in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
- (19) Kingma DP, Dhariwal P (2018) Glow: Generative flow with invertible 1×1 convolutions. Advances in Neural Information Processing Systems 2018-Decem:10215–10224.
Germain M, Gregor K, Murray I, Larochelle H (2015) MADE: Masked autoencoder for distribution estimation.32nd International Conference on Machine Learning, ICML 2015 2:881–889.
- (21) Uria B, Côté MA, Gregor K, Murray I, Larochelle H (2016) Neural autoregressive distribution estimation. J. Mach. Learn. Res. 17(1):7184–7220.
- (22) Van Den Oord A, Kalchbrenner N, Kavukcuoglu K (2016) Pixel recurrent neural networks. 33rd International Conference on Machine Learning, ICML 2016 4:2611–2620.
- (23) Van Den Oord A, et al. (2016) Conditional image generation with PixelCNN decoders. Advances in Neural Information Processing Systems pp. 4797–4805.
- (24) van den Oord A, et al. (2016) WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499.
- (25) Kingma DP, et al. (2016) Improving Variational Inference with Inverse Autoregressive Flow. arXiv:1606.04934.
- (26) Papamakarios G, Pavlakou T, Murray I (2017) Masked autoregressive flow for density estimation. Advances in Neural Information Processing Systems 2017-Decem:2339–2348.
- (27) Huang CW, Krueger D, Lacoste A, Courville A (2018) Neural autoregressive flows. 35th International Conference on Machine Learning, ICML 2018 5:3309–3324.
- (28) Wehenkel A, Louppe G (2019) Unconstrained Monotonic Neural Networks. arXiv:1908.05164.
- (29) Durkan C, Bekasov A, Murray I, Papamakarios G (2019) Cubic-Spline Flows. arXiv:1906.02145.
- (30) Durkan C, Bekasov A, Murray I, Papamakarios G (2019) Neural Spline Flows. arXiv:1906.04032.
Hjortsø MA, Wolenski P (2018) Neural Ordinary Differential Equations.Linear Mathematical Models in Chemical Engineering abs/1806.0:123–145.
- (32) Grathwohl W, Chen RTQ, Bettencourt J, Duvenaud D (2019) Scalable reversible generative models with free-form continuous dynamics in International Conference on Learning Representations.
Meeds E, Welling M (2914) Gps-abc: Gaussian process surrogate approximate
bayesian computation in
Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI’14. (AUAI Press, Arlington, Virginia, United States), pp. 593–602.
- (34) Gutmann MU, Corander J (2016) Bayesian optimization for likelihood-free inference of simulator-based statistical models. J. Mach. Learn. Res. 17(1):4256–4302.
- (35) Meeds E, Welling M (2015) Optimization monte carlo: Efficient and embarrassingly parallel likelihood-free inference in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15. (MIT Press, Cambridge, MA, USA), pp. 2080–2088.
- (36) Järvenpää M, Gutmann MU, Pleska A, Vehtari A, Marttinen P (2017) Efficient acquisition rules for model-based approximate Bayesian computation. arXiv:1704.00520.
- (37) Wang H, Li J (2017) Adaptive Gaussian process approximation for Bayesian inference with expensive likelihood functions. arXiv:1703.09930.
- (38) Lueckmann JM, et al. (2017) Flexible statistical inference for mechanistic models of neural dynamics. Advances in Neural Information Processing Systems 2017-Decem:1290–1300.
- (39) Papamakarios G, Sterratt DC, Murray I (2019) Sequential neural likelihood: Fast likelihood-free inference with autoregressive flows. International Conference on Artificial Intelligence and Statistics.
- (40) Ranjan P, Bingham D, Michailidis G (2008) Sequential experiment design for contour estimation from complex computer codes. Technometrics 50(4):527–541.
- (41) Bect J, Ginsbourger D, Li L, Picheny V, Vazquez E (2012) Sequential design of computer experiments for the estimation of a probability of failure. Statistics and Computing 22(3):773.
- (42) Heinrich L, Louppe G, Cranmer K (2018) excursion (doi:10.5281/zenodo.1634428).
- (43) Cutler M, Walsh TJ, How JP (2014) Reinforcement learning with multi-fidelity simulators in 2014 IEEE International Conference on Robotics and Automation (ICRA). pp. 3888–3895.
- (44) Kandasamy K, Dasarathy G, Schneider J, Poczos B (2017) Multi-fidelity bayesian optimisation with continuous approximations in Proceedings of the 34th International Conference on Machine Learning-Volume 70. (JMLR. org), pp. 1799–1808.
- (45) Gordon AD, Henzinger TA, Nori AV, Rajamani SK (2014) Probabilistic programming in Future of Software Engineering. ACM.
- (46) Baydin AG, et al. (2019) Etalumis: Bringing Probabilistic Programming to Scientific Simulators at Scale. arXiv:1907.03382.
- (47) Graham MM, Storkey AJ (2017) Asymptotically exact inference in differentiable generative models. Electronic Journal of Statistics 11(2):5105–5164.
- (48) Brehmer J, Louppe G, Pavez J, Cranmer K (2018) Mining gold from implicit models to improve likelihood-free inference. arXiv:1805.12244.
- (49) Stoye M, Brehmer J, Louppe G, Pavez J, Cranmer K (2018) Likelihood-free inference with an improved cross-entropy estimator. arXiv:1808.00973.
- (50) Baydin AG, Pearlmutter BA, Radul AA, Siskind JM (2018) Automatic differentiation in machine learning: a survey. Journal of machine learning research 18(153).
- (51) Gutmann MU, Dutta R, Kaski S, Corander J (2018) Likelihood-free inference via classification. Statistics and Computing 28(2):411–425.
- (52) Le TA, Baydin AG, Wood F (2017) Inference compilation and universal probabilistic programming in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, Proceedings of Machine Learning Research. (PMLR, Fort Lauderdale, FL, USA), Vol. 54, pp. 1338–1348.
- (53) Tarantola A (2005) Inverse problem theory and methods for model parameter estimation. (siam) Vol. 89.
- (54) Putzky P, Welling M (2017) Recurrent Inference Machines for Solving Inverse Problems. arXiv:1706.04008.
- (55) Pesah A, Wehenkel A, Louppe G (2018) Recurrent machines for likelihood-free inference.
- (56) Louppe G, Hermans J, Cranmer K (2019) Adversarial variational optimization of non-differentiable simulators in Proceedings of Machine Learning Research, Proceedings of Machine Learning Research, eds. Chaudhuri K, Sugiyama M. (PMLR), Vol. 89, pp. 1438–1447.
- (57) Rezende DJ, Mohamed S (2015) Variational inference with normalizing flows. 32nd International Conference on Machine Learning, ICML 2015 2:1530–1538.
- (58) Papamakarios G, Murray I (2016) Fast e-free inference of simulation models with Bayesian conditional density estimation in Advances in Neural Information Processing Systems. pp. 1036–1044.
- (59) Izbicki R, Lee AB, Pospisil T (2019) Abc–cde: Toward approximate bayesian computation with complex high-dimensional data and limited simulations. Journal of Computational and Graphical Statistics 28(3):481–492.
- (60) Paige B, Wood F (2016) Inference networks for sequential monte carlo in graphical models. 33rd International Conference on Machine Learning, ICML 2016 6:4434–4444.
- (61) Tran D, Ranganath R, Blei DM (2017) Hierarchical implicit models and likelihood-free variational inference in Advances in Neural Information Processing Systems, eds. Guyon I, et al. Vol. 2017-Decem, pp. 5524–5534.
- (62) Durkan C, Papamakarios G, Murray I (2018) Sequential Neural Methods for Likelihood-free Inference. arXiv:1811.08723.
- (63) Lueckmann JM, Bassetto G, Karaletsos T, Macke JH (2019) Likelihood-free inference with emulator networks in Proceedings of The 1st Symposium on Advances in Approximate Bayesian Inference, Proceedings of Machine Learning Research, eds. Ruiz F, Zhang C, Liang D, Bui T. (PMLR), Vol. 96, pp. 32–53.
- (64) Alsing J, Charnock T, Feeney S, Wandelt B (2019) Fast likelihood-free cosmology with neural density estimators and active learning. Monthly Notices of the Royal Astronomical Society 488(3):4440–4458.
- (65) Neal RM (2007) Computing likelihood functions for high-energy physics experiments when distributions are defined by simulators with nuisance parameters in Statistical issues for LHC physics. Proceedings, PHYSTAT-LHC 2007. pp. 111–118.
- (66) Fan Y, Nott DJ, Sisson SA (2013) Approximate Bayesian computation via regression density estimation. Stat 2(1):34–48.
- (67) Cranmer K, Pavez J, Louppe G (2015) Approximating Likelihood Ratios with Calibrated Discriminative Classifiers. arXiv:1506.02169.
- (68) Mohamed S, Lakshminarayanan B (2016) Learning in Implicit Generative Models. arXiv:1610.03483.
- (69) Thomas O, Dutta R, Corander J, Kaski S, Gutmann MU (2016) Likelihood-free inference by ratio estimation. arXiv:1611.10242.
- (70) Dinev T, Gutmann MU (2018) Dynamic Likelihood-free Inference via Ratio Estimation (DIRE). arXiv:1810.09899.
- (71) Hermans J, Begy V, Louppe G (2019) Likelihood-free MCMC with Approximate Likelihood Ratios. arXiv:1903.04057.
- (72) Brehmer J, Cranmer K, Louppe G, Pavez J (2018) Constraining Effective Field Theories with Machine Learning. Physical Review Letters 121(11):111801.
- (73) Louppe G, Cho K, Becot C, Cranmer K (2019) QCD-Aware Recursive Neural Networks for Jet Physics. JHEP 01:57.
- (74) Andreassen A, Feige I, Frye C, Schwartz MD (2019) JUNIPR: a Framework for Unsupervised Machine Learning in Particle Physics. Eur. Phys. J. C79(2):102.
- (75) Carleo G, et al. (2019) Machine learning and the physical sciences. arXiv:1903.10563.
- (76) Munk A, et al. (2019) Deep probabilistic surrogate networks for universal simulator approximation. arXiv:1910.11950.
- (77) Alsing J, Wandelt B (2018) Generalized massive optimal data compression. Monthly Notices of the Royal Astronomical Society: Letters 476(1):L60–L64.
- (78) Alsing J, Wandelt B, Feeney S (2018) Massive optimal data compression and density estimation for scalable, likelihood-free inference in cosmology. Monthly Notices of the Royal Astronomical Society 477(3):2874–2885.
- (79) De Castro P, Dorigo T (2019) INFERNO: Inference-Aware Neural Optimisation. Comput. Phys. Commun. 244:170–179.
- (80) Alsing J, Wandelt B (2019) Nuisance hardened data compression for fast likelihood-free inference. Monthly Notices of the Royal Astronomical Society 488(4):5093–5103.
- (81) Wilks SS (1938) The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses. The Annals of Mathematical Statistics 9(1):60–62.
- (82) Wald A (1943) Tests of Statistical Hypotheses Concerning Several Parameters When the Number of Observations is Large. Transactions of the American Mathematical Society 54(3):426.
- (83) Cowan G, Cranmer K, Gross E, Vitells O (2011) Asymptotic formulae for likelihood-based tests of new physics. European Physical Journal C 71(2):1554.
- (84) Wood F, Van De Meent JW, Mansinghka V (2014) A new approach to probabilistic programming inference in Journal of Machine Learning Research. Vol. 33, pp. 1024–1032.