1 Introduction
Mechanistic simulators have been developed in a wide range of scientific domains Cranmer et al. (2020). Often these simulators act as a black box: given a set of parameters, the simulator can be sampled, but the distribution over the outputs–the likelihood–cannot be evaluated, rendering typical inference algorithms inapplicable. Simulationbased inference (SBI) methods provide a way to perform inference with these models Beaumont (2019); Cranmer et al. (2020). Given a prior over parameters and a simulator for the likelihood , the goal of SBI is to approximate the posterior for any set of i.i.d. observations . Most SBI methods work by running the simulator to generate samples for different parameters , and using the resulting samples to build an approximation of the posterior. Since many domains involve expensive simulators, recent work has focused on developing algorithms that yield good approximations using a limited budget of simulator calls.
Recent work introduced Neural Posterior Estimation (NPE) methods Papamakarios and Murray (2016); Lueckmann et al. (2017); Greenberg et al. (2019); Chan et al. (2018), which use samples to train a conditional neural density estimator with parameters , often a normalizing flow Rezende and Mohamed (2015); Tabak and Turner (2013); Winkler et al. (2019), via maximum likelihood. After training, the estimator provides an amortized approximation to for any set of observations of size .^{1}^{1}1NPE methods can handle sets of observations with varying cardinality as well Radev et al. (2020), by training the density estimator using samples of varying size , and conditioning it not only on the samples but also on the cardinality of the set . We give details in Appendix B. The drawback of NPE methods is that each training sample requires simulator calls, so building a training set of size requires running the simulator times. This may be problematic in scenarios where is large and calls to the simulator are expensive.
Neural Likelihood Estimation (NLE) methods Papamakarios et al. (2019); Wood (2010); Lueckmann et al. (2019) are a natural alternative for cases where . These methods learn a surrogate likelihood (or a likelihood ratio Pham et al. (2014); Cranmer et al. (2015); Hermans et al. (2020)) using samples , where is a proposal distribution, which in the simplest case can default to the prior . Then, given a set of observations , inference is carried out on the approximate unnormalized target by standard methods, typically MCMC Papamakarios et al. (2019); Lueckmann et al. (2019) or variational inference Wiqvist et al. (2021); Glöckler et al. (2022). While these methods can handle arbitrary sets of observations at inference time without retraining the surrogate, they do not approximate the posterior directly, and thus require the inference step to be repeated for each set of observations of interest. Moreover, their performance depends on the performance of the underlying generic inference methods, which tend to struggle e.g. with multimodal distributions.
Our goal is to develop a method that enjoys the benefits of both types of approaches while avoiding their drawbacks—a method that approximates the posterior directly, is trained on samples , and is able to naturally handle a varying number of observations at test time. We propose such an approach that relies on scorebased modeling SohlDickstein et al. (2015); Song and Ermon (2019); Ho et al. (2020); Song et al. (2020). Simply put, we train a single conditional score network to approximate the score of (diffused versions of) for any , and propose an algorithm that uses the trained network to approximately sample the posterior for any set of observations . Our method satisfies the three desiderata outlined above: it directly approximates the posterior, learns form samples produced with a single call to the simulator, and provides a sampling algorithm that can handle arbitrary sets of observations without retraining.
1.1 Conditional scorebased generative modeling
The goal of conditional generative modeling is to learn an approximation of a target distribution for some conditioning variable given samples , which is exactly the problem SBI methods need to solve. Methods based on score modeling have shown impressive performance for this task Song and Ermon (2019); Dhariwal and Nichol (2021); Ho and Salimans (2022); Ramesh et al. (2022); Saharia et al. (2022); Shi et al. (2022). They define a sequence of conditional densities by diffusing the target with Gaussian kernels of increasing levels of noise, learn the scores of each density in the sequence using denoising score matching Hyvärinen and Dayan (2005); Vincent (2011), and use Langevin dynamics Roberts and Tweedie (1996); Welling and Teh (2011) with the learned scores to approximately sample from the target distribution.
Specifically, for the noise levels and the corresponding Gaussian diffusion kernels , the sequence of densities is defined as
(1) 
Since , this sequence gradually bridges between the initial tractable reference and the target . Scorebased methods train a score network parameterized by to approximate the scores of these densities, . As only samples from the target are available, this is done via denoising score matching Hyvärinen and Dayan (2005); Vincent (2011), minimizing
(2) 
Finally, the score network is used to approximately sample the target using annealed Langevin dynamics, as shown in Algorithm 1.
2 Scorebased Neural Posterior Estimation
This section presents our approach for SBI using score modeling. Our goal is to develop a method that can be trained using parameter/singleobservation pairs , and that can be used at test time to approximate for arbitrary sets of observations with any cardinality . As we explain in Appendix A, a naive application of conditional score based modeling fails to satisfy our desiderata. Instead, we propose an alternative approach based on score modeling, involving different choices for the bridging densities and the reference distribution.
Our method is based on the observation that (see Appendix A). Using this factorization, we propose the sequence of densities
(3) 
where is defined in Eq. 1, taking . This construction has four key properties. First, the distribution for recovers the target . Second, the distribution for is a tractable Gaussian ,^{1}^{1}1Since the prior term vanishes and for all . and thus can be used as a reference for the process. Third, the score of the resulting densities can be decomposed in terms of the score of the prior (available exactly) and the scores of as
(4) 
And fourth, the scores can all be approximated using a single score network trained via denoising score matching using samples , as explained in Section 1.1.
After training, given an arbitrary set of observations we can approximately sample the target by running Algorithm 1 with the reference distribution , conditioning variable , and the approximate score given by
(5) 
It is straightforward to verify that our approach satisfies our original desiderata: the score network is trained using samples , and the sampling algorithm can be used with any set of observations , as it relies on Langevin dynamics with Eq. 5.
2.1 Alternative sampling approach
The sampling process described in Algorithm 1 requires choosing stepsizes , the number of steps per noise level, and has complexity . This section introduces a different method to approximately sample the target , which does not use Langevin dynamics and runs in steps. The approach is based on the formulation by SohlDickstein et al. SohlDickstein et al. (2015) to use diffusion models to approximately sample from a product of distributions. The final method involves sampling
Gaussian transitions with means and variances computed using the learned score network. We describe the method in
Algorithm 2, and give its derivation in Appendix C.3 Empirical Evaluation
This section presents an empirical evaluation on two problems commonly used to evaluate SBI methods Lueckmann et al. (2021). One involves a “simulator” consisting of a Gaussian prior and likelihood, and (we set to a diagonal matrix with elements increasing linearly from to ), while the other uses a Gaussian prior and a mixtureofGaussians likelihood, and . In both cases we set .
We compare our approach, called Score NPE, to NPE using a normalizing flow with four Real NVP layers Dinh et al. (2016) (details in Appendix B). We compare the methods’ performance when trained on datasets of different sizes, corresponding to different budgets of simulator calls
. In all cases we use 20% of the training data as a validation set for early stopping, and train for a maximum of 20k epochs. We train all methods using Adam
Kingma and Ba (2014) with a learning rate of .After training we generate a set of observations by drawing and , and report the squared MMD Gretton et al. (2012) between the true posterior and the approximation returned by each method for subsets of of different size. Figure 1 shows average results over 40 random seeds. We observe that both methods perform similarly for the simpler GaussianGaussian model, but that Score NPE outperforms the flow baseline for the model with the mixtureofGaussians likelihood.
Multimodal posterior.
We also consider a twodimensional example with a multimodal posterior, with the prior and likelihood given by and . We train each method using a budget of simulator calls. After training we sample and , and use each method to generate samples from the approximate posterior obtained by conditioning on subsets of of different size. Results are shown in Fig. 2, where it can be observed that our method is able to capture both modes well for all subset sizes of observations despite only being trained on parameters/singleobservation pairs.
References
 Approximate Bayesian computation. Annual review of statistics and its application 6, pp. 379–403. Cited by: §1.

A likelihoodfree inference framework for population genetic data using exchangeable neural networks
. Advances in Neural Information Processing Systems 31. Cited by: §B.2, §1.  The frontier of simulationbased inference. Proceedings of the National Academy of Sciences 117 (48), pp. 30055–30062. Cited by: §1.

Approximating likelihood ratios with calibrated discriminative classifiers
. arXiv preprint arXiv:1506.02169. Cited by: §1.  Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems 34, pp. 8780–8794. Cited by: §1.1.
 Density estimation using Real NVP. arXiv preprint arXiv:1605.08803. Cited by: §B.2, §3.
 Variational methods for simulationbased inference. arXiv preprint arXiv:2203.04176. Cited by: §1.

Automatic posterior transformation for likelihoodfree inference.
In
International Conference on Machine Learning
, pp. 2404–2414. Cited by: §1.  A kernel twosample test. The Journal of Machine Learning Research 13 (1), pp. 723–773. Cited by: §3.
 Likelihoodfree MCMC with amortized approximate ratio estimators. In International Conference on Machine Learning, pp. 4239–4248. Cited by: §1.
 Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33, pp. 6840–6851. Cited by: §1.
 Classifierfree diffusion guidance. arXiv preprint arXiv:2207.12598. Cited by: §1.1.
 Estimation of nonnormalized statistical models by score matching.. Journal of Machine Learning Research 6 (4). Cited by: §1.1, §1.1.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.

Likelihoodfree inference with emulator networks.
In
Symposium on Advances in Approximate Bayesian Inference
, pp. 32–53. Cited by: §1. 
Benchmarking simulationbased inference.
In
International Conference on Artificial Intelligence and Statistics
, pp. 343–351. Cited by: §3.  Flexible statistical inference for mechanistic models of neural dynamics. Advances in Neural Information Processing Systems 30. Cited by: §1.
 Understanding diffusion models: a unified perspective. arXiv preprint arXiv:2208.11970. Cited by: 3rd item, Appendix C, Appendix C.

MCMC using Hamiltonian dynamics.
Handbook of Markov chain Monte Carlo
2 (11), pp. 2. Cited by: Figure 1.  Fast free inference of simulation models with Bayesian conditional density estimation. Advances in Neural Information Processing Systems 29. Cited by: §1.
 Sequential neural likelihood: fast likelihoodfree inference with autoregressive flows. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 837–848. Cited by: §1.
 A note on approximating ABCMCMC using flexible classifiers. Stat 3 (1), pp. 218–227. Cited by: §1.
 BayesFlow: learning complex stochastic models with invertible neural networks. IEEE transactions on neural networks and learning systems. Cited by: §B.2, footnote 1.
 On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 29. Cited by: Figure 1.
 Hierarchical textconditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125. Cited by: §1.1.
 Variational inference with normalizing flows. In International Conference on Machine Learning, pp. 1530–1538. Cited by: §1.
 Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 2 (4), pp. 341–363. Cited by: §1.1.

Photorealistic texttoimage diffusion models with deep language understanding
. arXiv preprint arXiv:2205.11487. Cited by: §1.1.  Conditional simulation using diffusion Schrödinger bridges. arXiv preprint arXiv:2202.13460. Cited by: §1.1.

Deep unsupervised learning using nonequilibrium thermodynamics
. In International Conference on Machine Learning, pp. 2256–2265. Cited by: Appendix C, Appendix C, Appendix C, §1, §2.1.  Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems 32. Cited by: §1.1, §1.
 Scorebased generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456. Cited by: §1.
 A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics 66 (2), pp. 145–164. Cited by: §1.
 Attention is all you need. Advances in Neural Information Processing Systems 30. Cited by: 3rd item.

A connection between score matching and denoising autoencoders
. Neural computation 23 (7), pp. 1661–1674. Cited by: §1.1, §1.1.  Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning, pp. 681–688. Cited by: §1.1.
 Learning likelihoods with conditional normalizing flows. arXiv preprint arXiv:1912.00042. Cited by: §1.
 Sequential neural posterior and likelihood approximation. arXiv preprint arXiv:2102.06522. Cited by: §1.
 Statistical inference for noisy nonlinear ecological dynamic systems. Nature 466 (7310), pp. 1102–1104. Cited by: §1.
Appendix A Failure of direct application of conditional score modeling
The target distribution is given by . A direct application of conditional score modeling yields the sequence of densities
(6) 
It can be seen from the equation above that the score does not factorize in terms of the singleobservation scores , meaning that the corresponding score network would have to be trained using samples , obtained by calling the simulator times for a single sample . As mentioned in Section 1 this is one of the drawbacks of NPE methods that we seek to avoid.
a.1 Derivation of posterior factorization
The factorization for the posterior distribution is obtained applying Bayes rule twice:
(Bayes rule)  (7)  
(8)  
(Bayes rule)  (9)  
(10) 
Appendix B Details for empirical evaluation
b.1 Score NPE
Our implementation of the score network has three blocks:

An MLP with 3 hidden layers that takes as input and outputs an embedding ,

An MLP with 3 hidden layers that takes as input and outputs an embedding ,
All MLPs use residual connections throughout.
Running Algorithm 1 to generate samples using the learned score network requires choosing step sizes and the number of Langevin steps for each noise level . We use and , where and for . For all our experiments we use .
b.2 Flow NPE
We use an implementation of NPE methods based on flows able to handle sets of observations of any size . The flow can be expressed as . Following Chen et al. [2] and Radev et al. [23, §2.4], we use an exchangeable neural network to process the observations . Specifically, we use an MLP with 3 hidden layers to generate an embedding for each observation . We then compute the mean embedding across observations , which we use as input for the conditional flow. Finally, we model the flow , where is an untrained embedding for the number of observations . For the flow we use 4 Real NVP layers [6], each one consisting on MLPs with three hidden layers. As for the Score NPE method, we use residual connections throughout.
We train the flow via maximum likelihood using samples . At test time, this architecture can handle sets of observations of any size .
Appendix C Alternative sampling method without unadjusted Langevin dynamics
This section gives the derivation for the sampling method shown in Algorithm 2. In short, the derivation uses the formulation of scorebased methods as diffusions, and has 3 main steps: (1) using the scores of to compute the Gaussian transition kernels of the corresponding diffusion process [18]; (2) composing Gaussian transitions corresponding to the diffusions of (this is based on SohlDickstein et al. [30]); and (3) adding a correction for the prior term (also based on SohlDickstein et al. [30]). We note that steps 2 and 3 require some approximations. Despite this, our empirical evaluation indicates that the method works well in practice. We believe a thorough analysis of these approximations would be useful in understanding when the sampling method from Appendix C can be expected to work. For clarity, we use [A] to indicate when the approximations are introduced/used.
Connection between scorebased methods and diffusion models
We begin by noting that scorebased methods can be equivalently formulated as diffusion models, where the mean of Gaussian transitions that act as denoising steps are learned instead of the scores. Specifically, letting , , and , for , the learned model is given by a sequence of Gaussian transitions trained to invert a sequence of noising steps given by . The connection between diffusion models and scorebased methods comes from the fact that the optimal means and scores are linearly related [18]
(11) 
Approximately composing diffusions
To simplify notation, we use a superscript to indicate distributions that are conditioned on (e.g. ). Assume we have transition kernels that exactly reverse the forward kernels [A1], meaning that , or equivalently . Our goal is to find a transition kernel that satisfies
(12) 
where .^{2}^{2}2This is closely related to our formulation in Section 2, since our definition for the bridging densities involves the product . It is straightforward to verify that the condition from Eq. 12 can be rewritten as
(13)  
(14) 
Then, one way to satisfy Eq. 14 is by setting so that the term in blue above is equal to . That is,
(15) 
However, the resulting
may not be a normalized distribution
[30]. Following SohlDickstein et al. [30], we propose to use the corresponding normalized distribution defined as [A2]. Given that Eq. 15 corresponds to the product of Gaussian densities, the resulting normalized transition is also Gaussian, with mean and variance given by(16) 
where each is obtained using Eq. 11.
Prior correction term
The formulation above ignores the fact that the bridging densities defined in Eq. 3 involve the prior . We use the method proposed by SohlDickstein et al. [30] to correct for this, which involves adding the term to the mean from Eq. 16. (The derivation for this is similar to the one above, and also requires setting the resulting transition kernel to the normalized version of an unnormalized distribution [30].)
As mentioned previously, this derivation uses two assumptions/approximations. [A1] assumes that the learned score function/reverse diffusion approximately reverses the noising process, which is reasonable if the forward kernels add small amounts of noise per step (equivalently, if the noise levels increase slowly). [A2] assumes that the normalized version of , given by , approximately satisfies Eq. 14.