Log In Sign Up

Score Modeling for Simulation-based Inference

by   Tomas Geffner, et al.

Neural Posterior Estimation methods for simulation-based inference can be ill-suited for dealing with posterior distributions obtained by conditioning on multiple observations, as they may require a large number of simulator calls to yield accurate approximations. Neural Likelihood Estimation methods can naturally handle multiple observations, but require a separate inference step, which may affect their efficiency and performance. We introduce a new method for simulation-based inference that enjoys the benefits of both approaches. We propose to model the scores for the posterior distributions induced by individual observations, and introduce a sampling algorithm that combines the learned scores to approximately sample from the target efficiently.


page 1

page 2

page 3

page 4


Neural Posterior Estimation with Differentiable Simulators

Simulation-Based Inference (SBI) is a promising Bayesian inference frame...

Sequential Neural Score Estimation: Likelihood-Free Inference with Conditional Score Based Diffusion Models

We introduce Sequential Neural Posterior Score Estimation (SNPSE) and Se...

Maximum Likelihood Learning of Energy-Based Models for Simulation-Based Inference

We introduce two synthetic likelihood methods for Simulation-Based Infer...

Robust and integrative Bayesian neural networks for likelihood-free parameter inference

State-of-the-art neural network-based methods for learning summary stati...

Truncated proposals for scalable and hassle-free simulation-based inference

Simulation-based inference (SBI) solves statistical inverse problems by ...

Truncated Marginal Neural Ratio Estimation

Parametric stochastic simulators are ubiquitous in science, often featur...

Simulation-efficient marginal posterior estimation with swyft: stop wasting your precious time

We present algorithms (a) for nested neural likelihood-to-evidence ratio...

1 Introduction

Mechanistic simulators have been developed in a wide range of scientific domains Cranmer et al. (2020). Often these simulators act as a black box: given a set of parameters, the simulator can be sampled, but the distribution over the outputs–the likelihood–cannot be evaluated, rendering typical inference algorithms inapplicable. Simulation-based inference (SBI) methods provide a way to perform inference with these models Beaumont (2019); Cranmer et al. (2020). Given a prior over parameters and a simulator for the likelihood , the goal of SBI is to approximate the posterior for any set of i.i.d. observations . Most SBI methods work by running the simulator to generate samples for different parameters , and using the resulting samples to build an approximation of the posterior. Since many domains involve expensive simulators, recent work has focused on developing algorithms that yield good approximations using a limited budget of simulator calls.

Recent work introduced Neural Posterior Estimation (NPE) methods Papamakarios and Murray (2016); Lueckmann et al. (2017); Greenberg et al. (2019); Chan et al. (2018), which use samples to train a conditional neural density estimator with parameters , often a normalizing flow Rezende and Mohamed (2015); Tabak and Turner (2013); Winkler et al. (2019), via maximum likelihood. After training, the estimator provides an amortized approximation to for any set of observations of size .111NPE methods can handle sets of observations with varying cardinality as well Radev et al. (2020), by training the density estimator using samples of varying size , and conditioning it not only on the samples but also on the cardinality of the set . We give details in Appendix B. The drawback of NPE methods is that each training sample requires simulator calls, so building a training set of size requires running the simulator times. This may be problematic in scenarios where is large and calls to the simulator are expensive.

Neural Likelihood Estimation (NLE) methods Papamakarios et al. (2019); Wood (2010); Lueckmann et al. (2019) are a natural alternative for cases where . These methods learn a surrogate likelihood (or a likelihood ratio Pham et al. (2014); Cranmer et al. (2015); Hermans et al. (2020)) using samples , where is a proposal distribution, which in the simplest case can default to the prior . Then, given a set of observations , inference is carried out on the approximate unnormalized target by standard methods, typically MCMC Papamakarios et al. (2019); Lueckmann et al. (2019) or variational inference Wiqvist et al. (2021); Glöckler et al. (2022). While these methods can handle arbitrary sets of observations at inference time without re-training the surrogate, they do not approximate the posterior directly, and thus require the inference step to be repeated for each set of observations of interest. Moreover, their performance depends on the performance of the underlying generic inference methods, which tend to struggle e.g. with multimodal distributions.

Our goal is to develop a method that enjoys the benefits of both types of approaches while avoiding their drawbacks—a method that approximates the posterior directly, is trained on samples , and is able to naturally handle a varying number of observations at test time. We propose such an approach that relies on score-based modeling Sohl-Dickstein et al. (2015); Song and Ermon (2019); Ho et al. (2020); Song et al. (2020). Simply put, we train a single conditional score network to approximate the score of (diffused versions of) for any , and propose an algorithm that uses the trained network to approximately sample the posterior for any set of observations . Our method satisfies the three desiderata outlined above: it directly approximates the posterior, learns form samples produced with a single call to the simulator, and provides a sampling algorithm that can handle arbitrary sets of observations without re-training.

1.1 Conditional score-based generative modeling

The goal of conditional generative modeling is to learn an approximation of a target distribution for some conditioning variable given samples , which is exactly the problem SBI methods need to solve. Methods based on score modeling have shown impressive performance for this task Song and Ermon (2019); Dhariwal and Nichol (2021); Ho and Salimans (2022); Ramesh et al. (2022); Saharia et al. (2022); Shi et al. (2022). They define a sequence of conditional densities by diffusing the target with Gaussian kernels of increasing levels of noise, learn the scores of each density in the sequence using denoising score matching Hyvärinen and Dayan (2005); Vincent (2011), and use Langevin dynamics Roberts and Tweedie (1996); Welling and Teh (2011) with the learned scores to approximately sample from the target distribution.

Specifically, for the noise levels and the corresponding Gaussian diffusion kernels , the sequence of densities is defined as


Since , this sequence gradually bridges between the initial tractable reference and the target . Score-based methods train a score network parameterized by to approximate the scores of these densities, . As only samples from the target are available, this is done via denoising score matching Hyvärinen and Dayan (2005); Vincent (2011), minimizing


Finally, the score network is used to approximately sample the target using annealed Langevin dynamics, as shown in Algorithm 1.

Input: Score network , reference distribution
Input: Conditioning variable , number of Langevin steps , Langevin step sizes

Sample reference
for  do
     for  do
          Unadjusted Langevin step      
Algorithm 1 Approximate sampling with learned scores

2 Score-based Neural Posterior Estimation

This section presents our approach for SBI using score modeling. Our goal is to develop a method that can be trained using parameter/single-observation pairs , and that can be used at test time to approximate for arbitrary sets of observations with any cardinality . As we explain in Appendix A, a naive application of conditional score based modeling fails to satisfy our desiderata. Instead, we propose an alternative approach based on score modeling, involving different choices for the bridging densities and the reference distribution.

Our method is based on the observation that (see Appendix A). Using this factorization, we propose the sequence of densities


where is defined in Eq. 1, taking . This construction has four key properties. First, the distribution for recovers the target . Second, the distribution for is a tractable Gaussian ,111Since the prior term vanishes and for all . and thus can be used as a reference for the process. Third, the score of the resulting densities can be decomposed in terms of the score of the prior (available exactly) and the scores of as


And fourth, the scores can all be approximated using a single score network trained via denoising score matching using samples , as explained in Section 1.1.

After training, given an arbitrary set of observations we can approximately sample the target by running Algorithm 1 with the reference distribution , conditioning variable , and the approximate score given by


It is straightforward to verify that our approach satisfies our original desiderata: the score network is trained using samples , and the sampling algorithm can be used with any set of observations , as it relies on Langevin dynamics with Eq. 5.

2.1 Alternative sampling approach

The sampling process described in Algorithm 1 requires choosing step-sizes , the number of steps per noise level, and has complexity . This section introduces a different method to approximately sample the target , which does not use Langevin dynamics and runs in steps. The approach is based on the formulation by Sohl-Dickstein et al. Sohl-Dickstein et al. (2015) to use diffusion models to approximately sample from a product of distributions. The final method involves sampling

Gaussian transitions with means and variances computed using the learned score network. We describe the method in

Algorithm 2, and give its derivation in Appendix C.

Input: Score network , reference distribution
Input: Conditioning variables , noise levels

Define , , and , for
Sample reference
for  do
      for Compute auxiliary variables
      Set transition mean and variance
      Sample transition
Algorithm 2 Approximately sampling without unadjusted Langevin dynamics

3 Empirical Evaluation

This section presents an empirical evaluation on two problems commonly used to evaluate SBI methods Lueckmann et al. (2021). One involves a “simulator” consisting of a Gaussian prior and likelihood, and (we set to a diagonal matrix with elements increasing linearly from to ), while the other uses a Gaussian prior and a mixture-of-Gaussians likelihood, and . In both cases we set .

We compare our approach, called Score NPE, to NPE using a normalizing flow with four Real NVP layers Dinh et al. (2016) (details in Appendix B). We compare the methods’ performance when trained on datasets of different sizes, corresponding to different budgets of simulator calls

. In all cases we use 20% of the training data as a validation set for early stopping, and train for a maximum of 20k epochs. We train all methods using Adam

Kingma and Ba (2014) with a learning rate of .

After training we generate a set of observations by drawing and , and report the squared MMD Gretton et al. (2012) between the true posterior and the approximation returned by each method for subsets of of different size. Figure 1 shows average results over 40 random seeds. We observe that both methods perform similarly for the simpler Gaussian-Gaussian model, but that Score NPE outperforms the flow baseline for the model with the mixture-of-Gaussians likelihood.

Figure 1: Squared MMD between the true posterior and approximation returned by different methods. (A1) and (A2) refer to using Algorithms 2, LABEL: and 1 for sampling (details for step size and other parameters are in Appendix B

). The MMD is computed using a Gaussian kernel with scale determined by the median heuristic

Ramdas et al. (2015). Samples from the true posterior were obtained with HMC Neal (2011).

Multimodal posterior.

We also consider a two-dimensional example with a multimodal posterior, with the prior and likelihood given by and . We train each method using a budget of simulator calls. After training we sample and , and use each method to generate samples from the approximate posterior obtained by conditioning on subsets of of different size. Results are shown in Fig. 2, where it can be observed that our method is able to capture both modes well for all subset sizes of observations despite only being trained on parameters/single-observation pairs.

Figure 2: Posteriors for the multimodal example. True parameters used to generate are shown in black in the first row. Score NPE samples were obtained using Algorithm 2.


  • M. A. Beaumont (2019) Approximate Bayesian computation. Annual review of statistics and its application 6, pp. 379–403. Cited by: §1.
  • J. Chan, V. Perrone, J. Spence, P. Jenkins, S. Mathieson, and Y. Song (2018)

    A likelihood-free inference framework for population genetic data using exchangeable neural networks

    Advances in Neural Information Processing Systems 31. Cited by: §B.2, §1.
  • K. Cranmer, J. Brehmer, and G. Louppe (2020) The frontier of simulation-based inference. Proceedings of the National Academy of Sciences 117 (48), pp. 30055–30062. Cited by: §1.
  • K. Cranmer, J. Pavez, and G. Louppe (2015)

    Approximating likelihood ratios with calibrated discriminative classifiers

    arXiv preprint arXiv:1506.02169. Cited by: §1.
  • P. Dhariwal and A. Nichol (2021) Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems 34, pp. 8780–8794. Cited by: §1.1.
  • L. Dinh, J. Sohl-Dickstein, and S. Bengio (2016) Density estimation using Real NVP. arXiv preprint arXiv:1605.08803. Cited by: §B.2, §3.
  • M. Glöckler, M. Deistler, and J. H. Macke (2022) Variational methods for simulation-based inference. arXiv preprint arXiv:2203.04176. Cited by: §1.
  • D. Greenberg, M. Nonnenmacher, and J. Macke (2019) Automatic posterior transformation for likelihood-free inference. In

    International Conference on Machine Learning

    pp. 2404–2414. Cited by: §1.
  • A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola (2012) A kernel two-sample test. The Journal of Machine Learning Research 13 (1), pp. 723–773. Cited by: §3.
  • J. Hermans, V. Begy, and G. Louppe (2020) Likelihood-free MCMC with amortized approximate ratio estimators. In International Conference on Machine Learning, pp. 4239–4248. Cited by: §1.
  • J. Ho, A. Jain, and P. Abbeel (2020) Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33, pp. 6840–6851. Cited by: §1.
  • J. Ho and T. Salimans (2022) Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598. Cited by: §1.1.
  • A. Hyvärinen and P. Dayan (2005) Estimation of non-normalized statistical models by score matching.. Journal of Machine Learning Research 6 (4). Cited by: §1.1, §1.1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.
  • J. Lueckmann, G. Bassetto, T. Karaletsos, and J. H. Macke (2019) Likelihood-free inference with emulator networks. In

    Symposium on Advances in Approximate Bayesian Inference

    pp. 32–53. Cited by: §1.
  • J. Lueckmann, J. Boelts, D. Greenberg, P. Goncalves, and J. Macke (2021) Benchmarking simulation-based inference. In

    International Conference on Artificial Intelligence and Statistics

    pp. 343–351. Cited by: §3.
  • J. Lueckmann, P. J. Goncalves, G. Bassetto, K. Öcal, M. Nonnenmacher, and J. H. Macke (2017) Flexible statistical inference for mechanistic models of neural dynamics. Advances in Neural Information Processing Systems 30. Cited by: §1.
  • C. Luo (2022) Understanding diffusion models: a unified perspective. arXiv preprint arXiv:2208.11970. Cited by: 3rd item, Appendix C, Appendix C.
  • R. M. Neal (2011) MCMC using Hamiltonian dynamics.

    Handbook of Markov chain Monte Carlo

    2 (11), pp. 2.
    Cited by: Figure 1.
  • G. Papamakarios and I. Murray (2016) Fast -free inference of simulation models with Bayesian conditional density estimation. Advances in Neural Information Processing Systems 29. Cited by: §1.
  • G. Papamakarios, D. Sterratt, and I. Murray (2019) Sequential neural likelihood: fast likelihood-free inference with autoregressive flows. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 837–848. Cited by: §1.
  • K. C. Pham, D. J. Nott, and S. Chaudhuri (2014) A note on approximating ABC-MCMC using flexible classifiers. Stat 3 (1), pp. 218–227. Cited by: §1.
  • S. T. Radev, U. K. Mertens, A. Voss, L. Ardizzone, and U. Köthe (2020) BayesFlow: learning complex stochastic models with invertible neural networks. IEEE transactions on neural networks and learning systems. Cited by: §B.2, footnote 1.
  • A. Ramdas, S. J. Reddi, B. Póczos, A. Singh, and L. Wasserman (2015) On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 29. Cited by: Figure 1.
  • A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen (2022) Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125. Cited by: §1.1.
  • D. Rezende and S. Mohamed (2015) Variational inference with normalizing flows. In International Conference on Machine Learning, pp. 1530–1538. Cited by: §1.
  • G. O. Roberts and R. L. Tweedie (1996) Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 2 (4), pp. 341–363. Cited by: §1.1.
  • C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, et al. (2022)

    Photorealistic text-to-image diffusion models with deep language understanding

    arXiv preprint arXiv:2205.11487. Cited by: §1.1.
  • Y. Shi, V. De Bortoli, G. Deligiannidis, and A. Doucet (2022) Conditional simulation using diffusion Schrödinger bridges. arXiv preprint arXiv:2202.13460. Cited by: §1.1.
  • J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli (2015)

    Deep unsupervised learning using nonequilibrium thermodynamics

    In International Conference on Machine Learning, pp. 2256–2265. Cited by: Appendix C, Appendix C, Appendix C, §1, §2.1.
  • Y. Song and S. Ermon (2019) Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems 32. Cited by: §1.1, §1.
  • Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole (2020) Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456. Cited by: §1.
  • E. G. Tabak and C. V. Turner (2013) A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics 66 (2), pp. 145–164. Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in Neural Information Processing Systems 30. Cited by: 3rd item.
  • P. Vincent (2011)

    A connection between score matching and denoising autoencoders

    Neural computation 23 (7), pp. 1661–1674. Cited by: §1.1, §1.1.
  • M. Welling and Y. W. Teh (2011) Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning, pp. 681–688. Cited by: §1.1.
  • C. Winkler, D. Worrall, E. Hoogeboom, and M. Welling (2019) Learning likelihoods with conditional normalizing flows. arXiv preprint arXiv:1912.00042. Cited by: §1.
  • S. Wiqvist, J. Frellsen, and U. Picchini (2021) Sequential neural posterior and likelihood approximation. arXiv preprint arXiv:2102.06522. Cited by: §1.
  • S. N. Wood (2010) Statistical inference for noisy nonlinear ecological dynamic systems. Nature 466 (7310), pp. 1102–1104. Cited by: §1.

Appendix A Failure of direct application of conditional score modeling

The target distribution is given by . A direct application of conditional score modeling yields the sequence of densities


It can be seen from the equation above that the score does not factorize in terms of the single-observation scores , meaning that the corresponding score network would have to be trained using samples , obtained by calling the simulator times for a single sample . As mentioned in Section 1 this is one of the drawbacks of NPE methods that we seek to avoid.

a.1 Derivation of posterior factorization

The factorization for the posterior distribution is obtained applying Bayes rule twice:

(Bayes rule) (7)
(Bayes rule) (9)

Appendix B Details for empirical evaluation

b.1 Score NPE

Our implementation of the score network has three blocks:

  • An MLP with 3 hidden layers that takes as input and outputs an embedding ,

  • An MLP with 3 hidden layers that takes as input and outputs an embedding ,

  • An MLP with 3 hidden layers that takes as input, where is a positional embedding obtained as described by Vaswani et al. [34], and outputs the estimate for the score. (We parameterize the score in terms of the noise variables [18].)

All MLPs use residual connections throughout.

Running Algorithm 1 to generate samples using the learned score network requires choosing step sizes and the number of Langevin steps for each noise level . We use and , where and for . For all our experiments we use .

b.2 Flow NPE

We use an implementation of NPE methods based on flows able to handle sets of observations of any size . The flow can be expressed as . Following Chen et al. [2] and Radev et al. [23, §2.4], we use an exchangeable neural network to process the observations . Specifically, we use an MLP with 3 hidden layers to generate an embedding for each observation . We then compute the mean embedding across observations , which we use as input for the conditional flow. Finally, we model the flow , where is an untrained embedding for the number of observations . For the flow we use 4 Real NVP layers [6], each one consisting on MLPs with three hidden layers. As for the Score NPE method, we use residual connections throughout.

We train the flow via maximum likelihood using samples . At test time, this architecture can handle sets of observations of any size .

Appendix C Alternative sampling method without unadjusted Langevin dynamics

This section gives the derivation for the sampling method shown in Algorithm 2. In short, the derivation uses the formulation of score-based methods as diffusions, and has 3 main steps: (1) using the scores of to compute the Gaussian transition kernels of the corresponding diffusion process [18]; (2) composing Gaussian transitions corresponding to the diffusions of (this is based on Sohl-Dickstein et al. [30]); and (3) adding a correction for the prior term (also based on Sohl-Dickstein et al. [30]). We note that steps 2 and 3 require some approximations. Despite this, our empirical evaluation indicates that the method works well in practice. We believe a thorough analysis of these approximations would be useful in understanding when the sampling method from Appendix C can be expected to work. For clarity, we use [A] to indicate when the approximations are introduced/used.

Connection between score-based methods and diffusion models

We begin by noting that score-based methods can be equivalently formulated as diffusion models, where the mean of Gaussian transitions that act as denoising steps are learned instead of the scores. Specifically, letting , , and , for , the learned model is given by a sequence of Gaussian transitions trained to invert a sequence of noising steps given by . The connection between diffusion models and score-based methods comes from the fact that the optimal means and scores are linearly related [18]


Approximately composing diffusions

To simplify notation, we use a superscript to indicate distributions that are conditioned on (e.g. ). Assume we have transition kernels that exactly reverse the forward kernels [A1], meaning that , or equivalently . Our goal is to find a transition kernel that satisfies


where .222This is closely related to our formulation in Section 2, since our definition for the bridging densities involves the product . It is straightforward to verify that the condition from Eq. 12 can be re-written as


Then, one way to satisfy Eq. 14 is by setting so that the term in blue above is equal to . That is,


However, the resulting

may not be a normalized distribution

[30]. Following Sohl-Dickstein et al. [30], we propose to use the corresponding normalized distribution defined as [A2]. Given that Eq. 15 corresponds to the product of Gaussian densities, the resulting normalized transition is also Gaussian, with mean and variance given by


where each is obtained using Eq. 11.

Prior correction term

The formulation above ignores the fact that the bridging densities defined in Eq. 3 involve the prior . We use the method proposed by Sohl-Dickstein et al. [30] to correct for this, which involves adding the term to the mean from Eq. 16. (The derivation for this is similar to the one above, and also requires setting the resulting transition kernel to the normalized version of an unnormalized distribution [30].)

As mentioned previously, this derivation uses two assumptions/approximations. [A1] assumes that the learned score function/reverse diffusion approximately reverses the noising process, which is reasonable if the forward kernels add small amounts of noise per step (equivalently, if the noise levels increase slowly). [A2] assumes that the normalized version of , given by , approximately satisfies Eq. 14.