1 Introduction
The goal of Bayesian analysis is to infer the underlying characteristics of some natural process of interest given observable manifestations . In a Bayesian setting, we assume that we already posses sufficient understanding of the forward problem, that is, a suitable model of the mechanism that generates observations from a given configuration of the hidden parameters . This knowledge can be available in two forms. In likelihoodbased approaches, we are able to evaluate the likelihood function analytically or numerically for any pair . In contrast, likelihoodfree approaches only require the ability to sample from the likelihood. This can be done either directly or, equivalently, by simulating data from some forward model, which generates instances of the observables via a deterministic function of parameters and independent noise :
(1) 
In this case, the likelihood function is only defined implicitly, and we cannot calculate its actual numeric value for the resulting instances , which, in turn, prohibits standard statistical inference. Likelihoodfree problems arise, for example, when is not available in closedform, or when the forward process is defined by a stochastic differential equation, a computer simulation, or a complicated algorithm [klinger2018pyabc, usher2001time, turner2014generalized, wood2010statistical]. In this paper, we propose a new Bayesian solution to the likelihoodfree setting in terms of invertible neural networks.
Bayesian modeling leverages the available knowledge about the forward problem to get the best possible estimate of the posterior distribution of the inverse problem
In Bayesian inference, the posterior encodes all information about obtainable from a set of observations . The observations are assumed to arise from runs of the forward process with fixed, but unknown, true parameters . Bayesian inverse modeling is challenging for three reasons:

The RHS of Bayes’ formula above is always intractable in the likelihoodfree case and must be approximated.

The forward process is usually nondeterministic, so that there is intrinsic uncertainty about the true value of .

The forward process is typically not informationpreserving, so that there is ambiguity among possible values of .
The standard solution to these problems is offered by approximate Bayesian computation (ABC) methods [sunnaaker2013approximate, csillery2010approximate, park2016k2, turner2014generalized]. ABC methods approximate the posterior by repeatedly sampling parameters from a proposal (prior) distribution and then simulating multiple data sets by running the forward model for . If the resulting data set is sufficiently similar to the actually observed data set , the corresponding is retained as a sample from the desired posterior, otherwise rejected. Stricter similarity criteria lead to more accurate approximations of the desired posterior at the price of higher and oftentimes prohibitive rejection rates.
More efficient methods for approximate inference, such as sequential Monte Carlo (ABCSMC), MarkovChain Monte Carlo variants
[sisson2011likelihood], or neural density estimation methods [greenberg2019automatic, papamakarios2018sequential, lueckmann2017flexible], optimize sampling from a proposal distribution in order to balance the speedaccuracy tradeoff of vanilla ABC methods. More details about these methods can be found in the section Related Work.All sampling methods described above operate on the level of individual data sets: For each observation sequence , the entire estimation procedure for the posterior must be run again from scratch. Therefore, we refer to this approach as casebased inference. Running estimation for each individual data set separately stands in contrast to amortized inference, where estimation is split into a potentially expensive upfront training phase, followed by a much cheaper inference phase. The goal of the upfront training phase is to learn an approximate posterior that works well for any observation sequence . Evaluating this model for specific observations is then very fast, so that the training effort amortizes over repeated evaluations. The breakeven between amortized and casebased inference depends on the application and model types, and we will report comparisons in the experimental section. Our main aim in this paper, however, is the introduction of a general approach to amortized Bayesian inference and the demonstration of its excellent accuracy in posterior estimation for a variety of popular forward models.
To make amortized inference feasible in practice, it must work well for arbitrary data set sizes . Depending on data acquisition circumstances, the number of available observations for a fixed model parameter setting may vary considerably, ranging from to several hundreds and beyond. This has not only consequences for the required architecture of our density approximators, but also for their behavior: They must exhibit correct posterior contraction. Accordingly, the estimated posterior should get sharper (i.e., more peaked) as the number
of available observations increases. In the simplest case, the posterior variance should decrease at rate
, but more complex behavior can occur for difficult (e.g. multimodal) true posteriors .We incorporate these considerations into our method by integrating two separate deep neural networks modules (detailed in the Methods section; see also Figure 1), which are trained jointly on simulated data from the forward model: a summary network and an invertible network.
The summary network is responsible for reducing a set of observations
of variable size to a fixedsize vector of
learned summary statistics. In traditional likelihoodfree approaches, the method designer is responsible for selecting suitable statistics for each application a priori [mestdagh2019prepaid, mertens2018abrox, raynal2018abc, sunnaaker2013approximate]. In contrast, our summary networks learn the most informative statistics directly from data, and we will show experimentally (see Experiment 3.7) that these statistics are superior to manually constructed ones. Summary networks differ from standard feedforward networks because they should be independent of the input size and respect the inherent functional and probabilistic symmetries of the data. For example, permutation invariant networks are required for i.i.d. observations [bloem2019probabilistic], and recurrent networks [gers1999learning] or convolutional networks [long2015fully] for data with temporal or spatial dependencies.The invertible network is responsible for learning the true posterior of model parameters given the summary statistics of the observed data. Since it sees the data only through the lens of the summary network, all symmetries captured by the latter are automatically inherited by the posterior. Invertible neural networks are based on the recent theory and applications of normalizing flows [ardizzone2019guided, kingma2018glow, grover2018flow, dinh2016density, kingma2017improving]. Flowbased methods can perform asymptotically exact inference and scale favourably from simple lowdimensional problems to highdimensional distributions with complex dependencies, for instance, the pixels of an image [kingma2018glow]. For each application/model of interest, we train an invertible network jointly with a corresponding summary network using simulated data from the respective known forward process with reasonable priors. After convergence of this forward training, the network’s invertibility ensures that a model for the inverse process is obtained for free, simply by running inference through the model backwards. Thus, our networks can perform fast amortized Bayesian inference on arbitrary many data sets from a given application domain without expensive casebased optimization. We call our method BayesFlow, as it combines ideas from Bayesian inference and flowbased deep learning.
BayesFlow draws on major advances in modern deep probabilistic modeling, also referred to as deep generative modeling [bloem2019probabilistic, kingma2018glow, ardizzone2018analyzing, kingma2014auto]. A hallmark idea in deep probabilistic modeling is to represent a complicated target distribution as a nonlinear bijective transformation of some simpler latent distribution (e.g., Gaussian or uniform), a so called pushforward
. Density estimation of the target distribution, a very complex problem, is thus reduced to learning a nonlinear transformation, a task that is ideally suited for gradientbased neural network training via standard backpropagation (see
(a)). During the inference phase, samples from the target distribution are obtained by sampling from the simpler latent distribution and applying the inverse transformation learned during the training phase (see (b)). Using this approach, recent applications of deep probabilistic models have achieved unprecedented performance on hitherto intractable highdimensional problems [bloem2019probabilistic, kingma2018glow, grover2018flow].In the context of Bayesian inference, the target distribution is the posterior
of model parameters given observed data. We leverage the fact that we can simulate arbitrarily large amounts of training data from the forward model in order to ensure that the summary and invertible networks approximate the true posterior as well as possible. During the inference phase, our model can either numerically evaluate the posterior probability of any candidate parameter
, or can generate a posterior sample of likely parameters for the observed data . In the Methods section, we show that our networks indeed sample from the correct posterior under perfect convergence. In summary, the contributions of our BayesFlow method are the following:
Globally amortized approximate Bayesian inference with reusable invertible neural networks;

Learning maximally informative summary statistics from raw data sets with variable number of observations instead of relying on restrictive handcrafted summary statistics;

Asymptotic theoretical guarantee for sampling from the true posterior distribution without assumptions on the shapes of the posterior or prior;

Parallel computations applicable to both forward simulations and neural network optimization;
To illustrate the utility of BayesFlow, we first apply it to two toy models with analytically tractable posteriors. The first is a multivariate Gaussian with a full covariance matrix and a unimodal posterior. The second is a Gaussian mixture model with a multimodal posterior. Then, we present applications to challenging models with intractable likelihoods from population dynamics, cognitive science, epidemiology, and ecology and demonstrate the utility of BayesFlow in terms of speed, accuracy of recovery, and probabilistic calibration. Across the examples, we introduce multiple tools to validate the performance of our method.
1.1 Related Work
BayesFlow incorporates ideas from previous machine learning and deep learning approaches to likelihoodfree inference
[mertens2019deep, radev2019towards, mestdagh2019prepaid, raynal2018abc, jiang2017learning]. The most common approach has been to cast the problem of parameter estimation as a supervised learning task. In this setting, a large data set of the form
is created by repeatedly sampling from and simulating an artificial datasets by running with the sampled parameters. Usually, the dimensionality of the simulated data is reduced by computing summary statistics with a fixed summary function. Then, a supervised learning algorithm (e.g., random forest
[raynal2018abc], or a neural network [radev2019towards]) is trained on the summary statistics of the simulated data to output an estimate of the true data generating parameters. Thus, an attempt is made to learn the intractable inverse model. A main shortcoming of supervised approaches is that they provide only limited information about the posterior (e.g., pointestimates, quantiles or variance estimates) or impose restrictive distributional assumptions on the shape of the posterior (e.g., normality).
Our ideas are also closely related to the concept of optimal transport maps and its application in Bayesian inference [detommaso2019hint, parno2018transport, chen2018fast, bigoni2019greedy]. A transport map defines a transformation between (probability) measures which can be constructed in a way to warp
a simple probability distribution into a more complex one. In the context of Bayesian inference, transport maps have been applied to accelerate MCMC sampling
[parno2018transport], to perform sequential inference [detommaso2019hint], and to solve inference problems via direct optimization [bigoni2019greedy]. In fact, BayesFlow can be viewed as a parameterization of invertible transport maps via invertible neural networks. An important distinction is that BayesFlow does not require an explicit likelihood function for approximating the target posteriors and is capable of amortized inference.Similar ideas for likelihoodfree inference are incorporated in the recent automatic posterior transformation (APT) [greenberg2019automatic], and the sequential neural likelihood (SNL) [papamakarios2018sequential]
methods. APT iteratively refines a proposal distribution via masked autoregressive flow (MAF) networks to generate parameter samples which closely match a particular observed data set. SNL, in turn, trains a masked autoencoder density estimator (MADE) neural network within an MCMC loop to speedup convergence to the true posterior. A shared feature of these methods is that the neural density estimators involved need to be retrained for inferring the parameters on each individual observed data set. In contrast, we propose to learn the posterior globally over the entire range of plausible parameter values by employing a conditional invertible neural network (cINN) estimator. Previously, INNs have been successfully employed to model data from astrophysics and medicine
[ardizzone2018analyzing]. We adapt the model to suit the task of parameter estimation in the context of mathematical modeling (see Figure 1 for a full graphical illustration of BayesFlow) and develop a reusable probabilistic architecture for fully Bayesian and globally amortized inference on complex mathematical models.2 Methods
2.1 Notation
In the following, the number of parameters of a mathematical model will be denoted as , and the number of observations in a data set as . We denote data simulated from the mathematical model of interest as , where each individual can represent a scalar or a vector. Observed or test data will be marked with a superscript (i.e., ). . The parameters of a mathematical model are represented as a vector , and all trainable parameters of the invertible and summary neural networks as and , respectively. When a data set consists of observations over a period of time, the number of observations will be denotes as .
2.2 Learning the Posterior
Assume that we have an invertible function , parameterized by a vector of parameters , for which the inverse exists. For now, consider the case when raw simulated data of size is entered directly into the invertible network without using a summary network. Our goal is to train an invertible neural network which approximates the true posterior as accurately as possible
(2) 
for all and . We reparameterize the approximate posterior in terms of a conditional invertible neural network (cINN) which implements a normalizing flow between and a Gaussian latent variable :
(3) 
Accordingly, we need to ensure that outputs of follow the target posterior . Thus, we seek neural network parameters which minimize the KullbackLeibler (KL) divergence between the true and the modelinduced posterior for all possible data sets . Therefore, our objective becomes:
(4)  
(5)  
(6)  
(7) 
Note, that the log posterior density can be dropped from the optimization objective in Eq.6, as it does not depend on the neural network parameters . In other words, we seek neural network parameters which maximize the posterior probability of datagenerating parameters given observed data for all and . Since by design, the change of variable rule of probability yields:
(8) 
Thus, we can rewrite our objective as:
(9)  
(10) 
where we have abbreviated (the Jacobian of evaluated at and ) as . Due to the architecture of our cINN, the is easy to compute (see next section for details).
Utilizing simulations from the forward process (Eq.1), we can approximate the expectations by minimizing the MonteCarlo estimate of the negative of Eq.10. Accordingly, for a batch of simulated data sets and datagenerating parameters we have:
(11)  
(12)  
(13) 
We treat Eq.13
as a loss function
which can be minimized with any stochastic gradient descent method. The first term follows from Eq.
12due to the fact that we have prescribed a unit Gaussian distribution to
and represents the negative log of . The second term controls the rate of volume change induced by the learned nonlinear transformation from to achieved by . Thus, minimizing Eq.13 ensures that follows the prescribed unit Gaussian.Our cINN is perfectly converged when for all possible . This, in turn, implies equality in distributions . Following a similar line of reasoning as in [ardizzone2018analyzing], we can prove that samples from a perfectly converged cINN follow the true posterior .
Proposition 1.
Assume that we have a perfectly converged cINN and an arbitrary but fixed observation . If transforms a probability density into , then the inverse transforms back into . Therefore, repeatedly sampling and applying to each yields samples from .
Proof.
Denote the conditional probability density of obtained by inverse sampling as . We need to show that . Applying the change of variable formula to the inverse transformation, we obtain:
(14) 
Correspondingly, for the forward transformation we have:
(15) 
Substituting for into Eq.14 leads to:
(16)  
(17)  
(18) 
which is true for all possible due to the assumption of perfect convergence, that is, . This completes the proof. ∎
We now generalize our formulation to data sets with arbitrary numbers of observations. If we let the number of observations vary and train a summary network together with the cINN, our main objective changes to:
(19) 
and its Monte Carlo estimate to:
(20) 
In order to make estimation of tractable, we assume that there exists a vector of sufficient statistics that captures all information about contained in in a fixedsize representation. For to be a useful estimator for , both should convey the same information about , as measured by the mutual information:
(21) 
where the mutual information between and (analogously, between and ) is defined as:
(22) 
Since we do not know , we can enforce this requirement only indirectly by minimizing the Monte Carlo estimate of Eq.19. The following proposition states that, under perfect convergence, samples from a cINN still follow the true posterior given the outputs of a summary networks.
Proposition 2.
Assume that we have a perfectly converged cINN and a perfectly converged summary network . Assume also, that there exists a vector of sufficient summary statistics for . Then, repeatedly sampling and applying to each yields samples from .
Proof.
Perfect convergence of the networks under Eq.18 implies . This, in turn, implies that , because a perfect match of the densities would be impossible if contained less information about than . Therefore, the proof reduces to that of Proposition 1. Note, that whenever the KL divergence is driven to a minimum, is a maximally informative statistic [deans2002maximally]. ∎
In summary, the approximate posteriors obtained by the BayesFlow method are correct if the summary and invertible networks are perfectly converged. In practice, however, perfect convergence is unrealistic, and there are three sources of error which can lead to incorrect posteriors. The first is the Monte Carlo error introduced by using simulations from to approximate the expectation in Eq.19. The second is due to a summary network which may not fully capture the relevant structure of the data or when sufficient summary statistics do not exist. The third is due to an invertible network which does not accurately transform the true posterior into the prescribed Gaussian latent space. Even though we can mitigate the Monte Carlo error by running the simulator for a longer time, the latter two can be harder to detect and alleviate in a principled way. Nevertheless, recent work on probabilistic symmetry [bloem2019probabilistic] and algorithmic alignment [xu2019what] can provide some guidelines on how to choose the right summary network for a particular problem. Additionally, the number as well as the structure of ACBs in an invertible chain can be tuned to increase the expressiveness of the learned transformation from space to space. The benefits of neural network depth has been confirmed both in theory [lin2018resnet] and practice [goodfellow2016deep], so we expect better performance in complex settings with increasing number of ACBs.
2.3 Composing Invertible Networks
The basic building block of our cINN is the affine coupling block (ACB) [dinh2016density]. Each ACB consists of four separate fully connected neural networks denoted as . An ACB performs an invertible nonlinear transformation, which means that in addition to a parametric mapping it also learns the inverse mapping for free. Denoting the input vector of as and the output vector as , it follows that and . Invertibility is achieved by splitting the input vector into two parts with and and performing the following operations on the split input:
(23)  
(24) 
where denotes elementwise multiplication. The outputs are then concatenated again and passed to the next ACB. The inverse operation is given by:
(25)  
(26) 
This formulation ensures that the Jacobian of the affine transformation is a strictly upper or a lower triangular matrix and therefore its determinant is very cheap to compute. Furthermore, the internal functions can be represented by arbitrarily complex neural networks, which themselves need not be invertible, since they are only ever evaluated in the forward direction during both the forward and the inverse pass through the ACBs. In our applications, we parameterize the internal functions as fully connected neural networks with exponential linear units (ELU).
In order to ensure that the neural network architecture is expressive enough to represent complex distributions, we chain multiple ACBs, so that the output of each ACB becomes the input to the next one. In this way, the whole chain remains invertible from the first input to the last output and can be viewed as a single function parameterized by trainable parameters .
In our applications, the input to the first ACB is the parameter vector , and the output of the final ACB is a dimensional vector representing the nonlinear transformation of the parameters. As described in the previous section, we ensure that follows a unit Gaussian distribution via optimization, that is, . Fixed permutation matrices are used before each ACB to ensure that each axis of the transformed parameter space encodes information from all components of .
In order to account for the observed data, we feed the learned summary vectors into each internal network of each ACB (explained shortly). Intuitively, in this way we realize the following process: the forward pass maps datagenerating parameters to space using conditional information from the data , while the inverse pass maps data points from space to the datagenerating parameters of interest using the same conditional information.
2.4 Summary Network
Since the number of observations usually varies in practical scenarios (e.g., different number of measurements or time points) and since data sets might exhibit various redundancies, the cINN can profit from some form of dimensionality reduction. As previously mentioned, we want to avoid information loss through restrictive handcrafted summary statistics and, instead, learn the most informative summary statistics directly from data. Therefore, instead of feeding the raw simulated or observed data to each ACB, we pass the data through an additional summary network to obtain a fixedsized vector of learned summary statistics .
The architecture of the summary network should be aligned with the structure of the observed data. An obvious choice for time seriesdata is an LSTMnetwork [gers1999learning], since recurrent networks can naturally deal with long sequences of variable size. Another choice might be a 1D fully convolutional network [long2015fully], which has already been applied in the context of likelihoodfree inference [radev2019towards]. A different architecture is needed when dealing with i.i.d. samples of variable size. Such data are often referred to as exchangeable, or permutation invariant, since changing the order of individual elements does not change the associated likelihood or posterior. In other words, if is an arbitrary permutation of elements, the following should hold for the posterior:
(27) 
Following [bloem2019probabilistic], we encode probabilistic permutation invariance by implementing a permutation invariant function through an equivariant nonlinear transformation followed by a pooling operator (e.g., sum or mean) and another nonlinear transformation:
(28) 
where and are two different fully connected neural networks. In practice, we stack multiple equivariant and invariant functions into an invariant network in order to achieve higher expressiveness [bloem2019probabilistic]. Additionally, we use an attention mechanism which computes learnable weights to each individual sample:
(29)  
(30)  
(31) 
where Eq.30 performs an elementwise softmax operation to ensure that the weight vectors sum to 1 across each dimension. Note, that the function remains permutation invariant. The computation of attention weights is optional and introduces some computational overhead, but we observe much faster convergence and better performance when the summary network implements it.
We optimize the parameters of the summary network jointly with those of the cINN chain via backpropagation. Thus, the method remains completely endtoend and is capable of generalizing to data of variable size (within a given domain, by applying the same network) and structure (between different domains, by changing the architecture of the summary network).
To incorporate the observed or simulated data , each of the internal networks of each ACB is augmented to take the learned summary vector of the data as an additional input. The output of each ACB then becomes:
(32)  
(33) 
Thus, a complete pass through the entire conditional invertible chain can be expressed as together with the inverse operation .
2.5 Putting It All Together
Algorithm 1 describes the essential steps of the BayesFlow method using an arbitrary summary network and employing an online learning approach.
The backpropagation algorithm works by computing the gradients of the loss function with respect to the parameters of the neural networks and then adjusting the parameters, so as to drive the loss function to a minimum. We experienced no instability or convergence issues during training with the loss function given by Eq.19. Note, that steps 210 and 1317 of Algorithm 1 can be executed in parallel with GPU support in order to dramatically accelerate convergence and inference. Moreover, steps 1317 can be applied in parallel to an arbitrary number of observed data sets after convergence of the networks.
In what follows, we apply BayesFlow to two toy models with a unimodal and multimodal posteriors, respectively, and then use it to perform Bayesian inference on challenging models from population dynamics, cognitive science, epidemiology, and ecology.^{1}^{1}1Code and simulation scripts for all current applications are available at https://github.com/stefanradev93/cINN. We deem these models suitable for an initial validation, since they differ widely in the generative mechanisms they implement and the observed data they model. Therefore, good performance on these disparate examples underlines the broad empirical utility of the BayesFlow method. Details for models’ setup can be found in Appendix B.
3 Experiments
3.1 Training the Networks
We train all invertible and summary networks described in this paper jointly via backpropagation. For all following experiments, we use the Adam optimizer with a starter learning rate of and an exponential decay rate of . We perform to
iterations (i.e., minibatch update steps) for each experiment, and report the results obtained by the converged networks. Note, that we did not perform an extensive search for optimal values of network hyperparameters, but use a default BayesFlow with 5 to 10 ACBs and a summary vector of size
for all examples in this paper (see Appendix B for more details on summary network architectures). All networks were implemented in Python using the TensorFlow library [abadi2016tensorflow] and trained on a singleGPU machine equipped with NVIDIA^{®} GTX1060 graphics card. Regarding the data generation step, we take an approach which incorporates ideas from online learning [mnih2015human] where data are stimulated by Eq.1 on demand.Correspondingly, a data set , or a batch of data sets , is generated on the fly and then passed through the neural network. This training approach has the advantage that the network never experiences the same input data twice. Moreover, training can continue as long as the network keeps improving (i.e., the loss keeps decreasing), since overfitting in the classical sense is nearly impossible. However, if simulations are computationally expensive and researchers need to experiment with different networks or training hyperparameters, it might be beneficial to store and reuse simulations, since simulation and training in online learning are tightly intertwined.
Once the networks have converged, we store the trained networks and reuse them to perform amortized inference on a separate validation set of data sets. The pretrained networks can also be shared among a research community so that multiple researchers/labs can benefit from the amortization of inference.
3.2 Performance Validation
To evaluate the performance of BayesFlow in the following application examples, we consider a number of different metrics:

Normalized root mean squared error (NRMSE)  to asses accuracy of pointestimates in recovering groundtruth parameter values;

Coefficient of determination ()  to asses the proportion of variance in groundtruth parameters that is captured by the point estimates;

Resimulation error (
)  to asses the predictive mismatch between the true data distribution and the data distribution generated with the estimated parameters (i.e., posterior predictive check);

Calibration error (, [ardizzone2018analyzing]
)  to asses the coverage of the approximate posteriors (i.e., whether credibility intervals are indeed credible);

Simulationbased calibration (SBC, [talts2018validating])  to visually detect systematic biases in the approximate posteriors;
Details for computing all metrics can be found in Appendix A.
3.3 Proof of Concept: Multivariate Normal Distribution
As a proofofconcept, we apply the BayesFlow method to recover the posterior mean vector of a toy multivariate normal (MVN) example. For a dimensional MVN vector, the forward model is given by:
(34)  
(35) 
If the covariance matrix is known, the posterior of the mean vector has a closedform which is also a MVN with posterior precision matrix given by and posterior mean given by [bolstad2016introduction]. We can thus generate multiple batches of the form and pass them directly through an invertible network. Since the groundtruth posterior is Gaussian, we can compute the KL divergence as a measure of mismatch between the true and approximate posteriors in closed form:
(36)  
(37) 
where and denote the covariance matrix and mean vector computed from samples obtained by BayesFlow. We run three experiments with where the size of the ACB blocks was doubled for each successive . To asses results, we compute the and NRMSE between approximate and true means as well as the KL divergence between approximate and true distributions on 100 test datasets. To compute the approximate covariance matrix, we draw 5000 samples from the approximate posteriors for and and 50000 samples for .
The KL divergence for the 5D and 50D MVNs reached 0 after 23 epochs of 1000 iterations indicating perfect recovery of the true posteriors. The results on the 5D MVN are illustrated in
(a). We observe that BayesFlow is able to capture the true covariance of the posterior. The same applies for the 50D model. The KL divergence for the 500D MVN model reached 0.37 after 50 epochs, indicating very good approximation with respect to the high dimensionality of the problem. (b) depicts the recovery of the true posterior means. We observe low NRMSE scores and high across all 500 mean parameters (xaxes).3.4 Multimodal Posterior  Gaussian Mixture Model
In order to test whether the BayesFlow method can recover multimodal posteriors, we apply it to a generative Gaussian mixture model (GMM). Multimodal posteriors arise in practice, for example, when forward models are defined as mixtures between different processes, or when models exhibit large multivariate tradeoffs in their parameter space (e.g., there are multiple separate regions of posterior density with plausible parameter values). Therefore, it is important to show that our method is able to capture such behavior and does not suffer from mode collapse.
Following [ardizzone2018analyzing], we construct a scenario in which the observed data
is a onehot encoded vector representing one of the labels
red, green, blue, or yellow (hard assignment of labels). The parameters are the 2D coordinates ^{2}^{2}2Note that this is not the typical GMM setup, as we construct the example such that the mixture assignments (labels) are observed and the data coordinates are the latent parameters. of points drawn from a mixture of eight Gaussian clusters with centers distributed around the origin in a clockwise manner and unit variance (Figure 3). The first four clusters are assigned the label red, the next two the label green, and the remaining two the labels blue and yellow. The posterior is composed of the clusters indexed by the corresponding label. We perform the experiment multiple times by increasing the depth of the BayesFlow starting from 1 ACB block up to 5 ACB blocks. In this way, we can investigate the effects of cINN depth on the quality of the approximate multimodal posteriors. We train each BayesFlow for 50 epochs and draw 8000 samples from the approximate posteriors obtained by the trained models.Results for all BayesFlows are depicted in Figure 3. We observe that approximations profit from having a deeper cINN chain, with cluster separation becoming clearer when using more ACBs. This confirms that our method is capable of recovering multimodal posteriors.
3.5 Stochastic TimeSeries Model  The Ricker Model
In the following, we estimate the parameters of a wellknown discrete stochastic population dynamics model [wood2010statistical]. With this example, we are pursuing several goals: First, we want to demonstrate that the BayesFlow method is able to accurately recover the parameters of an actual model with intractable likelihood by learning summary statistics from raw data. Second, we show that BayesFlow can deal adequately with parameters that are completely unrelated to the data by reducing estimates to the corresponding parameters’ prior. Third, we compare the global performance of the BayesFlow method to that of related methods capable of amortized likelihoodfree inference. Finally, we demonstrate the desired posterior contraction and improvement in estimation with increasing number of observations.
Discrete population dynamics models describe how the number of individuals in a population changes over discrete units of time [wood2010statistical]. In particular, the Ricker model describes the number of individuals in generation as a function of the expected number of individuals in the previous generation by the following nonlinear equations:
(38)  
(39)  
(40) 
for where is the expected number of individuals at time , is the growth rate, is a scaling parameter and is random Gaussian noise. The likelihood function for the Ricker model is not available in closed form, and the model is known to exhibit chaotic behavior [mestdagh2019prepaid]. Thus, it is a suitable candidate for likelihoodfree inference. The parameter estimation task consists of recovering from the observed onedimensional timeseries data where each .
What if the data does not contain any information about a particular parameter? In this case, any good estimation method should detect this, and return the prior of the particular parameter. To test this, we append a random uniform variable to the parameter vector and train BayesFlow with this additional dummy parameter. We expect that the networks ignore this dummy parameter, that is, we assume that the estimated posterior of resembles the uniform prior.
We compare the performance of BayesFlow to the following recent methods capable of amortized likelihoodfree inference: conditional variational autoencoder (cVAE) [mishra2018generative], cVAE with autoregressive flow (cVAEIAF) [kingma2017improving], DeepInference
with heteroscedastic loss
[radev2019towards], approximate Bayesian computation with an LSTM neural network for learning informative summary statistics (ABCNN) [jiang2017learning] and quantile random forest (ABCRF) [raynal2018abc]. For training the models, we simulate timeseries from the Ricker model with varying lengths. The number of time pointsis drawn from a uniform distribution
at each training iteration. All neural network methods were trained for epochs with iterations each on simulated data from the Ricker model. The ABCRF method was fitted on a reference table with datasets, since the method does not allow for online learning and increasing the reference table did not seem to improve performance. In order to avoid using handcrafted summary statistics for the ABCRF method, we input summary vectors obtained by applying the summary network trained jointly with the cINN. Thus, the ABCRF method has the advantage of using maximally informative statistics as input. We validate the performance of all methods on an independent test set of 500 datasets generated with . We report performance metrics for each method and each parameter in Table 1.

Posterior contraction in terms of posterior standard deviation for each parameter across increasing number of available generations (shaded regions indicate bootstrap 95% CIs).
Parameters and seem to be well recoverable by all methods considered here. The parameter turns out to be harder to estimate, with BayesFlow and the ABCNN method performing best. Further, BayesFlow performs very well across all parameters and metrics. Importantly, the calibration error obtained by BayesFlow is always low, indicating that the shape of the approximate posterior closely matches that of the true posteriors. Variational methods (cVAE, cVAEIAF) experience some problems recovering the posterior of . The ABCNN and ABCRF methods seem to recover point estimates with high accuracy but the approximate posteriors of the former exhibit relatively high calibration error. The ABCRF method can only estimate posterior quantiles, so no comparable calibration metric could be computed.
Further results are depicted in Figure 4. Inspecting the full posteriors obtained by all methods on an example test dataset, we note that only BayesFlow and the ABCNN methods are able to recover the uninformative posterior distribution of the dummy noise variable ((a)). Moreover, the importance of a Bayesian treatment of the Ricker model becomes clear when looking at the posteriors of . On most test datasets, the posterior density spreads over the entire prior range (high posterior variance) indicating large uncertainty in the obtained estimates. Some datasets even resulted in bimodal posteriors for , which highlight the importance of imposing no ad hoc restrictions on the posteriors. We also observe that parameter estimation with BayesFlow becomes increasingly accurate when more time points are available ((b)). Parameter recovery is especially good with the maximum number of time points (see (c)). Finally, ((d)) reveals a notable posterior contraction across increasing number of time points available to the summary network.
BayesFlow  cVAE  cVAEIAF  DeepInference  ABCNN  ABCRF  

0.017 0.007  0.014 0.007  0.058 0.017  0.122 0.016  0.164 0.015    
0.013 0.007  0.419 0.011  0.382 0.013  0.184 0.021  0.119 0.014    
0.084 0.018  0.121 0.017  0.188 0.018  0.111 0.019  0.283 0.012    
0.041 0.002  0.047 0.004  0.047 0.006  0.052 0.003  0.053 0.003  0.044 0.004  
0.077 0.005  0.137 0.004  0.124 0.006  0.108 0.004  0.077 0.004  0.081 0.005  
0.018 0.001  0.016 0.002  0.019 0.002  0.019 0.002  0.033 0.002  0.021 0.001  
0.980 0.003  0.973 0.005  0.973 0.007  0.968 0.005  0.966 0.004  0.977 0.004  
0.919 0.011  0.745 0.020  0.792 0.020  0.841 0.014  0.919 0.010  0.912 0.011  
0.996 0.001  0.997 0.001  0.996 0.001  0.996 0.001  0.986 0.002  0.994 0.001  
  0.038 0.001  0.041 0.001  0.042 0.001  0.041 0.001  0.048 0.002  0.041 0.002 

Note: For each parameter, bootstrapped means (standard error) of different performance metrics are displayed for all tested methods. For each metric and each parameter, the best performance across methods is printed in bold font.
3.6 A Model of Perceptual Decision Making  The LévyFlight Model
In the following, we estimate the parameters of a stochastic differential equation model of human decision making. We perform the first Bayesian treatment of the recently proposed LévyFlight Model (LFM), as its intractability has so far rendered traditional nonamortized Bayesian inference methods prohibitively slow [voss2019sequential].
With this example, we first want to show empirically that BayesFlow is able to deal with data sets of variable size arising from independent runs of a complex stochastic simulator. For this, we inspect global performance of BayesFlow over a wide range of data set sizes. Additionally, we want to show the advantage of amortized inference compared to casebased inference in terms of efficiency and recovery. For this, we apply BayesFlow along with four other recent methods for likelihoodfree inference to a single data set and show that in some cases the speed advantage of amortized inference becomes noticeable even after as few as 5 data sets. Crucially, researchers often fit the same models to different datasets, so if a pretrained model exists, it would present a huge advantage in terms of efficiency and productivity.
We focus on the family of evidence accumulator models (EAMs) which describe human decision making by a set of neurocognitively motivated parameters [ratcliff2008diffusion]
. EAMs are most often applied to choice reaction times (RT) data to obtain an estimate of the underlying processes governing categorization and (perceptual) decision making. In its most general formulation, the forward model of EAMs takes the form of a stochastic ordinary differential equation (ODE) given by
[usher2001time]:(41) 
where denotes a change in activation of an accumulator, denotes the average speed of information accumulation (often termed the drift rate), and represents a stochastic additive component, usually modeled as coming from a Gaussian distribution centered around : .
EAMs are particularly amenable for likelihoodfree inference, since the likelihood of most interesting members of this model family turn out to be intractable [miletic2017parameter]. This intractability has precluded many interesting applications and empirically driven model refinements. Here, we apply BayesFlow to estimate the parameters of the recently proposed LévyFlight Model (LFM) [voss2019sequential]. The LFM assumes an stable noise distribution of the evidence accumulation process which allows to model discontinuities in the decision process. However, the inclusion of stable noise (instead of the typically assumed Gaussian noise) leads to a model with intractable likelihood:
(42)  
(43) 
where
controls the probability of outliers in the noise distribution. The LFM has three additional parameters: the threshold
determining the amount of evidence needed for the termination of a decision process; a relative starting point, , determining the amount of starting evidence available to the accumulator before the actual decision alternatives are presented; and an additive nondecision time .During training of the networks, we simulate response times data from two experimental conditions with two different drift rates, since such a design is often encountered in psychological research. The parameter estimation task is thus to recover the parameters from twodimensional i.i.d. RT data where each represents RTs obtained in the two conditions. The number of trials is drawn from a uniform distribution at each training iteration. Training the networks took a little less than a day with the online learning approach. Inference on datasets with posterior samples per parameter took approximately seconds.
In order to investigate whether amortized inference is advantageous for this model, we additionally apply a version of the SMCABC algorithm available in the pyABC package [klinger2018pyabc] to a single data set with . Since no sufficient summary statistics are available for EAM data, we apply the maximum mean discrepancy (MMD) metric as a distance between the full raw empirical RT distributions, in order to prevent information loss [park2016k2]. Since the MMD is expensive to compute, we use a GPU implementation to ensure that computation of MMD is not a bottleneck for the comparison. In order to achieve good approximation with 2000 samples from the SMCMMD approximate posterior, we run the algorithm for 20 populations with a final rejection threshold . We also sample[draw] samples from the approximate posterior obtained by applying our pretrained BayesFlow networks to the same data set.
Along SMCMMD, we apply three recent methods for neural density estimation, SNPEA [papamakarios2016fast], SNPEB [lueckmann2017flexible], and SNPEC ([greenberg2019automatic]
, also dubbed APT). Since these methods all depend on summary statistics of the data, we compute the first 6 moments of each empirical response time distribution as well as the fractions of correct/wrong responses. We train each method for a single round with
epochs and simulated datasets, in order to keep running time at a minimum. Also, we did not observe improvement in performance when training for more than one round. For each model, we sample samples from the approximate joint posterior to align the number of samples with those obtained via SMCMMD.The comparison results are depicted in Figure 5. We first focus on the comparison with SMCMMD on the single data set. (a) depicts marginal and bivariate posteriors obtained by BayesFlow and SMCMMD. The approximate posteriors of BayesFlow appear noticeably sharper. Observing the SCB plots (LABEL:fig:Fig.6c), we can conclude that the approximate posteriors of BayesFlow mirror the sharpness of the true posterior, since otherwise the SCB plots would show marked deviations from uniformity. Further, (b) depicts the marginal posteriors obtained from the application of each method. Noticeably, performance and sharpness varies across the methods and parameters, with all methods yielding good pointestimate recovery via posterior means in terms of the NRMSE and metrics.
Importantly, Table 2 lists the advantage of amortized inference for the LFM model. For instance, compared to SMCMMD, the extra effort of learning a global BayesFlow model upfront is worthwhile even after as few as 5 datasets, as inference with SMCMMD would have taken more than a day to finish. On the other hand, the breakeven for SNPEC/APT occurs after 75 datasets, so in cases where only a few dozens of data sets are considered, casebased inference might be preferable. However, the difficulties in manually finding meaningful and efficiently computable summary statistics may eat up possible savings even in this situation. We acknowledge that our choices in this respect might be suboptimal, so performance comparisons should be treated with some caution.
We note, that after a day of training, the pretrained networks of BayesFlow take less than seconds to perform inference on datasets even with maximum number of trials . Using the casebased SMCMMD algorithm, inference runs would have taken more than half a year to complete. We also note, that parallelizing separate inference threads across multiple cores or across nodes of a (GPU) computing cluster can dramatically increase the wallclock speed of the casebased methods considered here. However, the same applies to BayesFlow training, since its most expensive part, the simulation from the forward model, would profit the most from parallel computing.
Upfront Training  Inference (1 data set)  Inference (500 data sets)  Breakeven after  

BayesFlow  23.2 h  60 ms  3.7 s   
SMCMMD    5.5 h  2700 h  5 data sets 
SNPEA    0.65 h  325 h  37 data sets 
SNPEB    0.65 h  325 h  37 data sets 
SNPEC    0.35 h  175 h  75 data sets 

Note: Inference times for 500 datasets as well as the number of data sets for breakeven with BayesFlow for the SMCMMD, SNPEA, SNPEB, and SNPEC methods are extrapolated from the wallclock running time on a single data set, so these are approximate quantities.
The global performance of BayesFlow over all validation data sets and all trial sizes is depicted in (Figure 6). First, we observe excellent recovery of all LFM parameters with NRMSEs ranging between and and between for the maximum number of trials. Importantly, estimation remains very good across all trial numbers, and improves as more trials become available ((a)). The parameter appears to be most challenging to estimate, requiring more data for good estimation, whereas the nondecision time parameter can be recovered almost perfectly for all trial sizes. Last, the SCB histograms indicate no systematic deviations across the marginal posteriors ((b)).
3.7 Stochastic Differential Equations  The SIR Epidemiology Model
With this example, we want to further corroborate the excellent global performance and probabilistic calibration observed for the LFM model on a noni.i.d. stochastic ODE model. Even though the generative mechanism of the LFM is formulated as a stochastic ODE, its output consists of i.i.d. data sets comprising multiple runs of the simulator. Here, we study a compartmental model from epidemiology, whose output comprises variablesized multidimensional and interdependent timeseries. It is therefore of interest to investigate whether our method is applicable to data which is the direct output of an ODE simulator.
Compartmental models in epidemiology describe the stochastic dynamics of infectious diseases as they spread over a population of individuals [sahneh2017gemfsim, keeling2011modeling, hethcote2000mathematics]. The parameters of compartmental models encode important characteristics of diseases, such as the rates of infection or recovery from the disease. The stochastic SIR model describes the transition dynamics of individuals between three discrete states: susceptible (), infected (), and recovered (). The transition dynamics are given by the following equations:
(44)  
(45)  
(46)  
(47)  
(48) 
where give the number of susceptible, infected, and recovered individuals, respectively. The parameter controls the transition rate from being susceptible to infected, and controls the transition rate from being infected to recovered (see (a)). The number of individuals moving from to , given by , and the number of people moving from to , given by , over a time interval
are modeled as binomial random variables. The above listed stochastic system has no analytic solution and thus requires numerical simulation methods for recovering parameter values from data. Cast as a parameter estimation task, the challenge is to recover
from three dimensional timeseries data where each is a triple containing the number of susceptible (), number of infected (), and recovered () individuals at time .During training of the networks, we simulate timeseries from the stochastic SIR model with varying lengths. The number of time points is drawn from a uniform distribution at each training iteration. Usually, at lower s, the system has not converged to an equilibrium (i.e., not all individuals have transitioned from being to ). Thus, it is especially interesting to see if BayesFlow can recover the rate parameters, even if the process dynamics are still unfolding over time. Training the networks took approximately two hours with the online learning approach. Inference on datasets with posterior samples per parameter took approximately seconds.
The results on the SIR model are depicted in Figure 7. In line with the previous examples, we observe very good recovery of the true parameters, with NRMSE at around , and s around . Further, we observe decent performance even at smaller s, with better recovery performance as increases. We also observe that the posterior variance shrinks, as increases, Finally, the SCB plots indicate that the approximate posteriors are well calibrated, with the approximate posterior mean of slightly overestimating the true parameter values in the lower range.
3.8 Learned vs. HandCrafted Summaries: The LotkaVolterra Population Model
With this final examples, we want to compare the performance of our method with an LSTM summary network vs. performance obtained with a standard set of handcrafted summary statistics. For this, we focus on the wellstudied LotkaVolterra (LV) model. The LV model describes the dynamics of biological systems in which a population of predators interacts with a population of prey [wilkinson2011stochastic]. It involves a pair of first order, nonlinear, differential equations given by:
(49)  
(50) 
where denotes the number of preys, denotes the number of predators, and the parameter vector controlling the interaction between the species is .
During training of the networks, we set the initial conditions as and and consider an interval of discrete time units with time steps (samples) in between. Each sample in each LV timeseries is thus a 2dimensional vector containing the number of prey and predators in the population at time unit .
We train two invertible neural networks. The first is trained jointly with an LSTM summary network which outputs a 9dimensional learned summary statistic . The second uses a set of 9 typically used, handcrafted summary statistics [papamakarios2016fast, papamakarios2018sequential], which include: the mean of the time series; the log variance of the timeseries; the autocorrelation of each timeseries at lags 0.2 and 0.4 time units; the crosscorrelation between the two time series. The same cINN architecture with 5 ACBs is used for both training scenarios. For each scenario, we perform the same number of iterations and epochs. Online learning for each training scenario took approximately 4 hours in total wallclock time.
The results obtained on the LV model are depicted in Figure 8. We observe notably better recovery of the true parameter estimates when performing inference with the learned summary statistics. The approximate posteriors are also better calibrated when conditioned on the set of 9 learned summary statistics. These results highlight the advantages of using a summary networks when no sufficient summary statistics are available. Finally, (e) and (f) depict the posteriors obtained by the two different INNs on a single dataset with groundtruth parameters . Evidently, learning the summary statistics leads to much sharper posteriors and better pointestimate recovery.
4 Discussion
In the current work, we proposed and explored a novel method which uses invertible neural networks to perform globally amortized approximate Bayesian inference. The method, which we named BayesFlow, requires only simulations from a forward model to learn a reusable probabilistic mapping between data and parameters. We demonstrated the utility of BayesFlow by applying it to models and data from various research domains. Further, we explored an online learning approach with variable number of observations per iteration. We demonstrated that this approach leads to excellent parameter estimation throughout the examples considered in the current work. In theory, BayesFlow is applicable to any mathematical process model which can be implemented as a computer simulation
BayesFlow combines the universal approximation capacity of deep learning methods [goodfellow2016deep] with the crucial uncertainty quantification assets of Bayesian inference [kendall2017uncertainties, gelman2013bayesian]. Besides being capable of performing fully Bayesian inference on intractable mathematical models, BayesFlow provides a general framework for designing reusable Bayesian parameter estimation machines for various research domains. Moreover, it could also prove as a viable alternative in modeling contexts where standard inference methods are available, but inference is nevertheless prohibitively slow. In the following, we highlight the main advantages of BayesFlow.
First, the introduction of separate summary and invertible networks renders the method independent of the shape or the size of the observed data. The summary network learns a fixedsize vector representation of the data in an automatic, datadriven manner. Since the summary network is optimized jointly with the invertible network, the learned data representation is encouraged to be maximally informative for inferring the parameters’ posterior. This is particularly useful in settings where appropriate summary statistics are not known and, as a consequence, relevant information is lost through the choice of suboptimal summary functions. However, if sufficient statistics are available in a given domain, one might omit the summary network altogether and feed these statistics directly to the invertible network.
Second, we showed that BayesFlow generates samples from the correct posterior under perfect convergence without distributional assumptions on the shape of the posterior. This is in contrast to variational methods which optimize a lowerbound on the posterior [kingma2017improving, kingma2014auto], and oftentimes assume Gaussian approximate posteriors. Additionally, we also showed throughout all examples that the posterior means generated by the BayesFlow method are mostly excellent estimates for the true values. Beyond this, the fact that the BayesFlow method recovers the full posterior over parameters does not necessitate the usage of point estimates or summary statistics of the posterior. Further, we observe the desired posterior contraction (posterior variance decreases with increasing number of observations) and better recovery with increasing number of observations. These are indispensable properties of any Bayesian parameter estimation method, as it mirrors the decrease in epistemic uncertainty and simultaneous increase in information due to availability of more data.
Third, the largest computational cost of BayesFlow is paid during training. Once trained, the networks can be used and reused to perform inference on large numbers of data sets within seconds and across a research domain given the same priors. Indeed, there are many instances of research domains where a single model is extensively explored and independently fitted by multiple researches to test scientific hypotheses [voss2019sequential, zappia2017splatter, ratcliff2008diffusion, de2002fitting]. These research domains are expected to benefit the most from learning the model universe once and then inverting the model multiple times for fast inference on multiple data sets. In this regard, BayesFlow is similar to the recently introduced prepaid method [mestdagh2019prepaid] which uses a database of precomputed summary statistics and nearestneighbors for inference. Note, however, that BayesFlow does not need to store training data, since the knowledge about the relationship between data and parameters is compressed into the networks’ weights. This not only makes the global sharing of pretrained parameter estimation networks across researchers extremely easy, but also makes their local storage very efficient. Finally, all computations involved in the BayesFlow method benefit from a high degree of parallelism and can thus utilize the advantages of modern GPU acceleration.
These advantages notwithstanding, limitations of the BayesFlow method should also be mentioned. Although we could provide an asymptotic theoretical guarantee that BayesFlow recovers the true posteriors under perfect convergence, this might not be achieved in practice. Therefore, is it essential that proper calibration of point estimates and estimated posteriors is performed for each application of the method. Below, we discuss potential challenges and limitations of the method.
A first challenge is the Monte Carlo error introduced by approximating the mathematical expectations in our optimization objective (Eq.19) with a finite sum. The fact that we can loop through the training phase for a potentially unlimited amount of data should mitigate the Monte Carlo error. However, this would only apply to the online learning approach which is preferable when simulations are computationally cheap to perform. Approximation error remains a potential issue for the offline learning approach with a fixed grid or reference table of simulated data. Note, that this is a potential shortcoming of all methods for approximate inference using the reference table approach.
Second, the design of the summary network and invertible networks is a crucial choice for achieving optimal performance of the method. As already mentioned, the summary network should be able to represent the observed data without losing essential information and the invertible network should be powerful enough to encode the relevant datagenerating process. Nevertheless, in some realworld scenarios there might be little guidance on how to actually construct suitable summary networks. Recent work on probabilistic symmetry [bloem2019probabilistic] and algorithmic alignment [xu2019what] as well as our current experiments do, however, provide some insights about the design of summary networks. For instance, i.i.d. data induce a permutation invariant distribution which is well modeled with a deep invariant network [bloem2019probabilistic]. Data with temporal or spatial dependencies are best modeled with recurrent [hwang2018conditional], or convolutional [radev2019towards] networks. When pairwise or multiway relationships are particularly informative, attention [vaswani2017attention] or graph networks [xu2019what] appear as reasonable choices. On the other hand, the depth of the invertible network should be tailored to the complexity of the mathematical model of interest. More ACBs will enable the network to encode more complex distributions but will increase training time. Very highdimensional problems might also require very large networks with millions of parameters, up to a point where estimation becomes practically unfeasible. However, most mathematical models in the life sciences prioritize parsimony and interpretability, so they do not contain hundreds or thousands of latent parameters. In any case, future applications might require novel network architectures and solutions which go beyond our initial recommendations.
A third potential issue is the large number of neural network and optimization hyperparameters that might require finetuning by the user for optimal performance on a given task. However, we observe that many of the default hyperparameter values are sufficient to achieve excellent performance, and starting with a relatively large network consisting of 5 to 10 ACBs does not appear to hurt performance or destabilize training, even if the model to be learned is relatively simple. Based on our results, we expect that a single architecture should be able to perform well on models from a given domain. Future research should investigate this question of generality by applying the method to different or even competing models within different research domains. Future research should investigate the impact of modern hyperparameter optimization methods such as Bayesian optimization [eggensperger2013towards].
Finally, even though modern libraries such as TensorFlow [abadi2016tensorflow] or Torch [collobert2002torch] allow for rapid and relatively straightforward development of various neural network architectures, the implementational burden associated with the current method is still reasonably high. In order to ease the understanding and independent application of the method, we provide fully functioning code to reproduce and study all of the examples tackled in this paper. In addition, we provide implementation of all metrics and visualization tools for performance validation used throughout the paper. Moreover, we are currently developing a general userfriendly software, which should abstract away most intricacies from the user of the method.
We hope that the new BayesFlow method will enable researchers from a variety of fields to accelerate modelbased inference and will further prove its utility beyond the examples considered in this paper.
Acknowledgements
We thank Paul Bürkner, Manuel Haussmann, Jeffrey Rouder, Raphael Hartmann, David Izydorczyk, Hannes Wendler, Chris Wendler, and Karin Prillinger for their invaluable comments and suggestions that greatly improved the manuscript. We also thank Francis Tuerlinckx and Stijn Verdonck for their support and thoughtprovoking ideas.
References
Appendix
A Computation of Validation Metrics
Normalized Root Mean Squared Error
The normalized root mean squared error (NRMSE) between a sample of true parameters and a sample of estimated parameters is given by:
(1) 
Due to the normalization factor , the NRMSE is scaleindependent, and thus suitable for comparing the recovery across parameters with different numerical ranges. The NRMSE equals zero when the estimates are exactly equal to the true values.
Coefficient of Determination
The coefficient of determination measures the proportion of variance in a sample of true parameters that is explained by a sample of estimated parameters . It is computed as:
(2) 
where denotes the mean of the true parameter samples. When equals , the estimates are perfect reconstructions of the true parameters.
Resimulation Error
To compute the resimulation error , we first obtain an estimate of the true parameter value given an observed (validations) data set by computing the mean of the approximate posterior . Then, we run the mathematical model to obtain a simulated dataset . Finally, we compute the maximum mean discrepancy (MMD, [gretton2012kernel]) between the observed and the simulated dataset . The MMD is a kernelbased metric which estimates the mismatch between two distributions given samples from the distributions by comparing all of their moments. It equals zero when the two distributions are equal almost everywhere [gretton2012kernel]). Thus, a low MMD indicates that the distribution of is close to the distribution of . Conversely, a high MMD indicates that the distribution of is far from the distribution of . We report the median MMD computed over all validation data sets.
Calibration Error
The calibration error quantifies how well the coverage of an approximate posterior matches the coverage of an unknown true posterior. Let be the fraction of true parameter values lying in a corresponding credible interval of the approximate posterior. Thus, for a perfectly calibrated approximate posterior, should equal for all . We compute the calibration error for each marginal posterior as the median absolute deviation for 100 equally spaced values of . Therefore, the calibration error ranges between 0 and 1 with 0 indicating perfect calibration and 1 indicating complete miscalibration of the approximate posterior.
KullbackLeibler Divergence
The KullbackLeibler divergence (
) quantifies the increase in entropy incurred by approximating a target probability distribution with a distribution . Its general form for absolutely continuous distributions is given by(3) 
where and denote the pdfs of and . In the case where and are both multivariate Gaussian distributions, the KL divergence can be computed in closed form [hershey2007approximating]:
(4) 
where and denote the covariance matrices of and , and the respective mean vectors, and the number of dimensions of the Gaussian. In the case of diagonal Gaussian distributions, Eq.4 reduces to:
Comments
There are no comments yet.