 # Coping With Simulators That Don't Always Return

Deterministic models are approximations of reality that are easy to interpret and often easier to build than stochastic alternatives. Unfortunately, as nature is capricious, observational data can never be fully explained by deterministic models in practice. Observation and process noise need to be added to adapt deterministic models to behave stochastically, such that they are capable of explaining and extrapolating from noisy data. We investigate and address computational inefficiencies that arise from adding process noise to deterministic simulators that fail to return for certain inputs; a property we describe as "brittle." We show how to train a conditional normalizing flow to propose perturbations such that the simulator succeeds with high probability, increasing computational efficiency.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In order to compensate for epistemic uncertainty due to modelling approximations and unmodelled aleatoric uncertainty, deterministic simulators are often “converted” to “stochastic” simulators by perturbing the state at each time step. In practice this allows the simulator to explain the variability observed in real data without requiring excessive observation noise. Such models are more resilient to misspecification, are capable of providing uncertainty estimates, and provide better inferences in general

[Møller et al., 2011; Saarinen et al., 2008; Lv et al., 2008; Pimblott and LaVerne, 1990; Renard et al., 2013].

Often, state-independent Gaussian noise with heuristically tuned variance is used to perturb the state

[Adhikari and Agrawal, 2013; Brockwell and Davis, 2016; Fox, 1997; Reddy and Clinton, 2016; Du and Sam, 2006; Allen, 2017; Mbalawata et al., 2013]. However, naively adding noise to the state will, in many applications, render the perturbed input state “invalid,” where invalid states cause the simulator to raise an exception and not return a value [Razavi et al., 2019; Lucas et al., 2013; Sheikholeslami et al., 2019]. We formally define failure by extending the possible output of the simulator to include (read as “bottom”) denoting simulator failure. The principal contribution of this paper is a technique for avoiding invalid states by choosing perturbations that minimize the failure rate. The technique we develop results in a reduction in simulator failures, while maintaining the original model.

Examples of failure modes include ordinary differential equation (ODE) solvers not converging to the required tolerance in the allocated time, or, the perturbed state entering into an unhandled configuration, such as solid bodies intersecting. Establishing the state-perturbation pairs that cause failure is non-trivial. Hence, the simulation artifact can be sensitive to seemingly inconsequential alterations to the state – a property we describe as “brittle.” Failures waste computational resources and reduce the diversity of simulations for a finite sample budget, for instance, when used as the proposal in sequential Monte Carlo. As such, we wish to learn a proposal over perturbations such that the simulator exits with high probability, but renders the joint distribution unchanged.

We proceed by framing sampling from brittle simulators as rejection samplers, then seek to eliminate rejections by estimating the state-dependent density over perturbations that do not induce failure. We then demonstrate that using this learned proposal yields lower variance results when used in posterior inference with a fixed sample budget, such as pseudo-marginal evidence estimates produced by sequential Monte Carlo sweeps. Source code for reproduction of figures and results in this paper is available at https://github.com/plai-group/stdr.

## 2 Background

### 2.1 Smoothing Deterministic Models

Deterministic simulators are often stochastically perturbed to increase the diversity of the achievable simulations and to fit data more effectively. The most widespread example of this is perturbing linear dynamical systems with Gaussian noise at each timestep. The design of the system is such that the distribution over state at each point in time is Gaussian distributed. However, the simplistic dynamics of such a system may be insufficient for simulating more complex systems. Examples of such systems are: stochastic models of neural dynamics

[Fox, 1997; Coutin et al., 2018; Goldwyn and Shea-Brown, 2011; Saarinen et al., 2008], econometrics [Lopes and Tsay, 2011], epidemiology [Allen, 2017] and mobile robotics [Thrun et al., 2001; Fallon et al., 2012]. In these examples, the simulator state is perturbed with noise drawn from a distribution and is iterated using the simulator to create discrete approximations of the distribution over state as a function of time.

### 2.2 Simulator Failure

As simulators become more complex, guaranteeing the simulator will not fail for perturbed inputs becomes more difficult, and individual function evaluations become more expensive. Lucas et al.  and Edwards et al. 

establish the sensitivity of earth science models to global parameter values by building a discriminative classifier for parameters that induce failure.

Sheikholeslami et al. 

take an alternative approach instead treating simulator failure as an imputation problem, fitting a function regressor to predict the outcome of the failed experiment given the neighboring experiments that successfully terminated. However these methods are limited by the lack of clear probabilistic interpretation in terms of the originally specified joint distribution in time series models, their ability to scale to high dimensions, and their applicability to state-space models.

### 2.3 State-space Inference and Model Selection

Probabilistic models are ultimately deployed to make inferences about the world. Hence the goal is to be able to recover distributions over unobserved states, predict future states and learn unknown parameters of the model from data. Posterior state-space inference refers to the task of recovering the distribution , where are the latent states, are the observed data, and denotes the model if multiple different models are available. Inference in Gaussian perturbed linear dynamical systems can be performed using techniques such as Kalman smoothing [Kalman, 1960], however, the restrictions on such techniques limit their applicability to complex simulators, and so numerical methods are often used in practice.

A common method for performing inference in complex, simulation based models is sequential Monte Carlo (SMC) [Doucet et al., 2001]. The basic algorithm for SMC is shown in Algorithm 1, where is the prior over initial state, is the dynamics model, or simulator, is the likelihood, defining the relationship between latent states and observed data, and is the number of particles used. On a high level, SMC produces a discrete approximation of the target distribution by iterating particles through the simulator, and then preferentially continuing those simulations that “explain” the observed data well. While a detailed understanding of particle filtering is not required, the core observation required for this work is that the likelihood of failed simulations is defined as zero: , and hence are rejected with certainty.

Posterior inference pipelines often also provide estimates of the model evidence, . SMC provides such an estimate, referred to as a pseudo-marginal evidence, denoted in Algorithm 1 as . This pseudo-marginal evidence is calculated (in log space) as the sum of the expected value of the unnormalized importance weights (Algorithm 1, Lines 8 and 14

). This evidence can be combined with the prior probability of each model via Bayes rule to estimate the posterior probability of the model (up to a normalizing constant)

[MacKay, 2003]. These posteriors can be compared to perform Bayesian model selection, where the model with the highest posterior is selected and used to perform inference. This is often referred to as marginal maximum a posteriori parameter estimation (or model selection) [Doucet et al., 2002; Kantas et al., 2015]. Recent work investigates model selection using approximate, likelihood-free inference techniques [Papamakarios et al., 2019; Lueckmann et al., 2019], however, we do not consider these methods here, instead focusing on mitigating computational inefficiencies arising directly from simulator failure.

## 3 Methodology

We consider deterministic models, expressed as simulators, describing the time-evolution of a state , where we denote application of the simulator iterating the state as . A stochastic, additive perturbation to state, denoted , is applied to induce a distribution over states. The distribution from which this perturbation is sampled is denoted , although, in practice, this distribution is often state independent. The iterated state is then calculated as .

However, we consider the case where the simulator can fail for “invalid” inputs, denoted by a return value of . Hence the complete definition of is . The region of valid inputs is denoted as , and the region of invalid inputs as , such that , where the boundary between these regions is unknown. Over the whole support, defines a many-to-one function, as maps to . However, the algorithm we derive only requires that

is one-to-one in the accepted region. This is not uncommon in real simulators, and is satisfied by, for example, ODE models. We define the random variable

to denote whether the state-perturbation pair does not yield simulator failure and is “accepted.”

We define the iteration of perturbed deterministic simulator as a rejection sampler, with a well-defined target distribution (§3.1). We use this definition and the assumptions on to show that we can target the same distribution by learning the state-conditional density of perturbations, conditioned on acceptance (§3.2). We train an autoregressive flow to fit this density (§3.3), and describe how this can be used in inference, highlighting the ease with which it can be inserted into a particle filter (§3.4). We empirically show that using this learned proposal distribution in place of the original proposal improves the performance of particle-based state-space inference methods (§4).

### 3.1 Brittle Simulators as Rejection Samplers

The naive approach to sampling from the perturbed system, shown in Algorithm 2, is to repeatedly sample from the proposal distribution and evaluate until the simulator successfully exits. This procedure defines , i.e. successfully iterated samples are accepted with certainty. This incurs significant wasted computation as the simulator must be called repeatedly, with failed iterations being discarded. The objective of this work is to derive a more efficient sampling mechanism.

We begin by establishing Algorithm 2 as a rejection sampler, targeting the distribution over successfully iterated states. This reasoning is illustrated in Figure 1. The behavior of and the distribution implicitly define a distribution over successfully iterated states. We denote this “target” distribution as , where the bar indicates that the sample was accepted, and hence places no probability mass on failures. Note there is no bar on , indicating that it is defined before the accept/reject behaviors of and hence probability mass may be placed on regions that yield failure. The functional form of is unavailable, and the density cannot be evaluated for any input value. Figure 1: Graphical representation of how a brittle deterministic simulator acts as a rejection sampler, targeting ¯¯¯p(zt|xt−1). We set xt=0 for clarity. The simulator, f(zt), returns ⊥ for unknown input regions, shown in green. The proposal over zt is shown in blue. The target distribution, ¯¯¯p(zt), shown in orange, is implicitly defined as ¯¯¯p(zt)=1Mpp(zt)I[f(zt)≠⊥], where Mp is the normalizing constant from ¯¯¯p, equal to the acceptance rate. Accordingly, the proposal distribution, scaled by Mp, is exactly equal to ¯¯¯p(zt) in the accepted region. Algorithm 2 therefore implicitly constructs a rejection sampler, where the acceptance criterion reduces to I[f(zt)≠⊥], without needing to specify any additional scaling constants.

The existence of implies the existence of a second distribution: the distribution over accepted perturbations, denoted . Note that this distribution is also conditioned on acceptance under the chosen simulator, indicated by the presence of a bar. We assume is one-to-one in the accepted region, and so the change of variables rule can be applied to directly relate this to . Under our initial algorithm for sampling from a brittle simulator we can therefore write the following identity:

 ¯¯¯p(zt|xt−1)={1Mpp(zt|xt−1),if f(xt−1+zt)≠⊥0,otherwise (1)

where the normalizing constant is the acceptance rate under . (1) indicates accepting with certainty perturbations that exit successfully can be seen as proportionally shifting mass from regions of where the simulator fails to regions where it does not. We exploit this definition to learn an efficient proposal.

### 3.2 Change of Variable in Brittle Simulator

We now derive how we can learn the proposal distribution, denoted and parameterized by , to replace , such that the acceptance rate under (denoted ) tends towards unity, minimizing wasted computation. We denote as the proposal we train, which, coupled with the simulator, implicitly defines a proposal over accepted samples, denoted .

Expressing this mathematically, we wish to minimize the distance between joint distribution implicitly specified over accepted iterated states using the a priori specified proposal distribution, , and :

 ϕ∗=argminϕEp(xt−1)[DKL[¯¯¯p(xt|xt−1)||¯¯¯qϕ(xt|xt−1)]], (2)

where we select the Kullback-Leibler (KL) divergence as the metric of distance between distributions. The outer expectation defines this objective as amortized across state space, where we can generate the samples by directly sampling trajectories from the model [Le et al., 2017; Gershman and Goodman, 2014]. We use the forward KL as opposed to the reverse KL, , as high-variance REINFORCE estimators must be used to obtain the the differential with respect to of the reverse KL.

Expanding the KL term yields:

 ϕ∗ =argminϕEp(xt−1)E¯¯p(xt|xt−1)[log(w)], (3) w =¯¯¯p(xt|xt−1)¯¯¯qϕ(xt|xt−1). (4)

Noting that and are defined only on accepted samples, where is one-to-one, we can apply a change of variables defined for as:

 ¯¯¯qϕ(xt|xt−1)=¯¯¯qϕ(f−1(xt)|xt−1)∣∣∣df−1(xt)dxt∣∣∣, (5)

and likewise for . This transforms the distribution over into a distribution over and a Jacobian term:

 w=¯¯¯p(f−1(xt)|xt−1)∣∣∣df−1(xt)dxt∣∣∣¯¯¯qϕ(f−1(xt)|xt−1)∣∣∣% df−1(xt)dxt∣∣∣. (6)

taking care to also apply the change of variables in the distribution we are sampling from in (3). Noting that the same Jacobian terms appear in the numerator and denominator we are able to cancel these:

 w=¯¯¯p(f−1(xt)|xt−1)¯¯¯qϕ(f−1(xt)|xt−1). (7)

We can now discard the term as it is independent of . Noting we can write (2) as:

 ϕ∗ (8)

However, this distribution is defined after rejection sampling, and can only be defined as in (1):

 ¯¯¯qϕ(zt|xt−1) =qϕ(zt|xt−1,At=1) (9) ={1Mqϕqϕ(zt|xt−1)if f(xt−1+zt)≠⊥,0otherwise,

denoting as the acceptance rate under .

However, there is an infinite family of proposals that yield , each a maximizer of (8) but with different rejection rates. Noting however that there is only a single that has a rejection rate of zero and renders , and that this distribution also renders , we can instead optimize :

 ϕ∗=argmaxϕEp(xt−1)E¯¯p(zt|xt−1)[logqϕ(zt|xt−1)], (10)

with no consideration of rejection behavior under .

One might alternatively try to achieve low rejection rates by adding a regularization term to (8) penalizing high . However, differentiation of is intractable, meaning direct optimization of (8) is intractable.

The objective stated in (10) implicitly rewards distributions that place minimal mass on rejections by placing as much mass on accepted samples as possible. This expressionn is differentiable with respect to and so we can maximize this quantity through gradient ascent with minibatches drawn from . This expression shows that we can learn the distribution over accepted values by learning the distribution over the accepted , without needing to calculate the Jacobian or inverse of . Doing so minimizes wasted computation, targets the same overall joint distribution, and retains interpretability by utilizing the simulator.

### 3.3 Training qϕ

To train we first define a method for sampling state-perturbation pairs. We initialize the simulator to a state sampled from a distribution over initial value, and then iterate the perturbed simulator forward for some finite period. All state-perturbation pairs sampled during the trajectory are treated as a training example, and, in total, represent a discrete approximation to the prior over state for all time, and accepted state-conditional perturbation, i.e. and .

We train the conditional density estimator using these samples by maximizing the conditional density of the sampled perturbation under the true distribution, as in (10), as this minimizes the desired KL divergence originally stated in (2). Our conditional density estimator is fully differentiable and can be trained through stochastic gradient ascent as shown in Algorithm 3. The details of the chosen architecture is explained in §3.5. The result of this procedure is an artifact that approximates the density over valid perturbations conditioned on state, . Figure 2: Diagram visualizing how qϕ is structured and used. The previous state is input to the hypernetwork, a series of Lsingle layer neural networks, denoted hl. Each network outputs parameters, denoted ϕl, for each of the L layers in the flow conditioned on the state. The flow samples a perturbation as zt∼qϕ(zt|xt−1), with the internal states of the flow denoted by ϵl. This perturbation is summed with the previous state and passed through the simulator, f, outputting the iterated state, xt.

### 3.4 Using qϕ

Once has been trained, it can be deployed to enhance posterior inference, by replacing samples from with . We highlight here the ease with which it can be introduced into an SMC sweep. The state is iterated by sampling from on Line 7 of Algorithm 1, where this sampling procedure is defined in Algorithm 2. Instead of sampling from in Algorithm 2, the sample is drawn from , and as such the sample is more likely to be accepted. This modification requires only changing a single function call made inside the implementation of Algorithm 2.

### 3.5 Implementation

We parameterize the density using an autoregressive flow (AF) [Larochelle and Murray, 2011]

. Flows define a parameterized density estimator that can be trained using stochastic gradient descent, and variants have been used in image generation

[Kingma and Dhariwal, 2018]

, as priors for variational autoencoders

[Kingma et al., 2016], and in likelihood-free inference [Papamakarios et al., 2019; Lueckmann et al., 2019].

Specifically, we structure using a masked autoregressive flow [Papamakarios et al., 2017], with single-layer MADE blocks [Germain et al., 2015]

, and batch normalization at the input to each intermediate MADE block. The dimensionality of the flow is the number of states perturbed in the original model. We implement conditioning through the use of a hypernetwork

[Ha et al., 2016], which outputs the parameters of the flow layers given as input, as shown in Figure 2. The hypernetworks are single-layer neural networks defined per flow layer. Together, the flow and hypernetwork define

, and can be jointly trained using stochastic gradient descent. The networks are implemented in PyTorch

[Paszke et al., 2017] and are optimized using ADAM [Kingma and Ba, 2014].

## 4 Experiments

### 4.1 Toy Problem – Annulus

We first demonstrate our approach on a toy problem. The true generative model of the observed data is a constant speed circular orbit around the origin in the - plane, such that . To analyze this data we use a misspecified model that only simulates linear forward motion. To overcome the model mismatch and fit the observed data, we add Gaussian noise to position and velocity. We impose a failure constraint limiting the change in the distance of the point from the origin to a fixed threshold. This condition mirrors our observation that states in brittle simulators have large allowable perturbations in particular directions, but very narrow permissible perturbations in other directions. The true radius is unknown and so we must amortize over possible radii.

The results of this experiment are shown in Figure 3. The interior of the black dashed lines in Figure 2(a) indicates the permissible - perturbation, for the given position and zero velocity, where we have centered each distribution on the current position for ease of visual inspection. Red contours indicate the original density , and blue contours indicate the learned density . The fraction of the probability mass outside the black dashed region is the expected rejection rate. Figure 2(b) shows the rejection rate drops from approximately under the original model to approximately using a trained .

We then use the learned as the perturbation proposal in an SMC sweep, where we condition on noisy observations of the - coordinates. As we focus on the sample efficiency of the sweep, we fix the number of calls to the simulator in Algorithm 2 to a single call, instead of proposing and rejecting until acceptance. Failed particles are then not resampled (with certainty) during the resampling. This means that each iteration of the SMC makes a fixed number of calls to the simulator, and hence we can compare algorithms under a fixed sample budget. Figure 2(c) shows that we recover lower variance evidence approximations for a fixed sample budget by using instead of

. A paired t-test evaluating the difference in variance returns a p-value of less than

, indicating a strong statistical difference between the performance under and , confirming that using increases the fidelity of inference for a fixed sample budget.

### 4.2 Bouncing Balls

Our second example uses a simulator of balls bouncing elastically, as shown in Figure 3(a)

. We model the position and velocity of each ball, such that the dimensionality of the state vector,

, is four times the number of balls. We add a small amount of Gaussian noise at each iteration to the position and velocity of each ball. This perturbation induces the possibility that two balls overlap, or, a ball intersects with the wall, representing an invalid physical configuration and results in simulator failure. We note that here, we are conditioning on the state of all balls simultaneously, and proposing the perturbation to the state jointly.

Figure 3(b) shows the distribution over position perturbation of a single ball, conditioned on the other ball being stationary. Blue contours show the estimated distribution over accepted perturbations learned by autoregressive flow. Figure 3(c) shows the rejection rate under and as a function of the position of the first ball, with the second ball fixed in the position shown, showing that rejection has been all but eliminated. We again see a reduction in the variance of the evidence approximation computed by a particle filter when using instead of (figure in the supplementary materials).

### 4.3 MuJoCo

We now apply our method to the popular robotics simulator MuJoCo [Todorov et al., 2012], specifically using the built-in example “tosser,” where a capsule is “tossed” by an actuator into a bucket, shown in Figure 4(a). Tosser displays “choatic” aspects, as minor changes in the position of the object results in large changes in the trajectories achieved by the simulator.

MuJoCo allows some overlap between the objects to simulate contact dynamics. This is an example of model misspecification borne out of the requirements of reasonably writing a simulator. We therefore place a hard limit on the amount objects are allowed to overlap. This is an example of a user-specified constraint that requires the simulator to be run to evaluate. We add Gaussian distributed noise to the position and velocity of the capsule.

Figure 5 shows the results of this experiment. The capsule is mostly in free space resulting in an average rejection rate under of . Figure 4(b) shows that the autoregressive flow learns a proposal with a lower rejection rate, reaching rejection. However these rejections are concentrated in the critical regions of state-space, where chaotic behavior occurs, and so this reduction yields an large reduction in the variance of the evidence approximation, as shown in Figure 4(c).

We conclude this example by evaluating our method on hypothesis testing using pseudo-marginal evidence estimates. The results for this are shown in Figure 4(d). We test different hypothesis of the mass of the capsule. Using results in higher variance evidence approximations than when is used. Additionally, under the wrong model is selected ( instead of ), although with low significance (), while using selects the correct hypothesis with . For this experiment we note that was trained on a single value of mass, and that this “training mass” was different to the “testing mass.” We believe this contributes to the increased variance in hypothesis , which is very light compared to the training mass. Training a with a further level of amortization over different mass values would further increase the fidelity of the model selection. This is intimately linked with the larger project of jointly learning the model, and so we defer investigation to future works.

### 4.4 Neuroscience Simulator

We conclude by applying our algorithm to a simulator for the widely studied Caenorhabditis elegans roundworm. WormSim, presented by Boyle et al. , is a simulator of the locomotion of the worm, using a dimensional state representation. We apply perturbations to a dimensional subspace defining the physical position of the worm, while conditioning on the full dimensional state vector. The expected rate of failure increases sharply as a function of the scale of the perturbation applied, as shown in Figure 5(a), as the integrator used in WormSim is unable to integrate highly perturbed states.

The rejection rate during training is shown in Figure 5(b). We are able to learn an autoregressive flow with lower rejection rates, reaching approximately rejection, when has approximately rejection. Although the rejection rate is higher than ultimately desired, we include this example as a demonstration of how rejections occur in simulators through integrator failure. We believe larger flows with regularized parameters can reduce the rejection rate further.

## 5 Conclusion

In this paper we have tackled reducing simulator failures caused by naively perturbing the input state. We achieve this by showing that stochastically perturbed simulators define a rejection sampler with a well defined target distribution and learning a conditional autoregressive flow to estimate the state-dependent proposal distribution conditioned on acceptance. We then show that using this learned proposal reduces the variance of inference results, with applications for Bayesian model selection. We believe this work has readily transferable practical contributions to not just the machine learning community, but the wider scientific community where such naively modified simulation platforms are widely deployed. As part of the experiments we present, we identify an extension: introducing an additional level of amortization over static simulation parameters. This extension builds towards our larger research vision of building toolchains for efficient inference and learning in brittle simulators. Further development will facilitate efficient gradient-based model learning in these brittle simulators.

## 6 Acknowledgements

Andrew Warrington is supported under the Shilston Scholarship, awarded by Keble College and the Department of Engineering Science, University of Oxford. Saeid Naderiparizi and Frank Wood are supported under the Natural Sciences and Engineering Research Council of Canada (NSERC), the Canada CIFAR AI Chairs Program, Compute Canada, Intel, Composites Research Network (CRN) and DARPA under its D3M and LWLL programs.

## References

• R. Adhikari and R. K. Agrawal (2013) An introductory study on time series modeling and forecasting. arXiv preprint arXiv:1302.6613. Cited by: §1.
• L. J. Allen (2017) A primer on stochastic epidemic models: formulation, numerical simulation, and analysis. Infectious Disease Modelling 2 (2), pp. 128–142. Cited by: §1, §2.1.
• J. H. Boyle, S. Berri, and N. Cohen (2012) Gait modulation in c. elegans: an integrated neuromechanical model. Frontiers in computational neuroscience 6, pp. 10. Cited by: §4.4.
• P. J. Brockwell and R. A. Davis (2016) Introduction to time series and forecasting. springer. Cited by: §1.
• L. Coutin, J. Guglielmi, and N. Marie (2018) On a fractional stochastic hodgkin–huxley model. International Journal of Biomathematics 11 (05), pp. 1850061. Cited by: §2.1.
• A. Doucet, N. De Freitas, and N. Gordon (2001) An introduction to sequential monte carlo methods. In Sequential Monte Carlo methods in practice, pp. 3–14. Cited by: §2.3.
• A. Doucet, S. J. Godsill, and C. P. Robert (2002)

Marginal maximum a posteriori estimation using markov chain monte carlo

.
Statistics and Computing 12 (1), pp. 77–84. Cited by: §2.3.
• N. H. Du and V. H. Sam (2006)

Dynamics of a stochastic lotka–volterra model perturbed by white noise

.
Journal of mathematical analysis and applications 324 (1), pp. 82–97. Cited by: §1.
• N. R. Edwards, D. Cameron, and J. Rougier (2011) Precalibrating an intermediate complexity climate model. Climate dynamics 37 (7-8), pp. 1469–1482. Cited by: §2.2.
• M. F. Fallon, H. Johannsson, and J. J. Leonard (2012) Efficient scene simulation for robust monte carlo localization using an rgb-d camera. In 2012 IEEE international conference on robotics and automation, pp. 1663–1670. Cited by: §2.1.
• R. F. Fox (1997) Stochastic versions of the hodgkin-huxley equations. Biophysical journal 72 (5), pp. 2068–2074. Cited by: §1, §2.1.
• M. Germain, K. Gregor, I. Murray, and H. Larochelle (2015) Made: masked autoencoder for distribution estimation. In International Conference on Machine Learning, pp. 881–889. Cited by: §3.5.
• S. Gershman and N. Goodman (2014) Amortized inference in probabilistic reasoning. In Proceedings of the annual meeting of the cognitive science society, Vol. 36. Cited by: §3.2.
• J. H. Goldwyn and E. Shea-Brown (2011) The what and where of adding channel noise to the hodgkin-huxley equations. PLOS Computational Biology 7 (11), pp. 1–9. Cited by: §2.1.
• D. Ha, A. Dai, and Q. V. Le (2016) Hypernetworks. arXiv preprint arXiv:1609.09106. Cited by: §3.5.
• R. Kalman (1960) A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering 82 (Series D), pp. 35–45. Cited by: §2.3.
• N. Kantas, A. Doucet, S. S. Singh, J. Maciejowski, N. Chopin, et al. (2015) On particle methods for parameter estimation in state-space models. Statistical science 30 (3), pp. 328–351. Cited by: §2.3.
• D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.5.
• D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §3.5.
• D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: §3.5.
• H. Larochelle and I. Murray (2011) The neural autoregressive distribution estimator. In

Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

,
pp. 29–37. Cited by: §3.5.
• T. A. Le, A. G. Baydin, and F. Wood (2017) Inference compilation and universal probabilistic programming. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), Proceedings of Machine Learning Research, Vol. 54, Fort Lauderdale, FL, USA, pp. 1338–1348. Cited by: §3.2.
• H. F. Lopes and R. S. Tsay (2011)

Particle filters and bayesian inference in financial econometrics

.
Journal of Forecasting 30 (1), pp. 168–209. External Links: Document Cited by: §2.1.
• D. Lucas, R. Klein, J. Tannahill, D. Ivanova, S. Brandon, D. Domyancic, and Y. Zhang (2013) Failure analysis of parameter-induced simulation crashes in climate models. Geoscientific Model Development 6 (4), pp. 1157–1171. Cited by: §1, §2.2.
• J. Lueckmann, G. Bassetto, T. Karaletsos, and J. H. Macke (2019) Likelihood-free inference with emulator networks. In Symposium on Advances in Approximate Bayesian Inference, pp. 32–53. Cited by: §2.3, §3.5.
• Q. Lv, M. K. Schneider, and J. W. Pitchford (2008) Individualism in plant populations: using stochastic differential equations to model individual neighbourhood-dependent plant growth. Theoretical Population Biology 74 (1), pp. 74 – 83. External Links: ISSN 0040-5809, Document Cited by: §1.
• D. J. MacKay (2003) Information theory, inference and learning algorithms. Cambridge university press. Cited by: §2.3.
• I. S. Mbalawata, S. Särkkä, and H. Haario (2013)

Parameter estimation in stochastic differential equations with markov chain monte carlo and non-linear kalman filtering

.
Computational Statistics 28 (3), pp. 1195–1223. Cited by: §1.
• J. K. Møller, H. Madsen, and J. Carstensen (2011) Parameter estimation in a simple stochastic differential equation for phytoplankton modelling. Ecological modelling 222 (11), pp. 1793–1799. Cited by: §1.
• G. Papamakarios, T. Pavlakou, and I. Murray (2017) Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pp. 2338–2347. Cited by: §3.5.
• G. Papamakarios, D. Sterratt, and I. Murray (2019) Sequential neural likelihood: fast likelihood-free inference with autoregressive flows. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 837–848. Cited by: §2.3, §3.5.
• A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §3.5.
• S. M. Pimblott and J. A. LaVerne (1990) Comparison of stochastic and deterministic methods for modeling spur kinetics. Radiation Research 122 (1), pp. 12–23. External Links: ISSN 00337587, 19385404 Cited by: §1.
• S. Razavi, R. Sheikholeslami, H. V. Gupta, and A. Haghnegahdar (2019) VARS-tool: a toolbox for comprehensive, efficient, and robust sensitivity and uncertainty analysis. Environmental Modelling & Software 112, pp. 95 – 107. External Links: ISSN 1364-8152 Cited by: §1.
• K. Reddy and V. Clinton (2016) Simulating stock prices using geometric brownian motion: evidence from australian companies. Australasian Accounting, Business and Finance Journal 10 (3), pp. 23–47. Cited by: §1.
• P. Renard, A. Alcolea, and D. Gingsbourger (2013) Stochastic versus deterministic approaches. In Environmental Modelling: Finding Simplicity in Complexity, Second Edition (eds J. Wainwright and M. Mulligan), pp. 133–149. Cited by: §1.
• A. Saarinen, M. Linne, and O. Yli-Harja (2008) Stochastic differential equation model for cerebellar granule cell excitability. PLOS Computational Biology 4 (2), pp. 1–11. External Links: Document Cited by: §1, §2.1.
• R. Sheikholeslami, S. Razavi, and A. Haghnegahdar (2019) What do we do with model simulation crashes? recommendations for global sensitivity analysis of earth and environmental systems models. Geoscientific Model Development Discussions 2019, pp. 1–32. External Links: Document Cited by: §1, §2.2.
• S. Thrun, D. Fox, W. Burgard, and F. Dellaert (2001) Robust monte carlo localization for mobile robots. Artificial intelligence 128 (1-2), pp. 99–141. Cited by: §2.1.
• E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §4.3.

## References

• R. Adhikari and R. K. Agrawal (2013) An introductory study on time series modeling and forecasting. arXiv preprint arXiv:1302.6613. Cited by: §1.
• L. J. Allen (2017) A primer on stochastic epidemic models: formulation, numerical simulation, and analysis. Infectious Disease Modelling 2 (2), pp. 128–142. Cited by: §1, §2.1.
• J. H. Boyle, S. Berri, and N. Cohen (2012) Gait modulation in c. elegans: an integrated neuromechanical model. Frontiers in computational neuroscience 6, pp. 10. Cited by: §4.4.
• P. J. Brockwell and R. A. Davis (2016) Introduction to time series and forecasting. springer. Cited by: §1.
• L. Coutin, J. Guglielmi, and N. Marie (2018) On a fractional stochastic hodgkin–huxley model. International Journal of Biomathematics 11 (05), pp. 1850061. Cited by: §2.1.
• A. Doucet, N. De Freitas, and N. Gordon (2001) An introduction to sequential monte carlo methods. In Sequential Monte Carlo methods in practice, pp. 3–14. Cited by: §2.3.
• A. Doucet, S. J. Godsill, and C. P. Robert (2002)

Marginal maximum a posteriori estimation using markov chain monte carlo

.
Statistics and Computing 12 (1), pp. 77–84. Cited by: §2.3.
• N. H. Du and V. H. Sam (2006)

Dynamics of a stochastic lotka–volterra model perturbed by white noise

.
Journal of mathematical analysis and applications 324 (1), pp. 82–97. Cited by: §1.
• N. R. Edwards, D. Cameron, and J. Rougier (2011) Precalibrating an intermediate complexity climate model. Climate dynamics 37 (7-8), pp. 1469–1482. Cited by: §2.2.
• M. F. Fallon, H. Johannsson, and J. J. Leonard (2012) Efficient scene simulation for robust monte carlo localization using an rgb-d camera. In 2012 IEEE international conference on robotics and automation, pp. 1663–1670. Cited by: §2.1.
• R. F. Fox (1997) Stochastic versions of the hodgkin-huxley equations. Biophysical journal 72 (5), pp. 2068–2074. Cited by: §1, §2.1.
• M. Germain, K. Gregor, I. Murray, and H. Larochelle (2015) Made: masked autoencoder for distribution estimation. In International Conference on Machine Learning, pp. 881–889. Cited by: §3.5.
• S. Gershman and N. Goodman (2014) Amortized inference in probabilistic reasoning. In Proceedings of the annual meeting of the cognitive science society, Vol. 36. Cited by: §3.2.
• J. H. Goldwyn and E. Shea-Brown (2011) The what and where of adding channel noise to the hodgkin-huxley equations. PLOS Computational Biology 7 (11), pp. 1–9. Cited by: §2.1.
• D. Ha, A. Dai, and Q. V. Le (2016) Hypernetworks. arXiv preprint arXiv:1609.09106. Cited by: §3.5.
• R. Kalman (1960) A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering 82 (Series D), pp. 35–45. Cited by: §2.3.
• N. Kantas, A. Doucet, S. S. Singh, J. Maciejowski, N. Chopin, et al. (2015) On particle methods for parameter estimation in state-space models. Statistical science 30 (3), pp. 328–351. Cited by: §2.3.
• D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.5.
• D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §3.5.
• D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: §3.5.
• H. Larochelle and I. Murray (2011) The neural autoregressive distribution estimator. In

Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

,
pp. 29–37. Cited by: §3.5.
• T. A. Le, A. G. Baydin, and F. Wood (2017) Inference compilation and universal probabilistic programming. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), Proceedings of Machine Learning Research, Vol. 54, Fort Lauderdale, FL, USA, pp. 1338–1348. Cited by: §3.2.
• H. F. Lopes and R. S. Tsay (2011)

Particle filters and bayesian inference in financial econometrics

.
Journal of Forecasting 30 (1), pp. 168–209. External Links: Document Cited by: §2.1.
• D. Lucas, R. Klein, J. Tannahill, D. Ivanova, S. Brandon, D. Domyancic, and Y. Zhang (2013) Failure analysis of parameter-induced simulation crashes in climate models. Geoscientific Model Development 6 (4), pp. 1157–1171. Cited by: §1, §2.2.
• J. Lueckmann, G. Bassetto, T. Karaletsos, and J. H. Macke (2019) Likelihood-free inference with emulator networks. In Symposium on Advances in Approximate Bayesian Inference, pp. 32–53. Cited by: §2.3, §3.5.
• Q. Lv, M. K. Schneider, and J. W. Pitchford (2008) Individualism in plant populations: using stochastic differential equations to model individual neighbourhood-dependent plant growth. Theoretical Population Biology 74 (1), pp. 74 – 83. External Links: ISSN 0040-5809, Document Cited by: §1.
• D. J. MacKay (2003) Information theory, inference and learning algorithms. Cambridge university press. Cited by: §2.3.
• I. S. Mbalawata, S. Särkkä, and H. Haario (2013)

Parameter estimation in stochastic differential equations with markov chain monte carlo and non-linear kalman filtering

.
Computational Statistics 28 (3), pp. 1195–1223. Cited by: §1.
• J. K. Møller, H. Madsen, and J. Carstensen (2011) Parameter estimation in a simple stochastic differential equation for phytoplankton modelling. Ecological modelling 222 (11), pp. 1793–1799. Cited by: §1.
• G. Papamakarios, T. Pavlakou, and I. Murray (2017) Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pp. 2338–2347. Cited by: §3.5.
• G. Papamakarios, D. Sterratt, and I. Murray (2019) Sequential neural likelihood: fast likelihood-free inference with autoregressive flows. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 837–848. Cited by: §2.3, §3.5.
• A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §3.5.
• S. M. Pimblott and J. A. LaVerne (1990) Comparison of stochastic and deterministic methods for modeling spur kinetics. Radiation Research 122 (1), pp. 12–23. External Links: ISSN 00337587, 19385404 Cited by: §1.
• S. Razavi, R. Sheikholeslami, H. V. Gupta, and A. Haghnegahdar (2019) VARS-tool: a toolbox for comprehensive, efficient, and robust sensitivity and uncertainty analysis. Environmental Modelling & Software 112, pp. 95 – 107. External Links: ISSN 1364-8152 Cited by: §1.
• K. Reddy and V. Clinton (2016) Simulating stock prices using geometric brownian motion: evidence from australian companies. Australasian Accounting, Business and Finance Journal 10 (3), pp. 23–47. Cited by: §1.
• P. Renard, A. Alcolea, and D. Gingsbourger (2013) Stochastic versus deterministic approaches. In Environmental Modelling: Finding Simplicity in Complexity, Second Edition (eds J. Wainwright and M. Mulligan), pp. 133–149. Cited by: §1.
• A. Saarinen, M. Linne, and O. Yli-Harja (2008) Stochastic differential equation model for cerebellar granule cell excitability. PLOS Computational Biology 4 (2), pp. 1–11. External Links: Document Cited by: §1, §2.1.
• R. Sheikholeslami, S. Razavi, and A. Haghnegahdar (2019) What do we do with model simulation crashes? recommendations for global sensitivity analysis of earth and environmental systems models. Geoscientific Model Development Discussions 2019, pp. 1–32. External Links: Document Cited by: §1, §2.2.
• S. Thrun, D. Fox, W. Burgard, and F. Dellaert (2001) Robust monte carlo localization for mobile robots. Artificial intelligence 128 (1-2), pp. 99–141. Cited by: §2.1.
• E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §4.3.