1 Introduction
A simulatorbased model is a datagenerating process described by a computer program, usually with some free parameters we need to learn from data. Simulatorbased modelling lends itself naturally to scientific domains such as evolutionary biology [Beaumont et al., 2002], ecology [Wood, 2010], disease epidemics [Gutmann and Corander, 2015], economics [Gouriéroux et al., 1993] and cosmology [Schafer and Freeman, 2012], where observations are best understood as products of underlying physical processes. Inference in these models amounts to discovering plausible parameter settings that could have generated our observed data. The application domains mentioned can require properly calibrated distributions that express uncertainty over plausible parameters, rather than just point estimates, in order to reach scientific conclusions or make decisions.
As an analytical expression for the likelihood of parameters given observations is typically not available for simulatorbased models, conventional likelihoodbased Bayesian inference is not applicable. An alternative family of algorithms for likelihoodfree inference has been developed, referred to as Approximate Bayesian Computation (ABC). These algorithms simulate the model repeatedly and only accept parameter settings which generate synthetic data similar to the observed data, typically gathered in a realworld experiment.
Rejection ABC [Pritchard et al., 1999]
, the most basic ABC algorithm, simulates the model for each setting of proposed parameters, and rejects parameters if the generated data is not within a certain distance from the observations. The accepted parameters form a set of independent samples from an approximate posterior. Markov Chain Monte Carlo ABC (MCMCABC)
[Marjoram et al., 2003] is an improvement over rejection ABC which, instead of independently proposing parameters, explores the parameter space by perturbing the most recently accepted parameters. Sequential Monte Carlo ABC (SMCABC) [Beaumont et al., 2009, Bonassi and West, 2015] uses importance sampling to simulate a sequence of slowlychanging distributions, the last of which is an approximation to the parameter posterior.Conventional ABC algorithms such as the above suffer from three drawbacks. First, they only represent the parameter posterior as a set of (possibly weighted or correlated) samples. A samplebased representation easily gives estimates and error bars of individual parameters, and model predictions. However these computations are noisy, and it is not obvious how to perform some other computations using samples, such as combining posteriors from two separate analyses. Second, the parameter samples do not come from the correct Bayesian posterior, but from an approximation based on assuming a pseudoobservation that the data is within an ball centred on the data actually observed. Third, as the tolerance is reduced, it can become impractical to simulate the model enough times to match the observed data even once. When simulations are expensive to perform, good quality inference becomes impractical.
We propose a parametric approach to likelihoodfree inference, which unlike conventional ABC does not suffer from the above three issues. Instead of returning samples from an
approximation to the posterior, our approach learns a parametric approximation to the exact posterior, which can be made as accurate as required. Preliminary fits to the posterior are used to guide future simulations, which can reduce the number of simulations required to learn an accurate approximation by orders of magnitude. Our approach uses conditional density estimation with Bayesian neural networks, and draws upon advances in parametric density estimation, stochastic variational inference, and recognition networks, as discussed in the related work section.
2 Bayesian conditional density estimation for likelihoodfree inference
2.1 Simulatorbased models and ABC
Let
be a vector of parameters controlling a simulatorbased model, and let
be a data vector generated by the model. The model may be provided as a probabilistic program that can be easily simulated, and implicitly defines a likelihood , which we assume we cannot evaluate. Let encode our prior beliefs about the parameters. Given an observation , we are interested in the parameter posterior .As the likelihood is unavailable, conventional Bayesian inference cannot be carried out. The principle behind ABC is to approximate by for a sufficiently small value of , and then estimate the latter—e.g. by Monte Carlo—using simulations from the model. Hence, ABC approximates the posterior by , which is typically broader and more uncertain. ABC can trade off computation for accuracy by decreasing , which improves the approximation to the posterior but requires more simulations from the model. However, the approximation becomes exact only when , in which case simulations never match the observations, , and existing methods break down. In this paper, we refer to as the exact posterior, as it corresponds to setting in .
In most practical applications of ABC, is taken to be a fixedlength vector of summary statistics that is calculated from data generated by the simulator, rather than the raw data itself. Extracting statistics is often necessary in practice, to reduce the dimensionality of the data and maintain to practically acceptable levels. For the purposes of this paper, we will make no distinction between raw data and summary statistics, and we will regard the calculation of summary statistics as part of the data generating process.
2.2 Learning the posterior
Rather than using simulations from the model in order to estimate an approximate likelihood, , we will use the simulations to directly estimate . We will run simulations for parameters drawn from a distribution, , which we shall refer to as the proposal prior. The proposition below indicates how we can then form a consistent estimate of the exact posterior, using a flexible family of conditional densities, , parameterized by a vector .
Proposition 1.
We assume that each of a set of pairs was independently generated by
(1) 
In the limit
, the probability of the parameter vectors
is maximized w.r.t. if and only if(2) 
provided a setting of that makes proportional to exists.
Intuition: if we simulated enough parameters from the prior, the density estimator would learn a conditional of the joint prior model over parameters and data, which is the posterior . If we simulate parameters drawn from another distribution, we need to “importance reweight” the result. A more detailed proof can be found in Appendix A.
The proposition above suggests the following procedure for learning the posterior: (a) propose a set of parameter vectors from the proposal prior; (b) for each run the simulator to obtain a corresponding data vector ; (c) train with maximum likelihood on ; and (d) estimate the posterior by
(3) 
This procedure is summarized in Algorithm 2.
2.3 Choice of conditional density estimator and proposal prior
In choosing the types of density estimator and proposal prior , we need to meet the following criteria: (a) should be flexible enough to represent the posterior but easy to train with maximum likelihood; (b) should be easy to evaluate and sample from; and (c) the righthand side expression in Equation (3) should be easily evaluated and normalized.
We draw upon work on conditional neural density estimation and take to be a Mixture Density Network (MDN) [Bishop, 1994] with fully parameterized covariance matrices. That is, takes the form of a mixture of Gaussian components , whose mixing coefficients , means and covariance matrices are computed by a feedforward neural network parameterized by , taking as input. Such an architecture is capable of representing any conditional distribution arbitrarily accurately—provided the number of components
and number of hidden units in the neural network are sufficiently large—while remaining trainable by backpropagation. The parameterization of the MDN is detailed in Appendix
B.We take the proposal prior to be a single Gaussian , with mean and full covariance matrix . Assuming the prior is a simple distribution (uniform or Gaussian, as is typically the case in practice), then this choice allows us to calculate in Equation (3) analytically. That is, will be a mixture of Gaussians, whose parameters will be a function of evaluated at (as detailed in Appendix C).
2.4 Learning the proposal prior
Simple rejection ABC is inefficient because the posterior is typically much narrower than the prior . A parameter vector sampled from will rarely be plausible under and will most likely be rejected. Practical ABC algorithms attempt to reduce the number of rejections by modifying the way they propose parameters; for instance, MCMCABC and SMCABC propose new parameters by perturbing parameters they already consider plausible, in the hope that nearby parameters remain plausible.
In our framework, the key to efficient use of simulations lies in the choice of proposal prior. If we take to be the actual prior, then will learn the posterior for all , as can be seen from Equation (2). Such a strategy however is grossly inefficient if we are only interested in the posterior for . Conversely, if closely matches , then most simulations will produce samples that are highly informative in learning for . In other words, if we already knew the true posterior, we could use it to construct an efficient proposal prior for learning it.
We exploit this idea to set up a fixedpoint system. Our strategy becomes to learn an efficient proposal prior that closely approximates the posterior as follows: (a) initially take to be the prior ; (b) propose samples from and corresponding samples from the simulator, and train on them; (c) approximate the posterior using Equation (3) and set to it; (d) repeat until has converged. This procedure is summarized in Algorithm 1.
In the procedure above, as long as has only one Gaussian component () then remains a single Gaussian throughout. Moreover, in each iteration we initialize with the density estimator learnt in the iteration before, thus we keep training throughout. This initialization allows us to use a small sample size in each iteration, thus making efficient use of simulations.
As we shall demonstrate in Section 3, the procedure above learns Gaussian approximations to the true posterior fast: in our experiments typically – iterations of – samples each were sufficient. This Gaussian approximation can be used as a rough but cheap approximation to the true posterior, or it can serve as a good proposal prior in Algorithm 2 for efficiently finetuning a nonGaussian multicomponent posterior. If the second strategy is adopted, then we can reuse the singlecomponent neural density estimator learnt in Algorithm 1 to initialize in Algorithm 2. The weights in the final layer of the MDN are replicated times, with small random perturbations to break symmetry.
initialize with one component repeat for do sample sample end for retrain on until has converged;  initialize with K components // if available by Algorithm 1 // initialize by replicating its // one component times for do sample sample end for train on 
2.5 Use of Bayesian neural density estimators
To make Algorithm 1 as efficient as possible, the number of simulations per iteration should be kept small, while at the same time it should provide a sufficient training signal for . With a conventional MDN, if is made too small, there is a danger of overfitting, especially in early iterations, leading to overconfident proposal priors and an unstable procedure. Early stopping could be used to avoid overfitting; however a significant fraction of the samples would have to be used as a validation set, leading to inefficient use of simulations.
As a better alternative, we developed a Bayesian version of the MDN using Stochastic Variational Inference (SVI) for neural networks [Kingma and Welling, 2013]. We shall refer to this Bayesian version of the MDN as MDNSVI. An MDNSVI has two sets of adjustable parameters of the same size, the means
and the log variances
. The means correspond to the parameters of a conventional MDN. During training, Gaussian noise of variance is added to the means independently for each training example . The Bayesian interpretation of this procedure is that it optimizes a variational Gaussian posterior with a diagonal covariance matrix over parameters . At prediction time, the noise is switched off and the MDNSVI behaves like a conventional MDN with . Appendix D details the implementation and training of MDNSVI. We found that using an MDNSVI instead of an MDN improves the robustness and efficiency of Algorithm 1 because (a) MDNSVI is resistant to overfitting, allowing us to use a smaller number of simulations ; (b) no validation set is needed, so all samples can be used for training; and (c) since overfitting is not an issue, no careful tuning of training time is necessary.3 Experiments
We showcase three versions of our approach: (a) learning the posterior with Algorithm 2 where is a conventional MDN and the proposal prior is taken to be the actual prior , which we refer to as MDN with prior; (b) training a proposal prior with Algorithm 1 where is an MDNSVI, which we refer to as proposal prior; and (c) learning the posterior with Algorithm 2 where is an MDNSVI and the proposal prior is taken to be the one learnt in (b), which we refer to as MDN with proposal. All MDNs were trained using Adam [Kingma and Ba, 2014] with its default parameters.
We compare to three ABC baselines: (a) rejection ABC [Pritchard et al., 1999], where parameters are proposed from the prior and are accepted if ; (b) MCMCABC [Marjoram et al., 2003] with a spherical Gaussian proposal, whose variance we manually tuned separately in each case for best performance; and (c) SMCABC [Beaumont et al., 2009], where the sequence of ’s was exponentially decayed, with a decay rate manually tuned separately in each case for best performance. MCMCABC was given the unrealistic advantage of being initialized with a sample from rejection ABC, removing the need for an otherwise necessary burnin period. Code for reproducing the experiments is provided at https://github.com/gpapamak/epsilon_free_inference.
3.1 Mixture of two Gaussians
The first experiment is a toy problem where the goal is to infer the common mean of a mixture of two 1D Gaussians, given a single datapoint . The setup is
(4) 
where , , , , and . The posterior can be calculated analytically, and is proportional to an equal mixture of two Gaussians centred at with variances and , restricted to . This problem is often used in the SMCABC literature to illustrate the difficulty of MCMCABC in representing long tails. Here we use it to demonstrate the correctness of our approach and its ability to accurately represent nonGaussian longtailed posteriors.
Figure 1 shows the results of neural density estimation using each strategy. All MDNs have one hidden layer with tanh units and Gaussian components, except for the proposal prior MDN which has a single component. Both MDN with prior and MDN with proposal learn good parametric approximations to the true posterior, and the proposal prior is a good Gaussian approximation to it. We used K simulations to train the MDN with prior, whereas the prior proposal took iterations of simulations each to train, and the MDN with proposal took simulations on top of the previous . The MDN with prior learns the posterior distributions for a large range of possible observations (middle plot of Figure 1
), whereas the MDN with proposal gives accurate posterior probabilities only near the value actually observed (right plot of Figure
1).3.2 Bayesian linear regression
In Bayesian linear regression, the goal is to infer the parameters
of a linear map from noisy observations of outputs at known inputs. The setup is(5) 
where we took , , , randomly generated inputs from a standard Gaussian, and randomly generated observations from the model. In our setup, and have and dimensions respectively. The posterior is analytically tractable, and is a single Gaussian.
All MDNs have one hidden layer of tanh units and one Gaussian component. ABC methods were run for a sequence of decreasing ’s, up to their failing points. To measure the approximation quality to the posterior, we analytically calculated the KL divergence from the true posterior to the learnt posterior (which for ABC was taken to be a Gaussian fit to the set of returned posterior samples). The left of Figure 2 shows the approximation quality vs ; MDN methods are shown as horizontal lines. As is decreased, ABC methods sample from an increasingly better approximation to the true posterior, however they eventually reach their failing point, or take prohibitively long. The best approximations are achieved by MDN with proposal and a very long run of SMCABC.
The middle of Figure 2 shows the increase in number of simulations needed to improve approximation quality (as decreases). We quote the total number of simulations for MDN training, and the number of simulations per effective sample for ABC. Appendix E describes how the number of effective samples is calculated. The number of simulations per effective sample should be multiplied by the number of effective samples needed in practice. Moreover, SMCABC will not work well with only one particle, so many times the quoted cost will always be needed. Here, MDNs make more efficient use of simulations than Monte Carlo ABC methods. Sequentially fitting a prior proposal was more than ten times cheaper than training with prior samples, and more accurate.
3.3 Lotka–Volterra predatorprey population model
The Lotka–Volterra model is a stochastic Markov jump process that describes the continuous time evolution of a population of predators interacting with a population of prey. There are four possible reactions: (a) a predator being born, (b) a predator dying, (c) a prey being born, and (d) a prey being eaten by a predator. Positive parameters control the rate of each reaction. Given a set of statistics calculated from an observed population time series, the objective is to infer . We used a flat prior over , and calculated a set of statistics . The full setup is detailed in Appendix F. The Lotka–Volterra model is commonly used in the ABC literature as a realistic model which can be simulated, but whose likelihood is intractable. One of the properties of Lotka–Volterra is that typical naturelike observations only occur for very specific parameter settings, resulting in narrow, Gaussianlike posteriors that are hard to recover.
The MDN trained with prior has two hidden layers of tanh units each, whereas the MDNSVI used to train the proposal prior and the MDNSVI trained with proposal have one hidden layer of tanh units. All three have one Gaussian component. We found that using more than one components made no difference to the results; in all cases the MDNs chose to use only one component and switch the rest off, which is consistent with our observation about the nearGaussianity of the posterior.
We measure how well each method retrieves the true parameter values that were used to generate by calculating their log probability under each learnt posterior; for ABC a Gaussian fit to the posterior samples was used. The left panel of Figure 3 shows how this log probability varies with , demonstrating the superiority of MDN methods over ABC. In the middle panel we can see that MDN training with proposal makes efficient use of simulations compared to training with prior and ABC; note that for ABC the number of simulations is only for one effective sample. In the right panel, we can see that the estimates returned by MDN methods are more confident around the true parameters compared to ABC, because the MDNs learn the exact posterior rather than an inflated version of it like ABC does (plots for the other three parameters look similar).
We found that when training an MDN with a welltuned proposal that focuses on the plausible region, an MDN with fewer parameters is needed compared to training with the prior. This is because the MDN trained with proposal needs to learn only the local relationship between and near , as opposed to in the entire domain of the prior. Hence, not only are savings achieved in number of simulations, but also training the MDN itself becomes more efficient.
3.4 M/G/1 queue model
The M/G/1 queue model describes the processing of a queue of continuously arriving jobs by a single server. In this model, the time the server takes to process each job is independently and uniformly distributed in the interval
. The time interval between arrival of two consecutive jobs is independently and exponentially distributed with rate
. The server observes only the time intervals between departure of two consecutive jobs. Given a set of equallyspaced percentiles of interdeparture times, the task is to infer parameters . This model is easy to simulate but its likelihood is intractable, and it has often been used as an ABC benchmark [Blum and François, 2010, Meeds and Welling, 2015]. Unlike Lotka–Volterra, data is weakly informative about , and hence the posterior over tends to be broad and nonGaussian. In our setup, we placed flat independent priors over , and , and we took to be equally spaced percentiles, as detailed in Appendix G.The MDN trained with prior has two hidden layers of tanh units each, whereas the MDNSVI used to train the proposal prior and the one trained with proposal have one hidden layer of tanh units. As observed in the Lotka–Volterra demo, less capacity is required when training with proposal, as the relationship to be learned is local and hence simpler, which saves compute time and gives a more accurate final posterior. All MDNs have Gaussian components (except the MDNSVI used to train the proposal prior, which always has one), which, after experimentation, we determined are enough for the MDNs to represent the nonGaussian nature of the posterior.
Figure 4 reports the log probability of the true parameters under each posterior learnt—for ABC, the log probability was calculated by fitting a mixture of
Gaussians to posterior samples using ExpectationMaximization—and the number of simulations needed to achieve it. As before, MDN methods are more confident compared to ABC around the true parameters, which is due to ABC computing a broader posterior than the true one. MDN methods make more efficient use of simulations, since they use all of them for training and, unlike ABC, do not throw a proportion of them away.
4 Related work
Regression adjustment. An early parametric approach to ABC is regression adjustment, where a parametric regressor is trained on simulation data in order to learn a mapping from to . The learnt mapping is then used to correct for using a large , by adjusting the location of posterior samples gathered by e.g. rejection ABC. Beaumont et al. [2002] used linear regressors, and later Blum and François [2010] used neural networks with one hidden layer that separately predicted the mean and variance of . Both can be viewed as rudimentary density estimators and as such they are a predecessor to our work. However, they were not flexible enough to accurately estimate the posterior, and they were only used within some other ABC method to allow for a larger . In our work, we make conditional density estimation flexible enough to approximate the posterior accurately.
Synthetic likelihood. Another parametric approach is synthetic likelihood
, where parametric models are used to estimate the likelihood
. Wood [2010] used a single Gaussian, and later Fan et al. [2013] used a mixture Gaussian model. Both of them learnt a separate density model of for each by repeatedly simulating the model for fixed . More recently, Meeds and Welling [2014]used a Gaussian process model to interpolate Gaussian likelihood approximations between different
’s. Compared to learning the posterior, synthetic likelihood has the advantage of not depending on the choice of proposal prior. Its main disadvantage is the need of further approximate inference on top of it in order to obtain the posterior. In our work we directly learn the posterior, eliminating the need for further inference, and we address the problem of correcting for the proposal prior.Efficient Monte Carlo ABC. Recent work on ABC has focused on reducing the simulation cost of samplebased ABC methods. Hamiltonian ABC [Meeds et al., 2015] improves upon MCMCABC by using stochastically estimated gradients in order to explore the parameter space more efficiently. Optimization Monte Carlo ABC [Meeds and Welling, 2015] explicitly optimizes the location of ABC samples, which greatly reduces rejection rate. Bayesian optimization ABC [Gutmann and Corander, 2015] models as a Gaussian process and then uses Bayesian optimization to guide simulations towards the region of small distances . In our work we show how a significant reduction in simulation cost can also be achieved with parametric methods, which target the posterior directly.
Recognition networks
. Our use of neural density estimators for learning posteriors is reminiscent of recognition networks in machine learning. A recognition network is a neural network that is trained to invert a generative model. The Helmholtz machine
[Dayan et al., 1995], the variational autoencoder [Kingma and Welling, 2013] and stochastic backpropagation [Rezende et al., 2014] are examples where a recognition network is trained jointly with the generative network it is designed to invert. Feedforward neural networks have been used to invert blackbox generative models [Nair et al., 2008]and binaryvalued Bayesian networks
Morris [2001], and convolutional neural networks have been used to invert a physics engine
[Wu et al., 2015]. Our work illustrates the potential of recognition networks in the field of likelihoodfree inference, where the generative model is fixed, and inference of its parameters is the goal.Learning proposals. Neural density estimators have been employed in learning proposal distributions for importance sampling [Papamakarios and Murray, 2015] and Sequential Monte Carlo [Gu et al., 2015, Paige and Wood, 2016]. Although not our focus here, our fit to the posterior could also be used within Monte Carlo inference methods. In this work we see how far we can get purely by fitting a series of conditional density estimators.
5 Conclusions
Bayesian conditional density estimation improves likelihoodfree inference in three main ways: (a) it represents the posterior parametrically, as opposed to as a set of samples, allowing for probabilistic evaluations later on in the pipeline; (b) it targets the exact posterior, rather than an approximation of it; and (c) it makes efficient use of simulations by not rejecting samples, by interpolating between samples, and by gradually focusing on the plausible parameter region. Our belief is that neural density estimation is a tool with great potential in likelihoodfree inference, and our hope is that this work helps in establishing its usefulness in the field.
Acknowledgments
We thank Amos Storkey for useful comments. George Papamakarios is supported by the Centre for Doctoral Training in Data Science, funded by EPSRC (grant EP/L016427/1) and the University of Edinburgh, and by Microsoft Research through its PhD Scholarship Programme.
Appendix A Proof of Proposition 1
Maximizing w.r.t. is equivalent to maximizing the average log probability
(6) 
Since
, due to the strong law of large numbers, as
the average log probability converges almost surely to the following expectation(7) 
Let be a distribution over . Maximizing the above expectation w.r.t. is equivalent to minimizing
(8) 
The above KL divergence is minimized (and becomes ) if and only if
(9) 
almost everywhere. It is easy to see that this can only happen for , since
(10) 
Thus, taking as above, and assuming a setting of that makes the KL equal to exists, the KL is minimized if and only if we have almost everywhere that
(11) 
A corollary of the above is that
(12) 
in other words, becomes what the posterior would be if the prior were .
Appendix B Parameterization and training of Mixture Density Networks
A Mixture Density Network (MDN) [Bishop, 1994] is a conditional density estimator , which takes the form of a mixture of Gaussian components, as follows
(13) 
The mixing coefficients , means and covariance matrices are computed by a feedforward neural network , which has input , weights and biases . In particular, let the output of the neural network be
(14) 
Then, the mixing coefficients are given by
(15) 
The softmax ensures that the mixing coefficients are strictly positive and sum to one. Similarly, the means are given by
(16) 
As for the covariance matrices, we need to ensure that they are symmetric and positive definite. For this reason, instead of parameterizing the covariance matrices directly, we parameterize the Cholesky factorization of their inverses. That is, we rewrite
(17) 
where is parameterized to be an upper triangular matrix with strictly positive elements in the diagonal, as follows
(18)  
(19)  
(20) 
In the above, picks out the diagonal elements, whereas and pick out the elements above and below the diagonal respectively. We chose to parameterize the factorization of rather than that of , since it is the inverse covariance that directly appears in the calculation of . Apart from ensuring the symmetry and positive definiteness of , the above parameterization also allows for efficiently calculating the log determinant of as follows
(21) 
The above parameterization of the covariance matrix was introduced by Williams [1996] for learning conditional Gaussians.
Given a set of training data , training the MDN with maximum likelihood amounts to maximizing the average log probability
(22) 
with respect to the MDN parameters
(23) 
Because the reparameterization
described above is unconstrained, any offtheshelf gradientbased stochastic optimizer can be used. Gradients of the average log probability can be easily computed with backpropagation. In our experiments, we implemented MDNs using Theano
[Theano Development Team, 2016], which automatically backpropagates gradients, and we maximized the average log likelihood using Adam [Kingma and Ba, 2014], which is currently the state of the art in minibatchbased stochastic optimization.Appendix C Analytical calculation of parameter posterior
According to Proposition 1, after training , the posterior at is approximated by
(24) 
Typically, the prior is a simple distribution like a uniform or a Gaussian. Here we will consider the uniform case, while the Gaussian case is treated analogously. Let be uniform everywhere (improper). Then the posterior estimate becomes
(25) 
In practice, we also used this estimate for uniform priors with broad but finite support. Since is a mixture of Gaussians and is a single Gaussian, that is
(26) 
their ratio can be calculated and normalized analytically. In particular, after some algebra it can be shown that the posterior estimate is also a mixture of Gaussians
(27) 
whose parameters are
(28)  
(29)  
(30) 
where quantities are given by
(31) 
For the above mixture to be well defined, the covariance matrices must be positive definite. This will not be the case if the proposal prior is narrower than some component of along some dimension. However, in both Algorithms 1 and 2, is trained on parameters sampled from , hence, if trained properly, it tends to be narrower than . Our experience with Algorithms 1 and 2 is that not being positive definite rarely happens, whereas it happening is an indication that the algorithm’s parameters have not been set up properly.
Appendix D Stochastic Variational Inference for Mixture Density Networks
In this section we describe our adaptation of Stochastic Variational Inference (SVI) for neural networks [Kingma and Welling, 2013], in order to develop a Bayesian version of MDN. The first step is to express beliefs about the MDN parameters
as independent Gaussian random variables with means
and log variances . Under this interpretation we can rewrite the parameters as(32) 
where the symbol denotes elementwise multiplication and is an unknown vector drawn from a standard normal,
(33) 
The above parameterization induces the following variational distribution over
(34) 
where denotes a diagonal covariance matrix whose diagonal is the vector . Moreover, we place the following Bayesian prior over
(35) 
Under this prior, before seeing any data we set the parameter means all to zero, and the parameter log variances all equal to . In our experiments, we used a default value of .
Given training data , the objective of SVI is to optimize and so as to make the variational distribution be as close as possible (in KL) to the true Bayesian posterior over . This objective is equivalent to maximizing a variational lower bound,
(36) 
with respect to and . The expectations in the first term of the above can be stochastically approximated by randomly drawing ’s from a standard normal. The KL term can be calculated analytically, which yields
(37) 
The above optimization problem has been parameterized in such a way that and are unconstrained. Moreover, the derivatives of the variational lower bound with respect to and can be easily calculated with backpropagation after stochastic approximations to the expectations have been made. This allows the use of any offtheshelf gradientbased stochastic optimizer. In our experiments, we implemented MDNSVI in Theano [Theano Development Team, 2016], which automatically calculates derivatives with backpropagation, and used Adam [Kingma and Ba, 2014] for stochastic maximization of the variational lower bound.
An important practical detail for stochastically approximating the expectation terms is the local reparameterization trick [Kingma et al., 2015], which leverages the layered feedforward structure of the MDN. Consider any hidden or output unit in the MDN; if is its activation and is the vector of its inputs, then the relationship between and is always of the form
(38) 
where and are the weights and bias respectively associated with this unit. As we have seen, in the SVI framework these weights and biases are Gaussian random variables with means and , and log variances and
. It is easy to see that this induces a Gaussian distribution over activation
, whose mean and variance is given by(39) 
where denotes elementwise multiplication. Therefore, randomly sampling and in order to estimate the expectations and their gradients in the variational lower bound is equivalent to directly sampling from a Gaussian with the above mean and variance. This trick saves computation by reducing calls to the random number generator, but more importantly it reduces the variance of the stochastic approximation of the expectations (intuitively this is because less randomness is involved) and hence it makes stochastic optimization more stable and faster to converge.
Appendix E Effective sample size of ABC methods
Rejection ABC returns a set of independent samples, MCMCABC returns a set of correlated samples, and SMCABC returns a set of independent but weighted samples. To make a fair comparison between them in terms of simulation cost, we quote the number of simulations per effective sample, that is, the total number of simulations divided by the effective sample size of the returned set of samples.
Let be a set of samples, not necessarily independent. The effective sample size is defined to be the number of equivalent independent samples that would give an estimator of equal variance. For rejection ABC , since all returned samples are independent.
Suppose that each sample is a vector of components. For MCMCABC, where samples come in the form of autocorrelated sequences, we estimated the effective sample size for component as
(40) 
where is the autocorrelation coefficient of component at lag , estimated from the samples. We calculated the summation up to lag , which corresponds to the first autocorrelation coefficient that is equal to . Then we took the effective sample size to be the minimum across components. For a more general discussion on estimating autocorrelation time (which is equal to and thus equivalent to effective sample size) see Thompson [2010].
For SMCABC, each sample is independent but comes with a corresponding nonnegative weight . The weights have to sum to one, that is . We estimated the effective sample size by
(41) 
It is easy to see that if for all then , and if all weights but one are then . For a discussion regarding the above estimate see Nowozin [2015].
Appendix F Setup for the Lotka–Volterra experiment
The Lotka–Volterra model [Wilkinson, 2011] is a stochastic model that was developed to describe the time evolution of a population of predators interacting with a population of prey. Let be the number of predators and be the number of prey. The model asserts that the following four reactions can take place, with corresponding rates:

[label=()]

A predator may be born, with rate , increasing by one.

A predator may die, with rate , decreasing by one.

A prey may be born, with rate , increasing by one.

A prey may be eaten by a predator, with rate , decreasing by one.
Given initial populations and , the above model can be simulated using Gillespie’s algorithm [Gillespie, 1977], as follows:

[label=()]

Draw the time to next reaction from an exponential distribution with rate equal to the total rate .

Select a reaction at random, with probability proportional to its rate.

Simulate the reaction, and go to step 1.
In our experiments, each simulation started with initial populations and , and took place for a total of time units. We recorded the values of and after every time units, resulting in two time series of values each.
Data was taken to be the following set of statistics calculated from the time series:

[label=()]

The mean of each time series.

The log variance of each time series.

The autocorrelation coefficient of each time series at lag and lag .

The crosscorrelation coefficient between the two time series.
Since the above statistics have potentially very different scales, we normalized them on the basis of a pilot run. That is, we performed a pilot run of simulations, calculated and stored the mean and standard deviation of each statistic across pilot simulations, and used them in all subsequent simulations to normalize each statistic by subtracting the pilot mean and dividing by the pilot standard deviation. This choice of statistics and normalization process was taken from Wilkinson [2013].
From our experience with the model we observed that typical evolutions of the predator/prey populations for randomly selected parameters include (a) the predators quickly eating all the prey and then slowly decaying exponentially, or (b) the predators quickly dying out and then the prey growing exponentially. However, for certain carefully tuned values of , the two populations exhibit an oscillatory behaviour, typical of natural ecological systems. In order to generate observations for our experimental setup, we set the parameters to
(42) 
and simulated the model to generate . We carefully chose parameter values that give rise to oscillatory behaviour (see Figure 5 for typical examples of population evolution corresponding to the above parameters). Since only a small subset of parameters give rise to such oscillatory behaviour, the posterior is expected to be tightly peaked around the true parameter values. We tested our algorithms by evaluating how well (in terms of assigned log probability) each algorithm retrieves the true parameters.
Finally, we took the prior over to be uniform in the log domain. That is, the prior was taken to be
(43) 
where and , which of course includes the true parameters. All our inferences where done in the log domain.
Appendix G Setup for the M/G/1 experiment
The M/G/1 queue model [Shestopaloff and Neal, 2014] is a statistical model that describes how a single server processes a queue formed by a set of continuously arriving jobs. Let be the total number of jobs to be processed, be the time the server takes to process job , be the time that job entered the queue, and be the time that job left the queue (i.e. the time when the server finished processing it). The M/G/1 queue model asserts that for each job we have
(44)  
(45)  
(46) 
In the above equations, denotes a uniform distribution in the range , denotes an exponential distribution with rate , and . In our experiments we used a total of jobs.
The goal is to infer parameters if the only knowledge is a set of percentiles of the empirical distribution of the interdeparture times for . In our experiments we used equally spaced percentiles. That is, given a set of interdeparture times , we took to be the th, th, th, th and th percentiles of the set of interdeparture times. Note that the th and th percentiles correspond to the minimum and maximum element in the set.
Since different percentiles can have different scales and strong correlations between them, we whitened the data on the basis of a pilot run. That is, we performed K pilot simulations, and recorded the mean vector and covariance matrix of the resulting percentiles. For each subsequent simulation, we calculated from resulting percentiles by subtracting the mean vector and decorrelating and normalizing with the covariance matrix.
To generate observed data , we set the parameters to the following values
(47) 
and simulated the model to get
. We evaluated inference algorithms by how well the true parameter values were retrieved, as measured by log probability under computed posteriors. Finally, the prior probability of the parameters was taken to be
(48)  
(49)  
(50) 
which is uniform, albeit not axisaligned, and of course includes the true parameters.
References
 Beaumont et al. [2002] M. A. Beaumont, W. Zhang, and D. J. Balding. Approximate Bayesian Computation in population genetics. Genetics, 162:2025–2035, Dec. 2002.
 Beaumont et al. [2009] M. A. Beaumont, J.M. Cornuet, J.M. Marin, and C. P. Robert. Adaptive Approximate Bayesian Computation. Biometrika, 96(4):983–990, 2009.
 Bishop [1994] C. M. Bishop. Mixture density networks. Technical Report NCRG/94/004, Aston University, 1994.
 Blum and François [2010] M. G. B. Blum and O. François. Nonlinear regression models for Approximate Bayesian Computation. Statistics and Computing, 20(1):63–73, 2010.
 Bonassi and West [2015] F. V. Bonassi and M. West. Sequential Monte Carlo with adaptive weights for Approximate Bayesian Computation. Bayesian Analysis, 10(1):171–187, Mar. 2015.
 Dayan et al. [1995] P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel. The Helmholtz machine. Neural Computation, 7:889–904, 1995.
 Fan et al. [2013] Y. Fan, D. J. Nott, and S. A. Sisson. Approximate Bayesian Computation via regression density estimation. Stat, 2(1):34–48, 2013.
 Gillespie [1977] D. T. Gillespie. Exact stochastic simulation of coupled chemical reactions. The Journal of Physical Chemistry, 81(25):2340–2361, 1977.
 Gouriéroux et al. [1993] C. Gouriéroux, A. Monfort, and E. Renault. Indirect inference. Journal of Applied Econometrics, 8(S1):S85–S118, 1993.
 Gu et al. [2015] S. Gu, Z. Ghahramani, and R. E. Turner. Neural adaptive Sequential Monte Carlo. Advances in Neural Information Processing Systems 28, pages 2629–2637, 2015.
 Gutmann and Corander [2015] M. U. Gutmann and J. Corander. Bayesian optimization for likelihoodfree inference of simulatorbased statistical models. arXiv eprints, abs/1501.03291v3, 2015.
 Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations, 2014.
 Kingma and Welling [2013] D. P. Kingma and M. Welling. Autoencoding variational Bayes. Proceedings of the 2nd International Conference on Learning Representations, 2013.
 Kingma et al. [2015] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization trick. Advances in Neural Information Processing Systems 28, pages 2575–2583, 2015.
 Marjoram et al. [2003] P. Marjoram, J. Molitor, V. Plagnol, and S. Tavaré. Markov chain Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences, 100(26):15324–15328, Dec. 2003.

Meeds and Welling [2014]
E. Meeds and M. Welling.
GPSABC: Gaussian Process Surrogate Approximate
Bayesian Computation.
Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence
, 30, 2014.  Meeds et al. [2015] E. Meeds, R. Leenders, and M. Welling. Hamiltonian ABC. Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence, pages 582–591, 2015.
 Meeds and Welling [2015] T. Meeds and M. Welling. Optimization Monte Carlo: Efficient and embarrassingly parallel likelihoodfree inference. Advances in Neural Information Processing Systems 28, pages 2071–2079, 2015.
 Morris [2001] Q. Morris. Recognition networks for approximate inference in BN20 networks. Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence, pages 370–377, 2001.
 Nair et al. [2008] V. Nair, J. Susskind, and G. E. Hinton. Analysisbysynthesis by learning to invert generative black boxes. Proceedings of the 18th International Conference on Artificial Neural Networks, 5163:971–981, 2008.
 Nowozin [2015] S. Nowozin. Effective sample size in importance sampling, Aug. 2015. URL http://www.nowozin.net/sebastian/blog/effectivesamplesizeinimportancesampling.html. Accessed on 18 May 2016.
 Paige and Wood [2016] B. Paige and F. Wood. Inference networks for Sequential Monte Carlo in graphical models. Proceedings of the 33rd International Conference on Machine Learning, 2016.
 Papamakarios and Murray [2015] G. Papamakarios and I. Murray. Distilling intractable generative models. Probabilistic Integration Workshop at Neural Information Processing Systems, 2015.
 Pritchard et al. [1999] J. K. Pritchard, M. T. Seielstad, A. PerezLezaun, and M. W. Feldman. Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Molecular Biology and Evolution, 16(12):1791–1798, 1999.
 Rezende et al. [2014] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. Proceedings of the 31st International Conference on Machine Learning, pages 1278–1286, 2014.
 Schafer and Freeman [2012] C. M. Schafer and P. E. Freeman. Likelihoodfree inference in cosmology: Potential for the estimation of luminosity functions. Statistical Challenges in Modern Astronomy V, pages 3–19, 2012.
 Shestopaloff and Neal [2014] A. Y. Shestopaloff and R. M. Neal. On Bayesian inference for the M/G/1 queue with efficient MCMC sampling. arXiv eprints, abs/1401.5548v1, 2014.
 Theano Development Team [2016] Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv eprints, abs/1605.02688v1, May 2016.
 Thompson [2010] M. B. Thompson. A comparison of methods for computing autocorrelation time. arXiv eprints, abs/1011.0175v1, Oct. 2010.
 Wilkinson [2011] D. J. Wilkinson. Stochastic Modelling for Systems Biology, Second Edition. Chapman & Hall/CRC Mathematical and Computational Biology. Taylor & Francis, 2011.
 Wilkinson [2013] D. J. Wilkinson. Summary stats for ABC, Sept. 2013. URL https://darrenjw.wordpress.com/2013/09/01/summarystatsforabc/. Accessed on 16 May 2016.
 Williams [1996] P. M. Williams. Using neural networks to model conditional multivariate densities. Neural Computation, 8(4):843–854, May 1996.
 Wood [2010] S. N. Wood. Statistical inference for noisy nonlinear ecological dynamic systems. Nature, 466(7310):1102–1104, 2010.

Wu et al. [2015]
J. Wu, I. Yildirim, J. J. Lim, B. Freeman, and J. Tenenbaum.
Galileo: Perceiving physical object properties by integrating a physics engine with deep learning.
Advances in Neural Information Processing Systems 28, pages 127–135, 2015.
Comments
There are no comments yet.