1 Introduction
Bayesian inference has had great success in in recent decades (Green et al., 2015), but remains challenging in models with a complex posterior dependence structure e.g. those involving latent variables. Monte Carlo methods are one state of the art approach to inference. These produce samples from the posterior distribution. However in many settings it remains challenging to design good mechanisms to propose plausible samples, despite many advances (Cappé et al., 2004; Cornuet et al., 2012; Graham and Storkey, 2017; Whitaker et al., 2017).
We focus on one simple Monte Carlo method: importance sampling (IS). This weights draws from a proposal distribution so that the weighted sample can be viewed as representing a target distribution
, such as the posterior. IS can be used in almost any setting, including in the presence of strong posterior dependence or discrete random variables. However it only achieves a representative weighted sample at a feasible cost if the proposal is a reasonable approximation to the target distribution.
An alternative to Monte Carlo is to use optimisation to find the best approximation to the posterior from a family of distributions. Typically this is done in the framework of variational inference (VI). VI is computationally efficient but has the drawback that it often produces poor approximations to the posterior distribution e.g. through overconcentration (Turner et al., 2008; Yao et al., 2018).
A recent improvement in VI is due to the development of a range of flexible and computationally tractable distributional families using normalising flows (Dinh et al., 2016; Papamakarios, 2019). These transform a simple base random distribution to a complex distribution, using a sequence of learnable transformations.
We propose an alternative to variational inference for training the parameters of an approximate posterior density, typically a normalising flow, which we call the distilled density. This alternates two steps. The first is importance sampling, using the current distilled density as the proposal. The target distribution is an approximate posterior, based on tempering, which is an improvement on the proposal. The second step is to use the resulting weighted sampled to train the distilled density further. Following Li et al. (2017), we refer to this as distilling the importance sampling results. By iteratively distilling IS results, we can target increasingly accurate posterior approximations i.e. reduce the amount of tempering.
Each step of our distilled importance sampling (DIS) method aims to reduce the Kullback Leibler (KL) divergence from the distilled density to the current tempered posterior. This is known as the inclusive KL divergence, as minimising it tends to produce a density which is overdispersed compared to the tempered posterior. Such a distribution is well suited to be an IS proposal distribution.
In the remainder of the paper, Section 2 presents background on Bayesian inference and normalising flows. Sections 3 and 4 describe our method. Section 5 illustrates it on a simple two dimensional inference task. Section 6 gives a more challenging queueing model example. Section 7 concludes with a discussion, including limitations of the approach and opportunities for future improvements.
A link to code for the examples will be included in the final version of this paper. All examples were run on a 6core desktop PC.
Related work
Several recent papers (Müller et al., 2018; Cotter et al., 2019; Duan, 2019) learn a density defined via a transformation to use as an importance sampling proposal. A novelty of our approach is using a sequential approach based on tempering.
A related approach is to distill Markov chain Monte Carlo output, but this turns out to be more difficult than for IS. One reason is that optimising the KL divergence typically requires unbiased estimates of it or related quantities (e.g. its gradient), but MCMC only provides unbiased estimates asymptotically.
Li et al. (2017) and Parno and Marzouk (2018) proceed by using biased estimates, while Ruiz and Titsias (2019) introduce an alternative more tractable divergence. However IS, as we shall see, can produce unbiased estimates of the required KL gradient.Approximate Bayesian computation (ABC) methods (Marin et al., 2012; Del Moral et al., 2012) involve simulating datasets under various parameters to find those which produce close matches to the observations. However close matches are rare unless the observations are low dimensional. Hence ABC typically uses dimension reduction of the observations through summary statistics, which reduces inference accuracy. Our method can instead learn a joint proposal distribution for the parameters and all the random variables used in simulating a dataset (see Section 6 for details). Hence it can control the simulation process to frequently output data similar to the full observations.
Conditional density estimation methods (see e.g. Le et al., 2017, Papamakarios et al., 2018, Grazian and Fan, 2019
) fit a joint distribution to parameters and data from simulations. Then one can condition on the observed data to approximate its posterior distribution. These methods also sometimes require dimension reduction, and can perform poorly when the observed data is unlike the simulations. Our approach avoids these difficulties by directly finding parameters which can reproduce the full observations.
More broadly, DIS has connections to several inference methods. Concentrating on its IS component, it is closely related to adaptive importance sampling (Cornuet et al., 2012) and sequential Monte Carlo (SMC) (Del Moral et al., 2006). Concentrating on training an approximate density, it can be seen as a version of the crossentropy method (Rubinstein, 1999).
2 Background
This section presents background on Bayesian inference and normalising flows.
2.1 Bayesian framework
We observe data
assumed to be the output of a probability model
under some parameters . Given prior density we aim to find corresponding posterior .Many probability models involve latent variables , so that . To avoid computing this integral we will attempt to infer the joint posterior , and then marginalise to get . For convenience we introduce to represent the collection of parameters and latent variables. (For models without latent variables .)
We now wish to infer . Typically we can only evaluate an unnormalised version,
Then where is an intractable normalising constant.
2.2 Tempering
2.3 Importance sampling
Let be a target density, such as a tempered posterior, where and only can be evaluated. Importance sampling (IS) is a Monte Carlo method to estimate expectations of the form
for some function . Here we give an overview of relevant aspects. For full details see e.g. Robert and Casella (2013) and Rubinstein and Kroese (2016).
IS requires a proposal density which can easily be sampled from, and must satisfy
(1) 
where denotes support. Then
(2) 
So an unbiased Monte Carlo estimate of is
(3) 
where are independent samples from , and gives a corresponding importance weight.
Typically is estimated as giving
(4) 
a biased, but consistent, estimate of . Equivalently
for normalised importance weights .
A drawback of importance sampling is that the variance of its estimates can be large, or infinite, if
is a poor approximation to . Hence diagnostics for the quality of the results are useful.A popular diagnostic is effective sample size (ESS),
(5) 
For most functions , roughly equals the variance of an idealised Monte Carlo estimate based on independent samples from (Liu, 1996).
2.4 Normalising flows
A normalising flow
represents a random vector
with a complicated distribution as an invertible transformation of a random vector with a simple base distribution, typically .Recent research has developed flexible learnable families of normalising flows. See Papamakarios (2019) for a review. We focus on real NVP (“nonvolume preserving”) flows (Dinh et al., 2016). These compose several transformations of . One type is a coupling layer which transforms input vector to output vector , both of dimension , by
where and are elementwise multiplication and exponentiation. Here the first elements of are copied unchanged. We typically take . The other elements are scaled by vector then shifted by vector , where and are functions of . This transformation is invertible, and allows quick computation of the density of from that of , as the Jacobian is simply . Coupling layers are alternated with permutations so that different variables are copied in successive coupling layers. Real NVP typically uses orderreversing or random permutations.
The functions and
are neural network outputs. Each coupling layer has its own neural network. The collection of all weights and biases,
, can be trained for particular tasks e.g. density estimation of images. Permutations are fixed in advance and not learnable.Real NVP produces a flexible family of densities with two useful properties for this paper. Firstly, samples can be drawn rapidly. Secondly, it is reasonably fast to compute for any .
3 Objective and gradient
Given an approximate family of densities , such as normalising flows, this section introduces objective functions to judge how well approximates a tempered posterior . It then discusses how to estimate the gradient of this objective with respect to . Section 4 presents our algorithm using these gradients to sequentially update while also reducing .
3.1 Objective
Given , we aim to minimise the inclusive KullbackLeibler (KL) divergence,
This is equivalent to maximising a scaled negative crossentropy, which we use as our objective,
(We scale by to avoid this intractable constant appearing in our gradient estimates below.)
The inclusive KL divergence penalises values which produce small when is large. Hence the optimal tends to make nonnegligible where is nonnegligible, known as the zeroavoiding property. This is an intuitively attractive feature for importance sampling proposal distributions. Indeed recent theoretical work shows that, under some conditions, the sample size required in importance sampling scales exponentially with the inclusive KL divergence (Agapiou et al., 2017; Chatterjee and Diaconis, 2018).
3.2 Basic gradient estimate
Assuming standard regularity conditions (Mohamed et al., 2019, Section 4.1), the objective has gradient
Using (2), an importance sampling form is
where is a proposal density. We will take for some . (In our main algorithm, will be the output of a previous optimisation step.) Note we use choices of with full support, so (1) is satisfied.
An unbiased Monte Carlo gradient estimate is
(6) 
where are independent samples and are importance sampling weights.
We calculate
by backpropagation. Note that we backpropagate with respect to
, but not which is treated as a constant. Hence the values are themselves constant.3.3 Improved gradient estimates
Here we discuss reducing the variance and cost of .
Clipping weights
To avoid high variance gradient estimates we apply truncated importance sampling (Ionides, 2008). This clips the weights at a maximum value , producing truncated importance weights . The resulting gradient estimate is
This gradient estimate typically has lower variance than , but some bias is introduced. See Appendix A for more details and discussion, including how we choose automatically.
Resampling
The gradient estimate requires calculating for . Each of these has an associated computational cost, but often many receive small weights and so contribute little to .
To reduce this cost we can discard many low weight samples, by using importance resampling (Smith and Gelfand, 1992) as follows. We sample times, with replacement, from the s with probabilities where . Denote the resulting samples as . The following is then an unbiased estimate of ,
(7) 
4 Algorithm
Our approach to inference is as follows. Given a current approximation to the posterior, , we use this as in (7) to produce a gradient estimate. We then update to by stochastic gradient ascent, aiming to increase . Over multiple iterations, we improve our target density by reducing , slowly enough to avoid high variance gradient estimates.
Algorithm 1 gives our implementation of this approach. The remainder of the section discusses several details of the algorithm.
We implement Algorithm 1
using Tensorflow for steps which sample from, backpropagate through, or update neural networks. Other steps are implemented in NumPy where this is more convenient e.g. updating
in step 5, or sampling from particularly complex models during step 6.Note that whenever step 5 reduces , the optimisation objective is altered. However the change in objective is typically small, and we did not observe the optimiser struggling to adapt to this change.
We generally take and , a conservative choice to avoid overfitting when importance sampling only produces a small effective sample size.
We investigate other tuning choices for the algorithm in Section 6. For now note that must be reasonably large since our method to update relies on making an accurate ESS estimate, as detailed in Section 4.2.
4.1 Initialisation
The initial should be similar to the initial target . Otherwise the first gradient estimates produced by importance sampling are likely to be high variance.
We can often achieve this by initialising appropriately. See Appendix B for details, and Sections 5 and 6 for examples.
Alternatively we can pretrain using straightforward density estimation by iterating the following steps:

Sample from .

Update using gradient .
This aims to maximise the negative crossentropy . We use , and terminate once achieves a reasonable effective sampling size when targeting in importance sampling.
4.2 Selecting
We select using effective sample size, as in Del Moral et al. (2012). Given sampled from , the resulting ESS value for target is
In step 5 of Algorithm 1 we first check whether , where is the target ESS value. If so we take . Otherwise we let be an estimate of the minimal such that , computed by a bisection algorithm.
5 Example: sinusoidal distribution
As a simple illustration, we target a distribution with and . This has unnormalised density
where is an indicator function. (Note earlier sections infer , where are latent variables. This example has no latent variables, so we simply infer .)
As initial distribution we take and to have independent distributions. We take
to give a reasonable match to the standard deviation of
under the target. HenceWe use the unnormalised tempered target
(8) 
We use real NVP for , with 4 coupling layers, alternated with permutation layers swapping and . Each coupling layer uses a neural network with 3 hidden layers each with 10 hidden units and ELU activation. We initialise close to a distribution, as described in Appendix B, then use pretraining so that approximates .
The main algorithm uses training samples with a target ESS of . These values are chosen to produce a clear visual illustration: we investigate efficient tuning choices later.
Figure 1 shows our results. The distilled density quickly adapts to meet the importance sampling results, and is reached by 90 iterations. This took roughly 1.5 minutes.
6 Example: M/G/1 queue
This section describes an application to likelihoodfree inference (Marin et al., 2012; Papamakarios and Murray, 2016). Here a generative model or simulator is specified, typically by computer code. This maps parameters and pseudorandom draws to data .
Given observations , we aim to infer the joint posterior of and , . In this section we approximate this with a blackbox choice of i.e. a generic normalising flow. This approach could be applied to any simulator model without modifying its computer code, instead overriding the random number generator to use values proposed by , as in Baydin et al. (2018). However, for higher dimensional a blackbox approach becomes impractical. An alternative would be to use knowledge of the simulator to inform the choice of .
6.1 Model
We consider a M/G/1 queuing model. This is based on a single queue of customers. Times between arrivals at the back of the queue are . Upon reaching the front of the queue, a customer’s service time is . All these random variables are independent.
We consider a setting where only interdeparture times are observed: times between departures from the queue. This model is a common benchmark for likelihoodfree inference (see e.g. Papamakarios and Murray, 2016). Nearexact posterior inference is also possible using a sophisticated MCMC scheme (Shestopaloff and Neal, 2014).
We sample a synthetic dataset of observations from parameter values . We attempt to infer these parameters under the prior (all independent).
6.2 DIS implementation
Latent variables and simulator
We introduce and , vectors of length 3 and , and take as the collection . Our simulator transforms these inputs to and . We do this in a such a way that when the elements have independent distributions, then is a sample from the prior and the model. See Appendix C for details.
Tempered target
We use the unnormalised tempered target
(9) 
where is the Euclidean distance between the simulated and observed data. This target is often used in ABC and corresponds to the posterior under the assumption that the data is observed with error (Wilkinson, 2013). We use initial value : a large scale value relative to our observations. Note that DIS cannot reach the exact target here. Instead, like ABC, it will produce increasingly good posterior approximations as .
Tuning choices
We use a real NVP architecture for , made up of 16 coupling layers, alternated with random permutation layers. Each coupling layer uses a neural network with 3 hidden layers of units and ELU activation. We initialise close to a distribution, as described in Appendix B. This produces samples sufficiently close to the initial target that pretraining was not needed.
6.3 Results
Figure 2 compares different choices of (number of importance samples) and (target ESS value). It shows that reduces more quickly for larger or smaller . Our choice of was restricted by memory requirements: the largest value we used was . Our choice of was restricted by numerical stability: values below a few hundred often produced numerical overflow errors in the normalising flow.
This tuning study suggests using large and small subject to these restrictions. However, note that in this example, the cost of a single evaluation of the target density is low. For more expensive models efficient tuning choices may differ.
Figure 3 shows that DIS results with and are a close match to nearexact MCMC output using the algorithm of Shestopaloff and Neal (2014). The main difference is that DIS lacks the sharp truncation at . Its tails also appear to be less accurate than those of MCMC.
Papamakarios et al. (2018) make a detailed comparison of likelihoodfree methods on the M/G/1 model. An advantage of DIS over these methods is that it performs inference on the full data, without losing information by using summary statistics. However a tradeoff is that some competing methods give good approximate results from far fewer simulator runs than DIS.
7 Conclusion
We’ve presented distilled importance sampling, and shown its application as an approximate Bayesian inference method for a likelihoodfree example.
Limitations
Variational inference scales well to large datasets, as it requires unbiased loglikelihood estimates, which can be calculated from small subsamples of data. One limitation of DIS is that it does not share this property, as the cost of each importance weight in DIS typically scales linearly with the size of the data e.g. by making a likelihood evaluation.
Also, we’ve focused an example where evaluating is quick, and used many evaluations by taking large. For more expensive models, this would be infeasible.
We hope both limitations can be addressed in future using ideas from adaptive importance sampling, SMC, the crossentropy method, normalising flows etc.
Extensions
There are interesting opportunities for extending DIS. It’s not required that is differentiable, so discrete parameter inference is plausible. Also, we can use randomweight importance sampling (Fearnhead et al., 2010) in DIS, and replace with an unbiased estimate.
Acknowledgements
Thanks to Alex Shestopaloff for providing MCMC code for the M/G/1 model.
References
 Agapiou et al. (2017) Agapiou, S., Papaspiliopoulos, O., SanzAlonso, D., and Stuart, A. M. (2017). Importance sampling: Intrinsic dimension and computational cost. Statistical Science, 32(3):405–431.
 Baydin et al. (2018) Baydin, A. G., Heinrich, L., Bhimji, W., GramHansen, B., Louppe, G., Shao, L., Cranmer, K., and Wood, F. (2018). Efficient probabilistic inference in the quest for physics beyond the standard model. arXiv preprint arXiv:1807.07706.
 Cappé et al. (2004) Cappé, O., Guillin, A., Marin, J.M., and Robert, C. P. (2004). Population Monte Carlo. Journal of Computational and Graphical Statistics, 13(4):907–929.
 Chatterjee and Diaconis (2018) Chatterjee, S. and Diaconis, P. (2018). The sample size required in importance sampling. The Annals of Applied Probability, 28(2):1099–1135.
 Cornuet et al. (2012) Cornuet, J.M., Marin, J.M., Mira, A., and Robert, C. P. (2012). Adaptive multiple importance sampling. Scandinavian Journal of Statistics, 39(4):798–812.
 Cotter et al. (2019) Cotter, S. L., Kevrekidis, I. G., and Russell, P. (2019). Transport map accelerated adaptive importance sampling, and application to inverse problems arising from multiscale stochastic reaction networks. arXiv preprint arXiv:1901.11269.
 Del Moral et al. (2006) Del Moral, P., Doucet, A., and Jasra, A. (2006). Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B, 68(3):411–436.
 Del Moral et al. (2012) Del Moral, P., Doucet, A., and Jasra, A. (2012). An adaptive sequential Monte Carlo method for approximate Bayesian computation. Statistics and Computing, 22(5):1009–1020.
 Dieng et al. (2017) Dieng, A. B., Tran, D., Ranganath, R., Paisley, J., and Blei, D. (2017). Variational inference via upper bound minimization. In Advances in Neural Information Processing Systems, pages 2732–2741.
 Dinh et al. (2016) Dinh, L., SohlDickstein, J., and Bengio, S. (2016). Density estimation using real NVP. arXiv preprint arXiv:1605.08803.
 Duan (2019) Duan, L. L. (2019). Transport Monte Carlo. arXiv preprint arXiv:1907.10448.
 Fearnhead et al. (2010) Fearnhead, P., Papaspiliopoulos, O., Roberts, G. O., and Stuart, A. (2010). Randomweight particle filtering of continuous time processes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):497–512.
 Graham and Storkey (2017) Graham, M. M. and Storkey, A. J. (2017). Asymptotically exact inference in differentiable generative models. Electronic Journal of Statistics, 11(2):5105–5164.
 Grazian and Fan (2019) Grazian, C. and Fan, Y. (2019). A review of approximate Bayesian computation methods via density estimation: inference for simulatormodels. arXiv preprint arXiv:1909.02736.
 Green et al. (2015) Green, P. J., Łatuszyński, K., Pereyra, M., and Robert, C. P. (2015). Bayesian computation: a summary of the current state, and samples backwards and forwards. Statistics and Computing, 25(4):835–862.
 Ionides (2008) Ionides, E. L. (2008). Truncated importance sampling. Journal of Computational and Graphical Statistics, 17(2):295–311.
 Le et al. (2017) Le, T. A., Baydin, A. G., and Wood, F. (2017). Inference compilation and universal probabilistic programming. In Artificial Intelligence and Statistics, pages 1338–1348.
 Li et al. (2017) Li, Y., Turner, R. E., and Liu, Q. (2017). Approximate inference with amortised MCMC. arXiv preprint arXiv:1702.08343.
 Lindley (1952) Lindley, D. V. (1952). The theory of queues with a single server. Mathematical Proceedings of the Cambridge Philosophical Society, 48(2):277–289.
 Liu (1996) Liu, J. S. (1996). Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Statistics and Computing, 6(2):113–119.
 Marin et al. (2012) Marin, J.M., Pudlo, P., Robert, C. P., and Ryder, R. J. (2012). Approximate Bayesian computational methods. Statistics and Computing, 22(6):1167–1180.
 Mohamed et al. (2019) Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A. (2019). Monte Carlo gradient estimation in machine learning. arXiv preprint arXiv:1906.10652.
 Müller et al. (2018) Müller, T., McWilliams, B., Rousselle, F., Gross, M., and Novák, J. (2018). Neural importance sampling. arXiv preprint arXiv:1808.03856.
 Papamakarios (2019) Papamakarios, G. (2019). Neural density estimation and likelihoodfree inference. PhD thesis, University of Edinburgh.
 Papamakarios and Murray (2016) Papamakarios, G. and Murray, I. (2016). Fast free inference of simulation models with Bayesian conditional density estimation. In Advances in Neural Information Processing Systems, pages 1028–1036.
 Papamakarios et al. (2018) Papamakarios, G., Sterratt, D. C., and Murray, I. (2018). Sequential neural likelihood: Fast likelihoodfree inference with autoregressive flows. arXiv preprint arXiv:1805.07226.
 Parno and Marzouk (2018) Parno, M. D. and Marzouk, Y. M. (2018). Transport map accelerated Markov chain Monte Carlo. SIAM/ASA Journal on Uncertainty Quantification, 6(2):645–682.

Pascanu et al. (2013)
Pascanu, R., Mikolov, T., and Bengio, Y. (2013).
On the difficulty of training recurrent neural networks.
InInternational conference on machine learning
, pages 1310–1318.  Robert and Casella (2013) Robert, C. P. and Casella, G. (2013). Monte Carlo statistical methods. Springer.
 Rubinstein (1999) Rubinstein, R. (1999). The crossentropy method for combinatorial and continuous optimization. Methodology and computing in applied probability, 1(2):127–190.
 Rubinstein and Kroese (2016) Rubinstein, R. Y. and Kroese, D. P. (2016). Simulation and the Monte Carlo method. John Wiley & Sons.
 Ruiz and Titsias (2019) Ruiz, F. J. R. and Titsias, M. K. (2019). A contrastive divergence for combining variational inference and MCMC. arXiv preprint arXiv:1905.04062.
 Shestopaloff and Neal (2014) Shestopaloff, A. Y. and Neal, R. M. (2014). On Bayesian inference for the M/G/1 queue with efficient MCMC sampling. arXiv preprint arXiv:1401.5548.
 Smith and Gelfand (1992) Smith, A. F. M. and Gelfand, A. E. (1992). Bayesian statistics without tears: a sampling–resampling perspective. The American Statistician, 46(2):84–88.
 Turner et al. (2008) Turner, R. E., Berkes, P., Sahani, M., and MacKay, D. J. C. (2008). Counterexamples to variational free energy compactness folk theorems. Technical report, University College London.
 Whitaker et al. (2017) Whitaker, G. A., Golightly, A., Boys, R. J., and Sherlock, C. (2017). Improved bridge constructs for stochastic differential equations. Statistics and Computing, 27(4):885–900.
 Wilkinson (2013) Wilkinson, R. D. (2013). Approximate Bayesian computation (ABC) gives exact results under the assumption of model error. Statistical applications in genetics and molecular biology, 12(2):129–141.
 Yao et al. (2018) Yao, Y., Vehtari, A., Simpson, D., and Gelman, A. (2018). Yes, but did it work?: Evaluating variational inference. In International Conference on Machine Learning, pages 5577–5586.
References
 Agapiou et al. (2017) Agapiou, S., Papaspiliopoulos, O., SanzAlonso, D., and Stuart, A. M. (2017). Importance sampling: Intrinsic dimension and computational cost. Statistical Science, 32(3):405–431.
 Baydin et al. (2018) Baydin, A. G., Heinrich, L., Bhimji, W., GramHansen, B., Louppe, G., Shao, L., Cranmer, K., and Wood, F. (2018). Efficient probabilistic inference in the quest for physics beyond the standard model. arXiv preprint arXiv:1807.07706.
 Cappé et al. (2004) Cappé, O., Guillin, A., Marin, J.M., and Robert, C. P. (2004). Population Monte Carlo. Journal of Computational and Graphical Statistics, 13(4):907–929.
 Chatterjee and Diaconis (2018) Chatterjee, S. and Diaconis, P. (2018). The sample size required in importance sampling. The Annals of Applied Probability, 28(2):1099–1135.
 Cornuet et al. (2012) Cornuet, J.M., Marin, J.M., Mira, A., and Robert, C. P. (2012). Adaptive multiple importance sampling. Scandinavian Journal of Statistics, 39(4):798–812.
 Cotter et al. (2019) Cotter, S. L., Kevrekidis, I. G., and Russell, P. (2019). Transport map accelerated adaptive importance sampling, and application to inverse problems arising from multiscale stochastic reaction networks. arXiv preprint arXiv:1901.11269.
 Del Moral et al. (2006) Del Moral, P., Doucet, A., and Jasra, A. (2006). Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B, 68(3):411–436.
 Del Moral et al. (2012) Del Moral, P., Doucet, A., and Jasra, A. (2012). An adaptive sequential Monte Carlo method for approximate Bayesian computation. Statistics and Computing, 22(5):1009–1020.
 Dieng et al. (2017) Dieng, A. B., Tran, D., Ranganath, R., Paisley, J., and Blei, D. (2017). Variational inference via upper bound minimization. In Advances in Neural Information Processing Systems, pages 2732–2741.
 Dinh et al. (2016) Dinh, L., SohlDickstein, J., and Bengio, S. (2016). Density estimation using real NVP. arXiv preprint arXiv:1605.08803.
 Duan (2019) Duan, L. L. (2019). Transport Monte Carlo. arXiv preprint arXiv:1907.10448.
 Fearnhead et al. (2010) Fearnhead, P., Papaspiliopoulos, O., Roberts, G. O., and Stuart, A. (2010). Randomweight particle filtering of continuous time processes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):497–512.
 Graham and Storkey (2017) Graham, M. M. and Storkey, A. J. (2017). Asymptotically exact inference in differentiable generative models. Electronic Journal of Statistics, 11(2):5105–5164.
 Grazian and Fan (2019) Grazian, C. and Fan, Y. (2019). A review of approximate Bayesian computation methods via density estimation: inference for simulatormodels. arXiv preprint arXiv:1909.02736.
 Green et al. (2015) Green, P. J., Łatuszyński, K., Pereyra, M., and Robert, C. P. (2015). Bayesian computation: a summary of the current state, and samples backwards and forwards. Statistics and Computing, 25(4):835–862.
 Ionides (2008) Ionides, E. L. (2008). Truncated importance sampling. Journal of Computational and Graphical Statistics, 17(2):295–311.
 Le et al. (2017) Le, T. A., Baydin, A. G., and Wood, F. (2017). Inference compilation and universal probabilistic programming. In Artificial Intelligence and Statistics, pages 1338–1348.
 Li et al. (2017) Li, Y., Turner, R. E., and Liu, Q. (2017). Approximate inference with amortised MCMC. arXiv preprint arXiv:1702.08343.
 Lindley (1952) Lindley, D. V. (1952). The theory of queues with a single server. Mathematical Proceedings of the Cambridge Philosophical Society, 48(2):277–289.
 Liu (1996) Liu, J. S. (1996). Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Statistics and Computing, 6(2):113–119.
 Marin et al. (2012) Marin, J.M., Pudlo, P., Robert, C. P., and Ryder, R. J. (2012). Approximate Bayesian computational methods. Statistics and Computing, 22(6):1167–1180.
 Mohamed et al. (2019) Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A. (2019). Monte Carlo gradient estimation in machine learning. arXiv preprint arXiv:1906.10652.
 Müller et al. (2018) Müller, T., McWilliams, B., Rousselle, F., Gross, M., and Novák, J. (2018). Neural importance sampling. arXiv preprint arXiv:1808.03856.
 Papamakarios (2019) Papamakarios, G. (2019). Neural density estimation and likelihoodfree inference. PhD thesis, University of Edinburgh.
 Papamakarios and Murray (2016) Papamakarios, G. and Murray, I. (2016). Fast free inference of simulation models with Bayesian conditional density estimation. In Advances in Neural Information Processing Systems, pages 1028–1036.
 Papamakarios et al. (2018) Papamakarios, G., Sterratt, D. C., and Murray, I. (2018). Sequential neural likelihood: Fast likelihoodfree inference with autoregressive flows. arXiv preprint arXiv:1805.07226.
 Parno and Marzouk (2018) Parno, M. D. and Marzouk, Y. M. (2018). Transport map accelerated Markov chain Monte Carlo. SIAM/ASA Journal on Uncertainty Quantification, 6(2):645–682.

Pascanu et al. (2013)
Pascanu, R., Mikolov, T., and Bengio, Y. (2013).
On the difficulty of training recurrent neural networks.
InInternational conference on machine learning
, pages 1310–1318.  Robert and Casella (2013) Robert, C. P. and Casella, G. (2013). Monte Carlo statistical methods. Springer.
 Rubinstein (1999) Rubinstein, R. (1999). The crossentropy method for combinatorial and continuous optimization. Methodology and computing in applied probability, 1(2):127–190.
 Rubinstein and Kroese (2016) Rubinstein, R. Y. and Kroese, D. P. (2016). Simulation and the Monte Carlo method. John Wiley & Sons.
 Ruiz and Titsias (2019) Ruiz, F. J. R. and Titsias, M. K. (2019). A contrastive divergence for combining variational inference and MCMC. arXiv preprint arXiv:1905.04062.
 Shestopaloff and Neal (2014) Shestopaloff, A. Y. and Neal, R. M. (2014). On Bayesian inference for the M/G/1 queue with efficient MCMC sampling. arXiv preprint arXiv:1401.5548.
 Smith and Gelfand (1992) Smith, A. F. M. and Gelfand, A. E. (1992). Bayesian statistics without tears: a sampling–resampling perspective. The American Statistician, 46(2):84–88.
 Turner et al. (2008) Turner, R. E., Berkes, P., Sahani, M., and MacKay, D. J. C. (2008). Counterexamples to variational free energy compactness folk theorems. Technical report, University College London.
 Whitaker et al. (2017) Whitaker, G. A., Golightly, A., Boys, R. J., and Sherlock, C. (2017). Improved bridge constructs for stochastic differential equations. Statistics and Computing, 27(4):885–900.
 Wilkinson (2013) Wilkinson, R. D. (2013). Approximate Bayesian computation (ABC) gives exact results under the assumption of model error. Statistical applications in genetics and molecular biology, 12(2):129–141.
 Yao et al. (2018) Yao, Y., Vehtari, A., Simpson, D., and Gelman, A. (2018). Yes, but did it work?: Evaluating variational inference. In International Conference on Machine Learning, pages 5577–5586.
Appendix A Truncating importance weights
Importance sampling estimates can have large or infinite variance if is a poor approximation to , often due to a small number of importance weights being very large relative to the others. To reduce the variance, Ionides (2008) introduced truncated importance sampling. This replaces each importance weight with given some maximum value . The truncated weights are then used in estimates (3) or (4). Truncating in this way typically reduces variance, at the price of increasing bias.
Truncating importance weights in Algorithm 1 is similar to clipping gradients to prevent occasional large gradient estimates from destabilising stochastic gradient optimisation (Pascanu et al., 2013). A potential drawback of both methods is that gradients lose the property of unbiasedness, which is theoretically required for convergence to an optimal
. A similar heuristic argument to gradient clipping can be used for why convergence is still likely with truncated weights. Firstly, even after truncation, the gradient is likely to point in a direction increasing the objective: one that increases the
density at values where is large. Secondly, we expect there is a region near the optimum where no truncation is necessary, and therefore gradient estimates are unbiased once this region is reached.We prefer truncating importance weights to clipping gradients as there is an automated way to choose the clipping threshold , as follows. We select to reduce the maximum normalised importance weight, , to a prespecified value: throughout we use . The required value can easily be calculated e.g. by bisection. (Occasionally no such exists i.e. if most weights equal zero. In this case we set to the smallest positive weight.)
Appendix B Approximate density initialisation
As discussed in Section 4.1, we wish to initialise close to its initial target distribution. This avoid importance sampling initially producing high variance gradient estimates.
This can sometimes be achieved by initialising to give a particular distribution, and designing our tempering scheme to have a similar initial target. In particular, represents neural network weights and biases, and we often initialise them all close to zero. We do this by setting biases to zero and sampling weights from distributions truncated to two standard deviations from the mean.
A particular case where this is useful, and which we use in our examples, is when includes the parameters of real NVP. Initialising them close to zero means the neural network outputs and
are also approximately zero – provided we use an activation function which maps zero to zero (as we do throughout the paper). Thus each coupling layer has shift vector
and scale vector . So real NVP produces samples similar to its base distribution . We can often design our initial target to equal this, or to be sufficiently similar that only a few iterations of pretraining are needed.Appendix C M/G/1 model
This section describes the M/G/1 queueing model, in particular how to simulate from it.
Recall that the parameters are , with independent prior distributions . We introduce a reparameterised version of our parameters: with independent priors. Then we can take . , , where is the cumulative distribution function.
The model involves latent variables for , where is the number of observations. These latent variables have independent distributions. They are used to generate
(interarrival times)  
(service times) 
We calculate interdeparture times through the following recursion (Lindley, 1952)
(interdeparture times) 
where (arrival times) and (departure times).
Comments
There are no comments yet.