    # Distilling importance sampling

The two main approaches to Bayesian inference are sampling and optimisation methods. However many complicated posteriors are difficult to approximate by either. Therefore we propose a novel approach combining features of both. We use a flexible parameterised family of densities, such as a normalising flow. Given a density from this family approximating the posterior we use importance sampling to produce a weighted sample from a more accurate posterior approximation. This sample is then used in optimisation to update the parameters of the approximate density, a process we refer to as "distilling" the importance sampling results. We illustrate our method in a queueing model example.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Bayesian inference has had great success in in recent decades (Green et al., 2015), but remains challenging in models with a complex posterior dependence structure e.g. those involving latent variables. Monte Carlo methods are one state of the art approach to inference. These produce samples from the posterior distribution. However in many settings it remains challenging to design good mechanisms to propose plausible samples, despite many advances (Cappé et al., 2004; Cornuet et al., 2012; Graham and Storkey, 2017; Whitaker et al., 2017).

We focus on one simple Monte Carlo method: importance sampling (IS). This weights draws from a proposal distribution so that the weighted sample can be viewed as representing a target distribution

, such as the posterior. IS can be used in almost any setting, including in the presence of strong posterior dependence or discrete random variables. However it only achieves a representative weighted sample at a feasible cost if the proposal is a reasonable approximation to the target distribution.

An alternative to Monte Carlo is to use optimisation to find the best approximation to the posterior from a family of distributions. Typically this is done in the framework of variational inference (VI). VI is computationally efficient but has the drawback that it often produces poor approximations to the posterior distribution e.g. through over-concentration (Turner et al., 2008; Yao et al., 2018).

A recent improvement in VI is due to the development of a range of flexible and computationally tractable distributional families using normalising flows (Dinh et al., 2016; Papamakarios, 2019). These transform a simple base random distribution to a complex distribution, using a sequence of learnable transformations.

We propose an alternative to variational inference for training the parameters of an approximate posterior density, typically a normalising flow, which we call the distilled density. This alternates two steps. The first is importance sampling, using the current distilled density as the proposal. The target distribution is an approximate posterior, based on tempering, which is an improvement on the proposal. The second step is to use the resulting weighted sampled to train the distilled density further. Following Li et al. (2017), we refer to this as distilling the importance sampling results. By iteratively distilling IS results, we can target increasingly accurate posterior approximations i.e. reduce the amount of tempering.

Each step of our distilled importance sampling (DIS) method aims to reduce the Kullback Leibler (KL) divergence from the distilled density to the current tempered posterior. This is known as the inclusive KL divergence, as minimising it tends to produce a density which is over-dispersed compared to the tempered posterior. Such a distribution is well suited to be an IS proposal distribution.

In the remainder of the paper, Section 2 presents background on Bayesian inference and normalising flows. Sections 3 and 4 describe our method. Section 5 illustrates it on a simple two dimensional inference task. Section 6 gives a more challenging queueing model example. Section 7 concludes with a discussion, including limitations of the approach and opportunities for future improvements.

A link to code for the examples will be included in the final version of this paper. All examples were run on a 6-core desktop PC.

#### Related work

Several recent papers (Müller et al., 2018; Cotter et al., 2019; Duan, 2019) learn a density defined via a transformation to use as an importance sampling proposal. A novelty of our approach is using a sequential approach based on tempering.

A related approach is to distill Markov chain Monte Carlo output, but this turns out to be more difficult than for IS. One reason is that optimising the KL divergence typically requires unbiased estimates of it or related quantities (e.g. its gradient), but MCMC only provides unbiased estimates asymptotically.

Li et al. (2017) and Parno and Marzouk (2018) proceed by using biased estimates, while Ruiz and Titsias (2019) introduce an alternative more tractable divergence. However IS, as we shall see, can produce unbiased estimates of the required KL gradient.

Approximate Bayesian computation (ABC) methods (Marin et al., 2012; Del Moral et al., 2012) involve simulating datasets under various parameters to find those which produce close matches to the observations. However close matches are rare unless the observations are low dimensional. Hence ABC typically uses dimension reduction of the observations through summary statistics, which reduces inference accuracy. Our method can instead learn a joint proposal distribution for the parameters and all the random variables used in simulating a dataset (see Section 6 for details). Hence it can control the simulation process to frequently output data similar to the full observations.

Conditional density estimation methods (see e.g. Le et al., 2017, Papamakarios et al., 2018, Grazian and Fan, 2019

) fit a joint distribution to parameters and data from simulations. Then one can condition on the observed data to approximate its posterior distribution. These methods also sometimes require dimension reduction, and can perform poorly when the observed data is unlike the simulations. Our approach avoids these difficulties by directly finding parameters which can reproduce the full observations.

More broadly, DIS has connections to several inference methods. Concentrating on its IS component, it is closely related to adaptive importance sampling (Cornuet et al., 2012) and sequential Monte Carlo (SMC) (Del Moral et al., 2006). Concentrating on training an approximate density, it can be seen as a version of the cross-entropy method (Rubinstein, 1999).

## 2 Background

This section presents background on Bayesian inference and normalising flows.

### 2.1 Bayesian framework

We observe data

assumed to be the output of a probability model

under some parameters . Given prior density we aim to find corresponding posterior .

Many probability models involve latent variables , so that . To avoid computing this integral we will attempt to infer the joint posterior , and then marginalise to get . For convenience we introduce to represent the collection of parameters and latent variables. (For models without latent variables .)

We now wish to infer . Typically we can only evaluate an unnormalised version,

 ~p(ξ|y)=p(y|θ,x)p(x|θ)π(θ).

Then where is an intractable normalising constant.

### 2.2 Tempering

We will use a tempered target density such that is the posterior and gives an approximation. As for the posterior, we can often only evaluate an unnormalised version . Then where . We use various tempering schemes later in the paper: see (8) and (9).

### 2.3 Importance sampling

Let be a target density, such as a tempered posterior, where and only can be evaluated. Importance sampling (IS) is a Monte Carlo method to estimate expectations of the form

 I=Eξ∼p[h(ξ)],

for some function . Here we give an overview of relevant aspects. For full details see e.g. Robert and Casella (2013) and Rubinstein and Kroese (2016).

IS requires a proposal density which can easily be sampled from, and must satisfy

 supp(p)⊆supp(λ), (1)

where denotes support. Then

 I=Eξ∼λ[p(ξ)λ(ξ)h(ξ)]. (2)

So an unbiased Monte Carlo estimate of is

 ^I1=1NZN∑i=1wih(ξ(i)), (3)

where are independent samples from , and gives a corresponding importance weight.

Typically is estimated as giving

 ^I2=N∑i=1wih(ξ(i))/N∑i=1wi, (4)

a biased, but consistent, estimate of . Equivalently

 ^I2=N∑i=1sih(ξ(i)),

for normalised importance weights .

A drawback of importance sampling is that the variance of its estimates can be large, or infinite, if

is a poor approximation to . Hence diagnostics for the quality of the results are useful.

A popular diagnostic is effective sample size (ESS),

 NESS=(N∑i=1wi)2/ N∑i=1w2i. (5)

For most functions , roughly equals the variance of an idealised Monte Carlo estimate based on independent samples from (Liu, 1996).

### 2.4 Normalising flows

A normalising flow

represents a random vector

with a complicated distribution as an invertible transformation of a random vector with a simple base distribution, typically .

Recent research has developed flexible learnable families of normalising flows. See Papamakarios (2019) for a review. We focus on real NVP (“non-volume preserving”) flows (Dinh et al., 2016). These compose several transformations of . One type is a coupling layer which transforms input vector to output vector , both of dimension , by

 v1:d =u1:d, vd+1:D =μ+exp(σ)⊙ud+1:D, μ =fμ(u1:d), σ =fσ(u1:d),

where and are elementwise multiplication and exponentiation. Here the first elements of are copied unchanged. We typically take . The other elements are scaled by vector then shifted by vector , where and are functions of . This transformation is invertible, and allows quick computation of the density of from that of , as the Jacobian is simply . Coupling layers are alternated with permutations so that different variables are copied in successive coupling layers. Real NVP typically uses order-reversing or random permutations.

The functions and

are neural network outputs. Each coupling layer has its own neural network. The collection of all weights and biases,

, can be trained for particular tasks e.g. density estimation of images. Permutations are fixed in advance and not learnable.

Real NVP produces a flexible family of densities with two useful properties for this paper. Firstly, samples can be drawn rapidly. Secondly, it is reasonably fast to compute for any .

Given an approximate family of densities , such as normalising flows, this section introduces objective functions to judge how well approximates a tempered posterior . It then discusses how to estimate the gradient of this objective with respect to . Section 4 presents our algorithm using these gradients to sequentially update while also reducing .

### 3.1 Objective

Given , we aim to minimise the inclusive Kullback-Leibler (KL) divergence,

 KL(pϵ||q)=Eξ∼pϵ[logpϵ(ξ)−logq(ξ;ϕ)].

This is equivalent to maximising a scaled negative cross-entropy, which we use as our objective,

 Jϵ(ϕ)=ZϵEξ∼pϵ[logq(ξ;ϕ)].

(We scale by to avoid this intractable constant appearing in our gradient estimates below.)

The inclusive KL divergence penalises values which produce small when is large. Hence the optimal tends to make non-negligible where is non-negligible, known as the zero-avoiding property. This is an intuitively attractive feature for importance sampling proposal distributions. Indeed recent theoretical work shows that, under some conditions, the sample size required in importance sampling scales exponentially with the inclusive KL divergence (Agapiou et al., 2017; Chatterjee and Diaconis, 2018).

(Our work could be adapted to use the divergence (Dieng et al., 2017; Müller et al., 2018), which is closely related to the variance of the importance weights.)

Assuming standard regularity conditions (Mohamed et al., 2019, Section 4.1), the objective has gradient

 ∇Jϵ(ϕ)=ZϵEξ∼pϵ[∇logq(ξ;ϕ)].

Using (2), an importance sampling form is

 ∇Jϵ(ϕ)=Eξ∼λ[~pϵ(ξ)λ(ξ)∇logq(ξ;ϕ)],

where is a proposal density. We will take for some . (In our main algorithm, will be the output of a previous optimisation step.) Note we use choices of with full support, so (1) is satisfied.

An unbiased Monte Carlo gradient estimate is

 g1=1NN∑i=1wi∇logq(ξ(i);ϕ), (6)

where are independent samples and are importance sampling weights.

We calculate

by backpropagation. Note that we backpropagate with respect to

, but not which is treated as a constant. Hence the values are themselves constant.

Here we discuss reducing the variance and cost of .

#### Clipping weights

To avoid high variance gradient estimates we apply truncated importance sampling (Ionides, 2008). This clips the weights at a maximum value , producing truncated importance weights . The resulting gradient estimate is

 g2=1NN∑i=1~wi∇logq(ξ(i);ϕ).

This gradient estimate typically has lower variance than , but some bias is introduced. See Appendix A for more details and discussion, including how we choose automatically.

#### Resampling

The gradient estimate requires calculating for . Each of these has an associated computational cost, but often many receive small weights and so contribute little to .

To reduce this cost we can discard many low weight samples, by using importance resampling (Smith and Gelfand, 1992) as follows. We sample times, with replacement, from the s with probabilities where . Denote the resulting samples as . The following is then an unbiased estimate of ,

 g3=Snn∑j=1∇logq(~ξ(j);ϕ). (7)

## 4 Algorithm

Our approach to inference is as follows. Given a current approximation to the posterior, , we use this as in (7) to produce a gradient estimate. We then update to by stochastic gradient ascent, aiming to increase . Over multiple iterations, we improve our target density by reducing , slowly enough to avoid high variance gradient estimates.

Algorithm 1 gives our implementation of this approach. The remainder of the section discusses several details of the algorithm.

We implement Algorithm 1

using Tensorflow for steps which sample from, backpropagate through, or update neural networks. Other steps are implemented in NumPy where this is more convenient e.g. updating

in step 5, or sampling from particularly complex models during step 6.

Note that whenever step 5 reduces , the optimisation objective is altered. However the change in objective is typically small, and we did not observe the optimiser struggling to adapt to this change.

We generally take and , a conservative choice to avoid overfitting when importance sampling only produces a small effective sample size.

We investigate other tuning choices for the algorithm in Section 6. For now note that must be reasonably large since our method to update relies on making an accurate ESS estimate, as detailed in Section 4.2.

### 4.1 Initialisation

The initial should be similar to the initial target . Otherwise the first gradient estimates produced by importance sampling are likely to be high variance.

We can often achieve this by initialising appropriately. See Appendix B for details, and Sections 5 and 6 for examples.

Alternatively we can pretrain using straightforward density estimation by iterating the following steps:

1. Sample from .

This aims to maximise the negative cross-entropy . We use , and terminate once achieves a reasonable effective sampling size when targeting in importance sampling.

### 4.2 Selecting ϵt

We select using effective sample size, as in Del Moral et al. (2012). Given sampled from , the resulting ESS value for target is

 NESS(ϵ)=[∑Ni=1w(ξi,ϵ)]2∑Ni=1w(ξi,ϵ)2, where w(ξ,ϵ)=~pϵ(ξ)q(ξ;ϕt−1).

In step 5 of Algorithm 1 we first check whether , where is the target ESS value. If so we take . Otherwise we let be an estimate of the minimal such that , computed by a bisection algorithm.

## 5 Example: sinusoidal distribution

As a simple illustration, we target a distribution with and . This has unnormalised density

 ~p(θ)=exp{−100[θ2−sin(θ1)2]}1[|θ1|<π],

where is an indicator function. (Note earlier sections infer , where are latent variables. This example has no latent variables, so we simply infer .)

As initial distribution we take and to have independent distributions. We take

to give a reasonable match to the standard deviation of

under the target. Hence

 p1(θ)=12πσ20exp[−12σ20(θ21+θ22)].

We use the unnormalised tempered target

 ~pϵ(θ)=p1(θ)ϵ~p(θ)1−ϵ. (8)

We use real NVP for , with 4 coupling layers, alternated with permutation layers swapping and . Each coupling layer uses a neural network with 3 hidden layers each with 10 hidden units and ELU activation. We initialise close to a distribution, as described in Appendix B, then use pretraining so that approximates .

The main algorithm uses training samples with a target ESS of . These values are chosen to produce a clear visual illustration: we investigate efficient tuning choices later.

Figure 1 shows our results. The distilled density quickly adapts to meet the importance sampling results, and is reached by 90 iterations. This took roughly 1.5 minutes. Figure 1: Training output for sinusoidal example. Each frame shows 300 samples from the current importance density. A subsample of 150 targeting pϵ(θ) for the current ϵ value is selected by resampling (see Section 3.3) and shown as red crosses. The remaining points are shown as blue dots.

## 6 Example: M/G/1 queue

This section describes an application to likelihood-free inference (Marin et al., 2012; Papamakarios and Murray, 2016). Here a generative model or simulator is specified, typically by computer code. This maps parameters and pseudo-random draws to data .

Given observations , we aim to infer the joint posterior of and , . In this section we approximate this with a black-box choice of i.e. a generic normalising flow. This approach could be applied to any simulator model without modifying its computer code, instead overriding the random number generator to use values proposed by , as in Baydin et al. (2018). However, for higher dimensional a black-box approach becomes impractical. An alternative would be to use knowledge of the simulator to inform the choice of .

### 6.1 Model

We consider a M/G/1 queuing model. This is based on a single queue of customers. Times between arrivals at the back of the queue are . Upon reaching the front of the queue, a customer’s service time is . All these random variables are independent.

We consider a setting where only inter-departure times are observed: times between departures from the queue. This model is a common benchmark for likelihood-free inference (see e.g. Papamakarios and Murray, 2016). Near-exact posterior inference is also possible using a sophisticated MCMC scheme (Shestopaloff and Neal, 2014).

We sample a synthetic dataset of observations from parameter values . We attempt to infer these parameters under the prior (all independent).

### 6.2 DIS implementation

#### Latent variables and simulator

We introduce and , vectors of length 3 and , and take as the collection . Our simulator transforms these inputs to and . We do this in a such a way that when the elements have independent distributions, then is a sample from the prior and the model. See Appendix C for details.

#### Tempered target

We use the unnormalised tempered target

 ~pϵ(ξ)=π(θ)exp[−12ϵ2d(ξ)2],for ϵ>0 (9)

where is the Euclidean distance between the simulated and observed data. This target is often used in ABC and corresponds to the posterior under the assumption that the data is observed with error (Wilkinson, 2013). We use initial value : a large scale value relative to our observations. Note that DIS cannot reach the exact target here. Instead, like ABC, it will produce increasingly good posterior approximations as .

#### Tuning choices

We use a real NVP architecture for , made up of 16 coupling layers, alternated with random permutation layers. Each coupling layer uses a neural network with 3 hidden layers of units and ELU activation. We initialise close to a distribution, as described in Appendix B. This produces samples sufficiently close to the initial target that pretraining was not needed.

### 6.3 Results

Figure 2 compares different choices of (number of importance samples) and (target ESS value). It shows that reduces more quickly for larger or smaller . Our choice of was restricted by memory requirements: the largest value we used was . Our choice of was restricted by numerical stability: values below a few hundred often produced numerical overflow errors in the normalising flow.

This tuning study suggests using large and small subject to these restrictions. However, note that in this example, the cost of a single evaluation of the target density is low. For more expensive models efficient tuning choices may differ.

Figure 3 shows that DIS results with and are a close match to near-exact MCMC output using the algorithm of Shestopaloff and Neal (2014). The main difference is that DIS lacks the sharp truncation at . Its tails also appear to be less accurate than those of MCMC.

Papamakarios et al. (2018) make a detailed comparison of likelihood-free methods on the M/G/1 model. An advantage of DIS over these methods is that it performs inference on the full data, without losing information by using summary statistics. However a trade-off is that some competing methods give good approximate results from far fewer simulator runs than DIS. Figure 2: The ϵ value reached by DIS on the M/G/1 example against computation time, for various choices of N (number of importance samples) and M/N (ratio of target effective sampling size to N). Figure 3: Marginal posterior histograms for M/G/1 example. Top: MCMC output. Bottom: DIS output after 180 minutes with ϵ=0.283.

## 7 Conclusion

We’ve presented distilled importance sampling, and shown its application as an approximate Bayesian inference method for a likelihood-free example.

#### Limitations

Variational inference scales well to large datasets, as it requires unbiased log-likelihood estimates, which can be calculated from small subsamples of data. One limitation of DIS is that it does not share this property, as the cost of each importance weight in DIS typically scales linearly with the size of the data e.g. by making a likelihood evaluation.

Also, we’ve focused an example where evaluating is quick, and used many evaluations by taking large. For more expensive models, this would be infeasible.

We hope both limitations can be addressed in future using ideas from adaptive importance sampling, SMC, the cross-entropy method, normalising flows etc.

#### Extensions

There are interesting opportunities for extending DIS. It’s not required that is differentiable, so discrete parameter inference is plausible. Also, we can use random-weight importance sampling (Fearnhead et al., 2010) in DIS, and replace with an unbiased estimate.

#### Acknowledgements

Thanks to Alex Shestopaloff for providing MCMC code for the M/G/1 model.

## Appendix A Truncating importance weights

Importance sampling estimates can have large or infinite variance if is a poor approximation to , often due to a small number of importance weights being very large relative to the others. To reduce the variance, Ionides (2008) introduced truncated importance sampling. This replaces each importance weight with given some maximum value . The truncated weights are then used in estimates (3) or (4). Truncating in this way typically reduces variance, at the price of increasing bias.

Truncating importance weights in Algorithm 1 is similar to clipping gradients to prevent occasional large gradient estimates from destabilising stochastic gradient optimisation (Pascanu et al., 2013). A potential drawback of both methods is that gradients lose the property of unbiasedness, which is theoretically required for convergence to an optimal

. A similar heuristic argument to gradient clipping can be used for why convergence is still likely with truncated weights. Firstly, even after truncation, the gradient is likely to point in a direction increasing the objective: one that increases the

density at values where is large. Secondly, we expect there is a region near the optimum where no truncation is necessary, and therefore gradient estimates are unbiased once this region is reached.

We prefer truncating importance weights to clipping gradients as there is an automated way to choose the clipping threshold , as follows. We select to reduce the maximum normalised importance weight, , to a prespecified value: throughout we use . The required value can easily be calculated e.g. by bisection. (Occasionally no such exists i.e. if most weights equal zero. In this case we set to the smallest positive weight.)

## Appendix B Approximate density initialisation

As discussed in Section 4.1, we wish to initialise close to its initial target distribution. This avoid importance sampling initially producing high variance gradient estimates.

This can sometimes be achieved by initialising to give a particular distribution, and designing our tempering scheme to have a similar initial target. In particular, represents neural network weights and biases, and we often initialise them all close to zero. We do this by setting biases to zero and sampling weights from distributions truncated to two standard deviations from the mean.

A particular case where this is useful, and which we use in our examples, is when includes the parameters of real NVP. Initialising them close to zero means the neural network outputs and

are also approximately zero – provided we use an activation function which maps zero to zero (as we do throughout the paper). Thus each coupling layer has shift vector

and scale vector . So real NVP produces samples similar to its base distribution . We can often design our initial target to equal this, or to be sufficiently similar that only a few iterations of pretraining are needed.

## Appendix C M/G/1 model

This section describes the M/G/1 queueing model, in particular how to simulate from it.

Recall that the parameters are , with independent prior distributions . We introduce a reparameterised version of our parameters: with independent priors. Then we can take . , , where is the cumulative distribution function.

The model involves latent variables for , where is the number of observations. These latent variables have independent distributions. They are used to generate

 ai =−1θ3logΦ−1(xi), (inter-arrival times) si =θ2+(θ3−θ2)Φ−1(xi+m). (service times)

We calculate inter-departure times through the following recursion (Lindley, 1952)

 di =si+max(Ai−Di−1), (inter-departure times)

where (arrival times) and (departure times).