# Increasing the efficiency of Sequential Monte Carlo samplers through the use of approximately optimal L-kernels

By facilitating the generation of samples from arbitrary probability distributions, Markov Chain Monte Carlo (MCMC) is, arguably, the tool for the evaluation of Bayesian inference problems that yield non-standard posterior distributions. In recent years, however, it has become apparent that Sequential Monte Carlo (SMC) samplers have the potential to outperform MCMC in a number of ways. SMC samplers are better suited to highly parallel computing architectures and also feature various tuning parameters that are not available to MCMC. One such parameter - the L-kernel' - is a user-defined probability distribution that can be used to influence the efficiency of the sampler. In the current paper, the authors explain how to derive an expression for the L-kernel that minimises the variance of the estimates realised by an SMC sampler. Various approximation methods are then proposed to aid implementation of the proposed L-kernel. The improved performance of the resulting algorithm is demonstrated in multiple scenarios. For the examples shown in the current paper, the use of an approximately optimum L-kernel has reduced the variance of the SMC estimates by up to 99 resampling was required by between 65 accompanying this manuscript are available through the Github repository <https://github.com/plgreenLIRU/SMC_approx_optL>.

## Authors

• 1 publication
• 1 publication
• 1 publication
• 10 publications
• 6 publications
• ### Pseudo-extended Markov chain Monte Carlo

Sampling from the posterior distribution using Markov chain Monte Carlo ...
08/17/2017 ∙ by Christopher Nemeth, et al. ∙ 0

• ### Unbiased and Consistent Nested Sampling via Sequential Monte Carlo

We introduce a new class of sequential Monte Carlo methods called Nested...
05/10/2018 ∙ by Robert Salomone, et al. ∙ 0

• ### Variational MCMC

We propose a new class of learning algorithms that combines variational ...
01/10/2013 ∙ by Nando de Freitas, et al. ∙ 0

• ### Regularised Zero-Variance Control Variates

Zero-variance control variates (ZV-CV) is a post-processing method to re...
11/13/2018 ∙ by Leah F. South, et al. ∙ 0

• ### Regularised Zero-Variance Control Variates for High-Dimensional Variance Reduction

Zero-variance control variates (ZV-CV) are a post-processing method to r...
11/13/2018 ∙ by Leah F. South, et al. ∙ 0

• ### BAREB: A Bayesian repulsive biclustering model for periodontal data

Preventing periodontal diseases (PD) and maintaining the structure and f...
02/15/2019 ∙ by Yuliang Li, et al. ∙ 0

• ### Bayesian Sequential Inference in Dynamic Survival Models

Dynamic hazard models are applied to analyze time-varying effects of cov...
06/19/2018 ∙ by Parfait Munezero, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Abstract

By facilitating the generation of samples from arbitrary probability distributions, Markov Chain Monte Carlo (MCMC) is, arguably, the tool for the evaluation of Bayesian inference problems that yield non-standard posterior distributions. In recent years, however, it has become apparent that Sequential Monte Carlo (SMC) samplers have the potential to outperform MCMC in a number of ways. SMC samplers are better suited to highly parallel computing architectures and also feature various tuning parameters that are not available to MCMC. One such parameter - the ‘L-kernel’ - is a user-defined probability distribution that can be used to influence the efficiency of the sampler. In the current paper, the authors explain how to derive an expression for the L-kernel that minimises the variance of the estimates realised by an SMC sampler. Various approximation methods are then proposed to aid implementation of the proposed L-kernel. The improved performance of the resulting algorithm is demonstrated in multiple scenarios. For the examples shown in the current paper, the use of an approximately optimum L-kernel has reduced the variance of the SMC estimates by up to 99 % while also reducing the number of times that resampling was required by between 65 % and 70 %. Python code and code tests accompanying this manuscript are available through the Github repository https://github.com/plgreenLIRU/SMC_approx_optL.

Key words: Sequential Monte Carlo, Markov chain Monte Carlo, Bayesian inference

## 1 Introduction

Markov Chain Monte Carlo (MCMC) is an approach through which, theoretically, it is possible to generate samples from an arbitrary probability distribution. MCMC has therefore become a key tool in the analysis of Bayesian inference problems as it can be used to generate samples from posterior probability distributions for which Bayes’ theorem does not yield closed-form solutions. Beginning with the fundamental paper by Metropolis

et al. [27], later generalised by Hastings [16], the popularity of MCMC has grown considerably. Myriad different MCMC approaches have now been proposed; algorithms such as Hamiltonian Monte Carlo (also known as Hybrid Monte Carlo) [10] and the No U-Turn Sampler (NUTS) [17] now lie at the heart of MCMC software packages, including Stan [4] and PyMC3 [31].

MCMC, however, does have certain disadvantages. By fundamentally relying on the convergence of a single Markov chain to its stationary distribution, MCMC approaches are poorly suited to parallel processing111The authors note that some variants of MCMC do attempt to exploit parallel processing - see TMCMC [6] for example. However, to the best of our knowledge, none of these are able to fully exploit the potential provided by modern computing architectures.. Moreover, samples generated before convergence of the Markov chain(s) cannot be treated as samples from the distribution of interest (herein referred to as the ‘target distribution’) and, as such, must be discarded (a process typically referred to as ‘burn-in’ or ‘convergence to the typical set’). The current paper considers, instead, an alternative to MCMC known as Sequential Monte Carlo (SMC) samplers.

SMC samplers222Not to be confused with SMC methods, which we define to be a collection of approaches that include, for example, the particle filter [1]. were first proposed in [8]

. They can fulfil the same role as MCMC in that, over successive iterations, they can be used to realise Monte-Carlo estimates of statistical moments associated with an arbitrary probability distribution. SMC samplers have been used to aid rare event estimation

[5], approximate Bayesian (a.k.a ‘likelihood free’) computation [11, 9, 30, 3, 34], inference from growing sets of data [7, 14] and Bayesian model selection [39, 21] while the papers [22, 25] have explored the applications of SMC samplers on parallel computing architectures.

SMC samplers differ from MCMC in that they are fundamentally based on an importance sampling approach, whereby a population of weighted samples evolve over a number of iterations. SMC samplers have additional ‘settings’, not available to MCMC, that can, potentially, be used to improve the performance of the resulting sampler. One of these settings is the so-called L-kernel - a user defined conditional probability density function that appears in the general SMC implementation

[8].

The current paper explains the optimal choice for this L-kernel. The novelty of the paper lies in the description of approximation schemes for implementing approximately optimal L-kernels for cases where the target distribution is uni-modal or multi-modal.

The paper is organised as follows. Section 2 introduces SMC samplers and the role of the L-kernel before, in Section 3, the optimal L-kernel is defined. Methods for implementing an approximately optimal L-kernel are then described in Section 4 before the approach is illustrated on a number of case studies in Section 5. Future work and main conclusions are then described in Sections 6 and 7 respectively.

## 2 SMC samplers

### 2.1 General algorithm

This section provides an introduction to SMC samplers and establishes the role of the L-kernel in a general SMC sampler implementation.

Ultimately, the goal of the algorithm described in the current section is to estimate the expectations of functions with respect to a target distribution, , where . We note that the approach is applicable to scenarios where the target distribution is only known up to a constant of proportionality (which is often the case in Bayesian inference problems333For clarity, generating samples from the distribution for fixed cannot be approached in the same manner as generating samples from , as is proportional, not equal, to .). Consequently, in the following, we use to represent an unnormalised target distribution, such that

 π(x)=π∗(x)∫π∗(x)dx (1)

The first iteration of an SMC sampler is identical to a standard importance sampling approach. Specifically, samples of are generated from an initial proposal distribution, , such that

 {X11,...,XN1}∼q(x) (2)

For the case where the normalising constant of the target distribution is unknown, the importance weight associated with each sample is

 w∗(Xi1)=π∗(Xi1)q(Xi1) (3)

These weighted samples can be used to estimate statistical quantities relating to the target distribution, as

 ∫f(x)π(x)dx=∫f(x)π∗(x)dx∫π∗(x)dx≈∑Ni=1w∗(Xi1)f(Xi1)∑Nj=1w∗(Xj1) (4)

(where is the quantity of interest).

As with any importance sampling scheme, the so-called ‘effective sample size’ [24] can be used to analyse sample coverage (such that, for example, a low effective sample size implies that only a small number of samples are associated with relatively large importance weights). If the effective sample size is found to be below a pre-defined threshold, a procedure known as resampling is undertaken. Resampling involves generating a new set of samples by sampling, with replacement, from the current set of samples. The probability that a sample from the current set, , will become part of the new set is proportional to that sample’s importance weight (). Consequently, resampling tends to produce replicas of samples that are sparsely separated relative to the probability density of , i.e., where the density is high, while also removing samples that are in regions of low probability density of . In all of the following, resampling is employed if the ratio of effective sample size to the number of samplers per iteration falls below 0.5.

Up until this point, the SMC sampler has essentially acted as a standard importance sampler. In the next iteration, however, a new set of samples are proposed conditional on the current set. Specifically:

 Xi2∼q→(x2|Xi1),i=1,...,N (5)

where, here, is referred to as the ‘general proposal distribution’. The main advantage of this iterative approach is that, if resampling has occurred in the previous iteration, the algorithm allows new samples to be generated in regions of interest (i.e., regions where samples from the previous iteration were associated with relatively high importance weights). The importance weights associated with the SMC sampler’s second iteration are

 w∗(Xi1,Xi2)=π∗1:2(Xi1,Xi2)q→(Xi2|Xi1)q(Xi1) (6)

At this point it is important to note that the weighted samples now target . Practically speaking, it is only necessary for the most recent set of samples, , to target . Consequently, the user has significant flexibility in the choice of subject to the condition that, when normalised, is a valid probability density function (PDF) where

 ∫π∗1:2(x1,x2)dx1∝π(x2) (7)

As described in [8], the user may choose to set

 π∗1:2(x1,x2)=π∗(x2)L1(x1|x2) (8)

where , the L-kernel, is any valid PDF. The choice of L-kernel gives a degree of flexibility that, potentially, can be used to significantly influence the performance of the SMC sampler.

From equation (8) it is now possible to write the weights of each sample, , as

 w∗(Xi1,Xi2)=π∗(Xi2)L1(Xi1|Xi2)q→(Xi2|Xi1)q(Xi1)=π∗(Xi2)π∗(Xi1)L1(Xi1|Xi2)q→(Xi2|Xi1)w∗(Xi1) (9)

In the general case then, when the SMC sampler has run for iterations, the distribution

 π∗1:k(x1:k)=π∗(xk)k∏k′=2Lk′−1(xk′−1|xk′) (10)

is being targeted by the weighted samples whose importance weights can be found iteratively from

 w∗(Xi1:k)=π∗(Xik)π∗(Xik−1)Lk−1(Xik−1|Xik)q→(Xik|Xik−1)w∗(Xi1:k−1) (11)

For notational purposes we note that, if the normalising constant of the target distribution is known, then the importance weights associated with the first and th iteration of the SMC sampler would be calculated according to

 w(Xi1)=π(Xi1)q(Xi1) (12)

and

 w(Xi1:k)=π(Xik)π(Xik−1)Lk−1(Xik−1|Xik)q→(Xik|Xik−1)w(Xi1:k−1) (13)

respectively.

### 2.2 Recycling

At the th iteration of an SMC sampler, the quantity of interest is estimated according to

 ^fk=∑Ni=1w∗(Xi1:k)f(Xik)∑Nj=1w∗(Xj1:k) (14)

Recycling schemes involve combining the estimates to realise an overall estimate, denoted . In the following, is realised using a linear combination of such that

 ^f=k∑k′=1ck′^fk′=k∑k′=1N∑i=1⎛⎝ck′w∗(Xi1:k′)∑Nj=1w∗(Xj1:k′)⎞⎠f(Xik′) (15)

where

 k∑k′=1ck′=1and0≤ck′≤1,k′=1,...,k (16)

In the following, the optimal choice of the constants are defined as those that maximise the effective sample size:

 ⎡⎢⎣k∑k′=1N∑i=1⎛⎝ck′w∗(Xi1:k′)∑Nj=1w∗(Xj1:k′)⎞⎠2⎤⎥⎦−1 (17)

subject to the constraints described by equation (16). As described in [28], this leads to

 coptk=lk∑kk′=1lk′wherelk=(∑Nj=1w∗(Xj1:k))2∑Ni=1w∗(Xi1:k)2 (18)

where represents the optimal choice of .

## 3 Optimal L-kernel

The L-kernel can be set such that

 Lk−1(xk−1|xk)=q→(xk−1|xk),∀xk−1,xk (19)

(this strategy, referred to here as the ‘forward proposal L-kernel’, was undertaken in [14]). Using a forward proposal L-kernel, the importance weights of the SMC algorithm are updated according to

 w(Xi1:k)=π(Xik)π(Xik−1)q→(Xik−1|Xik)q→(Xik|Xik−1)w(Xi1:k−1) (20)

which bares similarities to the Metropolis-Hastings acceptance rule444For clarity, in the notation of the current paper, the Metropolis-Hastings algorithm accepts as the new state of the Markov chain with probability .. While use of the forward proposal L-kernel appears mathematically convenient we stress that it is, in fact, generally suboptimal for the sampler itself.

The current section describes a mathematical definition of the optimal L-kernel. Specifically, in Section 3.1, we begin by discussing the optimal L-kernel for a simplified SMC sampler that runs over 2 iterations only. This result is then demonstrated on a simple illustrative example in Section 3.2 before a more general derivation of the optimal L-kernel is outlined in Section 3.3. The authors note that the optimal L-kernel was originally shown in [8] and that the novelty of the current paper lies in the implementation strategy described in Section 4.

### 3.1 Simplified case

Here we restrict our discussion to the case where the normalising constant of the target distribution is known and consider an SMC sampler that is being used to estimate (discussion of the case where the normalising constant is unknown is postponed until the end of Section 3.3). Furthermore, for the sake of simplicity, we initially consider a sampler that is run over 2 iterations only and where no resampling takes place (resampling is also addressed in Section 3.3).

At iteration 2, the sampler’s estimate of is

 ~f2=1NN∑i=1π(Xi2)L1(Xi1|Xi2)q→(Xi2|Xi1)q(Xi1)f(Xi2),Xi1:2∼q→(x2|x1)q(x1) (21)

The variance of the estimate shown in equation (21) is

 Var[~f2]=E[~f22]+const (22)

where

 E[~f22]=1NN∑i=1∫⎛⎜ ⎜⎝π(Xi2)L1(Xi1|Xi2)q→(Xi2|Xi1)q(Xi1)f(Xi2)⎞⎟ ⎟⎠2q→(Xi2|Xi1)q(Xi1)dXi1:2 (23)
 =∫π(x2)2L1(x1|x2)2q→(x2|x1)q(x1)f(x2)2dx1:2 (24)

We now aim to find the L-kernel, , that minimises the variance of the estimator (i.e., minimises equation (24)) subject to the constraint that

 ∫L1(x1|x2)dx1=1∀x2 (25)

(therefore ensuring that the L-kernel is a valid PDF). This constrained optimisation problem can be addressed using a Lagrange multiplier approach were, to ensure that the constraint in equation (25) is enforced, the following Lagrangian function is defined:

 E[~f22]−∫λ(x′2)(∫L1(x1|x′2)dx1−1)dx′2 (26)

where the s are Lagrange multipliers. The optimal L-kernel therefore satisfies the following:

 ∂∂L1[E[~f22]−∫λ(x′2)(∫L1(x1|x′2)dx1−1)dx′2]=0 (27)

(where is a functional derivative). Writing the optimal L-kernel as then, from equation (27), it follows that

 2∫π(x2)2Lopt1(x1|x2)q→(x2|x1)q(x1)f(x2)2dx1:2+∫λ(x′2)dx′2=0 (28)

which is satisfied when

 2π(x2)2Lopt1(x1|x2)q→(x2|x1)q(x1)f(x2)2+λ(x2)=0 (29)

therefore, noting that is constant for fixed , it follows that

 Lopt1(x1|x2)∝q→(x2|x1)q(x1)f(x2)2π(x2)2∝q→(x2|x1)q(x1) (30)

Equation (30) shows the optimal L-kernel (i.e., the L-kernel that would minimise the sample variance of the estimator) for an SMC sampler that has been applied over 2 iterations. To realise an analytical expression for , one must rearrange into the form where describes the probability that, given , there had been a sample in the interval at the first iteration of the SMC sampler. To emphasise this point, an illustrative example is given in the following section.

### 3.2 Illustrative Example

Here we consider the case where and where the aim is to estimate the mean of the distribution . The SMC sampler is run for 2 iterations, whereby samples are proposed according to

 q(x1)=N(x1;0,1),q→(x2|x1)=N(x2;x1,1) (31)

In this case, using standard expressions for Gaussian distributions, a closed-form expression for

can be shown to be

 q←(x1|x2)=N(x1;x22,12) (32)

Note that, in this notation, represents the probability of proposing a sample in the region given that the algorithm’s state is currently . Conversely, if the algorithm’s state is currently , represents the probability that the algorithm’s previous state was in the region . Recalling equation (30), the optimal L-kernel for this simple case study is therefore

 Lopt1(x1|x2)=N(x1;x22,12) (33)

The filled circles in Figure 1 (a) show 200 samples that were generated by an SMC sampler over two iterations. The contours shown in Figure 1 (a) illustrate the multi-dimensional target distribution, , for the cases where the forward proposal L-kernel is used () and where the optimal L-kernel (equation (33)) is used. Figure 1 (b) shows the marginal of the two contours over . We note that, as a result of the constraint outlined by equation (25), the marginal of the contours in Figure 1 (a) are both equal to (i.e., .) Figure 1 (c), however, shows that the marginal of the contours over are not necessarily the same.

Figure 1 helps to give an intuitive impression of how the optimal L-kernel improves sampler performance. The optimal-L contour in Figure 1 (a) appears to be positioned such that it encloses a relatively large number of samples, thus maintaining a high effective sample size whilst still having the property that the marginal of , over , is equal to . In this way, it is possible to envisage how using the optimal L-kernel can help to prevent relatively low values of effective sample size from occurring.

In the following section, a generalised expression for the optimal L-kernel is derived and a more mathematically rigorous explanation as to its performance enhancements is outlined.

### 3.3 General Case

Consider now, a general SMC sampler that has run over iterations and where . If no resampling has taken place before iteration then the estimate, , is realised according to

 ~fk=1NN∑i=1π(Xik)∏kk′=2Lk′−1(Xik−1|Xik)q(Xi1)∏kk′=2q→(Xik′|Xik′−1)f(Xik), (34)

where

 Xi1:k∼q(x1)k∏k′=2q→(xk′|xk′−1) (35)

In general, the optimal L-kernel is

 Loptk−1(xk−1|xk)=q←(xk−1|xk) (36)

As noted in [8], setting each of the L-kernels equal to their optimal value, the importance weights associated with a single ‘trajectory’ () of a generalised SMC sampler are therefore

 w(x1:k)=π1:k(x1:k)q(x1:k)=π(xk)∏kk′=2Loptk′−1(xk′−1|xk′)q(xk)∏kk′=2q←(xk′−1|xk′)=π(xk)q(xk) (37)

Equation (37) highlights how, by using the optimal L-kernel, the samples can now be viewed as inhabiting a lower dimensional space. Specifically, the importance weights associated with each trajectory () can now be calculated based purely with regards to the proposal distribution .

If resampling last occurred at iteration then the samples can be considered as (approximate) samples from . As a result, if we consider samples only after iteration , we have

 ~fk≈1NN∑i=1π(Xik)∏kk′=l+1Lk′−1(Xik−1|Xik)π(Xil)∏kk′=l+1q→(Xik′|Xik′−1)f(Xik), (38)

where, now, we have

 q(xk)=∫π(xl)k∏k′=l+1q→(xk′|xk′−1)dxl:k−1 (39)

The optimum L-kernel, however, is still as described by equation (36)555The authors note that the notation used in the current paper doesn’t necessarily capture the fact that the samples are approximate samples from the target, . The aim, at this point of the manuscript, is to illustrate that resampling does not alter the novel optimum L-kernel approximation schemes that are proposed in Section 4. For a more complete notation, the paper [8] is recommended..

Finally, for the case where the normalising constant of the target distribution is unknown, at iteration , estimates of the quantity of interest are realised from

 ~fk=∑Ni=1w∗(Xi1:k)f(Xik)∑Nj=1w∗(Xj1:k) (40)

In this case, the optimal L-kernel is still equal to that shown in equation (36) as this approach will minimise the variance of the estimates of both the numerator and the denominator of equation (40).

## 4 Estimating the optimal L-kernel

General closed-form expressions for the optimal L-kernel are usually intractable. As a result, in the current section, we present an approach that allows the implementation of an approximately optimal L-kernel. More specifically, Section 4.1 describes an approximation that is suitable for situations where can be approximated as a Gaussian distribution while Section 4.2 describes a more general approximation that can be applied when, for example, the target distribution has multiple modes.

### 4.1 Gaussian Approximation

If it assumed that

 q(xk−1,xk)≈^q(xk−1,xk)=N((xk−1xk);(μk−1μk),[Σk−1,k−1Σk−1,kΣk,k−1Σk,k]) (41)

then, using established properties of Gaussian distributions, it is possible to show that

 ^q←(xk−1|xk)=N(xk−1;μk−1|k,Σk−1|k) (42)

where

 μk−1|k=μk−1+Σk−1,kΣ−1k,k(xk−μk) (43)
 Σk−1|k=Σk−1,k−1−Σk−1,kΣ−1k,kΣk,k−1 (44)

Consequently, by using existing samples of ) to estimate the parameters of the Gaussian approximation described in equation (41), an approximately optimal L-kernel can be realised by setting equal to (where the latter is defined in equation (42)).

### 4.2 Mixture-of-Gaussian Approximation

For situations where a Gaussian approximation of is poor, we can instead attempt to use a Mixture-of-Gaussian approximation. Specifically, this approach involves using samples of to realise the approximation

 ^q(xk−1,xk)=∑mPr(m)N⎛⎝(xk−1xk);⎛⎝μ(m)k−1μ(m)k⎞⎠,⎡⎣Σ(m)k−1,k−1Σ(m)k−1,kΣ(m)k,k−1Σ(m)k,k⎤⎦⎞⎠ (45)

where

indexes each mixture component. Such an approach has the advantage that, once a Gaussian Mixture Model has been fitted to the samples

, closed-form expressions for can be realised. Specifically, it can be shown that

 ^q←(xk−1|xk)=∑mPr(m|xk)N(xk−1;μ(m)k−1|k,Σ(m)k−1|k) (46)

where

 μ(m)k−1|k=μ(m)k−1+Σ(m)k−1,k−1(Σ(m)k,k)−1(xk−μ(m)k) (47)
 (48)

and

 Pr(m|xk)=p(xk|m)Pr(m)∑m′p(xk|m′)Pr(m′)
 =N(xk|μ(m)k,Σ(m)k,k)Pr(m)∑m′N(xk|μ(m′)k,Σ(m′)k,k)Pr(m′) (49)

In the following examples, when employed, all Gaussian Mixture Models are trained using the Expectation Maximisation algorithm (implemented in Python’s ‘scikit-learn’ package [29]).

## 5 Case Studies

This section details 3 case studies that demonstrate the proposed approach. Section 5.1 describes a problem with a 2-dimensional target distribution, before, in Section 5.2, we study a scenario with a 1-dimensional bi-modal target distribution. Finally, in Section 5.3

, we apply the approach to a real case study where the aim is to analyse the hyperparameter uncertainty in a Gaussian Process rotorcraft simulation model.

### 5.1 2D Toy Problem

To begin with, we study a 2D problem with target distribution

 π(x)=N(x;μ,Σ) (50)

where

 (51)

The initial proposal distribution was chosen to be while the general proposal distribution was chosen to be . As mentioned in Section 3, it is often the case that the L-kernel is chosen (sub-optimally) such that

 Lk−1(xk−1|xk)=q→(xk−1|xk) (52)

(in the current example, this would be achieved by setting ). We note again that, in all of the following case studies, the choice of L-kernel shown in equation (52) is referred to as the ‘forward proposal L-kernel’, while the approximately optimal L-kernels that are described in Section 4 are referred to simply as the ‘optimal L-kernel’.

SMC samplers using both the forward proposal L-kernel and the single-Gaussian approximately optimal L-kernel were run, using iterations and samples per iteration. The code used to obtain the following results can be run from the file ‘2D_toy_problem.py’ in the Github repository https://github.com/plgreenLIRU/SMC_approx_optL.

Figures 2 and 3 show the estimates of the target mean and elements of the target covariance matrix, respectively, using the recycling scheme described in Section 2.2 . It is clear that, using the approximately optimal L-kernel, the SMC algorithm has converged to the true solutions faster than the SMC implementation with the forward proposal L-kernel. Figure 4 shows the effective sample sizes of the two approaches. It is interesting to note that, as well as aiding convergence , the use of an approximately optimal L-kernel has significantly reduced the number of times that resampling was required; the forward proposal L-kernel implementation performed resampling at every iteration whilst the approximately optimal L-kernel implementation performed resampling 35 times out of the 100 iterations. Finally, Table 1 shows the variance of the estimates realised using the two approaches, quantifying how, for this example, the use of an approximately optimal L-kernel reduces the variance of the algorithm’s estimates.

### 5.2 Bi-modal Toy Problem

For the second example, the aim is to target the bi-modal distribution:

 π(x)=12N(x;μ1,σ21)+12N(x;μ2,σ22) (53)

where , and . The mean and variance of equation (53) are

 E[X]=122∑c=1μc=0andVar[X]=122∑c=1(σ2c+μ2c)=10 (54)

respectively. The initial proposal was chosen to be while the general proposal was chosen to be . For the results shown in the remainder of this section, all SMC samplers were run for iterations with samples per iteration.

Given the bi-modal nature of the target distribution, it is reasonable to expect that the joint distribution of samples

will also be bi-modal. Consequently, the current example illustrates a case where we would expect a bi-modal approximation of the optimal L-kernel (implemented as described in Section 4.2) to outperform the approach that approximates with a single Gaussian (Section 4.1). In the following, ‘optimal L-kernel (1 component)’ and ‘optimal L-kernel (2 components)’ are used to represent the uni-modal and bi-modal approximations of the optimal L-kernel respectively. As before, ‘forward proposal L-kernel’ is used to represent the case where . The code used to obtain the following results can be run from the file ‘bimodal_toy_problem.py`’ in the Github repository https://github.com/plgreenLIRU/SMC_approx_optL.

Figure 5 shows the resulting estimates of the target mean and variance, using the recycling scheme described in Section 2.2. The use of a bi-modal approximately optimal L-kernel has led to faster convergence relative to the case where the uni-modal approximately optimal L-kernel is used (although both approaches outperform the forward proposal L-kernel). Figure 6 illustrates that, by using the bi-modal approximately optimal L-kernel, the rate at which the effective sample size drops is greatly reduced; out of the 1000 iterations the forward proposal L-kernel, uni-modal approximately optimum L-kernel and bi-modal approximately optimum L-kernel implementations performed resampling 116, 126 and 36 times respectively. This is particularly important for the case where the target is bi-modal as, whenever resampling is required, it is possible (if improbable) that all of the samples in one particular mode of the target will be removed. The sample variance of estimates using the 3 approaches are given in Table 2 where it can be seen that the bi-modal approximately optimal L-kernel achieved lower sample variances than the uni-modal approximately optimal L-kernel (and both L-kernel approximation methods achieve lower sample variance than the forward proposal L-kernel).

### 5.3 Gaussian Process Predictions of Rotorcraft Dynamics

Flight simulators are a vital part of any aircraft life cycle. However, to run in real-time, simplifications to the models that underpin flight simulators often have to be made. Such simplifications can cause the behaviour of the model to differ significantly from that of the real aircraft - this is particularly true for rotorcraft where complex nonlinear phenomena can occur. Data-based approaches have the potential to provide higher fidelity flight simulation models relative to current, physical-law based approaches.

The paper [20] details the development of a data-based flight simulator, that is trained on real flight data from a Bo105 rotorcraft666The AGARD database was delivered to the University of Liverpool as part of GARTEUR HC(AG-16) Rotorcraft Pilot Couplings [13]. Unfortunately, we do not currently have permission to publish this data alongside the current paper. In the current work, an SMC sampler is used to help quantify uncertainties associated with parameters of this machine-learnt flight simulator model.

Figure 7 shows the (normalised) pilot inputs during a particular maneuver of a Bo105 helicopter while Figure 8 shows the corresponding (normalised) pitch rate of the rotorcraft. We note here that, in the current example, we are investigating a model that is designed to predict on-axis responses only - the main pilot input during this maneuver is longitudinal stick position and, as such, the model is designed to predict the rotorcraft pitch rate.

Defining as the rotorcraft’s pitch rate at the th time step then, following on from the work in [20], predictions of are made according to a model of the form:

 yn=f(yn−1,δxn,δyn,δpn,δon) (55)

where is the longitudinal stick position, is the lateral stick position, is the tail rotor position and is the collective lever position. In the current example, a zero-mean Gaussian Process (GP) is used to model the relationship in equation (55). The GP has a squared exponential kernel and employs a Gaussian likelihood, where it assumed that differences between the model’s predictions and the measured values can be modelled as uncorrelated Gaussian noise with variance . The GP hyperparameters that require tuning are the length-scale of the kernel function, , and the variance, , of the Gaussian noise. The training data are shown as blue points on Figures 7 and 8

. Here, the model is used to interpolate between the training points. We note that the model being utilised can be classified as ‘GP-NARX’ and, as such, its predictions have to be calculated in a Monte-Carlo fashion. This need arises from the fact that the model’s own (uncertain) predictions are, in subsequent iterations, used as inputs (see

[37] for a more detailed discussion).

In the following analysis, the aim is to establish whether uncertainties associated with the GP hyperparameters ( and ) have significant influence on the overall uncertainty associated with the model’s predictions of pitch rate. To establish this, a Bayesian approach to the GP hyperparameter estimation is adopted where the following priors are used:

 p(l)=Gamma(l;a=1,s=1),p(σ)=Gamma(σ;a=1,s=0.01) (56)

where and are, respectively, the shape and scale parameters (as described in Scipy’s [35]

implementation of the Gamma distribution). To establish a ‘gold standard’ against which we can compare our proposed SMC approach, MCMC was first used to sample from the GP hyperparameter posterior distribution. To this end, the Metropolis-Hastings algorithm was used to generate posterior samples (where the first 2,000 were then considered ‘burn-in’ and removed accordingly). To provide a visual confirmation of MCMC convergence, 5 ‘short runs’ (of 10,000 samples each) were conducted - the resulting Markov chains are shown in Figure

9. A single MCMC ‘long run’, of 100,000 samples, was then conducted. The histograms associated with the MCMC long run are shown in Figure 10. In the following, results realised from Figure 10 are used to validate the outputs of the proposed SMC sampler. Each SMC sampler was run for iterations using samples per iteration.

Figures 11 and 12 show the estimated posterior hyperparameter means and covariances, using the recycling scheme described in Section 2.2. It can be seen that, as with the previous examples, utilising the approximately optimal L-kernel has aided algorithm convergence. Figure 13 shows that using the approximately optimal L-kernel has, again, helped the SMC sampler to maintain a higher effective sample size; the forward proposal L-kernel and approximately optimal L-kernel implementations performed resampling 219 and 68 times respectively. The sample variances achieved using both approaches are reported in Table 3. We note here that 2-component approximations to the optimal L-kernel were also investigated but that no significant improvement in performance was observed (indicating that the uni-modality of the target distribution). The 2-component results are not reported here for visual clarity.

For the sake of completeness, Figure 14 shows Monte Carlo simulations of the model’s predictions of pitch rate using a maximum posterior estimate of the GP hyperparameters (red lines) and accounting for hyperparameter uncertainty (blue lines). We note that the hyperparameter uncertainty does not appear to contribute significantly to the overall uncertainty of the modelling approach (a similar conclusion was reached, using MCMC, in [19]).

## 6 Discussion

The investigation detailed in the current paper highlights three considerations that must be made in the application of the proposed approximation approach for the optimal L-kernel.

Firstly, the approach relies on the approximation of the joint proposal distribution, , as a Gaussian mixture. The quality of this approximation will, of course, influence the success of the proposed approach. In particular, it has to be ensured that a sufficient number of samples are used at each iteration of the algorithm. The authors aim to investigate the influence of this requirement on higher dimensional problems as a topic of future work.

Secondly, the current work discusses an approximation for the optimal-L kernel, but it does not discuss the optimal choice of proposal distribution. The proposal distribution will, in a similar manner to MCMC, have significant influence over the efficiency of the algorithm. In short, if a poor proposal is selected, the algorithm will perform badly regardless of the choice of L-kernel. For future work, the authors aim to couple the proposed approach with a strategy for selecting good proposal distributions - this will likely utilise an ‘annealing-based’ algorithm, that is common to many MCMC [23, 33, 18, 26, 12, 6, 15] and SMC [38, 36, 2, 28, 32] implementations. This work may exploit the fact that SMC samplers are not restricted by the condition known as ‘detailed balance’ and can therefore potentially utilise a broader range of proposal distributions relative to MCMC. These proposal distributions may, for example, may incorporate memory i.e., they may be formed using information realised from multiple iterations of previously generated samples.

Thirdly, we note that, throughout the current paper, the use of an approximately optimal L-kernel prevented relatively large drops in effective sample size and helped reduce the number of times resampling was required. We emphasise that this feature of the approach is particularly important regarding problems with multiple modes as, in such cases, it is possible (if improbable) that resampling will lead to the removal of samples from a particular mode.

## 7 Conclusions

Markov chain Monte Carlo (MCMC) has become a key tool in the treatment of intractable Bayesian inference problems. Sequential Monte Carlo (SMC) samplers, however, provide an alternative to MCMC that is easier to parallelise and which has several ‘tuning parameters’ that are not available to MCMC. The current paper investigates an implementation strategy for one of these tuning parameters - the SMC sampler L-kernel. Through the proposed strategy, a number of case studies are used to illustrate how the efficiency of the sampler can be improved. Specifically, for the examples shown in the current paper, the use of an approximately optimum L-kernel has reduced the sample variance of the SMC estimates by up to 99 % while also reducing the number of times that resampling was required by between 65 % and 70 %.

## Acknowledgements

The authors gratefully acknowledge the Engineering and Physical Sciences Research Council, who funded this work through the grant ‘Big Hypotheses: a Fully Parallelised Bayesian Inference Solution’ (EP/R018537/1).

## References

• [1] M.S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp (2002) A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking. IEEE Transactions on signal processing 50 (2), pp. 174–188. Cited by: footnote 2.
• [2] J.M. Bernardo, M.J. Bayarri, J.O. Berger, A.P. Dawid, D. Heckerman, A.F.M. Smith, and M. West (2011) Free energy sequential Monte Carlo, application to mixture modelling. 9, pp. 91. Cited by: §6.
• [3] F.V. Bonassi, M. West, et al. (2015) Sequential Monte Carlo with adaptive weights for approximate Bayesian computation. Bayesian Analysis 10 (1), pp. 171–187. Cited by: §1.
• [4] B. Carpenter, A. Gelman, M.D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell (2017) Stan: A probabilistic programming language. Journal of statistical software 76 (1). Cited by: §1.
• [5] F. Cérou, P. Del Moral, T. Furon, and A. Guyader (2012) Sequential Monte Carlo for rare event estimation. Statistics and computing 22 (3), pp. 795–808. Cited by: §1.
• [6] J. Ching and Y.C. Chen (2007) Transitional Markov chain Monte Carlo method for Bayesian model updating, model class selection, and model averaging. Journal of engineering mechanics 133 (7), pp. 816–832. Cited by: §6, footnote 1.
• [7] N. Chopin (2002) A sequential particle filter method for static models.