1 Introduction
In Bayesian statistics, it is often necessary to collect and compute random samples from a probability distribution. Markov Chain Monte Carlo (MCMC) methods are commonly used to address this problem since direct sampling is often hard or impossible. Sequential Monte Carlo (SMC) Samplers are a member of the broader class of SMC methods (which also includes Particle Filters) and can be used in the same application domains as MCMC
[1]. While many papers on Particle Filters or MCMC methods exist, SMC Samplers still remain relatively unexplored as a replacement to MCMC.Research has been focused on improving the runtime and accuracy of SMC and MCMC methods to meet the constraints of modern applications. While accuracy has been improved by several approaches ranging from using better proposal distributions [2] to better resampling and better recycling [3], to improve the runtime these algorithms need to employ parallel computing.
Generic MCMC methods are not parallelisable by nature as it is hard for a single Markov chain to be processed simultaneously by multiple processing elements. In [4], an approach which aims to parallelise a single chain is presented but it quickly becomes problemspecific because the efficiency of parallelisation is not guaranteed, especially for computationally cheap proposal distributions. We acknowledge that one could implement multiple instances of MCMC in parallel as in [5], but argue that we could also apply the same idea to multiple instances of SMC samplers. However, all chains also need to burnin concurrently, making it difficult to use this approach to reduce the runtime. In this paper, we seek to develop a parallel implementation of a single instance of a sampling algorithm which outperforms a single MCMC algorithm both in terms of runtime and accuracy. Therefore, we leave comparisons with multiplechain MCMC to future work along with comparisons with parallel instances of SMC Samplers.
Particle Filters offer inherent parallelism, although an efficient parallelisation is not trivially achievable. The resampling step, which is necessary to respond to particle degeneracy [6], is indeed a challenging task to parallelise. This is due to the problems encountered in parallelising the constituent redistribute step. Initial approaches to performing resampling are explained in [6][7] and achieve time complexity. In [8], it has been proven that redistribute can be parallelised by using a divideandconquer approach with time complexity equal to . This algorithm has been optimised and implemented on MapReduce in [9] and then ported to Message Passing Interface (MPI) in [10]. Although the time complexity is improved to , it has been shown that at least parallel cores are required to outperform the redistribute version when all other steps are parallelised using MPI.
No parallel implementation of the SMC Sampler on MPI is currently available, despite its similarities with the Particle Filter. Hence, the first goal of this paper is to show that an MPI implementation of the SMC Sampler can be translated from the MPI Particle Filter in [10] by porting its key components. An optimisation of the redistribute in [10] will also be discussed and included in the proposed algorithm. This paper also compares, both in terms of runtime and accuracy, a basic implementation of the SMC Sampler on MPI with an equally simple MCMC method, MetropolisHastings [11]. By proving that the SMC Sampler can outperform at least one instance of MCMC, the goal is to clear the way for future research (which space constraints prohibit exploring extensively herein). That future research can then optimise the SMC Sampler and compare it with betterperforming MCMC methods, such as TMCMC [12] or HMC [13], in the context of both single and multiple chains (see above). In doing so, optimisations of SMC Samplers may include improved Lkernels, proposal distributions and a full comparison of resampling implementations (akin to that done in [14] in the context of a single core).
The rest of the paper is organised as follows: in Section 2 we give some information about distributed memory architectures and MPI. In Section 3, we describe MetropolisHastings and SMC methods with a focus on similarities and differences between Particle Filters and SMC Samplers. In Section 4, we introduce our novel implementation strategy. In Section 5, we describe and show the results of several exemplary case studies with a view to showing worstcase performance, maximum speedup and space complexity of our MPI implementation of the SMC Sampler and its performance versus MetropolisHastings. In Section 6, we draw our conclusions and give suggestions for future improvements.
2 Distributed Memory Model
Distributed memory architectures are a type of parallel system which are inherently different from shared memory architectures. In this environment, the memory is distributed over the cores and each core can only directly access its own private memory. Exchange of information stored in the memory of the other cores is achieved by sending/receiving explicit messages through a common communication network.
The main advantages relative to shared memory architectures include scalable memory and computation capability with the number of cores and a guarantee of there being no interference when a core accesses its own memory. The main disadvantage is the cost of communication and consequent data movement. This may affect the speedup relative to a singlecore.
In order to implement the algorithms we discuss in this paper, we use Message Passing Interface (MPI) which is one of the most common programming models for distributed memory environments. In this model, the cores are uniquely identified by a rank, connected via communicators and they use Send/Receive communication routines to exchange messages.
3 SMC and MCMC methods
In this section, we provide details about MCMC and SMC methods with a view to showing similarities and differences between Particle Filters and SMC Samplers. The reader is referred to [1], [6] and [11] for further details.
3.1 Sequential Monte Carlo methods
SMC methods apply the Importance Sampling principle to make Bayesian inferences. The main idea consists of generating
statistically independent hypotheses called particles (or samples) at every given iteration . The population of particles is sampled from a userdefined proposal distribution such that represents the pdf of the state of a dynamic model (in Particle Filters) or samples from a static target posterior distribution (in SMC Samplers^{1}^{1}1While it is not discussed here extensively, SMC Samplers can also be configured to offer improved performance in contexts where a Particle Filter struggles[15].). Each particle is then assigned to an unnormalised importance weight , such that the array of weights provides information on which particle best describes the real state of interest.The particles are however subjected to a phenomenon called degeneracy which (within a few iterations) makes all weights but one decrease towards
. This is because the variance of the weights is proven to increase at every iteration
[6]. There exist different strategies to tackle degeneracy. The most common is to perform a resampling step which repopulates the particles by eliminating the most negligible ones and duplicating the most important ones. Different variants of resampling exist [14] and the chosen methodology is described in detail in Section 3.3. Resampling is only triggered when it is needed, more precisely when the (approximate) effective sample size(1) 
decreases below a certain threshold (which is usually set to ). represents the array of the normalised weights, each of them calculated as follows:
(2) 
At every iteration, estimates are produced as a weighted sum of
, weighted using .3.1.1 Particle Filters
A range of different Particle Filter methods exist. This section provides a brief description of Sequential Importance Resampling (SIR), described by Algorithm 1 in the appendix.
Let be the current state of the dynamic system that we want to estimate. At every time step a new measurement is collected. In the SIR Filter, the weighted particles are initially drawn from the prior distribution and then drawn from the proposal distribution as follows:
(3) 
The weights are initially set to and then computed as
(4) 
The weights are then normalised and used to calculate as in (1). Then resampling is performed if needed. In the last step, the estimation of the state is given by the weighted mean of the particles.
3.1.2 SMC Samplers with recycling
Like MCMC methods, the goal in the SMC Samplers is to draw samples from a static target distribution of interest . The algorithm begins by drawing samples from the initial proposal and giving the th sample the weight .
After the first iteration, the samples are drawn from the forward Markov kernel, , while the weights require backward Markov kernels as follows:
(5) 
As is the case for Particle Filters, after the importance weights evaluation and normalisation, the resampling step may be triggered depending on the value of .
In the vanilla SMC Sampler, estimates are performed according to the particles in the final iteration. The expected value is computed by multiplication of the particles at the final iteration with the corresponding weights. In [3], a novel recycling method is proposed. Instead of considering the particles from the last iteration as providing the outputs, estimates are computed using all particles from all iterations. Using the notation of this paper, estimates are performed as follows:
(6) 
where is calculated as
(7) 
and the normalisation constant^{2}^{2}2(8) is equivalent to (14) in [3], albeit with simplified notation here. is
(8) 
Algorithm 2 in the appendix describes the SMC Sampler with the recycling method.
3.2 MetropolisHastings Algorithm
The MetropolisHastings (MH) algorithm (see Algorithm 3 in the appendix) simulates a Markov chain where, at each iteration, a new sample, , is drawn from a proposal distribution. The new sample is accepted or rejected using the Rejection Sampling principle with acceptance probability . To reduce the dependency on the initial sample, the first (userdefined) samples are discharged (burnin).
3.3 Key components of Particle Filters and SMC Samplers
Algorithms 1 and 2 in the appendix show that SIR Particle Filter and SMC Sampler with recycling share the same key components.
Importance Sampling is trivially parallelisable as (3), (4) and (5) are elementwise operations. Hence, Importance Sampling achieves time complexity for cores.
Expressions (1), (2), (7), and the weighted mean of particles require Sum and then can be easily parallelised by using Reduction. The time complexity of any Reduction operation scales logarithmically with the number of cores.
Both algorithms invoke resampling if . Several alternative resampling steps have been proposed and a comparison between them is discussed in [14]. These algorithms solve the problem in operations. The key idea of these algorithms is to process to generate an array of integers called whose th element, , indicates how many times the th particle has to be duplicated. It is easy to infer that has the following property:
(9) 
In previous work to parallelise Particle Filters described in [6], [8], [9] and [10], Minimum Variance Resampling (MVR), a variant of Systematic Resampling in [14], has always been the preferred resampling scheme. Since this paper is built on the results in [10], MVR will be the only variant of resampling we consider. This algorithm uses Cumulative Sum to calculate the CDF and then it generates such that the new population of particles has minimum ergodic variance. After that, it is necessary to perform a task called redistribute which duplicates as many times as . This task has already been identified as bottleneck (see [6], [8], [9]) and it will be discussed in detail in the next section. We note that the reset step (which sets all the weights to ) after redistribute is trivially parallelised.
3.3.1 Redistribute
The redistribute step is necessary to regenerate the population of particles and is a task which all resampling variants have in common. A naive and mature implementation can be found in [6][7]. The same is described by Algorithm 5 in the appendix and referred to as Sequential Redistribute (SR) in the rest of the paper. This routine simply iterates over and, for the th element, it copies as many times as . Considering that follows (9), it is easy to infer that SR achieves time complexity with a very low constant time. However, this algorithm is not trivial to parallelise because the workload among the processors cannot be readily distributed deterministically. This is because could be equal to any value between and . Parallelisation is even more complicated on distributed memory architectures since one core cannot directly access the memory of the other cores [10].
In [8], it has been shown that, by using a topdown divideandconquer approach, redistribute can be parallelised. Starting from the root node, the key idea consists of sorting and moving the particles at every stage of a binary tree. This can be achieved by searching for a particular index called which perfectly divides the node into two balanced leaves. Once is identified, the node is split. In order to find , Cumulative Sum (whose parallel implementation runs in steps [16]) is performed and then is the first index where Cumulative Sum is equal to or greater than . This routine is repeated recursively times. Since Bitonic Sort is the chosen parallel sorting algorithm and it is known that its time complexity is equal to with cores, then we can infer that this redistribute achieves time complexity for the same level of parallelism. Sorting the particles is required to make sure that the splitting phase can be performed deterministically in .
In [9], the redistribute algorithm in [8] was improved by making a subtle consideration: the workload can still be divided deterministically if we perform Bitonic Sort only once. After this single sort, the algorithm moves on to another topdown routine where we use rotational shifts to shift all particles on the left side of up to the left side of the node. This way the father node gets split into two balanced leaves. This algorithm is recursively performed times until the workload is equally distributed across the cores; then SR is called. Algorithm 7 in the appendix summarises this routine and, in this paper, is described as Bitonic Sort Based Redistribute (BR). Rotational shifts are faster than Bitonic Sort as the achieved time complexity is equal to and, therefore, the overall time complexity is improved to . In [9], BR has been implemented on MapReduce and, although it was significantly better than the algorithm in [8], its runtime for cores was up to times worse than a singlecore SR.
In [10], BR has been ported to distributed memory architectures by using MPI and compared to a deterministic MPI implementation of SR, in which one core gathers all particles from the other cores, performs SR locally and scatters back the resulting array. To avoid misunderstanding, we refer to the MPI implementation of SR in [10] as Centralised Redistribute (CR), which is described by Algorithm 6 in the appendix. The results indicate that the scalability of the MPI implementation is improved relative to the scalability achieved using MapReduce because BR on MPI could outperform CR for at least cores. Possible ways to improve BR are discussed in the next session.
4 Novel Implementation
In this section, we consider ways to improve redistribute and how an MPI SMC Sampler could be an alternative to MetropolisHastings.
4.1 Improving singlecore Bitonic Sort
As is observed here and has been discussed elsewhere in the literature [9][10], the redistribute step is the bottleneck that complicates parallel implementation of Particle Filters. To ensure this is clear, we repeat an experiment from [10] and report the results of the same SIR Particle Filter with BR within using , in Figure 1. The runtimes vs the number of cores for the entire Particle Filter, the constituent redistribute step and the subset of redistribute that is taken up with the Bitonic Sort step are given. As we can see, for , redistribute always accounts for at least of the total runtime and this proportion increases with . For the same values of , we can also observe that Bitonic Sort is by far the most computationally intensive task within redistribute and hence is the true bottleneck of the Particle Filter.
Bitonic Sort is a very fast comparisonbased parallel sorting algorithm [17]. This algorithm uses a divideandconquer approach to first divide the input sequence into a series of Bitonic sequences^{3}^{3}3A Bitonic sequence is a sequence of keys in which the first keys are sorted in increasing order, while the last keys are sorted in decreasing order.. Then the Bitonic sequences are recursively merged together until the algorithm returns a single monotonic sorted sequence. A possible sorting network which can be used is illustrated in Figure 2. Each horizontal wire represents a key, the vertical arrows connect the input keys for a comparison and the direction represents the order of the keys after the comparison has occurred. The coloured blocks represent application of Bitonic Merge (blue or red if Merge is called in increasing or decreasing order respectively).
It has been proven that, given a generic array of elements, Bitonic Sort solves the problem in comparisons [17]. Bitonic Sort is, however, suitable to run in parallel by making cores work on different chunks of the input array. In this case, each wire (or groups of wires) in Figure 2 may also represent a core (or the elements that each core operates on). When , it is easy to infer that the achieved time complexity is equal to . More generically, we can say that for any number of cores the time complexity is equal to
(10) 
is the time complexity to perform Bitonic Sort locally before the cores start interacting with each other. This term is definitely dominant, especially for low values of . One possible way to improve Bitonic Sort (and by extension redistribute) is to substitute the serial Bitonic Sort algorithm with a better singlecore sorting algorithm, as was been suggested in [10].
In the literature, there are plenty of alternatives to Bitonic Sort available. Algorithms such as Quicksort, Mergesort and Heapsort, for example, achieve time complexity. Quicksort is on average faster than Mergesort and Heapsort. However, Quicksort’s choice of its pivot can severely influence the performance: it is known, in fact, that Quicksort’s worstcase time complexity is . This occurs when the pivot chosen at every iteration is equal to either the minimum or the maximum of the available keys. Although this case is statistically very rare in several modern applications, in the case of SMC methods the worstcase scenario is however often encountered: has to be sorted and, since (9) holds, there is a high probability that is picked as Quicksort’s pivot, i.e. a high probability that the pivot is the minimum element.
Heapsort achieves time complexity in all cases except when all keys are equal. In this special although rather unlikely case, the time complexity is . However, Mergesort is perfectly deterministic and dataindependent and represents a valid alternative to Bitonic Sort which we consider in the experiment in Section 5.1. A Bitonic Sorter with Mergesort performed locally achieves the following time complexity:
(11) 
We also observe that is an array of integers. Hence, one could locally use linear time sorting algorithms such as Counting Sort or Radix Sort (which are both only applicable to arrays of integers). Although Counting Sort has deterministic and dataindependent time complexity, its space complexity is datadependent. This is because Counting Sort allocates a temporary array with as many elements as . In the worstcase , and since could be very high, the temporary array may not fit within the local memory of a single machine. This problem is shared with CR and the impact of this issue will be discussed in Section 5.4. On the other hand, Radix Sort is a feasible deterministic solution. However, Radix Sort is datadependent because its time complexity is actually where the constant is equal to the number of digits of the maximum element (which can be in the worstcase). Therefore, Radix Sort may be too slow when is high and its runtime may fluctuate too much as a function of the input.
In summary, we are looking for a parallel sorting algorithm that works with integer numbers and is deterministic and dataindependent with respect to both time and space complexity. While a combination of Bitonic Sort and Mergesort within a core achieves these aims, in the next two sessions, we go on to develop an improved strategy that is sufficient for our needs and does not require sort at all.
4.2 Nearly Sort: an alternative to singlecore sorting
In [6], sorting was used extensively. In BR, rotational shifts are used times while Bitonic Sort is used only once. This replacement of sort with rotational shift has improved the time complexity (from to . However, it has also led to a more subtle consideration: by observing the input of rotational shifts we can infer that we do not actually need to perfectly sort the particles to divide the workload deterministically. This condition is always satisfied as long as stage by stage the particles that have to be duplicated are separated from those that do not. To make things more clear we first provide the following definition.
Definition 1
Let be a sequence of elements. is called NearlySorted sequence when it has the following shape: where and . is an ascending NearlySorted sequence if the first elements of are and a descending NearlySorted sequence if the final elements are .
We can infer that the workload can be divided deterministically if is a NearlySorted sequence. In BR, this condition is ensured by sorting before the subsequent parts of the redistribute step. While there are singlecore sorting algorithms that achieve time complexity, these algorithms do not satisfy our need for deterministic runtime and storage. However, it is possible to use a single core nearlysort for an array of integers with a deterministic and dataindependent approach with time complexity.
Algorithm 8 in the appendix, which we have called Sequential Nearly Sort (SNS), declares two iterators and which respectively point at the first and the last element of . Step by step, the th element of is considered and if the value is higher than then the particle is copied to the right end of the output array. If not, it gets copied to the left end. The output will then be an ascending NearlySorted sequence. SNS requires iterations of the for loop, which means that it achieves time complexity or if we consider that each core owns elements.
SNS is, therefore, a very good alternative to Serial Bitonic Sort, Mergesort, Heapsort and Radix Sort. This is because it achieves low time complexity with deterministic and dataindependent runtime and space complexity.
4.3 Parallel Nearly Sort
We want SNS to be used as part of a parallel algorithm which generates a NearlySorted sequence from a random input one. We now discuss how to achieve this.
Definition 2
Let be a sequence of elements. is called a NearlyBitonic sequence when it is possible to find an index which splits into two monotonic NearlySorted sequences.
One could use SNS and the same sorting network of Bitonic Sort to first divide the input into a series of NearlyBitonic sequences and then to recursively merge the sequences together until we generate a monotonic NearlySorted sequence at the last step.
We need to adapt Bitonic Merge such that it processes a NearlyBitonic sequence and returns a monotonic NearlySorted sequence. We call this algorithm Nearly Merge. Stagebystage, one core with MPI rank is coupled with another core with MPI rank . The assumption is that each core owns a NearlySorted sequence of keys such that the combination of both is necessarily a NearlyBitonic sequence. Stage by stage, the cores call MPI_Sendrecv to exchange their local data. Then they consume a complementary subset of elements. Depending on the direction of the arrow in the sorting network (see again Figure 2), one core will start consuming the s first and then the positive elements while the other core will do the opposite. This way, the s will be confined to one end of the output array separated from the positive elements.
Figure 3 illustrates a possible example of Nearly Merge where each core owns
keys; the positive elements are padded with Xs for brevity. By extension, each core owns exactly
particles and performs the same amount of writes to memory. Therefore, Nearly Merge achieves time complexity just as SNS does. We can infer that the overall time complexity for Nearly Sort is equal to(12) 
This algorithm has asymptotically the same time complexity of Bitonic Sort when , but the time complexity for the serial algorithm is improved by a factor of . Therefore, we expect this algorithm to outperform both Bitonic Sort and a Bitonic Sorter with Mergesort performed locally. By extension, if we exchange Bitonic Sort with Nearly Sort in BR we expect to have better performance. Algorithm 9 in the appendix describes the new routine. A possible example for and is shown in Figure 4. From now on, we refer to this algorithm as Nearly Sort Based Redistribute (NR).
In SMC methods is the array to (nearly) sort and each key is necessarily coupled to the particle . (12) must then be extended to:
(13) 
4.4 Single SMC Sampler vs Single MetropolisHastings
MetropolisHastings and the SMC Sampler perform sampling from a target distribution and they both provide an accurate result for a sufficiently large number of iterations. However, the details of the two approaches differ substantially. As we have discussed in Section 3, MetropolisHastings uses a Markov Chain to generate each sample one by one based on the history of the previous samples. This approach makes a single instance of MetropolisHastings hard to parallelise in a problemagnostic way. On the other hand, the SMC Sampler is a populationbased algorithm where all samples are processed independently and concurrently during each iteration.
Now let the total simulationtime for each algorithm be fixed to seconds. After seconds, MetropolisHastings and the SMC Sampler will have performed and iterations respectively and provide a solution with a certain root mean squared error. Since a single MetropolisHastings is hard to parallelise, we cannot increase its accuracy without running the simulation for longer than seconds. However, a single SMC Sampler can improve its throughput or accuracy by taking advantage of its inherent parallelism. In an ideal world, cores can perform iterations in seconds, but they can also, and most importantly, run iterations in seconds. This means they can achieve better accuracy with the same runtime. By extension, cores can ideally run iterations in seconds but they will achieve a much better accuracy than a single core is capable of.
The main goal of this paper is to prove that a core MPI SMC Sampler can be more accurate over the same runtime than MetropolisHastings when they sample from the same target distribution and use the same proposal. A more exhaustive explanation with experimental results is provided in Section 5.5.1.
5 Case Studies and results
In this section, we briefly describe the experiments we make and we analyse the results. Table 1 provides details about Barkla and Chadwick, the two platforms we use for the described experiments. Barkla is the preferred cluster for the majority of the experiments as it can provide more resources.
5.1 Bottleneck
To evaluate the improvements in the bottleneck, we first focus on the sorting phase. Then we compare NR, BR and CR. in this first experiment for brevity.
5.1.1 Sorting
As we outlined in Section 4, Bitonic Sort is the slowest task in BR. In this experiment, we compare three different deterministic sorting/nearly sorting algorithms: Bitonic Sort (BS), Bitonic Sort with Mergesort performed locally (BS+MS) and Nearly Sort (NS). These algorithms are compared by passing in input the same two arrays: and . represents the array of the numbers of copies and hence it is an array of integers. It is generated randomly according to (9) by using a Gaussian random generator followed by MVR. represents an array of singledimension particles and hence it is an array of floating point numbers which are generated by a Gaussian random generator.
The experiments have been run on Barkla for particles and increasing numbers of cores . Both and must necessarily be equal to power of numbers, due to the constraint of Bitonic Sort and Nearly Sort. Each experiment has been run times and we report the median of the sampled runtimes vs the number of cores.
As we can see in Figure 6, NS does not scale for up to
cores. This trend might seem odd but it can be explained by analysing the time complexity of NS described by (
12). Figure 4(a) describes the qualitative trend of (10), (11) and (12). When , the quasilinear term in (12) is equal to . However, for , the quasilinear term offsets the improvement associated with the linear term. Theoretically, the runtime should have positive speedups for . In the measured values for , this happens when or cores, depending on . We associate this slight discrepancy to the additional communication cost associated with larger numbers of particles.However, the most important result is that NS is significantly faster than the other algorithms and especially BS for a low number of cores. Then, when increases the performance of both algorithms become closer because the time complexity of both algorithms is asymptotically equal to , as underlined in Section 4. These results suggest that using Bitonic Sort or Nearly Sort results in similar runtime for but, using Nearly Sort may lead to significant improvements for . This means that the crossing point with respect to the runtime of CR may be shifted to the left side of the graph, relative to the results in [10].
The results for keys show that BS and BS+MS stop scaling for a very low value of . The reasons behind this result have required further investigation. For a very low number of keys, the granularity of the pipeline is already fine. In other words, the computation cost is already comparable with the communication cost and using more cores does not provide any scalability. NS is also affected by the same problem. In addition, the time complexity of Nearly Sort is necessarily higher than for cores. For these two reasons NS always returns negative speedups.
5.1.2 BR vs NR vs CR
In this experiment, we use exactly the same strategy described in the previous section, since the required input for NR, BR and CR is the same. The results are shown in Figure 6. The results for redistribute with BS+MS are left out for brevity.
As we expected from theory, for NR is better than BR overall and much faster for a small number of cores. However, the most important result is that NR outperforms CR at the theoretical minimum (which is ) for some values of the dataset size . This suggests that, as long as we use a parallel redistribute whose time complexity is equal to , we cannot outperform CR for nor have positive speedup for the same values of . In order to achieve this goal, a new algorithm with time complexity is needed. Sorting networks which achieve the theoretical lower bound have been proposed in [18], [19] which improve the original AKS sorting network presented in [20]. These networks can also be rearranged to perform redistribute by substituting the comparators with balancers. However, they cannot be practically used because each atomic step requires a huge constant time . The exact value of is unknown as it also depends on the network parameters but it seems to be in the order of thousands (e.g. in the best configuration in [18]). It can then be inferred that they cannot outperform sorting networks such as Bitonic Sort for any practical . In [21] it has been estimated that a hypothetical would require to make AKSlike sorting networks faster than Bitonic Sort. Therefore, the infeasibility of this class of algorithms makes redistribute on distributed memory systems a practical lower bound (although it cannot yet be considered a theoretical minimum).
For , NR does not scale and does not outperform CR either. As we outlined in the previous section, for low values of the granularity is already too fine to observe any speedup.
5.2 Worstcase speedup
In this section, we analyse three case studies, two for the Particle Filter and one example of an SMC Sampler. Depending on the chosen redistribute (NR, BR or CR), we use the acronyms NPF/BPF/CPF for the Particle Filter, and NSMCS/BSMCS/CSMCS for the SMC Sampler. We consider a worstcase scenario which occurs when resampling is needed at every iteration and the time taken to perform Importance Sampling is small relative to the time taken to perform resampling. This section aims to achieve two goals. The first one is to demonstrate that the historic progress made in developing parallel implementations of Particle Filters can be translated to develop a parallel implementation of an SMC Sampler. The second one is to estimate the worstcase speedup of our improved algorithm.
All experiments in this section are run for the same values of as were considered in the previous section. Each runtime is once again the median of runs, each one representing a simulation of iterations in the worstcase scenario and for increasing number of cores .
5.2.1 Particle Filter on Econometrics
In this experiment, we use Barkla and compare NPF, BPF and CPF. We apply these algorithms to a stochastic volatility model which describes the evolution of the poundtodollar exchange rate between the 1st of October 1980 and the 28th of June 1985. This model has been used in [22] to demonstrate the utility of advanced SMC methods, such as Block Sampling Particle Filters, over SIR Particle Filters.
(14a)  
(14b) 
(14a) and (14b) represent the model where the coefficients , , (as selected in [22]) and the noise terms for the state and the measurement are and . The initial state is sampled as . The particles are initially drawn from the prior distribution. The importance weights are simply equal to the likelihood since the dynamic model is used as the proposal.
5.2.2 Particle Filter on BearingOnly Tracking
This experiment is focused on showing the performance of NPF, BPF and CPF applied to a nonscalar model. The chosen example is the popular fourdimensional state model for a BearingOnly Tracking problem, where the state is represented by the position and velocity of the tracked object. Both position and velocity are 2dimensional physical quantities. This model was previously presented in several publications, such as in [7], and used in [22] to test the Block Sampling SIR Filter. In accordance with [22], we consider the state to be composed of four elements denoted such that where , are position and , are velocity.
The model is defined as follows:
(15a)  
(15b) 
where the state transition matrix and the covariance are
The noise terms are and . The initial state
has the identity matrix as covariance and mean equal to the true initial simulated point of the system. The parameter
represents the sampling period and it is set to s.5.2.3 Sampling using SMC samplers
We apply NSMCS, BSMCS and CSMCS to sample from a static singledimensional () Student’s t posterior distribution calculated as:
(16) 
where and
correspond to the degrees of freedom and the mean value respectively.
In the experiment, the particles, as samples of , at the th iteration are generated using random walk as the proposal distribution, . The backward kernel is naively selected to emulate MCMC such that . We then anticipate that a better choice of can positively impact our estimate. The recycling described in Section 3.1.2 is also used.
5.2.4 Worstcase results
In this section we provide the results of the experiments described in Sections 5.2.1, 5.2.2 and 5.2.3 which are shown in Figure 7 for respectively. Since that NR, BR and CR have the same baseline, we show the speedups instead of the runtimes as, in this case, these speedups can also prove which algorithm is faster.
As we can see, for NPF/NSMCS does not outperform CPF/CSMCS. The reasons are the granularity and the theoretical time complexity of NR as we explained in Section 5.1. In contrast, CPF/CSMCS can keep scaling for a relatively low , until redistribute emerges as the bottleneck. However, since modern applications need a large number of particles ( is just ), we are not discouraged by these limitations.
For , computation becomes dominant over communication and both NPF/NSMCS and BPF/BSMCS can scale for much larger values of and can both outperform CPF/CSMCS. For this range of , using Nearly Sort instead of Bitonic Sort makes the SMC method faster for any number of cores and up to twice as fast as using Bitonic Sort for low values of . The gap between these two approaches also increases with . NPF/NSMCS runs faster than the solution with CR in the range and it can be as much as approximately times faster for cores.
Overall, we can say that for a fixed , changing the model in the Particle Filter (i.e. whether we consider stochastic volatility or BearingOnly Tracking) or switching from Particle Filters to SMC Samplers gives roughly the same trend. This proves that the improvements that have been demonstrated in the context of Particle Filters can directly be translated to the context of SMC Samplers.
As we can see, the minimum worstcase speedup is about and occurs in the context of the BearingOnly Tracking. On the other hand, the maximum worstcase speedup can be up to : this occurs in the context of the SMC Sampler. The efficiency of NSMCS with respect to the maximum speedup is indeed significantly higher than for the Particle Filter. This is due to the different way particles and weights are calculated in the SMC Sampler. For example, (16) is more computationally intensive than (14a), and the likelihood calculation in the SMC Sampler is more computationally demanding than the likelihood in the Particle Filters (this is because of the need to compute the kernel). Therefore, resampling accounts for a lower percentage of the entire workload in the SMC Sampler than it does in the Particle Filter. The resampling step is no longer such a significant bottleneck for a low number of cores. This is discussed in more detail in the next section.
5.3 Maximum speedup
All the previous experiments use relatively simple proposal distributions and likelihoods. However, in real problems, these two tasks are likely to be much more complicated (e.g. they may involve nonlinear systems or Partial Differential Equations etc.). In the next experiment, we investigate the impact that a more computationally intensive Importance Sampling step has on the maximum speedup.
In order to simulate this scenario, we adjust the experiment described in Section 5.2.2 by using sensors spread over the Cartesian plane. This practice is also common in real applications to make the estimate more accurate (since the triangulation observability criterion is satisfied [23]). The measurement is now a
dimensional measurement vector:
(17) 
where is the position of the th sensor with respect to the target. The state equation remains the same as is described in Section 5.2.2 such that is unchanged. We consider . The maximum speedup efficiency for each is estimated as the speedup for vs the ideal speedup for the same . We increase until we have at least efficiency. For each we also report the percentage of the total workload that Importance Sampling accounts for when the runtime of redistribute is at its peak, i.e. for .
As we can observe in Figure 10, a more computationally intensive Importance Sampling step leads to higher speedups. The speedup for is indeed about times the speedup for (which corresponds to the experiment in Section 5.2.2). Therefore, in these problems, the bottleneck for a low number of cores is likely to be the Importance Sampling step and not resampling. However, when , Importance Sampling has time complexity while resampling has complexity of . In other words, since resampling always emerges as the bottleneck for a sufficiently high level of parallelism, it is crucial to use a parallelisable redistribute such that we can achieve nearlinear speedups for higher .
5.4 Space Complexity
NR and BR, have both scalable space complexity equal to . However, CR has constant space complexity equal to [10]: one core is in charge of collecting the particles, performing the routine locally and then distributing the new population back to the other cores.
The main side effect is that when the available memory in each node is insufficient to store all the necessary data for , we cannot run CR for any . In contrast, even for very large values of , we can always run NR or BR as long as each node has enough memory for its data. In order to show this problem, we repeat the experiment described in Section 5.2.2 on Chadwick (which has a GB memory in each node, i.e. less than Barkla’s GB per node). Figure 10 shows the measured runtimes for (speedups are not provided since the baseline is impossible to run due to space complexity limitations). The results for are left out for brevity since they resemble the results in Figure 7. The total absence of a curve for CPF or the missing points for NPF occur because of an mpirun abort (which happens when we request more memory than the node has). As we can see we need at least cores to run NPF while it is never possible with CPF with .
We can conclude that NPF/NSMCS outperforms BPF/BSMCS for any and outperforms CPF/CSMCS for cores. Furthermore, it is always possible to run NPF/NSMCS while CPF/CSMCS may be impossible to run for high .
5.5 SMC Sampler vs MetropolisHastings
5.5.1 Description
As we have anticipated in Section 4.4, this experiment aims to achieve two goals. We first want to prove that a core implementation of the SMC Sampler can achieve a lower runtime than a singlechain MetropolisHastings when both algorithms draw the same number of samples in total (see below). Then we want to prove that the extra speedup that cores provide can make an SMC Sampler more accurate than MetropolisHastings, since as increases an SMC Sampler can perform more iterations over the same time span.
The first part of the experiment is done by comparing the runtime of both algorithms for the same workload such that:
(18) 
To investigate the second issue, we primarily need to know the intertask speedup which cores can provide, keeping fixed. We estimate from the first part of the experiment. Then we compare the algorithms in terms of accuracy over the same computational time which happens when:
(19) 
In other words, the SMC Sampler will run for times more iterations (or less if ) over the same runtime.
The number of samples, , and the number of cores, , are the same as in the previous experiments and we will use . is constant, independent of and always picked using (19) for . The SMC Sampler is once again assessed in the worstcase setting when resampling is needed at every iteration.
5.5.2 Results
Figure 10 shows the intertask speedup between SMC Sampler and MetropolisHastings for the same workload (see (18)), after having set . We calculate that a singlecore implementation of the SMC Sampler is slightly slower (typically by ) than MetropolisHastings. This means that an SMC Sampler running on a cluster of nodes could be much faster than MetropolisHastings. As we can see, for high values of (which as we have seen lead to larger speedups) the SMC Sampler can be up to times faster than MetropolisHastings.
In the second part of the experiment we compare the accuracy (expressed as Root Mean Squared Error (RMSE)) for high speedups when both algorithms run for the same time span (see (19)). To make the runtime equivalent we set to the values shown in Figure 10. Figure 11 shows the accuracy ratios between the algorithms vs for increasing or . The baseline (and numerator of the accuracy ratio) is the accuracy of MetropolisHastings (see Table 2). As we can see, for low values of , the SMC Sampler does not outperform MetropolisHastings as the gap for is initially too big. However, while an SMC Sampler is less accurate, the relative benefit of using MetropolisHastings reduces when increases. Combinations of bigger values for or lead to comparable gains in accuracy when is maximised (see pairs: , ; , ; , ). In the end, for even higher values of or the initial gap at the baseline is lower such that, when increases sufficiently, the SMC Sampler finally outperforms MetropolisHastings. Figures 11c proves that the RMSE of the SMC Sampler can be up to approximately
times lower. If we consider that the standard deviation
for MetropolisHastings scales as , we can infer that the ideal improvement in accuracy for cores would be:(20) 
This would occur only if we could trivially parallelise a singlechain MetropolisHastings and observe linear speedup, which means that the proposed MPI SMC Sampler already achieves approximately efficiency with respect to the ideal scenario. This is an encouraging finding, especially considering that we have used a simple SMC Sampler. We anticipate that further improvements in accuracy would result from using a more sophisticated Kernel, better recycling, novel proposal distributions or alternative resampling implementations [24]. The results for are not shown. This is because for high values of , the runtimes of MetropolisHastings and the SMC Sampler for low values of exceed the simulation time limit on both clusters (which is set to days).
Name  Barkla  Chadwick 

OS  CentOS Linux 7  RHEL 6.10 (Santiago) 
Number of Nodes  16  8 
Cores per node  40  16 
CPU  2 Xeon Gold 6138  2 Xeon(R) E52660 
RAM  384 GB  64 GB 
MPI Version  OpenMPI1.10.1  OpenMPI1.5.3 
Max time per job  72 hours  72 hours 









6 Conclusions
In this paper, we have shown that a parallel implementation of the SMC Sampler on distributed memory architectures is an advantageous alternative to MetropolisHastings as it can be up to times faster over the same workload, and up to times more accurate over the same runtime for cores.
To get to this position, we have made several advances. An MPI implementation of the SMC Sampler has previously been unavailable but we have proven that it can be produced by porting the key components of the Particle Filter. There exist several alternative algorithms to perform the common bottleneck, redistribute, including a stateoftheart parallel algorithm and a textbook nonparallelisable implementation. In this paper, we have optimised the parallel algorithm and proven it can outperform the current approach for any number of cores and be up to times faster than the textbook implementation for a sufficiently high degree of parallelism. In addition, we have demonstrated the infeasibility of the nonparallelisable algorithm for large numbers of particles.
The proposed algorithm for cores is times as fast as its serial configuration in the worst case scenario, which occurs when resampling (and redistribute) is needed at every step and, most importantly, when the model is unrealistically simple and hence Importance Sampling has a very fast constant time. More realistic models have highly computationally intensive proposal distributions or likelihoods. Under these realistic conditions, we have shown that the overall speedup increases with the workload of Importance Sampling and the maximum recorded speedup is about for cores.
A key observation we can make is that the SMC Sampler version we have used is a basic reference version as the Lkernel is equal to the proposal distribution, which is Gaussian; better recycling and resampling are yet to be explored. This means that we still have left significant scope for future improvements. A combination of intelligent recycling, a more sophisticated LKernel, improved proposal distribution and better resampling may have major impacts on the accuracy.
Another improvement avenue is to speed up the runtime which as we have seen can indirectly improve the accuracy too. One possible way to achieve this goal is to investigate the benefits of mixing shared memory architectures and distributed memory architectures. OpenMP is the most common programming model for shared memory architectures and including OpenMP algorithms within MPI is a routine approach in the high performance computing domain. Data locality may also lead to alternative and more efficient ways of implementing redistribute. A second environment that may lead to further speedup consists of using the additional computational power that the GPU card within each machine provides.
Future work will focus on implementing all these improvements and comparing the resulting SMC Sampler with better MCMC methods than MetropolisHastings. These comparisons must necessarily be made both in the single and multiple chain contexts.
Acknowledgments
This work was supported by the UK EPSRC Doctoral Training Award and by Schlumberger. The following algorithms summarise the routines which are described in detail in Sections 3 and 4 and compared in Section 5 of the main paper. In these algorithms, we use the same notation of the main paper: arrays and matrices are in bold while scalars, such as some input parameters and single elements of onedimensional arrays, are written in italic fontstyle.
SMC methods and MetropolisHastings
Resampling and Redistribute
Algorithm 4 depicts the chosen resampling step for our implementation of the Particle Filter and the SMC Sampler. The three considered MPI implementations of the constituent redistribute step are described by Algorithms 6, 7 and 9. These routines make use of Algorithm 5 to redistribute within each core once the workload is balanced (or centralised as in Algorithm 6). Algorithm 8 explains the singlecore Nearly Sort that is used for the MPI Nearly Sort in Algorithm 9 which replaces the MPI Bitonic Sort in Algorithm 7.
Comments
There are no comments yet.