Acceleration of expensive computations in Bayesian statistics using vector operations

02/25/2019
by   David J. Warne, et al.
qut
0

Many applications in Bayesian statistics are extremely computationally intensive. However, they are also often inherently parallel, making them prime targets for modern massively parallel central processing unit (CPU) architectures. While the use of multi-core and distributed computing is widely applied in the Bayesian community, very little attention has been given to fine-grain parallelisation using single instruction multiple data (SIMD) operations that are available on most modern commodity CPUs. Rather, most fine-grain tuning in the literature has centred around general purpose graphics processing units (GPGPUs). Since the effective utilisation of GPGPUs typically requires specialised programming languages, such technologies are not ideal for the wider Bayesian community. In this work, we practically demonstrate, using standard programming libraries, the utility of the SIMD approach for several topical Bayesian applications. In particular, we consider sampling of the prior predictive distribution for approximate Bayesian computation (ABC), the computation of Bayesian p-values for testing prior weak informativeness, and inference on a computationally challenging econometrics model. Through minor code alterations, we show that SIMD operations can improve the floating point arithmetic performance resulting in up to 6× improvement in the overall serial algorithm performance. Furthermore 4-way parallel versions can lead to almost 19× improvement over a naïve serial implementation. We illustrate the potential of SIMD operations for accelerating Bayesian computations and provide the reader with essential implementation techniques required to exploit modern massively parallel processing environments using standard software development tools.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

03/29/2006

Implementation of float-float operators on graphics hardware

The Graphic Processing Unit (GPU) has evolved into a powerful and flexib...
03/06/2012

A Short Note on Gaussian Process Modeling for Large Datasets using Graphics Processing Units

The graphics processing unit (GPU) has emerged as a powerful and cost ef...
10/12/2021

A Cross-Platform Benchmark for Interval Computation Libraries

Interval computation is widely used to certify computations that use flo...
08/29/2018

FPGA Implementation of Convolutional Neural Networks with Fixed-Point Calculations

Neural network-based methods for image processing are becoming widely us...
09/15/2021

Toward Modern Fortran Tooling and a Thriving Developer Community

Fortran is the oldest high-level programming language that remains in us...
10/05/2003

The Graphics Card as a Streaming Computer

Massive data sets have radically changed our understanding of how to des...
01/10/2018

Task parallel implementation of a solver for electromagnetic scattering problems

Electromagnetic computations, where the wavelength is small in relation ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The study of the posterior distribution is essential in applied Bayesian statistics. However, most practical applications are computationally challenging since the posterior probability density function (PDF) is only known up to a normalising constant. That is, given data,

, a theoretical stochastic model with parameter vector, , from some parameter space, , a prior PDF, , and likelihood function, , the posterior PDF is

(1)

where the evidence, , is typically analytically intractable. Therefore, advanced Monte Carlo schemes are often required. The difficulty is compounded for applications with computationally expensive likelihoods or intractable likelihoods, which are increasingly commonplace (Sisson et al., 2018). Of course, sampling of challenging posterior distributions is not the only computationally expensive problem in Bayesian statistics. For example, in some analyses that require some level of objectivity, it may be important for priors to be weakly informative, and selected from a family of priors

with hyperparameter,

 (Evans and Jang, 2011; Gelman, 2006; Nott et al., 2018).

While computational methods in Bayesian statistics continue to improve and evolve (Green et al., 2015), the computational performance of modern computer hardware is also increasing. Most computer processor improvements over the last decade are the result of increased capacity of parallel processing. Consideration of the implementation details for Monte Carlo algorithms is therefore essential if one is to fully exploit modern computational resources. The optimisation of codes for efficient use of hardware acceleration has become a standard technique to improve the performance of typical high performance computing (HPC) workloads.

Many Monte Carlo schemes lend themselves to parallel computing architectures. For example, likelihood-free methods, such as approximate Bayesian computation (ABC; Sisson et al. (2018)) typically require generating a large number of independent simulations from the prior predictive distribution

. However, while direct Monte Carlo schemes may trivially be executed in parallel, other more sophisticated samplers, such as Markov chain Monte Carlo (MCMC) 

(e.g. Green et al., 2015; Marjoram et al., 2003), Sequential Monte Carlo (SMC) (e.g. Del Moral et al., 2006; Sisson et al., 2007) and multilevel Monte Carlo (MLMC) (e.g. Beskos et al., 2017; Dodwell et al., 2015; Jasra et al., 2019; Warne et al., 2018) require more effort to efficiently parallelise (Mahani and Sharabiani, 2015; Murray et al., 2016) since there are synchronisation and communication requirements. In many cases, these overheads can inhibit the scalability of these algorithms across many central processing unit (CPU) cores.

General purpose graphics processing units (GPGPUs) have been demonstrated to be highly effective at accelerating advanced Monte Carlo schemes (Klingbeil et al., 2011; Lee et al., 2010). GPGPUs make heavy use of single instruction multiple data (SIMD) operations (van der Pas et al., 2017), that is, a numerical operation that operates on a vector of data in an element-wise fashion. Such SIMD operations are also widely available in modern CPUs that now contain multiple vector processing units (VPUs). By exploiting the VPUs, CPU-based implementations can achieve performance boosts comparable with GPGPU implementations (Hurn et al., 2016; Lee et al., 2010; Mudalige et al., 2016). A practical understanding of how CPU vector operations function and how to access them in software is, therefore, highly relevant to applied Bayesian analysis. In this article, we focus on the code structures and algorithmic techniques that applied computational statisticians may utilise to take advantage of VPU technology that is available is the majority of commodity CPUs today. We also provide example code written in the C programming language as supplementary material444Example code is available from GitHub: https://github.com/davidwarne/Bayesian_SIMD_examples.

1.1 Parallel computing paradigms

For more than a decade, the fundamental speed of CPUs, in terms of clock frequency, has not changed significantly (Ross, 2008). Modern CPUs are typically clocked between GHz and GHz. Therefore, to increase the computational throughput, CPU architectures are now designed to perform multiple tasks in parallel. For computationally intensive tasks, such as those frequently encountered in Bayesian statistics, understanding how to exploit parallel computing architectures is essential.

There are several paradigms of parallelism, all of which may be relevant to the same computational task. The most fundamental paradigms are distributed computing, multithreading, vectorisation, and pipelining (Trobec et al., 2018; van der Pas et al., 2017). Though the combination of multithreading and vectorisation is the focus of this work, we briefly introduce all these paradigms for context.

Distributed computing: Computational tasks are shared across multiple computers, each with their own CPU and memory, that are connected via a high-speed network. This form of parallelism is the basis of HPC and high throughput computing (HTC).

Multithreading: Programs are divided into several processing streams, called threads, which may be executed independently. While multithreading is an important software design concept even for single core CPUs, modern multicore CPUs allow threads to truly execute concurrently. Since threads are part of the same program, they share the same memory and variables.

Vectorisation: In many modern CPUs, each core may perform one or more vector instructions using the VPU. These allow a single operation to be applied to multiple input operands in the same clock cycle.

Pipelining: Many CPU instructions involve multiple stages and take multiple clock cycles to complete. Often, it is possible for processing of the next instruction to commence before the current one is complete. This is implemented in the CPU hardware and occurs automatically. However, it should be noted that some programming constructs may negatively affect this instruction pipeline, for example, a conditional statement. Some specialised devices, such as field programmable gate arrays (FPGAs) can also implement algorithm specific pipelining architectures.

Each of these parallel computing paradigms have a granularity, which refers to the communication to computation ratio of the parallel workload it supports. In this context, we refer to distributed computing and multithreading as coarse-grained, whereas vectorisation and pipelining are considered fine-grained.

The technology to achieve fine-grain parallelism has typically been accelerator co-processors. That is, specialised processors, typically packaged on an expansion card, that the CPU offloads fine-grain parallel workloads to. Unsurprisingly, most exemplars of fine-grain parallelism in the computational statistics literature is based on accelerators, such as, GPGPUs (Lee et al., 2010; Terenin et al., 2018), Intel Xeon Phis (Hurn et al., 2016; Mahani and Sharabiani, 2015), and FPGAs (Fernandes et al., 2016; Zierke and Bakos, 2010).

The usage of accelerators is not practical for all practitioners of computational Bayesian statistics. As a result, we will not discuss them in any detail. Instead we highlight that the parallelisation strategies required to get the most out of these accelerators are also key to getting the most out of standard CPU hardware. In particular, the effective utilisation of vectorisation can boost the performance of CPU-based software.

1.2 Contribution

The aim of this article is to present the utility of the modern CPU-based SIMD operations to accelerate computational Bayesian inference through practical demonstration. We present several topical case studies involving compute intensive tasks. Specifically, we investigate: (i) sampling of the prior predictive distributions for approximate Bayesian computation (ABC) 

(Sisson et al., 2018; Sunnåker et al., 2013); and (ii) computation of Bayesian -values for prior weak informativity tests (Evans and Jang, 2011; Gelman, 2006; Nott et al., 2018). As we progress through each case study, we highlight the key algorithmic features that are appropriate for fine-grain and coarse-grain parallelism and demonstrate modifications to provide guidance to the practitioner. Given the scarcity of exemplars in statistics using SIMD operations, we believe a clear demonstration of additional performance benefits is a significant contribution to the computational statistics community.

We specifically focus on the additional performance boost available through SIMD operations on modern commodity CPU hardware. Therefore, we do not provide comparisons with dedicated accelerator coprocessors, such as GPGPU, Intel Xeon Phi, and FPGA accelerators as this has been covered in some detail in the HPC literature (Dongarra et al., 2014; Halyo et al., 2014; Teodoro et al., 2014; Wang et al., 2014; Warne et al., 2014; Zohouri et al., 2016). However, it should be noted that our methods are directly applicable to these technologies and futhermore our code will run on the in-socket variants of the Intel Xeon Phi without any modifcations.

All of our code is written in the C programming language. We utilise the OpenMP standard (version ) to implement multithreading and vectorisation. Random number generators (RNGs) are provided by the Intel Math Kernel Library (version ). Compiled binary applications are built using the Intel compiler, icc. All case study codes, compiler options, and inputs are provided within the supplementary material. While we provide stand-alone C implementations, it is important to note that our algorithm design and implementation guidelines are also completely relevant to the MATLAB C MEX and Rcpp interfaces.

The remainder of this article is organised as follows: In Section 2, we introduce the fundamental programming constructs for exploiting SIMD operations. Where appropriate, example code is provided. In sections 3 and 4, we present our case studies on prior predictive sampling for ABC, and weak informativity tests respectively. In each case study, we present briefly any essential background theory, highlight the opportunities for parallelisation and vectorisation, demonstrate the steps taken to implement and optimise the computational problem. Performance improvement results are presented for two families of CPU architecture, namely, Haswell and Skylake. Section 5 provides a discussion of the results and the strengths and weaknesses of SIMD operations from the perspective of applied Bayesian statisticians.

2 A guide to vectorisation

In this section, we provide some necessary CPU computing concepts that we draw upon in our case studies. We avoid discussing low-level technological details in favour of providing simple progamming tools that may be directly applied to other Bayesian applications.

2.1 Parallel random numbers

Regardless of the computing architecture, a particularly important topic to consider when applying parallel computing techniques to statistical applications is parallel random number generation. Dealing with this poorly can lead to results that are statistically biased. We do not focus on specifics of this problem, as the topic is well documented (Bradley et al., 2011; Lee et al., 2010). However, we simply state the three main approaches:

  1. Generate all required randomness serially on the host processor and distribute among parallel processes;

  2. Use a random number generator that can be split into independent sub-streams, for example a generator that supports Skip Ahead or Leapfrog methods;

  3. Use a sequence of random number generators that are statistically independent for the same seed value.

While it may be common to simply used different seed values for the same RNG, caution is advised as this may not always be statistically valid. For example, it has been shown that sequences of Linear Congruential Generators can become correlated if a linear sequence of seeds is used (Davies and Brooks, 2007).

In all of our work, we utilise the VSL_BRNG_MT2203 generator family which is available in Intel Math Kernel Library (MKL). The VSL_BRNG_MT2203 generator family provides 6,024 statistically independent Mersenne Twister generators.

2.2 SIMD operations

The traditional view of a CPU core is a purely sequential processor, performing exactly one operation each clock cycle. An operation might involve reading/writing data or performing an arithmetic calculation. With this assumption, the loop in Figure 1(a) will require four floating point addition operations within the loop body. See Tian et al. (2013) for a more technical discussion on SIMD operations.

Figure 1: VPUs execute SIMD operations. The body of the for loop in (a) can be implemented with four scalar additions, or alternatively as one vector addition as shown in (b). If the elements in arrays a, b, and c are single precision floating point numbers, then -bit vector registers are required. Otherwise, for double precision floating point numbers, then -bit vector registers are required.

While modern CPU cores are still sequential555Although, they do not necessarily execute operations in the order defined by the users code. CPUs re-order operations as needed to optimise pipelining. This is called out-of-order execution and is a feature that is not readily available on accelerator architectures., arithmetic operations may be scalar or vector. While Figure 1(a) would take four scalar additions, only a single vector addition may be required, representing a speedup factor of four.

Most modern CPUs have VPUs. The inputs to these units are wider than the scalar arithmetic logic units (ALUs). For example, the VPU may accept -bit vector inputs, this could store eight single precision (-bit) floating point numbers or four double precision (-bit) floating point numbers. The VPU performs element-wise arithmetic on the input vectors as shown in Figure 1(b).

Speedups attained through the usage of vector operations are in addition to performance gain through multithreading. Using four threads and -bit vectors could perform up to times as many double precision operations per second than sequential and scalar operations.

2.3 Vectorisation and multithreading with OpenMP

Just as with multithreaded parallelism, there are many ways to access the VPUs in algorithm implementations. In this work, we focus on usage of OpenMP, an open standard for automated parallel computing. OpenMP allows developers to write standard C/C++ or FORTRAN code, and supports parallel computing through the insertion of directives (van der Pas et al., 2017). That is, statements that hint to the compiler, that a section of code may benefit from threading or vectorisation. All of the lower-level, technical details are dealt with by the compiler.

Figure 2: Example loop vectorisation, which the compiler will attempt to transform into vector operations.

For example, placing the directive #pragma omp simd before a loop, as in Figure 2, tells the compiler to attempt to re-formulate the loop using vector operations. Similarly, the directive #pragma omp parallel tells the compiler to create a worker pool of threads for within the next code block. From within a parallel block, using the directive #pragma omp for before a loop will cause the iterations to be distributed across the threads in this worker pool. For example, Figure 3(a) shows the same loop as in Figure 2, but here the parallelisation comes from using mutliple cores. An equivalent shorthand directive is shown in Figure 3(b).

Figure 3: (a) Example loop parallelisation with multithreading loop iterations are distributed across cores. (b) Example shorthand syntax for loop parallelisation with multithreading loop.

Figure 2 and 3 are demonstrations of the OpenMP directives only. In practice, real codes will have multiple nested loops, and determining when to use simd (Figure 2), parallel and for (Figure 3(a)), and parallel for (Figure 3(b)) is key to our case studies. The following guidelines useful to keep in mind:

  1. Use simd (Figure 2) when each loop iteration is completely independent with no branch conditions and data arrays are small. The loop body must also be simple, involving only arithmetic and standard mathematical functions (e.g. sin, cos, exp or pow).

  2. Use parallel for (Figure 3(b)) when loop iterations are independent but branch conditions may cause each iteration to perform dramatically different calculations. The loop body may be arbitrarily complex and may call user defined functions.

  3. A simd (Figure 2) loop may be used within the body of a parallel for (Figure 3(b)) loop, but the converse is not true.

  4. The full parallel and for (Figure 3(a)) construct can be useful to more explicitly control the behaviour of individual threads through the use of OpenMP functions.

These points are not comprehensive, but they are good starting points for many applications. We refer the reader to the OpenMP version specifications666See https://www.openmp.org/specifications/ for full specifications and reference guides for C/C++ and FORTRAN. for further details. The monologues by Trobec et al. (2018), Levesque and Wagenbreth (2010) and van der Pas et al. (2017) are also valuable sources of practical information of parallel computing for HPC.

2.4 Memory access and alignment

To take full advantage of vector operations, it is crucial to understand that memory access comes at a cost. For example, commodity random access memory (RAM) bandwidth is approximately GB/second. However, the VPU processes floating point data at a peak rate of around GB/second. Therefore, the VPU cannot be more than busy if the data to be processed is in RAM.

CPUs get around this bandwidth bottleneck with a hierarchy of memory caches, to try to keep data as close to the compute resources for as long as possible. Typically the cache has three levels denoted as L1, L2, and L3 cache. The lower the cache number the higher the bandwidth, but the smaller the available memory. For example, L1 cache is the only cache that can keep the VPU close to utilised, but it is typically less than KB.

When data is required for computation, the caches are tested in order for that data. If the data is in L1 cache, then there is no memory access penalty. However, if the data is not in L1 cache, then L2 is checked and so on. Finally, RAM is accessed if the data is not in any level of cache. Every failed cache access is called a cache miss.

The goal is to access data arrays in such a way that the number of cache misses is minimised. Key guidelines are:

  • Access memory in a regular pattern. For example, access a[i], then a[i+n], then a[i+2*n] and so on. Ideally, n

    (called the stride) is small (and

    n = 1 is preferred)777Access matrices according to the matrix storage format. In C/C++, the row-major format is used, that is for a matrix double A[N][M] matrix element A[i][j] is equivalent to the linear index A[i*M + j]. So access should proceed row-by-row.

  • Aim to reuse data soon after it was last accessed. If a partial calculation is pushed out of cache and into RAM, then it might be faster to re-compute every time rather than access RAM again.

Another important memory consideration for vector operation is memory alignment. Memory is partitioned into blocks called cache lines. For the CPUs we consider, these are bytes in length. When the CPU loads data at a particular memory address, the entire cache line is loaded. Therefore, when doing vector operations, it is important that the first byte in of a data array is aligned with the byte cache line boundaries, otherwise some vectors cross over cache line boundaries. Trying to perform vector operations on unaligned incurs a performance penalty. In practice this means:

  • For dynamic memory, default malloc or new commands cannot be used (they often default to a byte or byte boundary. Instead, use either Intel’s _mm_malloc or GNU’s memalign).

  • For static arrays append __declspec(align(64)) to the array declaration. For example, __declspec(align(64)) double A[100] is a byte aligned array.

This is a somewhat minor code alteration. It is important to note, that with modern processors, that vector operations can still be applied to unaligned data, but every time a vector crosses a cache line boundary there is an extra read operation penalty.

Efficient use of cache and memory access patterns is a large topic, and we only aim to highlight common features that we will utilise in our case studies. See for example, Crago et al. (2018); Edwards et al. (2014); Eichenberger et al. (2004); Geng et al. (2018); Pennycook et al. (2013) and the monologue by Levesque and Wagenbreth (2010) for further discussion on memory access considerations.

2.5 Performance analysis

Many optimisation decisions and hardware comparison benchmarks are made without proper analysis of the original unoptimised implementation of an algorithm (Hurn et al., 2016; Lee et al., 2010). This can be costly in terms of development time or new hardware purchases. Here, we provide some important guidelines that should assist in assessing how efficient a statistical application is, before code optimisation or acceleration is considered.

  1. Use a profiler and static code analysis software. A profiler will monitor the execution of running software and report time spent in each line of code. Also information like memory latency, cache miss frequency and peak floating point performance can be reported. Static code analysers, do not run the code, but look for particular patterns and provide recommendations. In this article we utilise the Intel VTune Amplifier and Intel Advisor tools.

  2. Estimate the theoretical peak performance. How many floating point operations does the code, on average, perform? What is the theoretical peak of the available CPU?

  3. The maximum performance improvement from multithreading and vectorisation is limited by the operations that cannot be parallelised or vectorised. Most algorithms require some sequential only operations, and not all floating point calculations can be efficiently mapped to vector operations. Amdahl’s law (Amdahl, 1967) provides a useful expression to evaluate the speedup opportunity for a given program with profile data. That is, the speedup factor, , is bounded by

    (2)

    where is the sequential runtime of code that can be parallelised/vectorised, is the runtime of code that must remain sequential, and is the number of cores/vector length. Hence, as we have as a measure of the extent to which the algorithm may ideally utilise of parallel architectures.

2.6 Summary

In this section, we have provided generic guidelines in the areas of parallel RNGs, vectorisation, multithreading, memory usage and code analysis. Other than parallel RNG, most of these guideline are not specific to Bayesian statistics, however, they are practical and easy to implement in any statistical codes that are based on fundamental languages like C/C++ and FORTRAN (we leave until the discussion in Section 5 to highlight issues with higher-level languages in these aspects). In the remainder of this article, we demonstrate these guidelines practically through topical Bayesian applications.

3 Prior predictive sampling

Approximate Bayesian computation (ABC) techniques are one class of likelihood-free methods that provide a powerful framework for the approximate sampling of posterior distributions when the likelihood function is computationally intractable (Sisson et al., 2018; Sunnåker et al., 2013). The most accessible algorithm for applied practitioners is the ABC rejection method, that generates artificial data form the prior predictive distribution,

(3)

and accepts a small proportion of them as samples from an approximation to the posterior based on a discrepancy measure against the observational data.

Given observed data, , the parameter prior distribution, , a discrepancy function, , and a vector of sufficient (or informative) summary statistic , then a simple version of the ABC rejection method generates posterior samples as given in Algorithm 1.

1:  repeat
2:     Generate prior sample ;
3:     Use stochastic simulation to generate prior predictive data ;
4:  until ;
5:  Accept as an approximate posterior sample;
Algorithm 1 ABC rejection sampling

In practice, ABC rejection sampling is rarely implemented in this manner. A more practical and effective means for performing ABC rejection, especially in parallel, is to generate a fixed number, , of prior predictive joint samples, (Algorithm 1 is serial, and produces a random number of samples). The acceptance threshold, , is then selected a posteriori based on the empirical distribution of the discrepancy metric, , for these samples. See Fan and Sisson (2018) and Warne et al. (2019) for detailed reviews of ABC Monte Carlo algorithms.

ABC methods are routinely used in the study of complex stochastic models (Beaumont et al., 2002; Browning et al., 2018; Drovandi and Pettitt, 2011; Johnston et al., 2014; Ross et al., 2017; Tanaka et al., 2006; Vo et al., 2015; Sisson et al., 2018). However, Monte Carlo estimators based on ABC converge slowly in mean-square compared with exact Monte Carlo sampling algorithms (Barber et al., 2015; Fearnhead and Prangle, 2012). The ABC rejection method naturally lends itself to multithreading, and distributed parallelism. However, the opportunities for SIMD level parallelism depend on the model. In this section, we demonstrate these opportunities through two illustrative inference problems with intractable likelihoods.

3.1 Model examples

We provide two example models, selected to demonstrate the advantages and limitations of SIMD parallelism compared with thread parallelism under ABC rejection sampling. Both models are Markov processes, however, one is discrete and the other is continuous in time. The continuity of time, for the task of generating a large number of prior predictive samples, turns out to be the main factor affecting the efficacy of SIMD operations.

3.1.1 Genetic toggle switch

We use the genetic toggle switch model of Bonassi et al. (2011), also considered by Vo et al. (2019). We let and represent the expression levels of genes and at time for cells . The state of each cell evolves according to two coupled stochastic differential equations (SDEs), which are discretised according to the Euler-Maruyama scheme with timestep, , so that

(4)

where the parameters ,, and define repressive responses and and

are independent normal random variables with variance

.

For inference, only gene is measured with error for each cell at a fixed time according to

(5)

where , and control the error rate and are standard normal random variables. A set of the measurements across all cells, , constitutes a dataset. We consider sampling of the prior predictive distribution for the purposes of estimation of the parameters .

3.1.2 Tuberculosis transmission dynamics

The second model we consider was developed by Tanaka et al. (2006) for the study of tuberculosis transmission dynamics, and which has been used as a reference model for many subsequent ABC analyses (e.g. Sisson et al. (2007); Fearnhead and Prangle (2012)). The model describes the evolution of the number of tuberculosis cases, , over time given by

where is the number of unique genotypes of the tuberculosis bacterium at time and is the number of cases caused by the th genotype. The time-varying evolution of the ’s is driven by a linear birth-death process with birth rate and death rate . Additionally, mutation events occur at rate in which new genotypes are created.

Given values for , and , realisations of this process can be generated using the Gillespie Direct Method (Gillespie, 1977) with propensity functions,

where , and are the birth, death and mutation propensities respectively. Note that in the event of a mutation, the dimension of the state vector increases (equivalently, may be treated as an element from a sequence space).

Tanaka et al. (2006) demonstrate the application of ABC methods to infer the three rate parameters for a real dataset of tuberculosis cases in San Francisco during the early 1990s that consists of distinct genotypes. The summary statistics used were the number of distinct genotypes, , and the genetic diversity, (see Tanaka et al. (2006) for details).

3.2 Parallelisation and vectorisation opportunities

Given CPU cores, the processes of i.i.d. sampling from the joint prior predictive can be trivially divided between cores. For the toggle switch model (Equation (4)) it is reasonable to divide the workload evenly across cores since the number of floating point operations per model realisation is independent of the parameter values. Such a workload division is called a static schedule in the OpenMP standard. In a static schedule, each core will generate prior predictive samples. For simplicity, we will assume that is an integer.

For the tuberculosis model, generation of the prior predictive samples involves a discrete-state, continuous-time Markov process that is implemented with the Gillespie algorithm. As a result, the simulation times can vary wildly and a static schedule could result is a load imbalance between cores. As a result, the guided or dynamic scheduling schemes could perform better. In practice, we found that the overhead associated with load imbalance becomes relatively low for larger . Furthermore, an early termination rule, as used in the Lazy ABC approach of Prangle (2016) would also further reduce this imbalance. We conclude a standard static distribution of prior predictive samples is appropriate as a general strategy for ABC applications.

Unfortunately, no such general strategy exists for vectorisation. The simulation algorithms for both ABC applications are very different. The toggle switch model utilises an Euler-Maruyama scheme, whereas the tuberculosis model utilises the Gillespie Direct method. Therefore we expect the performance improvement opportunity to differ.

A naïve, scalar implementation of the toggle switch model simulation would involve computing each realisation, for , in sequence as shown in Algorithm 2.

1:  Initialise for ;
2:  Set ;
3:  for  do
4:     for  do
5:        Generate increments ;
6:        ;
7:        ;
8:     end for
9:  end for
Algorithm 2 Scalar implementation of Euler-Maruyama scheme for the toggle switch model

Since each Euler-Maruyama simulation performs identical operations for each cell , this is ideal for vectorisation. That is, we can evolve cells together in blocks of length where is the bit width of the available VPUs. Aside from the generation of Gaussian random variates, every operation in Algorithm 2 may be replaced with a element-wise vector operation. However, we generate Gaussian variates using the high performing Intel MKL RNG prior to processing each block. This is efficient provided is small enough to fit into L1 cache, but large enough to take advantage of the Intel routines (typically the Intel routines require more that variates to be generated in a block to attain peak performance).

The resulting vectorised Euler-Maruyama scheme is shown in Algorithm 3.

1:  Initialise for ;
2:  Set ;
3:  Set and
4:  for  do
5:     Generate increments ;
6:     for  do
7:        ;
8:        ;
9:        ;
10:        ;
11:     end for
12:  end for
Algorithm 3 Vectorised implementation of Euler-Maruyama scheme for the toggle switch model

Note the use of the notation “” to mean a vector. For example e.g., and . Also denote . We also use Hadamard notation for element-wise division, , multiplication, and exponentiation, , where is a scalar.

Unfortunately, there are fewer opportunities for vectorisation in the Gillespie method used in the tuberculosis model. The only suitable vectorisable step is in the calculation of genotype weights for the lookup method to select the genotype that will experience the next event. We can also vectorise the discrepancy measure components, , and . While this is not as significant as in the toggle switch model, the length of the state vector in the tuberculosis model is large enough that some performance boost from the vectorisation is still noticeable. The computation of the genetic diversity is particularly efficient here due to the sum of squares operation that can exploit the fused-multiply-add (FMA) VPU operations that performs .

3.3 Performance

We now report the performance improvements obtained through vectorisation and multithreading. We test on two CPU architectures, the Xeon E5-2680v3 (Haswell)888https://ark.intel.com/products/81908/Intel-Xeon-Processor-E5-2680-v3-30M-Cache-2-50-GHz- and Xeon Gold 6140 (Skylake)999https://ark.intel.com/products/120485/Intel-Xeon-Gold-6140-Processor-24-75M-Cache-2-30-GHz-. The Xeon E5-2680v3 supports Intel’s AVX2 instruction set with -bit vector operations, and the Xeon Gold 6140 supports Intel’s AVX512 instruction set with -bit vector operations. The Xeon Gold series is a part of Intel’s latest Scalable processor family that inherits much of the VPU technology from the Intel Xeon Phi family.

For the toggle switch model, prior predictive samples are generated for a range of different cell counts and observation times . We present the results for in Tables 1 (Xeon E5-2680v3) and 2 (Xeon Gold 6140). The tables compare the scalar sequential runtimes against vectorised sequential, scalar parallel and vectorised parallel. Note that for 256-bit VPUs the vectorised sequential code out-performs the scalar parallel code using four cores. Furthermore, the vectorisation also further boosts the performance of the parallel code. This is a clear example of the benefits of designing algorithms for SIMD along with multithreading.

Configuration Runtime (Seconds) [Speedup factor]
sequential sequential+SIMD parallel parallel+SIMD
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
Table 1: Computational performance of computing prior predictive samples under the toggle switch model using an Intel Xeon E5-2680v3 (Haswell) processor. Parallel times are based on -way multithreading and SIMD times using -bit vector operations.
Configuration Runtime (Seconds) [Speedup factor]
sequential sequential+SIMD parallel parallel+SIMD
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
Table 2: Computational performance of computing prior predictive samples under the toggle switch model using an Intel Xeon Gold 6140 (Skylake) processor. Parallel times are based on -way multithreading and SIMD times using -bit vector operations.

We achieve nearly speedup over scalar sequential code using only four cores and 256-bit vector operations on the Xeon E5-2680v3, and nearly speedup when we move to 512-bit vector operations on the Xeon Gold 6140. At a glance the step from to seems lower than expected when the vector units have doubled in width. Based on the numbers from the vectorised sequential versions the performance the Xeon E5-2680v3 attains a speedup factor that is of perfect speedup, which is based on two -bit VPUs, resulting in an speedup if everything vectorises perfectly; an unrealistic scenario. For the Xeon Gold 6140, this reduces to about of a perfect speedup of . This is consistent with Amdahl’s law (Equation (2)) since both speedups correspond to which implies that even if all vectorised operations are computed instantaneously, the maximum performance improvement is about from vectorisation alone.

Tables 3 (Xeon E5-2680v3) and 4 (Xeon Gold 6140) show the performance results for the tuberculosis model for a range of prior predictive sample counts, . The speedup factor from vectorisation is unsurprisingly not as significant as for the toggle switch model. This is an important demonstration of the limitations of SIMD operations (this includes GPGPU-based accelerators). However, there is still improvement from vectorisation of between to , which is still good in the context of the HPC literature (Dongarra et al., 2014; Kozlov et al., 2014; Mudalige et al., 2016).

Number of samples Runtime (Seconds) [Speedup factor]
sequential sequential+SIMD parallel parallel+SIMD
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
Table 3: Computational performance of computing prior predictive samples under the tuberculosis model using an Intel Xeon E5-2680v3 (Haswell) processor. Parallel times are based on -way multithreading and SIMD times using -bit vector operations.
Number of samples Runtime (Seconds) [Speedup factor]
sequential sequential+SIMD parallel parallel+SIMD
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
Table 4: Computational performance of computing prior predictive samples under the tuberculosis model using an Intel Xeon Gold 6140 (Skylake) processor. Parallel times are based on -way multithreading and SIMD times using -bit vector operations.

Interestingly, the speedup factor due to vectorisation is larger, at , for the Xeon E5-2680v3 compared with for the Xeon Gold 6140. However, the Xeon Gold 6140 results demonstrate almost improvement in the scalar performance over the Xeon E5-2680v3. This was not observed in the toggle switch case. Therefore, we conclude that the tuberculosis model simulation is accessing L3 cache and main memory more often, and thus taking advantage of the memory bandwidth and extra memory channels on the Xeon Gold 6140. If the tuberculosis model is not utilising L1 and L2 cache as intensively, this would also explain the reduced speedup from moving to wider vectors since the VPUs cannot be kept busy.

The distinct different in performance of the toggle switch model and the tuberculosis model suggests that care must be taken when choosing the stochastic simulation algorithm. For example, introducing approximations (Gillespie, 2000, 2001) for the tuberculosis model could significantly improve performance.

3.4 Summary

We have demonstrated that the utilisation of SIMD operations can improve the performance of prior predictive sampling for ABC by a factor of more than compared with standard scalar implementations. In particular, we have highlighted that specific features of the stochastic model and simulation algorithm have a great impact on the potential improvements of SIMD operations. Simulation schemes such as Euler-Maruyama are ideal candidates for vectorisation, however, exact stochastic simulation methods like the Gillespie direct method are far more challenging. Regardless, a speedup of at least was achieved through vectorisation, in this case.

4 Weakly informative priors

Our second application is the selection of weakly informative priors in Bayesian inference. We specifically focus on the computational approach of Nott et al. (2018), which is based on the theoretical framework of Evans and Jang (2011) and Gelman (2006).

4.1 Testing weak prior informativity

Following the work of Nott et al. (2018), we consider the prior predictive -value,

(6)

as a measure of Bayesian model criticism, where are the observational data, and is the prior predictive distribution (Equation (3)). Equation (6) provides a -value for prior-data conflict (Evans and Jang, 2011; Nott et al., 2018)

which occurs when the model parameters that result in a good fit with the data have low prior probability density. If

is some pre-defined small cut-off, then signifies prior-data conflict.

Suppose we have a family of priors, , that are parameterised by the hyperparameter . Let the base prior, denoted by for some , represent the current best knowledge of the parameter . Let

(7)

be the prior predictive distribution for a given prior and let be the prior predictive -value under the prior (equivalent to Equation (6)),

(8)

Assuming data that is generated under the base prior predictive distribution, that is, , the task is to find values of such that,

(9)

Priors that satisfy Equation (9) are called weakly informative with respect to the base prior. That is, weak informativity indicates less prior-data conflict than the base. Equivalently, weak informativity at level indicates that where and are the prior-data conflict cut-offs such that .

Therefore, given a set of hyperparameter values, , with corresponding to the base prior, the task is to compute such that for and . The high-level process proceeds as in Algorithm 4. The output of Algorithm 4 is a set of level cutoffs , from which we can derive a set of weakly informative prior distributions with respect to the base prior , that is, .

1:  Initialise hyperparameters , with base parameter ;
2:  for  do
3:     for  do
4:        Generate data from base prior predictive ;
5:        Generate data from prior predictive ;
6:        Evaluate and ;
7:     end for
8:     for  do
9:        Estimate -value samples ;
10:     end for
11:     Compute as the quantile of ;
12:  end for
Algorithm 4 Weak informativity test

This is an extremely computationally intensive task since the normalising constant of the posterior density, , must be evaluated for many different . Here we use an adaptive sequential Monte Carlo (SMC) sampler using likelihood annealing and a Markov chain Monte Carlo (MCMC) proposal kernel (Beskos et al., 2016; Chopin, 2002; Drovandi and Pettitt, 2011) to estimate the normalising constant. This SMC sampler is given in Algorithm 5. Algorithm 4 requires executions of the adaptive SMC (Algorithm 5) with particles and appropriate tuning parameters , related to the number of MCMC iterations, and , related to the convariance of the Gaussian random walk MCMC proposals.

1:  Initialise , , , , and for , and specify user-defined values for tuning parameters and ;
2:  repeat
3:     Set ;
4:     Find such that , that is, the effective sample size (ESS) is ; Set ;
5:     Set for ;
6:     Set
7:     Resample with replacement the particles for ;
8:     Set for ;
9:     Construct a tuned proposal kernel , where is the sample covariance matrix of ;
10:     Set where is the estimated acceptance probability determined from initial trial MCMC iterations and is the ceiling function;
11:     for  do
12:        for  do
13:           Generate proposal
14:           Compute acceptance ratio,
15:           With probability , set ;
16:        end for
17:     end for
18:  until 
Algorithm 5 SMC sampler using particles for estimating

Just as in Evans and Jang (2011) and Nott et al. (2018), we consider weakly informative priors for the analysis of an acute toxicity test in which

groups of animals are given different dosages of some toxin, and the number of deaths in each group are recorded. A logistic regression model is applied for the number of deaths

in group ,

(10)

where and are respectively the number of animals and the toxin dose level in the th group, and model parameters are . Given data , the likelihood function is

(11)

We consider the family of bivariate Gaussian priors of the form, with hyperparameter vector . For our specific implementation we have , , , , and . The base prior hyperparameters are .

4.2 Parallelisation and vectorisation opportunities

There are many parallel implementations of SMC samplers using multithreading, however the resampling step and annealing both require thread synchronisation (Hurn et al., 2016; Lee et al., 2010; Murray et al., 2016). Consequently, it is more beneficial in our case to distribute the hyperparameter values in Algorithm 4 across cores, with each thread computing the quantile for hyperparameters, since these may be performed completely independently. For every hyperparameter, we sequentially process the -value computations. We focus on acceleration of the SMC sampler (Algorithm 5) through vectorisation.

In many ways, SMC samplers are well suited to vectorisation since sychronisation is effectively maintained automatically. While there are many aspects of Algorithm 5 that we vectorise in the code examples, we focus here on the vectorisation of the MCMC proposal kernel for diversification of particles (Steps 1117 of Algorithm 5) as it has the greatest effect on performance improvements.

The strategy we employ is similar to the vectorisation of the Euler-Maruyama scheme (Algorithm 3) as presented in Section 3 for the toggle switch model. At SMC step , each particle is perturbed via MCMC steps. The same operations are performed at each MCMC iteration, with the exception of a single branch operation that arises from the accept/reject step. Therefore, we process the particle updates in blocks of length and evolve the MCMC steps for this block together using vector operations. Just as with the Euler-Maruyama scheme, all of the Gaussian and uniform random variates required for the block are generated together before the block is processed. We use the same vector notation as in Section 3, however, we also define the element-wise application of a function, , over vectors of length using the notation . The vectorised MCMC proposals are performed as in Algorithm 6.

1:  for  do
2:     Generate increments ;
3:     Generate uniform variates ;
4:     for  do
5:        Generate proposal ;
6:         ;
7:        ;
8:        for  do
9:           if  then
10:              ;
11:           end if
12:        end for
13:     end for
14:  end for
Algorithm 6 Vectorised implementation of MCMC proposals for SMC

In practice, Algorithm 6 works with the likelihood and prior on the log scale to avoid numerical underflow, however we present the direct implementation here for clarity. Also, while we represent the accept/reject step as a loop using scalar operations, in reality, the conditional statement is also vectorised in our example codes. However, the accept/reject step is not a significant computational portion of the algorithm. For optimal performance, it is essential that we have vectorised forms of the likelihood , the prior density and the Gaussian proposal density . This is quite straight forward for the logistic regression model and for the Gaussian priors and proposals used in this example.

Other aspects of Algorithm 5 that could be vectorised include likelihood evaluations and the computation of ESS and weight updates. The main scalar bottleneck is the resampling step that involves sampling from a multinomial distribution via the look-up method. Parallel approximations for resampling have been proposed (Murray et al., 2016) and these techniques have been demonstrated to be very effective for large scale SMC samplers. However, since we focus here on a SIMD implementation of SMC, we have not implemented this approach. Extending the ideas of Murray et al. (2016) to the SIMD paradigm we consider an important piece of future work.

4.3 Performance

We test the performance improvement obtained through vectorisation and multithreading using the Xeon E5-2680v3 and Xeon Gold 6140 processors. The evaluation of the weak informativity test in Algorithm 4 is performed for a range of values for , the number of hyperparameters, and the number of datasets. In all simulations, the

hyperparameter values are generated using a bivariate uniform distribution

for . The SMC sampling is performed with particles, with the lower particle count enabling computation to remain largely within L1 and L2 cache. Tuning parameter values are specified as and . Results are provided in Tables 5 and 6.

Across all configurations, a consistent improvement due to vectorisation of is achieved using the Xeon E5-2680v3 and when using the Xeon Gold 6140. Once again, diminishing returns are observed when stepping from -bit vector operations to -bit vector operations.

It is worth noting that, in our bioassay example, the MCMC proposal kernel performs a small number of MCMC steps (usually no more than ) and the number of particles in the SMC sampler is . In Tables 1 and 2, the overall performance boost increases with larger and . Therefore, we expect the overall performance improvement to increase in cases when longer MCMC runs are required, since the MCMC step will dominate each SMC iteration and more efficient reuse of L1 cache data will be achieved.

Configuration Runtime (Seconds) [Speedup factor]
sequential sequential+SIMD parallel parallel+SIMD
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [] []
[] [] [<