Corrfunc: Blazing fast correlation functions with AVX512F SIMD Intrinsics

Correlation functions are widely used in extra-galactic astrophysics to extract insights into how galaxies occupy dark matter halos and in cosmology to place stringent constraints on cosmological parameters. A correlation function fundamentally requires computing pair-wise separations between two sets of points and then computing a histogram of the separations. Corrfunc is an existing open-source, high-performance software package for efficiently computing a multitude of correlation functions. In this paper, we will discuss the SIMD AVX512F kernels within Corrfunc, capable of processing 16 floats or 8 doubles at a time. The latest manually implemented Corrfunc AVX512F kernels show a speedup of up to ∼ 4× relative to compiler-generated code for double-precision calculations. The AVX512F kernels show ∼ 1.6× speedup relative to the AVX kernels and compare favorably to a theoretical maximum of 2×. In addition, by pruning pairs with too large of a minimum possible separation, we achieve a ∼ 5-10% speedup across all the SIMD kernels. Such speedups highlight the importance of programming explicitly with SIMD vector intrinsics for complex calculations that can not be efficiently vectorized by compilers. Corrfunc is publicly available at



There are no comments yet.


page 1

page 2

page 3

page 4


CGAlgebra: a Mathematica package for conformal geometric algebra

A tutorial of the Mathematica package CGAlgebra, for conformal geometric...

Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

Several manufacturers have already started to commercialize near-bank Pr...

MPLAPACK version 1.0.0 user manual

The MPLAPACK (formerly MPACK) is a multiple-precision version of LAPACK ...

Report: Performance comparison between C2075 and P100 GPU cards using cosmological correlation functions

In this report, some cosmological correlation functions are used to eval...

corr2D - Implementation of Two-Dimensional Correlation Analysis in R

In the package corr2D two-dimensional correlation analysis is implemente...

FLASH 1.0: A Software Framework for Rapid Parallel Deployment and Enhancing Host Code Portability in Heterogeneous Computing

In this paper, we present FLASH 1.0, a C++-based software framework for ...

An overview of block Gram-Schmidt methods and their stability properties

Block Gram-Schmidt algorithms comprise essential kernels in many scienti...

Code Repositories


⚡️⚡️⚡️Blazing fast correlation functions on the CPU.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dark matter halos are spatially distributed in the Universe based on the values of the cosmological parameters in the cosmological model. Galaxies live in dark matter halos, but how these galaxies populate halos depends on a complex interplay between various astrophysical processes. We constrain this ‘galaxy-halo connection’ by statistically comparing the spatial distribution of observed and modeled galaxies. One such commonly used statistical measure is the correlation function.

A correlation function is the measure of the excess probability of finding a pair of galaxies at a certain separation, compared to that of an Poisson process. The simplest correlation function is a 3-D spatial one —



where is the excess probability of finding a pair of galaxies, is the number density of galaxies, and is the correlation function. In practice, to calculate the correlation function, we need to count the number of galaxy-pairs found at a different separations. The separations themselves are typically discretized as a histogram; as such, calculating a correlation function amounts to calculating pairwise separations followed by a spatial distance histogram. We then need to compare these pair counts with the the number expected from randomly distributed points for the same histogram bins. is frequently calculated with the following[peebles1980]:


where and are respectively the number of “data-data” and “random-random” pairs in the histogram bin of separation .

In additional to the full 3-D separation , the spatial separations can be split into two components — typically a line-of-sight () and a projected separation (). The line-of-sight direction is arbitrary but is usually chosen to be a coordinate axis. When the separations are split into two components, the correlation function is computed as a 2D histogram of pair-counts, referred to as . The two-point projected correlation function, , is simply the integral of along the line-of-sight and defined as:


For the remainder of the paper we will focus on these two correlation functions — and .

A correlation function can be used to statistically compare any theoretically generated set of mock galaxies to the observed galaxy clustering. Such a comparison is frequently done within a Monte Carlo Markov Chain process


. For any reasonable MCMC estimates of the posterior, we would need a large number of evaluations of the correlation function. Hence a faster correlation function code is a key component for cutting edge astrophysical research.

In [corrfunc_paper] we showed that Corrfunc is at least faster than all existing bespoke correlation function codes. Corrfunc achieves such high-performance by refining the entire domain into cells, and then handling cell pairs with bespoke SIMD kernels targeting various instruction set architectures. In [corrfunc_paper], we presented three different kernels targeting three different instruction sets – AVX, SSE and the Fallback kernels. In this work, we will present AVX512F kernels and additional optimizations.

1.1 Correlation Functions

The simplest correlation function code can be written as: Naive code for a correlation function C for(int i=0;i¡N1;i++) for(int j=0;j¡N2;j++) double dist = distance_metric(i, j); if(dist ¡ mindist —— dist ¿= maxdist) continue;

int ibin = dist_to_bin_index(dist); numpairs[ibin]++; weight[ibin] += weight_func(i, j);

The only two components that are required to fully specify a correlation function are:

  • distance_metric: This specifies how to calculate the separation between the two points. For , the separation is simply the Euclidean distance between the points

  • dist_to_bin_index: This specifies how to map the separation into the histogram bin. Typically, there are bins logarithmically spaced between and , which span 2–3 orders of magnitude.

In this paper, we will be concentrating on two spatial correlation functions — and . Consider a pair of distinct points, denoted by the subscripts and , with Cartesian positions and . The separations and the associated constraints (i.e., distance_metric) for and are:


Thus, only pairs with add to the histogram for ; while for , pairs need to satisfy both conditions and before the histogram is updated. Note that the histogram for is still only a 1-D histogram based off ; the constraint simply filters out pairs with too large separation.

So far all we have is the histogram of pair-wise separations for the galaxies. To fully evaluate the correlation function, we also need to evaluate the histogram of pairs for randomly distributed points (see Eqn. 2). For simple domain geometry, like a cubic volume with periodic boundary conditions as is common with cosmological simulation data, we can analytically compute the number of random pairs for any histogram bin. Thus, we only need to compute the term in Eqn. 2 to calculate . A similar technique can be applied to as well. Consequently, we only need to compute the (i.e., an auto-correlation) term to calculate both and .

In real galaxy surveys, the domain geometry is not so simple. In angular extent, the domain is limited by foreground contamination and the sky area the telescope can see from its latitude; in radial extent, it is limited by the faintness of distant galaxies and other complex selection effects. Thus, the and terms must often be computed by pair-counting; this is a major computational expense and is a motivating factor for our development of a fast pair-counting code.

1.2 Partitioning the space into cells

For a given set of points, a naive implementation of a correlation function would require evaluating all pair-wise separations and hence scale as . However, most correlation functions only require calculating pair-separation up to some maximum separation . If we are only interested in pairs within , then computing the pairwise separation to all possible pairs only to discard majority of the computed separations is extremely computationally inefficient. We can create an initial list of potential neighbors and significantly reduce the total number of computations by first partitioning the particles into cells of size at least . This idea of cell lists[Quentrec_cell_linked_list73], or chaining meshes [HOCKNEY1974148], is used both in molecular dynamics simulations[gromacs_3_2001] and smoothed particle hydrodynamics simulations[swift_simd_verlet_2018].

After depositing the particles into cells of side at least , all possible pairs within must lie within the neighboring cells. In Fig 1, we show the lattice of side

imposed on the Poisson distributed red points. For any given query point (in blue), we first locate the target cell and then we immediately know all potential cells that may contain pairs.

Figure 1: Partitioning the space to speed up the search for any potential pair within . The distribution of red points is gridded up on a lattice with cell-size at least . For any blue query point, all possible pairs involving the red points must lie within the 9 neighboring cells (dark gray shaded cells). With a similar lattice structure in 3 dimensions, we approximate a sphere of volume with a cube of volume . Figure adapted from [corrfunc_paper].

However, this lattice implementation approximates the volume of a sphere of radius by that of a cube with side . Thus, if we compute all possible pair-separations, then only 16% of the separations will be within ; the remaining 84% will be spurious[cell_refinements_gonnet_07]. In Fig. 2, we show that sub-dividing the cells further reduces the effective search volume. Sub-dividing the cells into size , the spurious calculations drop to [cell_refinements_gonnet_07]. Continuing to sub-divide further reduces the spurious calculations even more, but the overhead of searching over many more neighboring cells starts to dominate. In our experience with Corrfunc, we have found that bin sizes in the vicinity of produce the fastest run-times for a broad range of use-cases.

Figure 2: Refining the cell-size reduces the search volume. Particles in the shaded regions are separated by more than and can not have a pair. Compared to the top panel, the middle and lower panels need to inspect a smaller search volume. Figure adapted from [corrfunc_paper].

2 Overview of Corrfunc

2.1 Optimization Design

Corrfunc provides a variety of high-performance, OpenMP parallel, user-friendly correlation function codes. We have presented Corrfunc in [corrfunc_paper] but to make this paper self-contained, we will briefly describe the design and optimizations implemented in Corrfunc.

2.1.1 Creating 3D cells to reduce the search volume

As we discussed in Sec 1.2, the computational cost of searching for potential neighbors can be reduced by dividing the space into 3D cells. Given two sets of positions, the correlation function routines in Corrfunc first computes a bounding box for the two datasets. The second step is to sub-divide the space into 3D cells such that all possible pairs are located within the neighboring cells (see Fig. 1). However, for an optimal grid size, we need to account for the separation in the specific correlation function. As we showed in Eqn. 4, the calculation only needs pairs that satisfy and correspondingly the grid-size is some fraction of . For , we need two different separations – and , where is the separation between a pair of points in the - plane, and is the separation along the axis. Therefore, the optimal grid-size is some fraction of in both and axes, and along the axes. We combine the two cases into , with the understanding that for calculations.

2.1.2 Improving cache utilization

Within each cell, the particle positions are stored as a Structure-of-Array (SoA) variable. The individual positions are copied from the input arrays and duplicated within dedicated pointers in the SoA cell-array. Since we can load in the positions from the SoA cell-array directly into vector registers without any shuffle operations, the SoA operation is very conducive to vectorization.

2.1.3 Reducing the search volume by sorting the positions

With some appropriate sub-divisions, we can now locate all possible pairs that satisfy the separation constraints among the neighboring cell pairs. For calculations, the spherical search volume of is then approximated with . The particle positions stored in the SoA are always sorted along the axis. With such a sorting, we only need to seek from any particle position to find all possible pairs. Thus, with the -sorting, we reduce the search volume along the axis from to .

2.1.4 Reducing the number of cell pairs

Once both the datasets have been gridded up and the particles assigned to cells, we associate all possible pairs of cells that may contain a pair of points within . The fundamental unit of work within Corrfunc involve such pairs of cells.

For cases where the two datasets are distinct (cross-correlations), there are no symmetry conditions to exploit. However, when the two datasets are identical (auto-correlations), we can take advantage of symmetry. Only unique pairs of cells need to calculated, and as long as we double the total number of pairs found at the end, we will still have the correct number of pairs. Both and are auto-correlations and benefit from this optimization.

2.2 Computing the minimum possible separation

After culling for cell-pairs based on symmetry, we are left with cell-pairs that are within along any one dimension, but might still be too far apart when the full 3D separation is considered. To remove such cell-pairs, we need to compute the minimum possible separation between these two cells. Based on the linear offset between the cell-pairs along each dimension, we know the minimum possible separations, and along each axis. However, these quantities only capture the extent of the individual cells and do not reflect the actual positions of the particles within the cell. Therefore, we also store the bounding box info for each cell. With the bounding box, we can then compute the minimum possible separation between the two cells by simply taking the difference of the bounding axes along each dimension and then using Eqn. 4. If the minimum possible separation is larger than , then there can not be any valid pairs between the particles in the two cells and we reject that cell-pair.

If a cell-pair passes this check, then there may exist at least one valid pair between the cells. So far the and related quantities only reflect the minimum possible separation between the cell-edges. could really reflect the minimum separation between any pair of particles. Since we have the bounding box info for the secondary cell, we increase each one of the three quantities by the separation between the secondary bounding box and the nearest secondary cell-edge. If most of the secondary particles are concentrated towards the center of the secondary cell, then we would increase by a large amount and correspondingly prune a larger chunk of the secondary particles.

2.2.1 Late-entry and early-exit from the -loop

For any pair of cells, we know the minimum possible particle separation along each of the , and axes – and respectively. We also store the positions for the X, Y and Z edges of the primary cell nearest to the secondary cell – , and respectively. Since the minimum possible separation (between any particle pairs) along and axes is and , and the maximum total separation is , the maximum possible that can satisfy :


This only makes sense for ; for calculations equals .

In addition, we can also compute the minimum possible separation between a given primary particle and any secondary particle. We can make an improved estimate for the minimum separation between the primary particle and any secondary particle by using the and positions of the primary particle. For every cell-pair we can then compute two conditions Therefore, we can compute the minimum possible


Recall that the values in are sorted in increasing order, i.e., , as well as . If we define , then the values are also in increasing order for a fixed value of . Therefore, if any particle in the second dataset has , all future particles from the second data also must have . When we encounter such a particle, we terminate the -loop and continue on with the next iteration of the outer -loop.

Since the values are effectively sorted in increasing order, for a given , the smallest value of (note flipped subscripts) will occur for the final -th particle. Since the positions in the first dataset are also sorted in increasing order, therefore Thus, is also the smallest possible value for all remaining pairs between the two datasets. Therefore, if exceeds , no further pairs are possible and we can exit the outer -loop altogether.

For any -th particle to have a pair in the second dataset, the condition must be met. Therefore if , there can not be any possible pair between this -th particle and any particles. However, a different particle from the first dataset might still be a valid pair with the remaining particles in the second dataset. Therefore, we simply continue on with the next iteration of the -loop.

Since the positions are sorted, we continue to loop over the particles, until we find a such that . This marks the beginning particle for calculating pairwise separations. mLate Entry and early exit condition [escapeinside=——]C const —DOUBLE— *zstart = z2; const —DOUBLE— *zend = z2 + N2; const —DOUBLE— dz—— = ——; for(int64_t i=0;i¡N1;i++) const —DOUBLE— xpos = *x1++; const —DOUBLE— ypos = *y1++; const —DOUBLE— zpos = *z1++;

DOUBLE— this_dz = *z2 - zpos; if(this_dz ¿= dz——) continue;

const —DOUBLE— dx = ——:0; const —DOUBLE— dy = ——:0; const —DOUBLE— dz = ——:0; const —DOUBLE— sqr_sep = dx—— + dy—— + dz——; if(sqr_sep ¿= ——) continue; const —DOUBLE— dz—— = ——;

while( (z2 ¡ zend) && (*z2 - zpos) ¡= -dz——) z2++; if(z2 == zend) break;

int64_t j = z2 - zstart;

2.2.2 Vector intrinsics in existing Simd kernels

In [corrfunc_paper], we presented the dedicated AVX and SSE kernels. These kernels operate on cell pairs, finding all possible pairs between the two cells and updating the histogram appropriately. Both the AVX and SSE kernels have an associated header file each that maps C-macros to the correct underlying intrinsic for both double and single precision floats. With such an approach, the same lines of (pseudo-)intrinsics in the SIMD kernels can be seamlessly used for both single and double precision floats.

We vectorize the -loop over the second set of points and process particles in chunks of SIMDLEN; where SIMDLEN is 8 and 4 for single-precision AVX and SSE respectively. For double precision calculations, SIMDLEN is halved and equals 4 and 2 for AVX and SSE respectively.

Immediately following from Code 2.2.1, we can start processing pairs of particles with SIMD intrinsics. The loop over secondary particles in SIMD kernels [escapeinside=——]C int64_t j = z2 - zstart; —DOUBLE— *localz2 = z2; —DOUBLE— *localx2 = x2 + j; —DOUBLE— *localy2 = y2 + j; const —SIMD_DOUBLE— simd_xpos = —SIMD_SPLAT—(xpos); const —SIMD_DOUBLE— simd_ypos = —SIMD_SPLAT—(ypos); const —SIMD_DOUBLE— simd_zpos = —SIMD_SPLAT—(zpos);

const —SIMD_DOUBLE— simd_sqr_rmax = —SIMD_SPLAT—(——); const —SIMD_DOUBLE— simd_sqr_rmin = —SIMD_SPLAT—(——);

for(;j¡=(N2 - —SIMDLEN—);j+=—SIMDLEN—) const —SIMD_DOUBLE— simd_x2 = —SIMD_LOAD—(localx2); const —SIMD_DOUBLE— simd_y2 = —SIMD_LOAD—(localy2); const —SIMD_DOUBLE— simd_z2 = —SIMD_LOAD—(localz2);

localx2 += —SIMDLEN—; localy2 += —SIMDLEN—; localz2 += —SIMDLEN—;

const —SIMD_DOUBLE— dx = —SIMD_SUB—(simd_x2 - simd_xpos); const —SIMD_DOUBLE— dy = —SIMD_SUB—(simd_y2 - simd_ypos); const —SIMD_DOUBLE— dz = —SIMD_SUB—(simd_z2 - simd_zpos);

if( —ALL—(dz ¡= -dz——) ) continue; if( —ANY—(dz ¿= dz——) ) j = N2;

const —SIMD_DOUBLE— rp_sqr = —SIMD_ADD—(—SIMD_MUL—(dx, dx), —SIMD_MUL—(dy, dy)); const —SIMD_DOUBLE— r_sqr = —SIMD_ADD—(rp_sqr, —SIMD_MUL—(dz, dz));

const —SIMD_MASK— rmax_pairs = —SIMD_CMP_LT—(r_sqr, sqr_rmax); const —SIMD_MASK— rmin_pairs = —SIMD_CMP_GE—(r_sqr, sqr_rmin); —SIMD_MASK— pairs_left = —SIMD_AND—(rmax_pairs,rmin_pairs); if( —NONE—(pairs_left) ) continue;

/* histogram update here */

for(;j¡N2; j++) /* remainder loop */ … Since we can only process multiples of SIMDLEN with the SIMD intrinsics, we need an additional remainder loop to process any remaining particles in the second dataset.

Once we have a set of SIMD vectors, calculating the separations is trivial. Note that we avoid the expensive sqrt operation in Eqn. 4 and always perform comparisons with squared separations. Once we have the (squared) separations, we we can create vector masks for separations that satisfy the conditions in Eqn. 4. If no particles satisfy the distance constraints, then we continue to process the next SIMDLEN chunk. If there are particles that do satisfy all distance criteria, then the histogram needs to be updated.

2.2.3 Updating the histogram of pair-counts

Within the SIMD kernels, we have the array containing the values of the lower and upper edges of the histogram bins. To update the histogram, we need to first ascertain which bin any given falls into. The simplest way to locate the bin is to loop through the bins, and stop when is within the bin-edges. Recall that we avoid computing , so the comparison is . However, we can take advantage of how the typical bins are spaced and perform the histogram update faster. Histogram Update in SIMD kernels [escapeinside=——]C for(int kbin=nbin-1;kbin¿=1;kbin–) const —SIMD_DOUBLE— m1 = —SIMD_CMP_GE—(r2, rupp_sqr[kbin-1]); const —SIMD_MASK— bin_mask = —SIMD_AND—(m1, pairs_left); npairs[kbin] += —POPCNT—(—SIMD_TEST—(bin_mask)); pairs_left = —SIMD_CMP_LT—(r2, rupp_sqr[kbin-1]); if( —NONE—(pairs_left) ) break;

Typically the histogram bins are logarithmically spaced, and consequently, the outer bins encompass a much larger volume than the inner bins. Therefore, many more pairs of points are likely to fall in the outer bins than the inner ones. Following [Chhugani2012], we loop backwards through the histogram bins (see Code 2.2.3). Within the SIMD kernels, we create a mask that evaluates to ‘true’ separations that fell within a bin. We then run a hardware popcnt operation to count the number of bits set and update that particular histogram bin. This iteration over the histogram stops once we have accounted for all valid separations.

2.3 Overview of Avx512f

AVX512 is the latest generation of instruction set architecture supported on both the Intel SkyLake-SP, Skylake-X and the Xeon Phi x2000 (Knights Landing) processors. AVX512 expands the vector length to 512-bytes (compared to the 256-bytes in AVX) and introduces an additional mask variable type. Every instruction now comes with a masked variety where some elements can be masked out and alternate values specified for those masked-out lanes. The dedicated mask variable type can be directly cast into an uint16_t.111For double precision calculations, the upper 8 bits of the mask are identically set to 0

Since AVX512 is composed of several distinct instruction sets, both current and upcoming, we have only targeted the specific subset – AVX512-Foundation (AVX512F). AVX512F is meant to be supported by all current and future processors with AVX512 support222 AVX512CD is meant to allow vectorization of histogram updates but our attempts at automatic vectorization have proved futile so far.

2.4 Avx512f kernel implementation

Within the AVX512F kernel, we employ the same late-entry and early-exit conditions discussed in Sec 2.2.1. In this subsection, we will describe the AVX512F kernel once the first possible particle in the -loop is identified.

The masked operations are quite handy when coding up the AVX512F kernel. For instance, typically the array lengths are not an exact multiple of the SIMD vector length. Therefore, at the end of each vectorized loop, there is always a ‘remainder’ loop (see Code 2.2.2) to process the elements that were left over. With the masked operations in AVX512F

, we can pad out the ‘remainder’ points to be an exact

SIMD vector length; and set the mask for these padded points as ‘false’ (see Code 2.4). All of the subsequent processing, including the ‘load’ from memory, then uses this mask as one of the operands. Since there is no longer any ‘remainder-loop’, the Corrfunc AVX512F kernels are automatically more compact. With such a masked load, we can completely avoid the ‘remainder’ loop. We simply continue to update the mask variable – mask_left – while processing the SIMDLEN separations per iteration of the -loop. In other words, we used masked comparisons, where the input mask already contains the mask_left.

Most of the remaining sections of the AVX512F kernel follows similarly to the previous kernels. The two significant updates in the AVX512F kernels are that we have used the FMA operations where possible, and instead of the hardware popcnt instruction, we have used a pre-computed array containing the number of bits set. The -loop in AVX512F kernels [escapeinside=——]C const uint16_t pairs_left_float[] = 0xFFFF, 0x0001, 0x0003, 0x0007, 0x000F, 0x001F, 0x003F, 0x007F, 0x00FF, 0x01FF, 0x03FF, 0x07FF, 0x0FFF, 0x1FFF, 0x3FFF, 0x7FFF;

const uint8_t pairs_left_double[] = 0xFF, 0x01,0x03,0x07,0x0F, 0x1F,0x3F,0x7F;

const int64_t n_off = z2 - zstart; const int64_t Nleft = N2 - n_off; —DOUBLE— *localz2 = z2; —DOUBLE— *localx2 = x2 + n_off; —DOUBLE— *localy2 = y2 + n_off; for(int64_t j=0;j¡Nleft;j+=—AVX512_SIMDLEN—) —AVX512_MASK— pairs_left = (Nleft - j) ¿= —AVX512_SIMDLEN— ? pairs_left_DOUBLE[0]:pairs_left_DOUBLE[Nleft-j]; …

3 Results

In this section we will show the performance results from the newest AVX512F kernel and compare to the previous AVX, SSE and Fallback kernels within Corrfunc. To run the benchmarks in Sec 3.1 and Sec 3.2, we have used the Intel C compiler suite 2018 (icc/2018.1.163) on a dedicated node with the Intel Skylake 6140 cpus at the OzSTAR supercomputing Centre ( Corrfunc was compiled with the compilation flag -O3 -xhost -axCORE-AVX512 -qopenmp. We used the git commit hash 7b698823af216f39331ffdf46288de57e554ad06 to run these benchmarks.

To run the benchmarks in Sec 3.3, we used the Intel C compiler (2017) with the same compiler flags. The hardware setup was a dual socket machine with two Intel Xeon Gold 6132 @ 2.60GHz, for 28 threads total.

3.1 Comparing the performance with sub-divided cells

We showed in Section 1.2 that if we keep the cell sizes at , then only 16% of the pairs computed are likely to be necessary. Within Corrfunc, we have the runtime option of refining the cell-size further. There are 3 bin_refine_factors corresponding to each of the , and axes. These bin_refine_factors dictate how many further sub-divisions of the initial cell are made. Based on past experience, refining along the axis only degrades performance; therefore, we have fixed the refinement at 1 and only allowed the refinements to vary along the and axes. In Fig 3, we show how reducing the cell-size impacts the total runtime for a calculation, with =25 in a periodic box of side 420. We let the and bin refinement factors to vary between 1 and 3. Every combination of represents cell-sizes of .

Figure 3: How sub-dividing the cells into sizes smaller than improves runtime performance. In this figure, the colors represent different combinations of the and cell sizes ( and ). The individual in the figure represent For example, is the case where , is where and so on. The AVX512F kernels are the fastest, followed by the AVX, the SSE and the Fallback kernels. There is variation of up to 50% in runtime (for the Fallback case) for different bin_refine_factors; the variation in runtime is less pronounced () in the SIMD kernels. Thus, choosing an optimal cell-size is an important aspect of performance.

In Fig. 3 we see that AVX512F kernel is by far the fastest, followed by the AVX, the SSE and the Fallback kernels. This relative ordering of the kernels is in keeping with the relative ordering of the vector register sizes.

Within each kernel group, the coarsest sub-division — — is the slowest with the performance improving with higher bin refinement factors. However, all kernels are slower for bin refinements of indicating that the overhead of looping over neighbor cells dominates over the reduction in the search volume from the bin refinements (see Fig 2). The improvement in the runtime for the Fallback kernel is drastic — faster in moving from to . The SIMD kernels also show variations in the runtime with different bin_refine_factors but the scatter seems to reduce with wider vector registers. For instance, the AVX512F kernels show the least variation in runtime with the bin_refine_factors while the SSE kernels show the largest variation.

Within Corrfunc, we simply set the default bin_refine_factor to and recommend that the user experiment with various refinement factors to find out what produces the fastest code for their use-case. With the new AVX512F kernel, the same default bin refinement factors continue to be the fastest option. As we showed in Fig 2, since the search volume reduces with higher bin refinement factors, we do expect a reduction in runtime. However, the exact speedup obtained is likely to be strongly dependent on and the number density of the particles. Exploring this dependence of the speedup on and the particle load and automatically setting the bin_refine_factors to close-to-optimal values would be a great improvement to the Corrfunc package.

3.2 Comparing the performance with bounding box optimizations

In Section 2.2.1, we discussed how we maintain a bounding box for every cell. After computing the minimum possible separation based on the bounding box, we can then reject any cell-pair that can not contain a particle pair within . In addition, for every primary particle, we can compute the minimum possible separation to any secondary particle. If this minimum separation is larger than , then we simply continue with the next primary particle. In Fig. 4 we show the performance improvement based on the minimum separation calculations. We can see that that the performance improvement is typically in the 5-10% range, with the latest instruction sets showing the smallest improvement. Since the minimum separation calculation is done in scalar mode, that optimization means more time is spent in the scalar section of the SIMD kernel and consequently the kernels with the widest vector registers show the smallest improvement. In addition, we also see that the improvement generally reduces as the number of sub-divisions increases. This is most likely a combination of two effects – i) increased runtime overhead for processing larger number of neighbor cells, and ii) overall lower amount of computation per cell pair (due to smaller number of particles per cell) means the potential work avoided with the minimum separation calculation is lower.

The dataset we have used for the benchmark contains a modest million galaxies. It is quite likely that the minimum separation optimization will have a larger impact for larger dataset (or equivalently, a larger ).

Figure 4: How the runtime changes for each SIMD kernel by calculating the minimum separation between cell-pairs, as well as the minimum separation for a primary particle and all remaining particles in the secondary cell (see Section 2.2). The colors show the different refinements (see Section 1.2 and Section 3.1). The improvement ranges from few% to 10% across all the SIMD kernels; however, the performance improvement seems to reduce when there are a larger number of sub-divisions.

3.3 Comparing the performance of the Simd kernels

In the preceding section we saw that the AVX512F kernels in are the fastest for a for a fixed value of . We also wish to validate that AVX512F performs well for a broad range of . Typical values of range from few percent to 25% of the periodic box size, so we explore this range in our tests. In the following, the box size of the test dataset is 420; we show performance for values between 10 and 100.

In this section, we will compare the performance of the four SIMD kernels for a range of values, with the calculations being done in double precision. For , increasing means increasing the search volume as . For , we have fixed ; hence, increasing also increases the search volume by . Thus, in both cases we expect an asymptotic runtime scaling.

In Fig. 5, we show how the various kernel run-times scale. We see the expected behavior at large , with the AVX512F kernels being the fastest for reasonably large . At the lowest values, each cell contains only a small number of particles and it is likely that there is not sufficient computational work to keep the wider vector registers full.333A low is potentially a case where the bin_refine_factors need to be set to to boost the particle occupancy in the cells

Figure 5: Correlation function runtime versus . For the calculations (left panel), we have set . The effective search volume is then either that of a sphere, for or a cylinder , since for this test. Thus, in both the and cases, the search volume scales as and correspondingly both the correlation functions scale accordingly. As we summarized in Table 1, we find that the AVX512F kernels are faster by relative to the Fallback kernel, and relative to the AVX kernel.

Now that we have seen the raw performance, let us compare the speedups obtained by the vectorized kernels relative to the Fallback kernel which contains no explicit SIMD instructions. Since the calculations are done in double precision, a priori we expect a theoretical maximum of speedup for the AVX512F, AVX and SSE kernels, assuming the compiler is not able to automatically generate any vectorized instructions. Even in the case that the compiler is able to do so, this test will measure how much more efficient our vector implementation is than the compiler’s.

In Table 1 we show the speedup obtained with the various SIMD kernels relative to the Fallback kernel for a range of values for . We see that for , there is essentially no performance boost from the vectorized kernels. Once is , the speedup seems to stabilize at for the AVX512F, AVX and the SSE kernels respectively. These achieved speedups are within a factor of 2 of the theoretical maximum speedup. More interestingly, the AVX512F is faster than the AVX kernels, compared to the theoretical maximum of speedup from the wider vector registers. We also use the FMA operations in AVX512F kernels, which also adds to the potential speedup.

max width= AVX512F AVX SSE Fallback AVX512F AVX SSE Fallback 10.0 1.1 1.0 1.0 1.0 1.0 1.0 0.9 1.0 20.0 2.7 1.8 1.3 1.0 2.2 1.8 1.1 1.0 40.0 3.0 1.8 1.2 1.0 2.4 1.9 1.2 1.0 80.0 3.9 2.3 1.3 1.0 3.6 2.3 1.4 1.0 100.0 3.8 2.4 1.4 1.0 3.8 2.4 1.5 1.0

Table 1: Speedup from the SIMD kernels relative to the Fallback kernel as a function of . For calculations, we have set . All of these calculations are done with a simulation box of periodic size .

4 Conclusions

In this paper, we have presented AVX512F kernels for calculating correlation functions within the open-source package Corrfunc. These AVX512F kernels have been manually coded with vector intrinsics and make extensive use of masked operations to compute the separations and then update the histogram of pair-counts. The AVX512F kernels show a typical speedup for relative to the compiler-generated code within the Fallback kernel. The speedup is relative to the AVX kernel and compares very well to the theoretical maximum of . In addition, by efficiently pruning pairs that have separations larger than , we gained up to a 10% speedup. This paper and [corrfunc_paper] highlight the importance of combining domain knowledge, efficient algorithms, and dedicated vector intrinsics for complex calculations. Such combinations are particularly powerful when the underlying problem is difficult for the compiler to efficiently vectorize, as is the case for Corrfunc.

5 Acknowledgements

MS was primarily supported by NSF Career Award (AST-1151650) during main Corrfunc design and development. MS was also supported by the Australian Research Council Laureate Fellowship (FL110100072) awarded to Stuart Wyithe and by funds for the Theoretical Astrophysical Observatory (TAO). TAO is part of the All-Sky Virtual Observatory and is funded and supported by Astronomy Australia Limited, Swinburne University of Technology, and the Australian Government. The latter is provided though the Commonwealth’s Education Investment Fund and National Collaborative Research Infrastructure Strategy (NCRIS), particularly the National eResearch Collaboration Tools and Resources (NeCTAR) project. Parts of this research were conducted by the Australian Research Council Centre of Excellence for All Sky Astrophysics in 3 Dimensions (ASTRO 3D), through project number CE170100013.