A Guide to Particle Advection Performance

by   Abhishek Yenpure, et al.

The performance of particle advection-based flow visualization techniques is complex, since computational work can vary based on many factors, including number of particles, duration, and mesh type. Further, while many approaches have been introduced to optimize performance, the efficacy of a given approach can be similarly complex. In this work, we seek to establish a guide for particle advection performance by conducting a comprehensive survey of the area. We begin by identifying the building blocks for particle advection and establishing a simple cost model incorporating these building blocks. We then survey existing optimizations for particle advection, using two high-level categories: algorithmic optimizations and hardware efficiency. The sub-categories of algorithmic optimizations include solvers, cell locators, I/O efficiency, and precomputation, while the sub-categories of hardware efficiency all involve parallelism: shared-memory, distributed-memory, and hybrid. Finally, we conclude the survey by identifying current gaps in particle advection performance, and in particular on achieving a workflow for predicting performance under various optimizations.


page 3

page 15

page 19


OpenFPM: A scalable open framework for particle and particle-mesh codes on parallel computers

Scalable and efficient numerical simulations continue to gain importance...

Middleware Building Blocks for Workflow Systems

This paper describes a building blocks approach to the design of scienti...

Scratchpad Sharing in GPUs

GPGPU applications exploit on-chip scratchpad memory available in the Gr...

Enabling particle applications for exascale computing platforms

The Exascale Computing Project (ECP) is invested in co-design to assure ...

VPIC 2.0: Next Generation Particle-in-Cell Simulations

VPIC is a general purpose Particle-in-Cell simulation code for modeling ...

Algorithmic Building Blocks for Asymmetric Memories

The future of main memory appears to lie in the direction of new non-vol...

A Survey of High Level Frameworks in Block-Structured Adaptive Mesh Refinement Packages

Over the last decade block-structured adaptive mesh refinement (SAMR) ha...

1 Introduction

Flow visualization techniques are used to understand flow patterns and movement of fluids in many fields, including oceanography, aerodynamics, and electromagnetics. Many flow visualization techniques operate by placing massless particles at seed locations, displacing those particles according to a vector field to form trajectories, and then using those trajectories to create a renderable output. Each of the trajectories are calculated via a series of “advection steps,” where each step advances a particle a short distance by solving an ordinary differential equation.

Particle advection workloads can be quite diverse across different flow visualization algorithms and grid types. These workloads consist of many factors, including the number of particles, duration of advection, velocity field evaluation, and analysis needed for each advection step. One particularly important aspect with respect to performance is the number of advection steps, which derive from both the number of particles and their durations. Many flow visualization techniques have numerous particles that go for short durations, while many others have few particles that go for long durations. Some cases, like when analyzing flow in the ocean [61]

, require numerous particles for long durations and thus billions of advection steps (or more). With respect to velocity field evaluation, uniform grids require only a few operations, while unstructured grids require many more (for cell location and interpolation). In all, the diverse nature of particle advection workloads makes it difficult to reason about both the execution time for a given workload and the potential improvement from a given optimization.

The main goal of this survey is to provide a guide for understanding particle advection performance, including possible optimizations. It does this in four parts. First, Section 2 provides background on the building blocks for particle advection. Second, Section 3 introduces a cost model for particle advection performance, to assist with reasoning about overall execution time and inform which aspects dominate runtime. Third, Section 4 surveys algorithmic optimizations. Fourth, Section 5 surveys approaches for utilizing hardware more efficiently, with nearly all of these works utilizing parallelism. Contrasting the latter two sections, Section 4 is about reducing the amount of work to perform, while Section 5 is about executing a fixed amount of work more quickly. In terms of how to read this survey, Sections 4 and 5 can be read in any order. In particular, readers interested in a particular algorithmic optimization or technique for hardware efficiency can skip to the corresponding subsection. That said, readers new to this topic should start with Sections 2 and 3 to gain a basic understanding of particle advection performance.

In terms of placing this survey into context with previously published literature, we feel this is the first effort to provide a guide to particle advection performance. The closest work to our own is the survey on distributed-memory parallel particle advection by Zhang and Yuan [86]. Our survey is differentiated in two main ways. First, our survey considers a broader context overall, i.e., it considers algorithmic optimizations and additional types of parallelism. Second, our discussion of distributed-memory techniques does a new summarization of workloads and parallel characteristics (specifically Table V

), and also has been updated to include works appearing since their publication. There also have been many other excellent surveys involving flow visualization and particle advection: feature extraction and tracking 

[65], dense and texture-based techniques [48], topology-based flow techniques [49] and a subsequent survey focusing on topology for unsteady flow [63], integration-based, geometric flow visualization [54], and seed placement and streamline selection [72]. Our survey complements these existing surveys — while some of these works consider aspects of performance within their individual focal point, none of the surveys endeavor to provide a guide to particle advection performance.

2 Particle Advection Background

Fig. 1: Organization of the components for a particle advection-based flow visualization algorithm. The components are arranged in three rows in decreasing levels of granularity from top to bottom. In other words, the components at the bottom are building blocks for the components at higher levels. The top row shows components that define the movement and analysis of a particle. The loop in the top row indicates its components are executed repeatedly until the particle is terminated. The middle row shows components that define a single step of advection. The arrows with the ellipsis from ODE solver to velocity field evaluation are meant to indicate that an ODE solver needs to evaluate the velocity field multiple times. Each velocity field evaluation takes as input a spatial location and possibly a time, and returns the velocity at the corresponding location (and time). The frequently-used Runge-Kutta 4 ODE solver requires four such velocity field evaluations. Finally, as depicted in the bottom row, each velocity field evaluation requires first locating which cell in the mesh contains the desired spatial location and then interpolating the velocity field to the desired location.

Flow visualization algorithms perform three general operations:

  • Seed Particles: defines the initial placement of particles.

  • Advance Particles: defines how the particles are displaced and analyzed.

  • Construct Output: constructs the final output of the flow visualization algorithm, which may be a renderable form, something quantitative in nature, etc.

These three operations often happen in sequence, but in some forms they happen in an overlapping fashion (i.e., seed, advance, seed more, advance more, etc.)

Our organization, which is illustrated in Figure 1, focuses on the “advance particles” portion of flow visualization algorithms. It divides the components into three levels of granularity.

The “top” level of our organization considers the process of advancing a single particle. It is divided into three components:

  • Advection Step: advances a particle to its next position.

  • Analyze Step: analyzes the advection step that was just taken.

  • Check for Termination: determines whether a particle should be terminated.

The process of advancing a particle involves three phases that are applied repeatedly. The first phase is to displace a particle from its current location to a new location. Such displacements are referred to as particle advection steps. The second phase is to analyze a step. The specifics of the analysis vary by flow visualization algorithm, and could be as simple as storing the particle’s new location in memory or could involve more computation. The third phase is to check if the particle meets the termination criteria. Similar to the second phase, flow visualization algorithms define specific criteria for when to terminate a particle. Finally, if the particle is not terminated, then these three phases are repeated in a loop until the termination criteria are reached.

The “middle” level of our organization considers the process of completing a single step for a particle. This level has two components:

  • ODE Solver: calculates a particle’s displacement to a new position by solving an ordinary diffential equation (ODE).

  • Velocity Field Evaluation: calculates the velocity value at a specific location by interpolating within the located cell.

Thus, to calculate the velocity at a point P, cell location is first used to identify the cell C that contains P, and then velocity field interpolation is performed to calculate the velocity at P using information at the vertices of C.

The “bottom” level of our organization considers the process of velocity field evaluation. This level also has two components:

  • Cell Location: locates the cell that contains some location.

  • Field Interpolation: calculates velocity field at a specific location via interpolation of surrounding velocity values.

In terms of a relationship, to calculate velocity at some point , first cell location is used to identify the cell that contains , and then velocity field interpolation is used to calculate the velocity at using ’s information.

Different flow visualization algorithms use these components in different ways, resulting in different performance across the algorithms. One way of comparing the different algorithms’ perforamance can be the workload required by the algorithms, which roughly translates to the total computation required by the algorithm. This workload can be defined as the total number of advection steps completed by the algorithm, which is the product of the total number of particles required by the algorithm and the number of steps expected to be completed by each particle. Figure 2 shows examples of four different flow visualization algorithms that demonstrate significant differences in their workloads and behaviours. Table I highlights the differences between the workloads for the example algorithms.

(a) streamlines
(b) streamsurface
(c) FTLE
(d) Poincaré
Fig. 2: Example flow visualizations from four representative algorithms. Subfigure (a) shows streamlines rendered over a slice of jet plume data created using the Gerris Flow Solver [64], subfigure (b) shows a streamsurface which is split by turbulence and vortices that can be observed towards the end [30], subfigure (c) shows attracting (blue) and repelling (red) Lagrangian structures extracted as FTLE ridges from a simulation of a von Korman vortex street [43], and subfigure (d) shows a Poincaré plot of a species being dissolved in water, where the color of the dots represent the level of dissolution [50].
Seeding Strategy Sparse Packed Seeding Curves
Number of Seeds Small Medium Large
1/1K cells ~1/100 cells 1/cell
Number of Steps Small Medium Large
100 ~1K 10K

(a) Each parameter is classified in three catagories.

Algorithm Seeding # Seeds # Steps Streamlines Sparse/Packed Small Large Streamsurface Seeding Curves Medium Large FTLE Packed Large Small Poincarè Packed Medium Large
(b) Typical parameter configurations for different flow visualization algorithms.

TABLE I: Parameters for seeding strategy, the number of seeds, and the number of steps for four representative flow visualization algorithms. (a) describes the parameters and their classifications, and (b) presents the typical values for the four algorithms.

3 Cost Model for Particle Advection Performance

This section considers costs from the perspective of the building blocks used to carry out particle advection. Its purpose is to build a general framework for reasoning about costs and also to inform which aspects contribute most to overall cost. That said, the simplicity of the cost model precludes directly evaluating many of the optimizations described in later sections (I/O, parallelism, precomputation, and adaptive step sizing); this topic is revisited in Section 6. Finally, Appendix A

goes into more depth on the cost model, including estimating costs for each term in the model, notional examples, and validation of the model.

Let the costs for particle advection be denoted by . Then a coarse formulation for is:


where represents the total number of particles used for the flow visualization, represents the total number of steps taken by the particle, and represents the amount of work required by particle at step in the process of advancing the particle.

To better illuminate the overall costs, the remainder of this section considers how the coarse formulation in Equation 1 can be further decomposed. We first consider the tasks within . In particular, each step that advances a particle contains three components — taking an advection step, analyzing the step in a way specific to the individual flow visualization algorithm, and checking if the particle should be terminated. Hence, Equation 1 can be written as:


where is the cost for advecting, is the cost for analyzing, and is the cost for checking the termination criteria for the particle at the step.

The cost can be further broken down by exploring the cost for a single advection step, . Particle advection uses an ODE solver to determine the next position of a particle, and this solver requires the velocity of the particle at the current location. Further, depending on the ODE solver, additional velocity evaluations in the proximity of the particle may be required. An Euler solver requires only one velocity evaluation, while an RK4 solver requires four velocity evaluations. Generalizing, the cost of a single particle advection step can be written as:


where is the cost for the ODE solver to determine the next position, is the number of velocity evaluations required by the ODE, and is the cost for velocity evaluation for the particle for the step at the location.

The cost for velocity evaluations, , can be further broken down into two components. Each evaluation involves two operations: locating the current cell for the current evaluation, and interpolating the velocity values for the current position using velocities at the vertices of the current cell. In all, the cost of velocity evaluations can be written as:


where is the cost for locating the cell, and is the cost for interpolating the velocities at the location.

Further, we can substitute 4 in 3 to yield:


Finally, we can substitute 5 in 2 to obtain our final formulation:


4 Algorithmic Optimizations

This section surveys algorithmic optimizations for particle advection building blocks, i.e., techniques for executing a given workload using fewer operations. Some of the building blocks do not particularly lend themselves to algorithmic optimizations. For example, a RK4 solver requires a fixed number of FLOPS, and the only possible “optimization” would be to use a different solver or adaptive step sizes. That said, cell location allows room for possible optimizations. Further, the efficiency of vector field evaluation can be improved by considering underlying I/O operations. This section discusses four optimizations that address the algorithmic challenges Section 4.1 discusses optimizations to ODE solvers, Section 4.2 discusses optiizations for cell location, Section 4.3 discusses strategies to improve I/O efficiency, and finally Section 4.4 discusses strategies that involve precomputation.

4.1 ODE Solvers

The fundamental problem underlying particle advection is solving of ODEs. Many methods are available for this, with different trade-offs, and a comprehensive review is beyond the scope of this work, and we refer the reader to the excellent book by Hairer et al. [37] for a more thorough overview. Due to the generally (numerically) benign nature of vector fields used in visualization, a set of standard schemes is used in many visualization implementations.

Beyond the Euler and the fourth-order Runge-Kutta (RK4) methods, techniques with adaptive step size control have proven useful. The primary objective of such methods is to allow precise control over the error of the approximation of the solution, which is achieved by automated selection of the step size (which in turn controls the approximation error) in each step. Often used methods in this context are the Runge-Kutta Fehlberg (RFK) method [37], the Runge-Kutta Cash-Karp (RKCK) method [37], and the Dormand-Prince fifth-order scheme (DOPRI5) [66]. As an additional benefit relevant in this context, due to the typically low error tolerances required for visualization purposes, significant performance benefits can be obtained if the adaptive step sizing results in fewer large steps taken compared to fixed-step size methods. While the magnitude of such benefits depends on a variety of factors that are hard to quantify, using an adaptive step sizing method is generally recommended. Corresponding implementations are widely available, e.g. in the VTK framework [33, 38].

For specialized applications, substantial performance benefits may be obtainable by relying on domain-specific integration schemes that generally exhibit higher accuracy orders and thus allow larger step sizes than general-purpose schemes. For example, Sanderson et al. [71] report substantial speedup from employing an Adams-type scheme for visualizing high-order fusion simulation data. However, general guidance on the selection of optimal schemes for domain-specific vector field data remains elusive.

4.2 Cell Locators

Cell locators facilitate interpolation queries over a grid and rely on auxiliary data structures that partition candidate cells spatially. These are typically constructed in a pre-processing step and induce linear memory overhead in the number of cells , while accelerating queries to . Many cell location schemes allow trading off memory overhead for improved performance. A variety of schemes have been developed for different scenarios. For example, limited available memory, e.g. on GPUs, can be addressed through multi-level data structures. According to Lohner and Ambrosiano [52] the process of cell location can follow one of the following three approaches.

Using a Cartesian background grid:

Cell are spatially subdivided using a superimposed Cartesian grid, storing a list of overlapping cells of the original grid per superimposed cell. The superimposed cell can be found in constant time, and cell location then requires traversing all overlapping cells to find the actual containing cell for the query point. While conceptually simple, this approach is not ideal if the background grid exhibits large variances in cell sizes, either incurring excessive storage overhead or decreased performance, depending on the resolution of the superimposed grid.

Using tree structures: A basic approach to hierarchical cell location is the use of octrees for cell location [76, 81]. Each leaf of an octree stores cells whose bounding box overlaps with the leaf extents. Leaves are subdivided until either a maximum depth is reached, or the number of overlapping cells falls below an upper bound. Cell location proceeds by traversing the octree from the root and descending through nodes until a leaf is reached, which then contains all the candidate cells. Due to the regular nature of octree subdivision, this approach does not work well with non-uniform vertex distributions, requiring either too many levels of subdivision and thus a considerable memory overhead, or does not shrink the candidate cell range down to acceptable levels.

Using kd-trees instead of octrees facilitates non-uniform subdivision, at the cost of generally deeper trees and a storage overhead. An innovative approach was given by Langbein et al.[47], based on a kd-tree storing just the vertices of an unstructured grid. This allows to quick location of a grid vertex close to the query point; using cell adjacency, ray marching is used to traverse the grid towards the query point using cell walking. Through clever storage of the cell-vertex incidence information, storage overhead can be kept reasonable.

Garth and Joy described the cell tree [31], which employs a kd-tree-like bounding interval hierarchy based on cell bounding boxes to quickly identify candidate cells. This allows a flexible trade-off between performance and storage overhead and allows rapid cell location even for very large unstructured grids with hundreds of millions of cells on commodity hardware and on memory-limited GPU architectures.

Addressing storage overhead directly, Andrysco and Tricoche [3] presented an efficient storage scheme for kd-trees and octrees, based on compressed sparse row (CSR) storage of tree levels, termed Matrix *Trees. The tree data structure is encoded as a sparse matrix in CSR representation. This alleviates most of the memory overhead of kd-trees, and they are able to perform cell location with reduced time and space complexity when compared with typical tree data structures.

Overall, non-uniform hierarchical subdivision can accommodates large meshes with significant variations in cell shapes and sizes well. While Lohner and Ambrosiano note that vectorization of this approach is challenging as tree-based schemes introduce additional indirect addressing, vectorization is still possible on modern CPU and GPU architectures with good performance [31].

Using successive neighbor searches: For the case of particle integration, successive interpolation queries exhibit strong coherence and are typically spatially close. This enables a form of locality caching: For each interpolation query except the first, the cell that contained the previous query point is checked first. If it does not contain the interpolation point, its immediate neighbors are likely to contain it, potentially reducing the number of cells to check. The initial interpolation point can be located using a separate scheme, e.g. as discussed above.

Lohner and Ambrosiano, as well as Ueng et al. [80], adopted a corresponding successive neighbor search method to cell location in particle advection for efficient streamline, streamribbon, and streamtube construction. They restricted their work to linear tetrahedral cells for simplification of certain formulations, requiring a pre-decomposition for general unstructured grids. Note that when applied to tetrahedral meshes, the successive neighbor search approach is sometimes also referred to as tetrahedral walk [75, 17].

Kenwright and Lane [45] extended the work by Ueng et al. by improving the technique to identify the particle’s containing tetrahedron. Their approach uses fewer floating point operations for cell location compared to Ueng et al.

Successive neighbor search is also naturally incorporated in the method of Langbein et al. [47]; ray casting with adjacency walking towards begins at the previous interpolation point in this case.

Algorithm Application Intent / Data Time Seed Performance
Evaluation Size Steps Count
Lohner and Ambrosiano[52] Streamlines Fast cell location and efficient vectorization - 10K
Ueng et al.[80] Streamlines Streamline computation and cell location in canonical coordinate space - 100
Chen et al.[24] Streamlines Improving data layout for better I/O performance 134M - 4K
200M -
537M -
Chen et al.[22] Pathlines Improving data layout for better I/O performance 25M 48 4K
65M 29
80M 25
Chen et al.[23] FTLE Improving data layout for better I/O performance 25M 48 -
65M 29
80M 25
TABLE II: Summary of studies considering algorithmic optimizations to particle advection. Studies that do not report quantitative performance improvements are not mentioned in the table. The asterisk for entries in the data size column represent unstructured grids.

4.3 I/O Efficiency

Simulations with very large numbers of cells often output their vector fields in a block-decomposed fashion, such that each block is small enough to fit in the memory of a compute node. Flow visualization algorithms that process block-decomposed data vary in strategy, although many operate by storing a few of these blocks in memory at a time, and loading/purging blocks as necessary. This method of computation is known as out-of-core computation. One of the significant bottlenecks for flow visualization algorithms while performing out-of-core computations is the cost of I/O. Particle advection is a data-dependent operation and efficient prefetching to ensure sequential access to data can be very beneficial in minimizing these I/O costs. This section discusses the works that aim to improve particle advection performance by improving the the efficiency of I/O operations.

Chen et al. [24]

presented an approach to improve the I/O efficiency of particle advection for out-of-core computation. Their approach relies on constructing an access dependency graph (ADG) based on the flow data. The graph’s nodes represent the data blocks, and the edges are weighted based on the probability that a particle travels from one block to another. The information from the graph is used during runtime to minimize data block misses. Their method demonstrated speed-ups over the Hilbert curve layout 


Chen et al. [22] extended the previous work to out-of-core computation of pathlines. Their results show a performance improvement in the range of 10%-40% compared to the Z-curve layout [87, 83].

Chen et al.[23] expanded the work further to introduce a seed scheduling strategy to be used along with the graph-based data layout. They demonstrated an efficient out-of-core approach for calculating FTLE. The performance improvements observed against the Z-curve layout were in the range of 8%-32%.

4.4 Precomputation

Besides optimizing individual particle advection building blocks, optimization of certain flow visualization workloads can benefit from a two-stage approach. During the first stage, based on current literature, a set of particle trajectories can be computed to inform data access patterns, or serve as a basis for interpolating new trajectories. Depending on the objectives, the number of trajectories computed during the first stage varies. The resulting set of trajectories can be referred to as the precomputed trajectories.

Precomputed trajectories can inform data access patterns to provide a strategy to improve I/O efficiency, as mentioned in the context of the study by Chen et al. [24] in the previous section. A similar approach was studied by Nouansengsy et al. [60] to improve load balancing in a distributed memory setting. In these cases, the first stage is a preprocessing step and a small number of particle might be advected to form the set of precomputed trajectories.

For a computationally expensive particle advection workload, a strategy to accelerate the computation or improve interactivity of time-varying vector field visualization is to divide the workload into two sets. The first set includes particle trajectories computed using high-order numerical integration. The second set includes particle trajectories that are derived by interpolating the precomputed trajectories. If new particle trajectories can be derived from the precomputed set faster than numerical integration, while remaining accurate and satisfying particle trajectory requirements for the specific flow visualization use case, then the total computational cost of the workload can be reduced compared to the numerical intergration of every trajectory.

Hlawatsch et al. [41] introduced a hierarchical scheme to construct integral curves, streamlines or pathlines, using sets of precomputed short flow maps. They demonstrated the approach for the computation of the finite-time Lyapunov exponent and the line integral convolution. Although the method introduces a trade-off of reduced accuracy, they demonstrate their approach can result in an order of magnitude speed up for long integration times.

To accelerate the computation of streamline workloads, Bleile et al. [11] employed block exterior flow maps (BEFMs) produced using precomputed trajectories. BEFMs, i.e., a mapping of block-specific particle entry to exit locations, are generated to map the transport of particles across entire blocks in a single interpolation step. Thus, when a new particle enters a block, instead of performing an unknown number of numerical integration steps to traverse the region within the block, based on the mapping information provided by precomputed trajectories, the location of the particle exiting (or terminating within) the block can be directly interpolated as a single step. Depending on the nature of the workload, large speedups can be observed using this strategy. For example, Bleile et al. [11] observed up to 20X speed up for a small loss of accuracy due to interpolation error.

To support exploratory visualization of time-varying vector fields, Agranovsky et al. [1] proposed usage of in situ processing to extract accurate Lagrangian representations. In the context of large-scale vector field data, and subsequent temporally sparse settings during post hoc analysis, reduced Lagrangian representations offer improved accuracy-storage propositions compared to traditional Eulerian approaches, as well as, can support acceleration of trajectory computation during post hoc analysis. By seeding the precomputed trajectories along a uniform grid, structured (uniform or rectilinear) grid interpolation performance can be achieved during post hoc analysis. To further optimize the accuracy of reconstructed pathlines in settings of temporal sparsity, research has considered how varying the set of precomputed trajectories can improve accuracy-storage propositions. For example, Sane et al. [73] studied the use of longer trajectories to reduce error propagation and improve accuracy, and Rapp et al. [69] proposed a statistical sampling technique to determine where seeds should be placed. Unstructured sampling strategies, however, can increase the cost of post hoc interpolation and diminish computational performance benefits.

4.5 Summary

Table II summarizes studies that address algorithmic optimizations and report performance improvements against a baseline implementation. The studies mentioned in the table either target optimizations for cell location or perform better I/O operations. For ODE solvers and precomputation, reporting performance improvements is difficult because of an associated accuracy trade-off for better performance. Optimizations to cell locators for unstructured grid enable significant speed-ups for the workloads. With a combination of efficient cell location and vectorization, Lohner and Ambrosiano [52] achieved the speed of . However, the other study [80] demonstrated a speed-up of around . The works by Chen et al. [24, 22, 23] for efficient I/O for particle advection all demonstrated speed-ups up to .

5 Using Hardware Efficiently

Flow visualization algorithms often share resources with large simulation codes, or require large amounts of computational resources of their own depending on the needs of the analysis task. This means flow visualization algorithms are often required to execute on supercomputers. Executing codes on supercomputers is expensive and it is necessary that all analysis and visualization tasks execute with utmost efficiency. Modern supercomputers have multiple ways to make algorithms execute fast. Typically, supercomputers have thousands of nodes over which computation can be distributed, and each node has multi-core CPUs on along with multiple accelerators (e.g., GPUs) for parallelization. As a result, algorithms are expected to make efficient use of this billion-way concurrency. This section discusses research for particle advection that addresses efficient usage of available hardware. Section 5.1 discusses research that aims to improve shared-memory (on-node) parallelism. Section 5.2 discusses research that aims to improve distributed memory parallelism. Section 5.3 discusses research that uses both shared and distributed memory parallelism.

5.1 Shared Memory Parallelism for Particle Advection

Shared memory parallelism refers to using parallel resources on a single node. The devices that enable shared memory parallelism are multi- and many-core CPUs and other accelerators, such as GPUs. In the case of shared memory parallelism, multiple threads of a program running on different cores of a processor (CPU or a GPU) share memory, hence the nomenclature. One of the primary reasons for the increase in supercomputers’ compute power can be attributed to the advancements of CPUs and accelerator hardware. In all, for applications to make cost-effective use of resources, it has become exceedingly important to use shared memory resources efficiently. However, making efficient use creates many challenges for the programmers and users. Two important factors to consider are 1) efficient use of shared memory concurrency, and 2) performance portability.

Algorithm Application Intent / Data Time Seed Performance
Evaluation Size Steps Count
Krüger et al. [46] source-dest Interactive flow visualization (steady) using GPUs - - - 60-80
Bürger et al.[16] various Interactive flow visualization (unsteady)
Bürger et al.[15] various Interactive flow visualization using importance metrics 7M -
4M 22
1M 30
Bürger et al.[14] streak surface Interactive streak surface visualization 589K 102 400
4.1M 22
Schirski et al.[75] pathlines, source-dest Efficient cell location on GPUs 5 1M
Garth et al.[31] source-dest Efficient cell location on GPUs for unstructured grids / Comparison against CPUs - 250K
Bußler et al.[17] source-dest Efficient cell location on GPUs using improved tetrahedral walk 5 1M
Pugmire et al.[68] source-dest Performance Portability / Comparison with specialized comparators for CPUs and GPUs 134M - 10M (GPUs)
134M (GPUs)
134M (GPUs)
TABLE III: Summary of shared memory particle advection. The asterisk for entries in the data size column represent unstructured grids.
Fig. 3: Scatter plots for CPU and GPU speed-ups for particle advections workloads. The X axis represents the magnitude of the workload in terms of total number of steps for each of the sub plots and the Y axis represents the speed-up. The size of the glyphs corresponds to the number of particles used in the experiment. The data for the plots on the top is collected from numbers reported by Pugmire et al.[68] They used 28 CPU cores and a Nvidia P100 GPU. The data from the plots on the bottom is collected from new experiments using VTK-m. This study used 12 CPU cores and a Nvidia K80 GPU.

GPUs have become a popular accelerator choice in the past decade, with most leading supercomputers using GPUs as accelerators [79]. Part of this has been the availability of specialized toolkits, including early efforts like Brook-GPU [13] and popular efforts like Nvidia’s CUDA [58], that enable GPUs to be used as general purpose computing devices [5]. However, programming applications for efficient execution on a GPU remains challenging for three main reasons. First, unlike CPUs which are built for low latency, GPUs are built for high throughput. CPUs have fewer than a hundred cores, while GPUs have a few thousand. However, each CPU core is significantly more powerful than a single GPU core. Second, efficient use of the GPU requires applications to have sufficiently large parallel workloads. Third, executing a workload on a GPU also has an implicit cost of data movement between the host and the device, where a host is the CPU and the DRAM of the system, and the device is the GPU and its dedicated memory. This cost makes GPUs inefficient for smaller workloads.

This sub-section discusses particle advection using shared memory parallelism in two parts. Section 5.1.1 discuss works to optimize the performance particle advection on GPUs and Section 5.1.1 describes optimizations for cell locators. Section 5.1.2 discusses works that use CPUs for improving the performance of particle advection.

5.1.1 Shared memory GPUs

Most of the solutions that focused on shared memory optimization focused on improving the performance on GPUs. This is because particle advection can benefit from using GPUs when there are many particles to advect. As particles can be advected independently from one another, each particle can be scheduled with a separate thread of the GPU, making the most of the available concurrency. Many works have tried to address performance issues of particle advection using GPUs, however, with different goals.

Krüger et al. [46] presented an approach for interactive visualization of particles in a steady flow field using a GPU. They exploited the GPU’s ability to simultaneously perform advection and render results without moving the data between the CPU and the CPU. This was done by accessing the texture maps in the GPU’s vertex units and writing the advection result. Their approach on the GPU provided the interactive rendering at 41 fps (frames per second) compared to 0.5 fps on the CPU.

Bürger et al. [16] extended the particle advection framework described by Krüger et al. for unsteady flow fields. With their method, unsteady data is streamed to the GPU using a ring-buffer. While the particles are being advected in some time interval , another host thread is responsible for moving from host memory to device memory. At any time, up to three timesteps of data are stored on the device. By decoupling the visualization and data management tasks, particle advection and visualization can occur without delays due to data loading. Bürger et al. [15] further demonstrated the efficacy of their particle tracing framework for visualizing an array of flow features. These features were gathered using some metric of importance, e.g., FTLE, vorticity, helicity, etc.

Bürger et al. [14] also provided a way for interactively rendering streak surfaces. Using GPUs, the streak surfaces can be adaptively refined/coarsened while still maintaining interactivity.

Cell Locators for GPUs

Bußler et al. [17] presented a GPU-based tetrahedral walk for particle advection. Their approach for cell location borrowed heavily from the work by Schiriski et al. [75] discussed in Section 4.2. However, they could execute the cell location strategy entirely on the GPU and do not require the CPU for the initial search. Additionally, they evaluated different Kd-tree traversal strategies to evaluate the impact of these strategies on the tetrahedral walk Their results concluded that the single-pass method, which performs only one pass through the kd-tree to find the nearest cell vertex (without the guarantee of it being the nearest) performs the best. The other strategies evaluated in the study were random restart and backtracking.

Garth and Joy [31] presented an approach for cell location based on bounding interval hierarchies. Their search structure, called celltree

, improves construction times via a heuristic to determine good spatial partitions. The authors presented a use case of advecting a million particles on a GPU in an unstructured grid with roughly 23 million hexahedral elements. The celltree data structure was able to obtain good performance on GPUs despite no GPU-specific optimizations.

5.1.2 Shared Memory Parallelism on CPUs

Hentschel et al. [39] presented a solution that focused on optimizing particle advection on CPUs. Their solution studied the performance benefits of using SIMD extensions on CPUs to acheive better performance. This paper addresses the general tendency of particles to move around in the flow field. This decreases memory locality of that data that are required to perform the advection computation. This work demonstrated the advantage of packaging particles into spatiaily local groups where SIMD extensions are able to be more efficient. Their approach resulted in performance improvements of up to 5.6X over the baseline implementation.

Finally, Pugmire et al. [68] provided a platform portable solution for particle advection using the VTK-m library. The solution builds on data parallel primitives provided by VTK-m. Their results demonstrated very good platform portability, providing comparable performance to platform specific solutions on many-core CPUs and Nvidia GPUs.

5.1.3 Summary

In terms of published research, Table III presents a summary of shared memory particle advection. These studies either presented approaches for interactive flow visualization or optimizations for particle advection of GPUs using cell locators, with one exception that demonstrated platform portability.

Figure 3 shows a preliminary result for understanding the characteristics of shared memory parallel particle advection. Some of the key observations are:

  1. Workloads with more number of particles scale better with added parallelism for the same amount of total work.

  2. The studies that used the RK4 integrator generally scaled better than the ones that used the Euler integrator.

  3. The experiments with unstructured data scaled better on the CPU than on the GPU. This could be because of the nature of memory accesses required by cell locators and justifies more research into GPU based cell locators.

Additionally, the plots for the CPUs demonstrate consistency in terms of scalability when the workload is increased. The plot for the P100 GPUs (top right) suggests that it is not able to scale larger workloads with the same efficiency as the smaller workloads considered by Pugmire et al. [68] There is also a tremendous variation in the speed-ups achieved by two considered GPUs, where the P100 GPU is able to achieve speed-ups of over and the K80 GPU achieves speed-ups of less than . The performance difference of particle advection between two generations of GPUs can be significant. Existing studies fail to capture this relation and makes it harder to estimate to speed-up that can be realized. Understanding the performance characteristics of particle advection across different GPUs is planned as future work.

5.2 Distributed Memory Parallelism for Particle Advection

Fluid simulations are capable of producing large volumes of data. Analyzing and visualizing volumes of data to extract useful information demands resources equivalent to that of the simulation. In most cases, this means access to many nodes of a supercomputer to handle the computational and memory needs of the analysis. Particle advection based flow visualization algorithms often execute in a distributed memory setting. The objective of the distribution of work is to perform efficient computation, memory and I/O operations, and communication. There are multiple strategies for distributing particle advection workloads in a distributed memory setting to achieve these objectives. These can be categorized under two main classes:

Parallelize over particles: Particles are distributed among parallel processes. Each process only advances particles assigned to it. Data is typically loaded as required for each of the particles.

Parallelize over data: Blocks of partitioned data are distributed among parallel processes. Each process only advances particles that occur within the data blocks assigned to it. Particles are communicated between processes based on their data requirement.

Most distributed particle advection solutions are either an optimization of these two classes or a combination of them. The decision to choose between these two classes depends on multiple factors, of which Camp et al. [20] identify the most prominent to be:

The volume of data set: If the data set can fit in memory, it can be easily replicated across nodes and particles can be distributed among nodes, i.e., the work can be parallelized over particles. However, for large partitioned data sets, work parallelized over data can be more efficient.

The number of particles: Some flow visualization algorithms require small number of particles integrated over a long duration, while others require a large number of particles advanced for a short duration. In the case where fewer particles are needed, parallelization over data is a better approach as it could potentially reduce I/O costs. In the case where more particles are needed, parallelization over particles can help better distribute computational costs.

Distribution of particles: The placement of particles for advection can potentially cause performance problems. When using parallelization over data, if particles are concentrated within a small region of the data set, the processes owning the associated data blocks will be responsible for a lot of computation while most other processes remain idle. Parallelization over particles can lead to better work distribution in such cases.

Data set complexity: The characteristics of the vector field have a significant influence on the work for the processes, e.g., if a process owning a data block that contains a sink, most particles will advect towards it, causing the process to do more work than the others. In such a case, parallelize over particles will enable better load balance. On the other hand, when particles switch data blocks often (e.g., a circular vector field), parallelize over data is better since it reduces the costs of I/O to load required blocks.

This section describes distributed particle advection works in two parts. Section 5.2.1 describes the optimization for parallelizing distributed particle advection in more depth. Section 5.2.2 summarizes findings from the survey of distributed particle advection studies.

5.2.1 Parallelization Methods

This section presents distributed particle advection works in three parts. Section 5.2.1 presents works that optimize parallelization over data. Section 5.2.1 presents works that optimize parallelization over particles. Section 5.2.1 presents works that use a combination of parallelization over data and particles.

Parallelization over data

“Parallelize over data” is a paradigm for work distribution in flow visualization where data blocks are distributed among processors. Each process is responsible for performing computations for active particles within the data blocks assigned to them. This method aims to reduce the cost of I/O operations, which is more expensive than the cost of performing computations.

Sujudi and Haimes [78] elicited the problems introduced by decomposing data into smaller blocks that can be used within the working memory of a single node. They presented important work in generating streamlines in a distributed memory setting using the parallelize over data scheme. They used a typical client-server model where clients perform the work, and the server coordinates the work. Clients are responsible for the computation of streamlines within their sub-domain; if a particle hits the boundary of the sub-domain, it requests the server to transfer the streamline to the process that owns the next sub-domain. The server is responsible for keeping track of client request and sending streamlines across to the clients with the correct sub-domain. No details of the method used to decompose the data in sub-domains are provided.

Camp et al. [20] compared the MPI-only implementation to the MPI-hybrid implementation of parallelizing over data. They noticed that the MPI-hybrid version benefits from reduced communication of streamlines across processes and increased throughput when using multiple cores to advance streamlines within data blocks. Their results demonstrated performance improvements between 1.5x-6x in the overall times for the MPI-hybrid version over the MPI-only version. The parallelize over data scheme is sensitive to the distribution of particles and complexity of vector field. The presence of critical points in certain blocks of data can potentially lead to load imbalances. Several techniques have been developed to deal with such cases and can be classified into two categories 1) works that require knowledge of vector field, and 2) works that do not require knowledge of vector field.

Knowledge of vector field required The works classified in this category acquire knowledge of vector fields by performing a pre-processing step. Pre-processing allows for either data or particles to be distributed such that all processes perform the same amount of computation.

Chen et al. presented a method that employs repartitioning of the data based on flow direction, flow features, and the number of particles [25]. They performed pre-processing of the vector field using various statistical and topological methods to enable effective partitioning. The objective of their work is to produce partitions such that the streamlines produced would seldom have to travel between different data blocks. This enabled them to speed up the computation of streamlines due to the reduced communication between processes.

Yu et al. [82]

presented another method that relies on pre-processing the vector field. They treated their spatiotemporal data as 4D data instead of considering the space and time dimensions as separate. They performed adaptive refinement of the 4D data using a higher resolution for regions with flow features and a lower resolution for others. Later, cells in this adaptive grid were clustered hierarchically using a binary cluster tree based on the similarity of cells in a neighborhood. This hierarchical clustering helped them to partition data that ensure workload balance. It also enabled them to render pathlines at different levels of abstraction.

Nouanesengsy et al. [60] used pre-processing to estimate the workload for each data block by advecting the initial set of particles. The estimates calculated from this step are used to distribute the work among processes. Their proposed solution maintained load balance and improved performance. While the solutions in this category are better at load balancing, they introduce an additional step of pre-processing which has its costs. This cost may be expensive and undesirable if the volume of data is significant.

Knowledge of vector field not required The works classified in this category aim to balance load dynamically without any pre-processing.

Peterka et al. [62] performed a study to analyze the effects of data partitioning on the performance of particle tracing. Their study compared static round-robin (also known as block-cyclic) partitioning to dynamic geometric repartitioning. The study concluded that while static round-robin assignment provided good load balancing for random dense distribution of particles, it fails to provide load balancing when data blocks contain critical points. They also noticed that dynamic repartitioning based on workload could improve the execution time between 5% to 25%. However, the costs to perform the repartitioning are restrictive. They suggest more research needs to focus on using less synchronous communication and improvements in computational load balancing.

Nouanesengsy et al. [59] extended the work by Perterka et al. to develop a solution for calculating Finite-Time Lyapunov Exponents (FTLE) for large time-varying data. The major cost in performing FTLE calculations is incurred due to particle tracing. Along with parallelize over data, they also used parallelize over time, which enabled them to create a pipeline that could advect particles in multiple time intervals in parallel. Although their work did not focus on load-balancing among processes, it presented a novel way to optimize time-varying particle tracing. Their work solidifies the conclusions about static data partitioning of the study by Peterka et aL [62].

Zhang et al. [84] proposed a method that is better at achieving dynamic load balancing. Their approach used a new method for domain decomposition, which they term as the constrained K-d tree. Initially, they decompose the data using the K-d tree approach such that there is no overlap in the partitioned data. The partitioned data is then expanded to include ghost regions to the extent that it still fits in memory. Later, the overlapping areas between data blocks become regions to place the splitting plane to repartition data such that each block gets an equal number of particles. Their results demonstrated better load balance was achieved among processes without additional costs of pre-processing and expensive communication. Their results also demonstrate higher parallel efficiency. However, their work made two crucial assumptions 1) an equal number of particles in data blocks might translate to equal work, and 2) the constrained K-d tree decomposition leads to an even distribution of particles. These assumptions do not always hold practically.

In conclusion, pre-processing works can achieve load balance with an additional cost for parallelize over data. This cost goes up with large volumes of data. The overall time for completing particle advection might not benefit from the additional cost of pre-processing, especially when the workload is not compute-intensive. Most solutions that rely on dynamic load balancing suffer from increased communication costs or are affected by the distribution of particles and the complexity of the vector field. The work proposed by Zhang et al. [84] is promising but still does not guarantee optimal load balancing.

Parallelize over particles

“Parallelize over particles” is a paradigm for work distribution in flow visualization where particles are distributed among processors. Most commonly, the particle distribution is done such that each process is responsible for computing the trajectories of particles. Each process is responsible for the computation of streamlines for particles assigned to it. This is done by loading the data blocks required by the process in order to advect the particles. Particles are advected until they can no longer continue within the current data block, in which case another data block is requested and loaded.

Previous works have explored different approaches to optimize the scheme described above. Since the blocks of data are loaded whenever requested, the cost of I/O is a dominant factor in the total time. Prefetching of data involves predicting the next needed data block while continuing to advect particles in the current block to hide the I/O cost. Most commonly, predictions are made by observing the I/O access patterns. Rhodes et al. [70] used these access patterns as a priori knowledge for caching and prefetching to improve I/O performance dynamically. Akande et al. [2] extended their work to unstructured grids. The performance of these methods depends on making correct predictions of the required blocks. One way to improve the prediction accuracy is by using a graph-based approach to model the dependencies between data blocks. Some works used a preprocessing step to construct these graphs  [24, 22, 23]. Guo et al. [36] used the access dependencies to produce fine-grained partitions that could be loaded at runtime for better efficiency of data accesses.

Zhang et al. [85] presented an idea of higher-order access transitions, which produce a more accurate prediction of data accesses. They incorporated historical data access information to calculate access dependencies.

Since particles assigned to a single process might require access to different blocks of data, most of the works using parallelization over particles use a cache to hold multiple data blocks. The process advects all the particles that occur within the blocks of data currently present in the cache. When it is no longer possible to continue computation with the data in the cache, blocks of data are purged, and new blocks are loaded into the cache. Different purging schemes are employed by these methods, among which “Least-Recently Used,” or LRU is most common. Lu et al. [53] demonstrated the benefits of using a cache in their work for generating stream surfaces. They also performed a cache-performance trade-off study to determine the optimal size of the cache.

Camp et al. [20] presented work comparing the MPI only and MPI-hybrid implementations of parallelizing over particles. Their objective was to prove the efficacy of using shared memory parallelism with distributed memory to reduce communication and I/O costs. They observed 2x-10x improvement in the overall time for calculation of streamlines while using the MPI-hybrid version.

Along with caching, Camp et al. [18] also presented work that leveraged different memory hierarchies available on modern supercomputers to improve the performance of particle advection. The objective of the work is to reduce the cost of I/O operations. Their work used Solid State Drives (SSDs) and local disks to store data blocks, where SSDs are used as a cache. Since the cache can only hold limited amounts of data compared to local disks, blocks are purged using the LRU method. When required blocks are not in the cache, the required data is searched in local disks before accessing the file system. The extended hierarchy allows for a larger than usual cache, reducing the need to perform expensive I/O operations.

One trait that makes the parallel computation of integral curves challenging is the dynamic data dependency. The data required to compute the curve cannot be determined in advance unless there is a priori knowledge of the data. However, this information is crucial for optimal load-balanced parallel scheduling. One solution to this problem is to opt for dynamic scheduling. Two well-studied techniques for dynamic scheduling are work-stealing and work-requesting. In both approaches, an idle process acquires work from a busy process. Popularly, idle processes are referred to as thieves, and busy processes are referred to as victims. The major distinction between work-stealing and work requesting is how the thief acquires work from the victim. In work-requesting, the thief requests work items, and the victim voluntarily shares it. In work-stealing, the thief directly accesses the victim’s queue for work items without the victim knowing.

A large body of works addresses work-stealing in task-based parallel systems in general [12, 29, 77]. In the case of integral curve calculation, task-based parallelism inspires the parallelize over particles scheme. Dinan et al. [29] demonstrated the scalability of the work-stealing approach. Lu et al. [53] presented a technique for calculating stream surface efficiently using work-stealing. Their algorithm aimed for the efficient generation of stream surfaces. The seeding curve for streamlines was divided into segments, and these segments were assigned to processes as tasks. In their implementation, each process maintains a queue of segments. When advancing the streamline segment using the front advancing algorithm proposed by Garth et al. [32], if a segment starts to diverge, it is split into two and placed back in the queue. When a processor requires additional data to advance a segment, it requests the data from the processes that own the data block. Their solution demonstrated good load balancing and scalability.

Work stealing has been proven to be efficient in theory and practice. However, Dinan et al. reported its implementation is complicated.

Muller et al. [57] presented an approach that used work requesting for tracing particle trajectories. Their algorithm started by equally distributing all work items (particles) among processes. However, they started by assigning all particles to a single process for performing the load balancing study. Every time an active particle from the work queue is unable to continue in the currently cached data, it is placed at the end of the queue. Whenever a thief tries to request work, the particles from the end of the queue are provided, reducing the current processes’ need to load the data block for the particle. The results reported performance improvements between 30% to 60%.

Binyahib et al. [7] compared the parallelize over particle strategy to parallelize over data for its in-situ applicability. Their findings suggest that for workloads where particles are densely seeded in a certain region of the data, parallelize over partilces is a much better strategy and can result in speedups upto .

According to Childs et al. [27], the dominant factor affecting the performance of parallelizing over particles is I/O. The solution to solve the I/O problem during runtime is to perform prefetching of data. However, works that propose prefetching incur additional costs of making predictions of which blocks to read. Leveraging the memory hierarchy similar to Camp et al. is a good strategy, provided proper considerations for vector field size and complexity are made. Apart from I/O costs, load balancing remains another factor affecting performance adversely. Previous work stealing and work requesting strategies have demonstrated good load balance with additional costs of communicating work items. These costs could potentially be restrictive in the case of workloads with a large number of particles.

Problem Classification Parallelization Strategy
Over Data Over Particles
Dataset size Large Small
Number of particles Small Large
Seed Distribution Sparse Dense
Vector Field Complexity No critical No circular
points field
TABLE IV: Recommendation of parallelization strategy for particle advection workloads based on features of the problem. This table appears in the survey by Binyahib [10].
Algorithm Architecture Procs. Data set Time Seed Application Seeding Intent /
Size steps Count Strategy Evaluation
Yu et al. Intel Xeon 32 644M - 1M streamlines, - hierarchical
[82] (8x4) pathlines representation,
AMD Optron 256 644M 100 1M - strong scaling
Chen et al. Intel Xeon 32 162M - 700 streamlines - data partitioning /
[25] (48x2) strong scaling
Pugmire et al. Cray XT5 (ORNL) 512 512M - 4k, 22K streamlines uniform data loading,
[67] (149K) 512 512M - 10K uniform data partitioning /
512 512M - 20K uniform weak scaling
Peterka et al. PowerPC-450 16k 8B - 128k streamlines, random domain decomposition,
[62] (40960x4) 32K 1.2B 32 16k pathlines random dynamic repartitioning /
strong and weak scaling
Camp et al. Intel Xeon - 512M - 2.5K, 10K streamlines dense, Effects of
[18] Dash (SDSC) - 512M - 2.5K, 10K uniform storage hierarchy
- 512M - 2.5K, 10K
Camp et al. Cray XT4 (NERSC) 128 512M - 2.5K, 10K streamlines dense, MPI-hybrid
[20] 9572x4 128 512M - 2.5K, 10K uniform parallelism
128 512M - 1.5K, 6K
Nouanesengsy PowerPC-450 4K 2B - 256K streamlines random workload aware
et al. [60] (1024x4) 4K 1.2B - 128K random domain decomposition /
strong and weak scaling
Nouanesengsy PowerPC-450 1k 8M 29 186M FTLE uniform pipelined temporal
et al. [59] (40960x4) 1K 25M 48 65.2M uniform advection, caching /
16k 345M 36 288M uniform strong and weak scaling
16K 43.5M 50 62M uniform
Camp et al. Cray XT4 (NERSC) 128 512M - 128 stream surface rake Comparison of
[19] (9572x4) 128 512M - 361 rake parallelization algorithms
128 512M - 128 rake for stream surfaces
Muller et al. AMD Magny-Cours 1K 32M 735 1M streamlines, uniform work requesting /
[57] (6384x24) pathlines load balancing,
strong scaling
Childs et al. Nvidia Kepler 8 1B - 8M source-dest uniform Distriburted particle
[26] (1 GPU / Proc) advection over different
Intel Xeon 192 1B - 8M hardware architectures /
comparison, strong scaling
Guo et al. Intel Xeon 64 755M 100 - streak surface, seed line sparse data
[36] (8x8) 64 3.75M 24 200 pathlines, uniform management /
Intel Xeon 512 25M 48 - FTLE uniform strong scaling
Lu et al. PowerPC A2 1K 25M - 32K stream surface rakes caching,
[53] (2048x16) 4K 80M - 32K rakes performance /
8K 500M - 32K rakes strong scaling
8K 2B - 64K rakes
Zhang et al. Intel Xeon 64 3.75M 24 6250 pathlines uniform data prefetching /
[85] (8x8) 64 25M 48 4096 - strong scaling
Zhang et al. PowerPC A2 8K 1B - 128M streamlines, - domain decomposition,
[84] (2048x16) 8K 3.8M 24 8M source-dest, - using K-d trees /
8K 25M 48 24M FTLE uniform strong and weak scaling
Binyahib et al. Intel Xeon 512 67M - 1M source-dest dense, In situ parallelization
[7] (2388x32) uniform over particles
Binyahib et al. Intel Xeon 1K 34B - 343M source-dest dense, Comparison of
[6] (2388x32) (8K cores) - uniform parallelization algorithms
Binyahib et al. Intel Xeon 1K 34B - 343M source-dest dense, novel hybrid
[8] (2388x32) (8K cores) - uniform parallelization algorithm
TABLE V: Summary of large scale distributed particle advection worklets. The numbers in parenthesis in the Architecture column represent the total number of cores available on the execution platform.
Application Particles /1k Cells
Souce-destination 72222.20
FTLE 5013.02
Streamlines 9.89
Pathlines 6.93
Stream surface 0.25
TABLE VI: Number of particles used per one thousand cells of data for different applications from works described in Table V.
Fig. 4: Weak scaling plots for distributed memory particle advection based on the comparison of four parallelization algorithms by Binyahib et al. [6]. The plots present performance comparison of the algorithms for two different workloads. The large workload used 1 particle per 100 cells where each particle advanced 10K steps. The small workload used 1 particle per 10K cells where each particle advanced 1K steps. The plots on the top present the performance of the algorithms in terms of throughput along the Y axis, while the plots on the bottom present the parallel efficiency while using weak scaling along the Y axis. In all cases the X axis represents the number of MPI ranks used to perform the experiments.
Hybrid Particle Advection

The works described in this section combine parallelize over data and parallelize over particles schemes to achieve optimal load balance. Pugmire et al. [67] introduced an algorithm that uses a master-worker model. The processes were divided into groups, and each group had a master process. The master is responsible for maintaining the load balance between processes as it coordinates the assignment of work. The algorithm begins with statically partitioning the data. All processes load data on demand. Whenever a process needs to load data for advancing its particles, it coordinates with the master. The master decides whether it is more efficient for the process to load data or to send its particles to another process. The method proved to be more efficient in I/O and communication than the traditional parallelization approaches.

Kendall et al. [44] provided a hybrid solution which they call DStep and works like the MapReduce framework [28]. Their algorithm used groups for processes as well and has a master to coordinate work among different groups. A static round-robin partitioning strategy is used to assign data blocks to processes, similar to Peterka et al. [62]. The work of tracking particles is split among groups where the master process maintains a work queue and assigns work to processes in its group. Processors within a group can communicate particles among them. However, particles across groups can only be communicated by the master processes. The algorithm provided an efficient and scalable solution for particle tracing and has been used by other works [35, 34, 51].

Binyahib et al [8] proposed a new ‘HyLiPoD’ algorithm for particle advection. Their work was inspired from the finding of the previous bake-off study comparing different distributed particle advection strategies [6]. HyLiPoD is short for Hybrid Lifeline and Parallelize over Data, and the algorithm aims to choose the best strategy between the Lifeline algorithm [9] and parallelize over data for distributed particle advection given a certain workload.

5.2.2 Summary

This section summarizes distributed particle advection in two parts. First, general take-aways are discussed based on the various factors discussed in the introduction of this section. Second, observations from the studies in terms of their particle advection workloads are presented.

Table IV provides a simple lookup for a parallelization strategy based on various workload factors discussed earlier in the section. These strategies were presented in a survey by Binyahib [10]. Parallelize over data is best suited when the data set volume is large. However, in the presence of flow features like critical points and vortices, parallelize over data can suffer from load imbalance. While several methods have been proposed for data repartitioning for load-balanced computation, these works incur the cost of pre-processing and redistributing data. Parallelize over particles is best suited when the number of particles is large. It can suffer from load imbalance due to inconsistencies in the computational work for different particles. Some works aim to address the problem of load imbalance but have added costs of pre-processing, communication, and I/O. Hybrid solutions demonstrate better scalability and efficiency compared to the traditional methods. However, implementing these methods is very complicated and typically has some added cost of communication and I/O.

Figure 4 shows a comparison of scaling behaviors of four parallelization algorithms, extracted from the study presented by Binyahib et al. [6]. These algorithms include parallelize over particles, parallelize over data, Lifeline Scheduling Method (LSM, an extension of parallelize over particles) [9], and master-worker (a hybrid parallel algorithm). The Figure presents a weak scaling of these algorithms. The top row plots show the throughput of these algorithms in terms of number of steps completed by each MPI rank per second. The bottom row plots show the efficiency of weak scaling achieved by the different algorithms. The efficiency of the algorithms drop significantly as the concurrency and workload are increased. The drop is more significant in smaller workloads than in larger workloads. The only study which compared the scaling behaviors of the most widely used parallelization algorithm used weak scaling. In order to be able to quantify the speed-ups resulting from added distributed parallelism for a given workload, a strong scaling study is necessary. The strong scaling study for these algorithms is a potential avenue for future research.

Table V summarizes large-scale parallel particle advection-based flow visualization studies in terms of the distributed executions and the magnitudes of the workloads. The platforms used by the considered studies in this section span from desktop computers to large supercomputers. The work with the least amount of processes and workload in this survey is by Chen et al. [25], which used only 32 processes to produce 700 streamlines. The work with the largest number of processes was by Nouanesengsy et al. [59], which used 16 thousand processes for FTLE calculation. However, the work with the most workload was by Binyahib et al. [6], which used 343 million particles for advection in data with 34 billion cells.

Table VI summarizes the number of particles used in proportion to the size of the data used in the works included in Table V. Stream surface generation is the application that required the least amount of particles. A significant part of the cost of generating stream surfaces comes from triangulating the surfaces from the advected streamlines. These streamlines cannot be numerous as they may lead to issues like occlusion. Source-destination queries use the most particles in proportion to the data size. All other applications need to store a lot of information in addition to the final location of the particle — streamlines and pathlines need to save intermediate locations for representing the trajectories, stream surfaces need the triangulated surface for rendering, and FTLE analysis needs to generate an additional scalar field. Source-destination analysis has no such costs and can instead use the savings in storage and computation to incorporate more particles.

5.3 Hybrid Parallelism for Particle Advection

Hybrid parallelism refers to a combination of using shared- and distributed-memory parallel techniques. For these works, the distributed-memory elements managed dividing work among nodes, and the shared-memory parallelism approach was providing a “pool” of cores that could advect particles quickly. Camp et al.[20] presented two approaches that used multi-core processors to parallelize particle advection 1) parallelization over particles, and 2) parallelize over data blocks. In both cases, the authors aimed to use the N allocated cores. For parallelization over particles, N worker threads were used along with I/O threads. The worker threads are responsible for performing particle advection. The I/O threads manage the cache of data blocks to support the worker threads. For parallelization over data blocks, worker threads are used, which access the cache of data blocks directly, and an additional thread was used for communicating results with other processes.

Camp et al. [21] also extended their previous work to GPUs. One of their objectives was to compare particle advection performance on the GPU against CPU under different workloads. They varied the datasets, the number of particles, and the duration of advection for their experiments. Their findings suggest that in the case where the workloads have fewer particles or longer durations, the CPU performed better. However, in most other cases, the GPU was able to outperform the CPU.

Childs et al. [26] explored particle advection performance across various GPUs (counts and device) and CPUs (processors and concurrency). Their objective was to explore the relationship between parallel device choice and the execution time for particle advection. Two of their key findings were: 1) For CPUs, adding more cores benefited workloads that execute for medium to longer duration, 2) CPUs with many cores were as performant as GPUs and often outperformed GPUs for small workloads with short execution times. 3) With higher particle densities ( or more) GPUs can be saturated and result in performance imporvements proportpional to their FLOP rates, faster GPUs can provide better speedups.

Jiang et al. [42] studied shared memory multi-threaded generation of streamlines with a locally attached NVRAM. Their particular area of interest was in understanding data movement strategies that will keep the threads busy performing particle advection. They used two data management strategies. The first used explicit I/O to access data. The second was a kernel-managed implicit I/O method that used memory-mapping to provide access to data. Their study indicated that thread oversubscription of streamline tasks is an effective method for hiding I/O latency, which is bottleneck for particle advection.

6 Conclusion and Future Work

Fig. 5: A flow chart to determine the potential optimizations to be applied to a flow visualization algorithm.

This survey has provided a guide to particle advection performance. The first two parts focused on high-level concepts: particle advection building blocks and a cost model. The last two parts surveyed existing approaches for algorithmic optimizations and parallelism. While the guide has summarized research findings to date, additional research can make the guide more complete. Looking ahead to future work, we feel this survey has illuminated three types of gaps in the area of particle advection performance.

The first type of gap involves the lack of holistic studies to inform behavior across diverse workloads. Adaptive step sizing, since its focus is more on accuracy than performance, can lead to highly varying speedups. Understanding when speedups occur and their magnitude would be very helpful for practitioners when deciding whether to include this approach. Similarly, the expected speedup for a GPU is highly varied based on workload and GPU architecture. While this survey was able to synthesize results from a recent study [68], significantly more detail would be useful.

The second type of gap covers possible optimizations that have not yet been pursued. All of the hardware efficiency works in this survey involved parallelism, yet there are still additional hardware optimizations available. In the ray tracing community — similar to particle advection in that rays move through a volume in a data-dependent manner — packet tracing, where rays on similar trajectories are traced together, has led to significant speedups. Further, there can be significant improvement from complex schemes. For example, Benthin et al. [4] employed a hybrid approach that generates and traces rays in packets and then automatically switches to tracing rays individually when they diverge. This hybrid algorithm outperforms conventional packet tracers by up to 2X. Finally, there are additional types of optimizations. Taking another example from ray tracing, Morrical et al. [56] presented a method that improved the performance of direct unstructured mesh point location [74] by using Nvidia RTX GPU. Their approach re-implemented the point location problem as a ray tracing problem, which enabled tracing the points using the hardware. Their results showed equal or better performance compared to state-of-the-art solutions and could provide inspiration for improved cell locators on GPUs for particle advection.

The third type of gap is in cost modeling. While our cost model is useful in the context of providing a guide for particle advection performance, future work could make the model more powerful. One possible extension relates to the first type of gap (lack of holistic studies). In particular, holistic studies on expected speedups over diverse workloads would enable adaptive step sizing and GPU acceleration to fit within the model. Another possible optimization is to broaden the model. In particular, I/O is not part of our cost model, so optimizations for I/O efficiency cannot be included at this time. Further, precomputation likely requires a different type of model altogether, i.e., using a form of the current model for regular particle advection and a different model for precomputation, and selecting the best approach from the two. Finally, distributed-memory parallelism is often applied to very large data sets that cannot fit into memory of a single node, which significantly increases modeling complexity. That said, these extensions could have significant benefit. One benefit would be using prediction to adapt workloads to fit available runtime. A second (perhaps more powerful) benefit would be to enable a workflow for decision-making. This workflow is shown via a flow chart in Figure 5, and would operate in three steps. In the first step, the desired workload would be analyzed to see how many operations need to be performed. In the second step, the analysis from the first step would be used to estimate the execution time costs to execute the algorithm. In the third step, the estimated costs from the second step would be compared to user requirements. If the estimated costs are within the user’s budget, then no optimizations are necessary and the workload can be executed as is. If not, then candidate optimizations should be considered and the workflow should be repeated with candidate optimizations until the desired runtime is predicted.



  • [1] A. Agranovsky, D. Camp, C. Garth, E. W. Bethel, K. I. Joy, and H. Childs (2014) Improved post hoc flow analysis via lagrangian representations. In 2014 IEEE 4th Symposium on Large Data Analysis and Visualization (LDAV), pp. 67–75. Cited by: §4.4.
  • [2] O. O. Akande and P. J. Rhodes (2013) Iteration aware prefetching for unstructured grids. In 2013 IEEE International Conference on Big Data, pp. 219–227. Cited by: §5.2.1.
  • [3] N. Andrysco and X. Tricoche (2010) Matrix trees. Comput. Graph. Forum 29 (3), pp. 963–972. External Links: Link, Document Cited by: §4.2.
  • [4] C. Benthin, I. Wald, S. Woop, M. Ernst, and W. R. Mark (2012) Combining single and packet-ray tracing for arbitrary ray distributions on the intel mic architecture. IEEE Transactions on Visualization and Computer Graphics 18 (9), pp. 1438–1448. External Links: Document Cited by: §6.
  • [5] K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, K. Hill, J. Hiller, et al. (2008) Exascale computing study: technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep 15. Cited by: §5.1.
  • [6] R. Binyahib, D. Pugmire, A. Yenpure, and H. Childs (2020) Parallel particle advection bake-off for scientific visualization workloads. In 2020 IEEE International Conference on Cluster Computing (CLUSTER), Vol. , pp. 381–391. Cited by: Fig. 4, §5.2.1, §5.2.2, §5.2.2, TABLE V.
  • [7] R. Binyahib, D. Pugmire, and H. Childs (2019) In situ particle advection via parallelizing over particles. In Proceedings of the Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, pp. 29–33. Cited by: §5.2.1, TABLE V.
  • [8] R. Binyahib, D. Pugmire, and H. Childs (2021) HyLiPoD: parallel particle advection via a hybrid of lifeline scheduling and parallelization-over-data. Cited by: §5.2.1, TABLE V.
  • [9] R. Binyahib, D. Pugmire, B. Norris, and H. Childs (2019) A lifeline-based approach for work requesting and parallel particle advection. In 2019 IEEE 9th Symposium on Large Data Analysis and Visualization (LDAV), pp. 52–61. Cited by: §5.2.1, §5.2.2.
  • [10] R. Binyahib (2019) Scientific visualization on supercomputers: a survey. University of Oregon, Computer and Information Sciences Department, Eugene OR. Note: Area ExamAvailable at http://www.cs.uoregon.edu/Reports/AREA-201903-Binyahib.pdf (2019/12/12) Cited by: §5.2.2, TABLE IV.
  • [11] R. Bleile, L. Sugiyama, C. Garth, and H. Childs (2017) Accelerating advection via approximate block exterior flow maps. Electronic Imaging 2017 (1), pp. 140–148. Cited by: §4.4.
  • [12] R. D. Blumofe and C. E. Leiserson (1999) Scheduling multithreaded computations by work stealing. Journal of the ACM (JACM) 46 (5), pp. 720–748. Cited by: §5.2.1.
  • [13] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan (2004) Brook for gpus: stream computing on graphics hardware. ACM transactions on graphics (TOG) 23 (3), pp. 777–786. Cited by: §5.1.
  • [14] K. Buerger, F. Ferstl, H. Theisel, and R. Westermann (2009) Interactive streak surface visualization on the gpu. IEEE Transactions on Visualization and Computer Graphics 15 (6), pp. 1259–1266. Cited by: §5.1.1, TABLE III.
  • [15] K. Burger, P. Kondratieva, J. Kruger, and R. Westermann (2008) Importance-driven particle techniques for flow visualization. In 2008 IEEE Pacific Visualization Symposium, pp. 71–78. Cited by: §5.1.1, TABLE III.
  • [16] K. Bürger, J. Schneider, P. Kondratieva, J. H. Krüger, and R. Westermann (2007) Interactive visual exploration of unsteady 3d flows.. In EuroVis, pp. 251–258. Cited by: §5.1.1, TABLE III.
  • [17] M. Bußler, T. Rick, A. Kelle-Emden, B. Hentschel, and T. W. Kuhlen (2011) Interactive particle tracing in time-varying tetrahedral grids.. In EGPGV, pp. 71–80. Cited by: §4.2, §5.1.1, TABLE III.
  • [18] D. Camp, H. Childs, A. Chourasia, C. Garth, and K. I. Joy (2011) Evaluating the benefits of an extended memory hierarchy for parallel streamline algorithms. In 2011 IEEE Symposium on Large Data Analysis and Visualization, pp. 57–64. Cited by: §5.2.1, TABLE V.
  • [19] D. Camp, H. Childs, C. Garth, D. Pugmire, and K. I. Joy (2012) Parallel stream surface computation for large data sets. In Ieee symposium on large data analysis and visualization (ldav), pp. 39–47. Cited by: TABLE V.
  • [20] D. Camp, C. Garth, H. Childs, D. Pugmire, and K. I. Joy (2011-11) Streamline Integration Using MPI-Hybrid Parallelism on a Large Multicore Architecture. IEEE Transactions on Visualization and Computer Graphics (TVCG) 17, pp. 1702–1713. External Links: ISSN 1077-2626 Cited by: §5.2.1, §5.2.1, §5.2, §5.3, TABLE V.
  • [21] D. Camp, H. Krishnan, D. Pugmire, C. Garth, I. Johnson, E. W. Bethel, K. I. Joy, and H. Childs (2013) GPU Acceleration of Particle Advection Workloads in a Parallel, Distributed Memory Setting. In Eurographics Symposium on Parallel Graphics and Visualization, External Links: ISSN 1727-348X, ISBN 978-3-905674-45-3 Cited by: §5.3.
  • [22] C. Chen, B. Nouanesengsy, T. Lee, and H. Shen (2012) Flow-guided file layout for out-of-core pathline computation. In IEEE Symposium on Large Data Analysis and Visualization (LDAV), pp. 109–112. Cited by: §4.3, §4.5, TABLE II, §5.2.1.
  • [23] C. Chen and H. Shen (2013) Graph-based seed scheduling for out-of-core ftle and pathline computation. In 2013 IEEE symposium on large-scale data analysis and visualization (LDAV), pp. 15–23. Cited by: §4.3, §4.5, TABLE II, §5.2.1.
  • [24] C. Chen, L. Xu, T. Lee, and H. Shen (2011) A flow-guided file layout for out-of-core streamline computation. In 2011 IEEE Symposium on Large Data Analysis and Visualization, pp. 115–116. Cited by: §4.3, §4.4, §4.5, TABLE II, §5.2.1.
  • [25] L. Chen and I. Fujishiro (2008) Optimizing parallel performance of streamline visualization for large distributed flow datasets. In 2008 IEEE Pacific Visualization Symposium, pp. 87–94. Cited by: §5.2.1, §5.2.2, TABLE V.
  • [26] H. Childs, S. Biersdorff, D. Poliakoff, D. Camp, and A. D. Malony (2014) Particle advection performance over varied architectures and workloads. In 2014 21st International Conference on High Performance Computing (HiPC), pp. 1–10. Cited by: §5.3, TABLE V.
  • [27] H. Childs, D. Pugmire, S. Ahern, B. Whitlock, M. Howison, G. H. Weber, E. W. Bethel, et al. (2010) Extreme scaling of production visualization software on diverse architectures. IEEE Computer Graphics and Applications (3), pp. 22–31. Cited by: §5.2.1.
  • [28] J. Dean and S. Ghemawat (2008) MapReduce: simplified data processing on large clusters. Communications of the ACM 51 (1), pp. 107–113. Cited by: §5.2.1.
  • [29] J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha (2009) Scalable work stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pp. 1–11. Cited by: §5.2.1.
  • [30] M. Fiser (2013)(Website) External Links: Link Cited by: Fig. 2.
  • [31] C. Garth and K. I. Joy (2010) Fast, memory-efficient cell location in unstructured grids for visualization. IEEE Transactions on Visualization and Computer Graphics 16 (6), pp. 1541–1550. Cited by: §4.2, §4.2, §5.1.1, TABLE III.
  • [32] C. Garth, X. Tricoche, T. Salzbrunn, T. Bobach, and G. Scheuermann (2004) Surface techniques for vortex visualization.. In VisSym, Vol. 4, pp. 155–164. Cited by: §5.2.1.
  • [33] B. Geveci, W. Schroeder, A. Brown, and G. Wilson (2012) VTK.

    The Architecture of Open Source Applications

    1, pp. 387–402.
    Cited by: §4.1.
  • [34] H. Guo, F. Hong, Q. Shu, J. Zhang, J. Huang, and X. Yuan (2014) Scalable lagrangian-based attribute space projection for multivariate unsteady flow data. In 2014 IEEE Pacific Visualization Symposium, pp. 33–40. Cited by: §5.2.1.
  • [35] H. Guo, X. Yuan, J. Huang, and X. Zhu (2013) Coupled ensemble flow line advection and analysis. IEEE Transactions on Visualization and Computer Graphics 19 (12), pp. 2733–2742. Cited by: §5.2.1.
  • [36] H. Guo, J. Zhang, R. Liu, L. Liu, X. Yuan, J. Huang, X. Meng, and J. Pan (2014) Advection-based sparse data management for visualizing unsteady flow. IEEE transactions on visualization and computer graphics 20 (12), pp. 2555–2564. Cited by: §5.2.1, TABLE V.
  • [37] E. Hairer, S.P. Nørsett, and G. Wanner (2000) Solving ordinary differential equations I nonstiff problems. Second edition, Springer, Berlin. Cited by: §4.1, §4.1.
  • [38] M. D. Hanwell, K. M. Martin, A. Chaudhary, and L. S. Avila (2015) The Visualization Toolkit (VTK): Rewriting the rendering code for modern graphics cards. SoftwareX 1, pp. 9–12. Cited by: §4.1.
  • [39] B. Hentschel, J. H. Göbbert, M. Klemm, P. Springer, A. Schnorr, and T. W. Kuhlen (2015) Packet-Oriented Streamline Tracing on Modern SIMD Architectures. In Eurographics Symposium on Parallel Graphics and Visualization, C. Dachsbacher and P. Navrátil (Eds.), External Links: Document Cited by: §5.1.2.
  • [40] D. Hilbert (1891) Über die stetige abbildung einer linie aufein flächenstück. Mathematische Annalen 38, pp. 459–460. Cited by: §4.3.
  • [41] M. Hlawatsch, F. Sadlo, and D. Weiskopf (2010) Hierarchical line integration. IEEE transactions on visualization and computer graphics 17 (8), pp. 1148–1163. Cited by: §4.4.
  • [42] M. Jiang, B. V. Essen, C. Harrison, and M. B. Gokhale (2014) Multi-threaded streamline tracing for data-intensive architectures. In 4th IEEE Symposium on Large Data Analysis and Visualization, LDAV 2014, Paris, France, November 9-10, 2014, H. Childs, R. Pajarola, and V. Vishwanath (Eds.), pp. 11–18. External Links: Link, Document Cited by: §5.3.
  • [43] J. Kasten, C. Petz, I. Hotz, H. Hege, B. R. Noack, and G. Tadmor (2010) Lagrangian feature extraction of the cylinder wake. Physics of fluids 22 (9), pp. 091108. Cited by: Fig. 2.
  • [44] W. Kendall, J. Wang, M. Allen, T. Peterka, J. Huang, and D. Erickson (2011) Simplified parallel domain traversal. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 10. Cited by: §5.2.1.
  • [45] D. N. Kenwright and D. A. Lane (1996) Interactive time-dependent particle tracing using tetrahedral decomposition. IEEE Transactions on Visualization and Computer Graphics 2 (2), pp. 120–129. Cited by: §4.2.
  • [46] J. Kruger, P. Kipfer, P. Konclratieva, and R. Westermann (2005) A particle system for interactive visualization of 3d flows. IEEE Transactions on visualization and computer graphics 11 (6), pp. 744–756. Cited by: §5.1.1, TABLE III.
  • [47] M. Langbein, G. Scheuermann, and X. Tricoche (2003) An efficient point location method for visualization in large unstructured grids. In Proceedings of Vision, Modeling, Visualization, Cited by: §4.2, §4.2.
  • [48] R. S. Laramee, H. Hauser, H. Doleisch, B. Vrolijk, F. H. Post, and D. Weiskopf (2004) The state of the art in flow visualization: dense and texture-based techniques. In Computer Graphics Forum, Vol. 23, pp. 203–221. Cited by: §1.
  • [49] R. S. Laramee, H. Hauser, L. Zhao, and F. H. Post (2007) Topology-based flow visualization, the state of the art. In Topology-based methods in visualization, pp. 1–19. Cited by: §1.
  • [50] C. Lexi (2015)(Website) External Links: Link Cited by: Fig. 2.
  • [51] R. Liu, H. Guo, J. Zhang, and X. Yuan (2016) Comparative visualization of vector field ensembles based on longest common subsequence. In 2016 IEEE Pacific Visualization Symposium (PacificVis), pp. 96–103. Cited by: §5.2.1.
  • [52] R. Löhner and J. Ambrosiano (1990) A vectorized particle tracer for unstructured grids. Journal of Computational Physics 91 (1), pp. 22–31. Cited by: §4.2, §4.5, TABLE II.
  • [53] K. Lu, H. Shen, and T. Peterka (2014) Scalable computation of stream surfaces on large scale vector fields. In SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1008–1019. Cited by: §5.2.1, §5.2.1, TABLE V.
  • [54] T. McLoughlin, R. S. Laramee, R. Peikert, F. H. Post, and M. Chen (2010) Over two decades of integration-based, geometric flow visualization. In Computer Graphics Forum, Vol. 29, pp. 1807–1829. Cited by: §1.
  • [55] K. Moreland, C. Sewell, W. Usher, L. Lo, J. Meredith, D. Pugmire, J. Kress, H. Schroots, K. Ma, H. Childs, et al. (2016) Vtk-m: accelerating the visualization toolkit for massively threaded architectures. IEEE computer graphics and applications 36 (3), pp. 48–58. Cited by: §A.3.
  • [56] N. Morrical, I. Wald, W. Usher, and V. Pascucci (2020) Accelerating Unstructured Mesh Point Location with RT Cores. IEEE Transactions on Visualization and Computer Graphics. External Links: Document Cited by: §6.
  • [57] C. Müller, D. Camp, B. Hentschel, and C. Garth (2013) Distributed parallel particle advection using work requesting. In 2013 IEEE Symposium on Large-Scale Data Analysis and Visualization (LDAV), pp. 1–6. Cited by: §5.2.1, TABLE V.
  • [58] J. Nickolls (2007) GPU parallel computing architecture and cuda programming model. In 2007 IEEE Hot Chips 19 Symposium (HCS), pp. 1–12. Cited by: §5.1.
  • [59] B. Nouanesengsy, T. Lee, K. Lu, H. Shen, and T. Peterka (2012) Parallel particle advection and FTLE computation for time-varying flow fields. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 61. Cited by: §5.2.1, §5.2.2, TABLE V.
  • [60] B. Nouanesengsy, T. Lee, and H. Shen (2011) Load-balanced parallel streamline generation on large scale vector fields. IEEE Transactions on Visualization and Computer Graphics 17 (12), pp. 1785–1794. Cited by: §4.4, §5.2.1, TABLE V.
  • [61] T. M. Özgökmen, A. C. Poje, P. F. Fischer, H. Childs, H. Krishnan, C. Garth, A. C. Haza, and E. Ryan (2012-10) On Multi-Scale Dispersion Under the Influence of Surface Mixed Layer Instabilities. Ocean Modelling 56, pp. 16–30. Cited by: §1.
  • [62] T. Peterka, R. Ross, B. Nouanesengsy, T. Lee, H. Shen, W. Kendall, and J. Huang (2011) A study of parallel particle tracing for steady-state and time-varying flow fields. In 2011 IEEE International Parallel & Distributed Processing Symposium, pp. 580–591. Cited by: §5.2.1, §5.2.1, §5.2.1, TABLE V.
  • [63] A. Pobitzer, R. Peikert, R. Fuchs, B. Schindler, A. Kuhn, H. Theisel, K. Matković, and H. Hauser (2011) The state of the art in topology-based visualization of unsteady flow. In Computer Graphics Forum, Vol. 30, pp. 1789–1811. Cited by: §1.
  • [64] S. Popinet (2003) Gerris: a tree-based adaptive solver for the incompressible euler equations in complex geometries. Journal of Computational Physics 190 (2), pp. 572–600. Cited by: Fig. 2.
  • [65] F. H. Post, B. Vrolijk, H. Hauser, R. S. Laramee, and H. Doleisch (2003) The state of the art in flow visualisation: feature extraction and tracking. In Computer Graphics Forum, Vol. 22, pp. 775–792. Cited by: §1.
  • [66] P. J. Prince and J. R. Dormand (1981) High order embedded runge-kutta formulae. Journal of Computational and Applied Mathematics 7 (1). Cited by: §4.1.
  • [67] D. Pugmire, H. Childs, C. Garth, S. Ahern, and G. H. Weber (2009) Scalable computation of streamlines on very large datasets. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pp. 16. Cited by: §5.2.1, TABLE V.
  • [68] D. Pugmire, A. Yenpure, M. Kim, J. Kress, R. Maynard, H. Childs, and B. Hentschel (2018) Performance-Portable Particle Advection with VTK-m. In Eurographics Symposium on Parallel Graphics and Visualization, H. Childs and F. Cucchietti (Eds.), External Links: ISSN 1727-348X, ISBN 978-3-03868-054-3, Document Cited by: Fig. 3, §5.1.2, §5.1.3, TABLE III, §6.
  • [69] T. Rapp, C. Peters, and C. Dachsbacher (2019) Void-and-cluster sampling of large scattered data and trajectories. IEEE transactions on visualization and computer graphics 26 (1), pp. 780–789. Cited by: §4.4.
  • [70] P. J. Rhodes, X. Tang, R. D. Bergeron, and T. M. Sparr (2005) Iteration aware prefetching for large multidimensional scientific datasets. In Proc. of the 17th international conference on Scientific and statistical database management (SSDBM), pp. 45–54. Cited by: §5.2.1.
  • [71] A. R. Sanderson, G. Chen, X. Tricoche, D. Pugmire, S. Kruger, and J. A. Breslau (2010) Analysis of recurrent patterns in toroidal magnetic fields. IEEE Trans. Vis. Comput. Graph. 16 (6), pp. 1431–1440. External Links: Link, Document Cited by: §4.1.
  • [72] S. Sane, R. Bujack, C. Garth, and H. Childs (2020) A survey of seed placement and streamline selection techniques. In Computer Graphics Forum, Vol. 39, pp. 785–809. Cited by: §1.
  • [73] S. Sane, H. Childs, and R. Bujack (2019) An interpolation scheme for vdvp lagrangian basis flows.. In EGPGV@ EuroVis, pp. 109–119. Cited by: §4.4.
  • [74] R. Sawhney and K. Crane (2020-07) Monte carlo geometry processing: a grid-free approach to pde-based methods on volumetric domains. ACM Trans. Graph. 39 (4). External Links: ISSN 0730-0301, Link, Document Cited by: §6.
  • [75] M. Schirski, C. Bischof, and T. Kuhlen (2006) Interactive particle tracing on tetrahedral grids using the gpu. In Proceedings of Vision, Modeling, and Visualization (VMV), pp. 153–160. Cited by: §4.2, §5.1.1, TABLE III.
  • [76] W. Schroeder, K. Martin, and B. Lorensen (1997) The visualization toolkit: an object-oriented approach to 3d graphics. Prentice Hall. Cited by: §4.2.
  • [77] S. Shiina and K. Taura (2019) Almost deterministic work stealing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19, New York, NY, USA, pp. 47:1–47:16. External Links: ISBN 978-1-4503-6229-0 Cited by: §5.2.1.
  • [78] D. Sujudi and R. Haimes (1996) Integration of particles and streamlines in a spatially-decomposed computation. Proceedings of Parallel Computational Fluid Dynamics, Los Alamitos, CA. Cited by: §5.2.1.
  • [79] (2020)(Website) External Links: Link Cited by: §5.1.
  • [80] S. Ueng, C. Sikorski, and K. Ma (1996) Efficient streamline, streamribbon, and streamtube constructions on unstructured grids. IEEE Transactions on Visualization and Computer Graphics 2 (2), pp. 100–110. Cited by: §4.2, §4.5, TABLE II.
  • [81] J. Wilhelms and A. van Gelder (1992) Octrees for faster isosurface generation. ACM Transactions of Graphics 11 (3), pp. 201–227. Cited by: §4.2.
  • [82] H. Yu, C. Wang, and K. Ma (2007) Parallel hierarchical visualization of large time-varying 3d vector fields. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pp. 24. Cited by: §5.2.1, TABLE V.
  • [83] C. Zhang, R. Zhang, and H. Ou (2003) The z curve database: a graphic representation of genome sequences. Bioinformatics 19 (5), pp. 593–599. Cited by: §4.3.
  • [84] J. Zhang, H. Guo, F. Hong, X. Yuan, and T. Peterka (2017) Dynamic load balancing based on constrained kd tree decomposition for parallel particle tracing. IEEE transactions on visualization and computer graphics 24 (1), pp. 954–963. Cited by: §5.2.1, §5.2.1, TABLE V.
  • [85] J. Zhang, H. Guo, and X. Yuan (2016) Efficient unsteady flow visualization with high-order access dependencies. In 2016 IEEE Pacific Visualization Symposium (PacificVis), pp. 80–87. Cited by: §5.2.1, TABLE V.
  • [86] J. Zhang and X. Yuan (2018) A survey of parallel particle tracing algorithms in flow visualization. Journal of Visualization 21 (3), pp. 351–368. Cited by: §1.
  • [87] R. Zhang and C. Zhang (1994) Z curves, an intutive tool for visualizing and analyzing the dna sequences. Journal of Biomolecular Structure and Dynamics 11 (4), pp. 767–782. Cited by: §4.3.

Appendix A Evaluating Particle Advection Performance Cost Model

This appendix further evaluates our cost model for particle advection performance in three parts. First, it considers the actual costs for terms in the cost model, measured in floating point operations. Second, it considers a notional example of how the cost model can be used to predict the number of floating point operations. Finally, it provides an evaluation of the cost model’s accuracy on various workloads, and also provides an approach for converting the output of the cost model to execution time.

a.1 Evaluating Cost Model Terms in Floating-Point Operations

Solver Data set Total
Euler Uniform 6 15 15 5 41
Rectilinear 6 17 15 5 43
Unstructured 6 918 35 5 964
RK4 Uniform 37 15 15 5 162
Rectilinear 37 17 15 5 170
Unstructured 37 918 35 5 3854
TABLE VII: Analytical cost calculation for particle advection. The costs in the table are by reference of a grid. Hence, for the next two equations, . Rectilinear location costs : . Unstructured location costs : . The costs for unstructured grid are highlighted in red as these are estimates based on a tree structure. This study assumes FLOPs for checking each level of the tree. An additional FLOPs are required to check if the point indeed belongs inside the identified containing cell which is estimated using the Newton’s method. Each iteration of the Newton’s method requires 374 FLOPs (code reviewed from VisIt), and based on our experimental validation each check required 2 iterations to converge.
Fig. 6: Comparing analytical cost and actual execution cost for particle advection. The first four columns describe the workload. The X axis represents the estimated number of FLOPS for a workload using information from Table VII and our cost formula (Equation 6). The Y axis represents the execution time of running this workload on a single core using VTK-m. All experimental points being collinear would build confidence that an analytic approach can be used to estimate runtimes.

Equation 6 does not specify the unit of measurement for each term and for the overall cost. While time would be a common choice, we choose our unit of measurement to be number of floating-point operations. We make this choice since floating-point operations are consistent over architecture and caching effects, and, further, are easy to quantify (either by studying code or through profilers). Finally, looking ahead to validation, counting floating-point operations does track closely with actual execution.

Table VII presents the floating-point costs for the terms in Equation 6 for a variety of instantiations: RK4 and Euler solvers and three commonly used types of meshes. The terms is this table were determined by studying code and counting floating-point operations. As noted in the table caption, some operations vary based on mesh size and the table entries correspond to a typical size. Finally, cell location with unstructured meshes incorporate Newton’s method, and we performed experiments to find the average number of iterations to converge (2 iterations).

a.2 Notional Usage of the Cost Model

This section demonstrates cost estimation for a hypothetical workload based on Equation 6 and Table VII. The hypothetical workloads consists of a flow visualization algorithm advancing a million particles in a 3D uniform grid for a maximum of 1000 steps using a RK4 solver. Then, the cost of each component can be estimated as follows:

  • Locate: The locate operation in a uniform grid uses FLOP to find the cell and the particle’s location within the cell for interpolation.

  • Interpolate: The interpolate operation in a uniform grid requires trilinear interpolation which uses FLOP to evaluate the velocity at the given location.

  • Solve: The solve operation for the RK4 integration scheme uses FLOP to calculate the determine the next position of the particle.

  • Analyze: The cost of analysis of the step varies based on the visualization technique being used. In case only deals with advancing particles and hence the analysis cost is .

  • Terminate: The terminate operation requires FLOP to determine if the particle is outside the spatio-temporal bounds or if the particles completes the maximum number of steps.

These costs can be substituted in Equation 6 to get the final cost of a single step of a particle in the presented situation.


a.3 Empirical Validation

This section performs experimental validation of our cost model (Equation 6) incorporating our measurements for the number of floating-point operations per term (Table VII). It presents a set of experiments performed where the calculated cost is translated into an execution time for a workload and them compared against its actual execution time. These experiments were performed on an Intel Xeon E5-1650 CPU with a clock rate of 3.80 GHz using a particle advection implementation from the VTK-m visualization library [55] with a single CPU core. The data used for the experiments was of the resolution , which matches the assumptions in Table VII.

The results of the experiments are presented in Figure 6. This figure shows a very strong fit between the predicted cost in FLOPS and the actual execution time; statistical analysis shows a correlation coefficient of X, and a best fit line of Y=mx+b. In effect, the best fit line provides the actual time prediction. For example, a workload with FLOPS would take seconds.

In the context of our workflow, a visualization application developer may choose to do more work. First, they could run several experiments on their intended architecture to calculate their own best fit line. Second, they could recalculate Table VII using their own implementation and/or anticipated data sizes. That said, performing such additional work is likely unnecessary. Our model and workflow are intended to infer coarse trends, and repeating our analysis with new implementations, architectures, or data sets would likely not yield a significantly different cost model.

Abhishek Yenpure is a Ph.D. candidate at the University of Oregon and CDUX research group led by Hank Childs. He received his Bachelor of Engineering degree from the University of Pune. His research interests include high performance computing and scientific visualization.

Sudhanshu Sane completed a Ph.D. in Computer Science at the University of Oregon in June 2020. Upon graduation, Sudhanshu joined the Scientific Computing and Imaging Institute at the University of Utah as a postdoctoral research fellow. As of 2022, Sudhanshu is a software engineer at Luminary Cloud. Sudhanshu’s research interests include computational fluid dynamics, uncertain and multivariate data visualization, and high performance computing.

Roba Binyahib is a graphics software engineer in the Advanced Rendering and Visualization Team at Intel. She received her Ph.D. in computer science from the University of Oregon in 2020. Her research is in the area of scientific visualization and high-performance computing, focusing on developing visualization for large-scale HPC platforms.

David Pugmire is a distinguished scientist at Oak Ridge National Laboratory and a Joint Faculty Professor in the Electrical Engineering and Computer Science Department at the University of Tennessee. He received his Ph.D. in computer science from the University of Utah in 2000. His research interests are in large-scale parallelism for analysis and visualization of scientific data.

Christoph Garth received the PhD degree in computer science from Technische Universität (TU) Kaiserslautern in 2007. After four years as a postdoctoral researcher with the University of California, Davis, he rejoined TU Kaiserslautern where he is currently a full professor of computer science. His research interests include large-scale data analysis and visualization, in situ visualization, topology-based methods in visualization, and interdisciplinary applications of visualization.

Hank Childs is a professor of computer and information science at the University of Oregon. He received a Ph.D. in computer science from the University of California at Davis in 2006. His research focuses on scientific visualization, high performance computing, and the intersection of the two.