Scalable In Situ Lagrangian Flow Map Extraction: Demonstrating the Viability of a Communication-Free Model

04/04/2020 ∙ by Sudhanshu Sane, et al. ∙ University of Oregon 0

We introduce and evaluate a new algorithm for the in situ extraction of Lagrangian flow maps, which we call Boundary Termination Optimization (BTO). Our approach is a communication-free model, requiring no message passing or synchronization between processes, improving scalability, thereby reducing overall execution time and alleviating the encumbrance placed on simulation codes from in situ processing. We terminate particle integration at node boundaries and store only a subset of the flow map that would have been extracted by communicating particles across nodes, thus introducing an accuracy-performance tradeoff. We run experiments with as many as 2048 GPUs and with multiple simulation data sets. For the experiment configurations we consider, our findings demonstrate that our communication-free technique saves as much as 2x to 4x in execution time in situ, while staying nearly as accurate quantitatively and qualitatively as previous work. Most significantly, this study establishes the viability of approaching in situ Lagrangian flow map extraction using communication-free models in the future.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Background and Related Work

A key operation for calculating a Lagrangian flow map in situ is computing pathlines in a distributed memory environment. We first discuss relevant Lagrangian analysis works, followed by a brief categorization of distributed memory techniques and the constraints preventing existing approaches from being adopted in situ. We end with a description of the steps involved in calculating a Lagrangian flow map in situ.

1.1 Lagrangian Analysis

In the Lagrangian specification, flow field information is stored using integral curves, where each integral curve encodes the trajectory of a single massless particle and describes its properties as it travels through space. Each integral curve provides insight regarding flow behavior in the vicinity of the particle’s trajectory [bujack2015lagrangian]. Collectively, a large number of integral curves spanning the spatial domain can be defined in terms of a flow map, i.e., a Lagrangian representation of the flow field. The flow map describes where a particle starting at position and time moves to in the time interval  [garth2007efficient].

Lagrangian methods have been increasingly used for flow visualization in the past decade. Lagrangian coherent structures (LCS), introduced by Haller et al. [haller2001distinguished, haller2000lagrangian, haller2000finding], is a popular technique to visualize attracting and repelling surfaces. Given the high computational cost of calculating LCS, several efforts have been aimed at accelerating its computation and visualization [garth2007efficient, garth2009visualization, sadlo2007efficient, sadlo2011time]. Further, LCS are visualized for varying application domains, such as large eddy simulations used to assess the transport properties of multiscale ocean flows [ozgokmen2012multi] and the design of energy-efficient buildings with respect to air circulation [schindler2012lagrangian].

More recently, in situ Lagrangian analysis has been used to extract time-varying flow field information. Most relevant to our work, Agranovsky et al. [agranovsky2014improved] extracted flow maps over a vector field in situ to represent the behavior of the flow for intervals of time. We describe this work in more detail in Section 1.3. Sane et al. [sane2018revisiting] extended the evaluation of Lagrangian in situ analysis by considering spatiotemporal tradeoffs across a range of configurations. Working with an SPH [GM77] simulation, Chandler et al. [chandler2015interpolation] stored the calculated flow maps in a modified k-d tree and queried it for pathline interpolation. However, none of these works considered the scalability of extracting flow maps in situ.

Several works have presented Lagrangian-based advection methods for post hoc flow reconstruction using Lagrangian flow maps as input. Hlawatsch et al. [hlawatsch2011hierarchical] employed a hierarchical scheme post hoc to decrease the number of integration steps by constructing longer pathlines using previously computed Lagrangian-based trajectories. Agranovsky et al. [agranovsky2011extracting] optimized pathline interpolation using scattered particles and considered the use of Moving Least Squares and Barycentric coordinate interpolation. Chandler et al. [chandler2016analysis] presented an analysis showing Lagrangian interpolation error correlates with divergence in flow fields. Bujack et al. [bujack2015lagrangian] identified local truncation error propagation as a source of error in Lagrangian-based advection methods. Extending this work, Hummel et al. [hummel2016error] provided theoretical upper error bounds and pathline uncertainty visualizations. Aiming to reduce the error propagation, Sane et al. [sane2019interpolation] proposed a pathline interpolation scheme that is capable of using longer particle trajectories varying in seed placement and duration. In this work, we use Barycentric coordinate interpolation as our Lagrangian-based advection method for flow reconstruction [agranovsky2011extracting, agranovsky2014improved].

While these studies have broadened and improved in situ Lagrangian analysis from an aesthetics, reconstruction accuracy, and storage perspective, very little work has been devoted to performance considerations and viability.

1.2 Distributed Integral Curve Computation

Efficient computation of integral curves in a distributed memory environment is an extensively researched field. However, the majority of works are limited to steady state vector fields in a post hoc environment. The primary challenge addressed by these works is improving the scalability, load balance, and overall efficiency of distributed memory integral curve computation. Typical parallelization strategies adopted to improve performance are parallelize-over-data [sujudi1996integration, peterka2011study, chen2008optimizing, yu2007parallel, nouanesengsy2011load], parallelize-over-particles [dinan2009scalable, camp2011evaluating, muller2013distributed, guo2014advection], or a hybrid approach [pugmire2009scalable, kendall2011simplified, lu2014scalable]. Parallelize-over-data techniques determine a domain decomposition and assign a subset of the total domain to each node. While calculating integral curves, these methods communicate particles between processors when required. Parallelize-over-particles techniques assign a set of particles to each node and load data from disk on demand. For a complete review of the algorithms, we refer the reader to a recent survey by Zhang et al. [zhang2018survey].

Although involving similar keywords as our own work, the following related works are only applicable to steady state vector fields in a post hoc setting. Bleile et al. [bleile2017accelerating] accelerated streamline calculation by swapping traditional Eulerian and Lagrangian-based advection at node boundaries. In this case, after a particle is communicated across a boundary, a previously computed mapping is used to transport the particle across the entire node. Liao et al. [Liao2019ScalablePF] presented a communication-free 3D LIC technique. They limit communication by using a preprocessing step to regroup unstructured grid cells and restricting particle advection to within the confines of a single cell.

Existing algorithms are difficult to adopt in the context of in situ calculation of pathlines. Operating in an in situ environment introduces new constraints with regard to which process can access what domain subset in space and time. First, domain decomposition and distribution are simulation-determined. Second, the time-varying nature of simulation data in conjunction with in situ memory constraints means only a single time step can be accessed at any given time, and communicating this information across processors each step is prohibitively expensive. These constraints complicate techniques like data prefetching, rearrangement, or completing particle advection within a node before particle exchange. (In situ methods are currently limited to advancing a particle by a single step before the vector field changes.) Further, the problem of poor scalability remains when considering a large number of processors and communication between them every cycle.

1.3 In Situ Extraction of Lagrangian Flow Maps

Agranovsky et al. [agranovsky2014improved] were the first to use a Lagrangian representation for ISR-PHE of a time-varying flow field. The method involves two distinct phases. The first phase involves in situ extraction of sets of basis flows (i.e., pathlines) and the second phase involves flow reconstruction using a Lagrangian-based advection method.

In the in situ phase, particles are seeded along a uniform grid and advected for a predetermined interval of time. In this context, an interval is defined as the number of cycles particles are advected for before saving their end positions and resetting them along a uniform grid. Thus, sets of basis flows are calculated in batches for nonoverlapping intervals of time. The number of seeds placed is configurable and depends on the desired data reduction. The data reduction is based on the vector field grid size, i.e., a 1:1 configuration uses as many particles as grid points, and a 1:27 configuration uses one particle for every twenty-seven grid points. Importantly, in the Agranovsky method, each process communicates particle exchange information after each simulation step, which contributes significantly to the total execution time. On “write cycles,” i.e., the cycles where data is stored to disk, each compute node returns its particles to their originating nodes.

In the post hoc phase, new particle trajectories for a desired flow analysis can be generated by interpolating the extracted basis flows. Tracing a new pathline begins with identifying the right neighborhood of basis flows to “follow” for an interval of time. Successive sets of basis flows can then be used to stitch together a complete pathline. This Lagrangian-based advection scheme enables scientists to explore the flow field, considering particle trajectories that were not saved as basis flows.

2 Boundary Termination Optimization

In this section, we first define requirements for extracting flow field information as a set of pathlines in situ. Next, we describe Boundary Termination Optimization (BTO), which results in the communication-free technique Lagrangian-BTO and our demonstration of it’s viability is the main contribution of this study. Further, we discuss the accuracy-performance tradeoff when comparing a communication-based versus communication-free model. Finally, we provide the details of the Lagrangian-based advection scheme we use for reconstruction and a theoretical error analysis.

The requirements surrounding the calculation of integral curves in distributed memory varies depending on whether the computation is for the purpose of post hoc visualization and analysis or in situ extraction of a flow map. Whereas in a post hoc setting an integral curve must continue particle integration across node boundaries, this is not necessary to calculate a Lagrangian representation of the flow field. We identify different requirements when extracting a Lagrangian flow map in situ and define them as follows:

  1. Extraction of a flow map or set of pathlines in situ should demonstrate good scalability.

  2. Flow field reconstruction using extracted pathlines should be accurate.

The method described in Section 1.3 demonstrated only the second requirement. We implement the method by Agranovsky et al. [agranovsky2014improved] using MPI[gropp1996high] for communication and refer to our implementation of this technique as Lagrangian-MPI. In this paper, we use Lagrangian-MPI as a baseline for comparison.

2.1 In Situ Extraction Using BTO

Our contribution with this work is a simple communication-free algorithm for extracting a Lagrangian flow map in situ. The benefit of this approach is that it has less execution time and improved scalability characteristics, which reduces the burden on the simulation code. To improve performance, our approach requires a small modification to Lagrangian-MPI: eliminate information exchange and synchronization.

Similar to Lagrangian-MPI, we use an initially uniform seed placement and advect particles for predetermined nonoverlapping intervals of time. However, as opposed to continuing particle integration across node boundaries, our approach terminates and discards these particle trajectories. Figures 0(a) and 0(b) illustrate notional examples of basis flows calculated by Lagrangian-MPI and Lagrangian-BTO, respectively. Thus, we store only those particle trajectories that remain within the domain until the end of the interval. Terminating particles that require communication across node boundaries to continue trajectory integration, allows the approach to remain communication-free. Since processors do not exchange particles, the Lagrangian analysis operator on each processor can operate independently and asynchronously.

We build both Lagrangian analysis techniques, i.e., Lagrangian-MPI and Lagrangian-BTO, as in situ analysis filters using the VTK-m [moreland2016vtk], VTK-h [larsen2017alpine], and Ascent [larsen2017alpine] libraries. VTK-m is a platform portable scientific visualization library for shared memory parallel environments. VTK-h is a distributed memory wrapper around VTK-m. Ascent is an in situ visualization infrastructure that we use to both integrate with simulations and create a workflow when loading data sets from disk.

The Lagrangian-BTO filter has two operations to perform: particle advection and particle management. Particle advection is performed using RK4 interpolation implemented as a VTK-m worklet [pugmire2018performance]. Particle management involves tracking particle trajectories, evaluating the validity, and managing memory to prevent invalid particles from being launched on GPU threads during advection. The Lagrangian-MPI filter has to perform three operations: particle advection, particle management, and communication. Consistent with the Agranovsky approach, communication of particles and particle information between ranks is performed using asynchronous, non-blocking, buffered MPI communication. Additionally, particle management further includes tracking of internal and external particles in order to return particles to originating nodes at write cycles.

(a) Lagrangian-MPI
(b) Lagrangian-BTO
Figure 1: Notional examples of calculating basis flows. Only green trajectories are stored to disk.

Accuracy-Performance Tradeoff — The loss of information in the form of terminated and discarded particle trajectories reduces the quality of flow reconstruction. Flow information that will typically be lost near boundaries can, however, be interpolated using additional information from adjacent processes post hoc. With respect to simulation overhead, the communication-free model offers a reduced cost and improved scalability. We believe evaluating the accuracy-performance tradeoff and determining the viability of a communication-free model as an alternative low-cost choice for a scientist or researcher is valuable. Our hypothesis regarding this method is that the execution time will be substantially improved (since there is no communication required), but the accuracy will be only modestly affected.

Finally, the Agranovsky work demonstrated that the Lagrangian method can be much more accurate than the Eulerian approach, including some cases where the accuracy improved by over 10x. Even though our practice of terminating particles at the boundary will reduce accuracy compared to the Agranovsky approach, we still would be much more accurate than the Eulerian approach. If we also can demonstrate significantly faster execution times, then we believe our proposed method would be appealing to future researchers and domain scientists in many settings.

(a) Neighborhoods using 0(a)
(b) Neighborhoods using 0(b)
Figure 2: Notional examples of neighborhoods produced using triangulation over basis flows shown in Figure 1. The blue circles represent starting locations of basis flows saved to disk.

2.2 Post Hoc Reconstruction

To reconstruct the complete flow map or to trace new pathlines, we set up a parallelize-over-data distributed memory Lagrangian-based advection scheme that is conceptually similar to previous work [agranovsky2014improved, chandler2015interpolation, sane2018revisiting]. For a given interval, each process loads basis flows generated from its own and adjacent processes for that interval. For any new trajectory to be calculated using basis flows, the particle must identify a neighborhood (convex hull) of basis flows to follow for the duration of an interval. We perform a Delaunay triangulation over the start locations of loaded basis flows to identify neighborhoods for tracing new particle trajectories. Figure 2 illustrates a notional example of triangulation using basis flows from the two nodes demonstrated in Figure 1. Using basis flows data from adjacent neighboring processes, i.e., flow information from all directions, allows approximation of regions where information has been lost. Once a particle neighborhood is identified, the location of the particle at the end of the interval can be calculated by following the neighborhood of basis flows. In our work, we use Barycentric coordinate interpolation to calculate the particle’s end position. To calculate a pathline for a duration greater than the interval of basis flows, a trajectory can be stitched together by using basis flows of successive nonoverlapping intervals.

We implemented our reconstruction workflow using CGAL for 3D parallel Delaunay triangulation [fabri2011cgal] and the vtkProbeFilter from the VTK library [schroeder2004visualization] for Barycentric coordinate interpolation. Further, we performed all our reconstruction and accuracy measurements on our in-house research cluster.

2.3 Theoretical Error Analysis

We use a one-dimensional linear interpolation of a function for

(1)

The higher dimensional result satisfies Equation (1) for each component.

In our approach, a particle starting at that reaches the node boundary during the interval of advection is terminated. Thus, its function value (the flow map) is not known. However, we can reconstruct it from its known neighbors . Consider a particle whose path is interpolated post hoc using this reconstructed value. We get the same result as if we had used the closest existing neighbors directly, because let denote , then

(2)

Bujack et al. [bujack2015lagrangian] previously established that the post hoc interpolation of pathlines is a numerical one-step integration method [GH10]. Its accuracy is bounded by its global truncation error at stitching step of

(3)

with dimension , start time , end time , spatial Lipschitz constant , Hessian of the temporal derivative of the flow map , and spatial distance between the basis flows. As a result, the interpolation error is if the flow map has bounded second derivatives in space and first derivatives in time, which is a reasonable assumption for a differentiable vector field, because the solutions of an initial value problem depend smoothly on the initial conditions and time [hartman1973ordinary].

Equation (2) shows that the error bound still holds but with a larger . Its size is determined by the size of missing information, which is limited by the maximum distance particles can move in one time interval. If a particle is seeded further than away from the node boundary, it cannot reach it and must therefore have the correct flow map information. In the worst case, we can have missing information on both sides of the boundary, and therefore we get

(4)

with the underlying velocity field and the temporal step size , which is the interval time between storing data to disk, which is usually around one-thousandth of the total integration time. Please note that the future increase of the global truncation error of a particle that traverses this region can continue even after it has left the region, but for all particles that never enter the region close to the boundary, the original error bound holds.

3 Study Overview

In this section, we describe our experiments to evaluate performance and reconstruction accuracy, and the metrics we use to measure the same.

3.1 Experiment Setup

To evaluate the viability of the Lagrangian-BTO analysis filters, we set up two workflows.

  • WF1: In situ weak scaling study to evaluate speed-up.

  • WF2: Evaluation of reconstruction accuracy by varying parameters used in flow map generation.

For the WF1 workflow, Ascent is directly connected to a simulation code. Lagrangian analysis parameters are specified as input to the Ascent pipeline. The simulation code generates vector field data and pauses when it invokes Ascent calls in order to perform in situ analysis. We scaled the number of processors used and proportionately increased the size of the simulation grid. Our WF1 experiments provide insight into the performance of both Lagrangian analysis filters at scale.

For the WF2 workflow, we load data set files from disk to create a theoretical in situ setup, and then call Ascent to initiate the Lagrangian analysis filters. We consider a fixed number of MPI tasks and nodes for WF2, i.e., we use a fixed domain decomposition for all tests. Our WF2

experiment configurations are designed to understand the range of specifications under which Lagrangian-BTO extracts accurate flow maps and to identify the potential limitations of using this method. We vary the value of the interval parameter (i.e., the number of cycles we wait before writing to disk) to understand the effect of longer advection intervals on flow map generation using the Lagrangian-BTO approach. In general, the longer the interval, the greater the probability of particles reaching the boundary and being terminated. Further, we consider the impact of various data reduction options. When using a very sparse number of particles to capture the behavior of the flow, the effect of losing particles to boundary termination could be greater. As mentioned in Section 

1.3, data reduction values are described in the form of 1:X, i.e., one particle for every X grid points.

3.2 Performance Metric

We measure performance in terms of the execution time of the Lagrangian analysis filters. All of our timings are measured using the timing functionality in Ascent. We measure the average time per cycle, which includes time to perform particle advection, particle management, and also communication in the case of Lagrangian-MPI. To simplify understanding the scalability of both Lagrangian analysis filters with respect to communication costs, we exclude cycles at the end of an interval. These cycles include a communication cost incurred to return all particles to the respective origin nodes for Lagrangian-MPI and an I/O cost to write information to disk for both methods. In this paper, we do not analyze the parallel I/O times, and we consider I/O optimization methods to be beyond the scope of this work.

3.3 Accuracy Metric

We measure the accuracy of flow field reconstruction by the Lagrangian-BTO analysis filter relative to the accuracy achieved by the corresponding Lagrangian-MPI analysis filter. Comparisons of Lagrangian-MPI to the traditional Eulerian method can be found in previous works [agranovsky2014improved, sane2018revisiting]. We measure total flow volume error using the average L2-norm over all samples considered. The total average L2-norm is calculated as

(5)

where is the total number of particles, is the location of a Lagrangian-BTO interpolated particle at time , and is the location of the Lagrangian-MPI particle at time .

For a given total average L2-norm value L, the reconstruction accuracy percentage is proportional to the length of cell side C for that specific configuration, and is calculated as

(6)

We note that we use the total average L2-norm in two contexts. First, to measure error when reconstructing the complete flow map as generated by the Lagrangian-MPI method. Second, to measure error of new pathlines traced using basis flows generated by both methods when compared to a ground truth.

In addition to the above total flow volume error measure, we report the maximum L2-norm in two forms — the greatest maximum L2-norm across all interval reconstructions, i.e., the error of the least accurately interpolated single basis flow, and the average maximum L2-norm across all intervals.

3.4 Data Sets

We consider four data sets, namely, the Cloverleaf3D simulation, Arnold-Beltrami-Childress flow, Jet flow simulation, and Nyx cosmology simulation.

3.5 Runtime Environment

We tested the Lagrangian analysis techniques by running our experiments on Summit (supercomputer at ORNL). Each node of Summit has two IBM Power9 CPUs, each with 22 cores running at 3.8 GHz and 512 GBytes of DDR4 memory. Further, nodes on Summit have enhanced on-chip acceleration with each CPU connected via NVLink to 3 GPUs, for a total of 6 GPUs per node. Each GPU is an NVIDIA Tesla V100 with 5120 CUDA cores, 6.1 TeraFLOPS of double precision performance, and 16 GBytes of HBM2 memory.

4 Results

We organize our results into four subsections, 4.1 to 4.4. Each subsection is focused on one data set. Specifically, in subsection 4.1, we consider the Cloverleaf3D simulation with workflow WF1 for our weak scaling study and workflow WF2 for our strong scaling study. In subsections 4.24.3, and 4.4, we consider the ABC, Nyx, and Jet flow simulations, respectively, and run experiments with workflow WF2.

For all data sets, we measured the accuracy of basis flows generated for every interval across all nodes. Tables 123, and 4 show reconstruction accuracy averaged over all the intervals, i.e., average reconstruction accuracy for the entire simulation duration. Figure 8 shows the change in reconstruction accuracy over every interval and the correlation to the number of particles terminated during that interval for a single configuration of each data set. Figures 7(a)7(b)7(c), and 7(d) use bubble size to represent the number of particles terminated and the corresponding figures 7(e)7(f)7(g), and 7(h) show average L2-norm curves and reconstruction accuracy as a percentage.

To supplement our quantitative evaluation, Figures 2(c)2(d)56, and 7 provide a qualitative comparison using colormapped visualizations of surfaces of subvolumes of FTLE scalar fields generated post hoc using basis flows calculated by Lagrangian-MPI and Lagrangian-BTO. Well-defined ridges in the FTLE field (identified by high scalar values), used to visualize Lagrangian Coherent Structures, are of particular interest in these figures.

Nodes MPI GPUs/ Dims Step L-BTO (s) L-MPI (s) Speed-up Discarded % Greatest Max Average Max Total Average Accuracy %
Ranks Node Size L2-norm L2-norm L2-norm
1 2 2 0.038 0.0050 0.0190 3.8x 1.9 2.52 7.58 4.53 99.6
1 4 4 0.029 0.0106 0.0301 2.8x 2 2.44 7.68 4.19 99.5
1 6 6 0.025 0.0158 0.0380 2.4x 2.8 3.43 1.20 4.89 99.4
2 8 4 0.023 0.0109 0.0405 3.7x 1.8 1.61 6.49 2.60 99.6
2 12 6 0.019 0.0173 0.0398 2.3x 2.4 2.84 1 3.09 99.5
4 16 4 0.017 0.0107 0.0338 3.1x 2.9 1.11 1.97 4.38 99.2
4 24 6 0.015 0.0178 0.0410 2.3x 4.6 5.83 2.35 6.28 98.8
8 32 4 0.013 0.0140 0.0506 3.6x 4.1 3.11 1.44 3.83 99.2
8 48 6 0.011 0.0201 0.0449 2.2x 4.9 5.70 3.15 4.45 98.9
16 64 4 0.010 0.0140 0.0504 3.6x 4.5 8.39 3.58 3.62 99.1
16 96 6 0.009 0.0200 0.0510 2.5x
32 128 4 0.008 0.0180 0.0750 4.1x
32 192 6 0.007 0.0301 0.0620 2.0x
64 256 4 0.006 0.0230 0.0808 3.5x
128 512 4 0.005 0.0303 0.1001 3.3x
256 1024 4 0.004 0.0380 0.1390 3.6x
512 2048 4 0.003 0.0475 0.1544 3.2x
Table 1: Weak scaling configurations and timing results for the Cloverleaf3D data set WF1 experiments. Lagrangian-BTO and Lagrangian-MPI columns show the average time per step in seconds. We post-processed and measured accuracy for only a subset of the WF1 experiments (10 of 17), since calculating the reconstruction accuracy takes prohibitively long periods of time on our local cluster. Each node operates on an approximately grid. Reported results are measured across all intervals.

4.1 Cloverleaf3D Simulation

Cloverleaf3D is a three-dimensional version of the Lagrangian-Eulerian explicit hydrodynamics mini-application Cloverleaf [mallinson2013cloverleaf]. It has been developed and used to evaluate techniques targeting Exascale applications.

(a)
(b)
(c) Lagrangian-MPI
(d) Lagrangian-BTO
Figure 3: Weak scaling study results for the Cloverleaf3D data set. In 2(a) each curve plots the average time required for a single step for an increasing number of MPI tasks. Lagrangian-BTO is on average 3x faster than Lagrangian-MPI. 2(b) shows the approximate time required for MPI communication by the Lagrangian-MPI analysis filter averaged across all processes. All measurements are in seconds.  2(c) and 2(d) compare FTLE visualizations generated post hoc using basis flows from Lagrangian-MPI and Lagrangian-BTO, respectively.

4.1.1 Weak Scaling

We performed WF1 experiments, i.e., an in situ weak scaling study to evaluate performance, using the Cloverleaf3D simulation on Summit. For each run, we terminated the simulation after 500 cycles. Given that our study varies the resolution of the simulation with the number of processes, the total number of simulation steps varies. 4 of 17 experiment configurations completed before 500 cycles and all the simulations that are terminated at 500 cycles reach different stages of the simulation (variable step size). In general, the greater the spatial resolution of the data set, the greater the number of cycles required to reach completion. Our experiment configurations span across 2 MPI tasks to across 2048 MPI tasks, with each MPI rank operating on an approximately grid. In each case, a single MPI task is allocated a single GPU. Thus, our smallest configuration used 2 GPUs on a single compute node, and the largest used 2048 GPUs across 512 compute nodes. Additionally, given that each node on Summit has 6 GPUs, we varied the number of GPUs utilized on a single node to gauge on-node MPI communication optimizations and performance of particle advection using multiple GPUs on the same node. For each experiment, we set the maximum step size to 0.1. However, Cloverleaf3D uses an adaptive step size based on the simulation grid resolution. We report the average step size taken by the simulation, configuration information (number of nodes, MPI ranks, and GPUs per node, and dimensions), and scalability performance results in Table 1. Further, each weak scaling test extracted basis flows at an interval of every 25 cycles and used a data reduction of 1:8, i.e., one particle for every eight grid points.

Performance — Figure 2(a) compares the average time required per step by each technique. Lagrangian-BTO scales better than Lagrangian-MPI because it is a communication-free analysis filter. For the extraction of Lagrangian flow maps in situ, Lagrangian-BTO demonstrates an average of 3x speed-up over Lagrangian-MPI. We observe the cost of particle advection (for both techniques) increases as the scale of the simulation increases. However, each process operates on an approximately grid irrespective of the total number of MPI tasks, and each Lagrangian-BTO process operates without any knowledge of other processes. In addition to the number of particles being advected, multiple variables influence the cost of the particle advection worklet, namely, cell size, step size, and the vector field. Performing Runge Kutta fourth-order interpolation could require interpolating velocity information from multiple cells (determined by cell size, step size, and the vector field). Although the increased cost of particle advection affects the time required by both Lagrangian analysis techniques, exact comparisons between different configurations (i.e., rows in Table 1) would not account for the above parameters. We believe the performance of the particle advection worklet considering the above parameters is worth future investigation. Further, use of a faster particle advection kernel would result in greater speed-ups for the Lagrangian-BTO technique.

Varying number of GPUs/node — The “sawtooth” nature of the plots in Figures 2(a) and 2(b) is a result of alternating the number of GPUs being utilized between 4 and 6. Particle advection performs better with 4 GPUs per node versus 6. The use of shared memory by multiple GPUs on a single node and saturation of the NVLink by the VTK-m particle advection worklet causes this effect. Figure 2(b) captures the difference in time between both curves, i.e., approximates the time spent on communication, and shows a reduction in MPI communication costs when using 6 ranks versus 4 per node, albeit scaling poorly as the number of nodes increases. On-node MPI communication optimizations contribute to better performance when grouping a larger number of MPI tasks on each node. However, as the number of nodes increases, the cost of inter-node communication remains high in comparison to on-node communication.

Accuracy — Higher resolution simulation configurations use a proportionately larger number of particles and thus generate more basis flows. We measured accuracy for only a subset of the WF1 experiments (10 of 17), since calculating the reconstruction accuracy takes prohibitively long periods of time on our local cluster. Lagrangian-BTO terminated between 2-5% of basis flows on average across all experiments, i.e., 2-5% less data was saved to disk. However, after we interpolate the missing trajectories using the saved basis flows, we approximate the complete flow map (as generated by Lagrangian-MPI) with over 99% accuracy on average using the L2-norm metric.

Figures 2(c) and 2(d) compare FTLE visualizations generated using approximately 2M basis flows saved by each technique (configuration in row 10 of Table 1). We observe no significant differences in the visualizations with both methods capturing the FTLE ridges with the same quality.

Overall, our weak scaling study showed that for the Cloverleaf3D data set, Lagrangian-BTO was on average 3x faster, while generating 99% as accurate flow maps on average and qualitatively comparable post hoc FTLE visualizations as Lagrangian-MPI. In situ analysis using Lagrangian-MPI contributed between 5-12% of the total simulation time, and in most cases, Lagrangian-BTO required 50%-80% less time than the corresponding Lagrangian-MPI configuration.

Figure 4: Strong scaling pathline reconstruction error plot for the Cloverleaf3D data set WF2 experiments. We use warm colored bars for Lagrangian-MPI and cool colored bars for Lagrangian-BTO. Bars are grouped from left to right by number of MPI tasks used, i.e., in increasing order of degree of domain decomposition. Within each group of bars, configurations are subgrouped by data reduction options, i.e, 1:1, 1:8, and 1:27. Within a subgroup, pairs of matching Lagrangian-MPI and Lagrangian-BTO tests are ordered from left to right by interval value. In these experiments, a dimension data set was used. The total number of simulation cycles was set to 800. We traced pathlines for 800 cycles using the basis flows generated by each Lagrangian analysis technique and compared them to the ground truth.

4.1.2 Strong Scaling

Whereas the weak scaling study demonstrated the viability of using Lagrangian-BTO, the strong scaling study helps us understand its limitations. We considered a grid and decomposed it into smaller grids based on the number of MPI tasks. Further, for these WF2 experiments, we considered interval values of 25, 50, and 100, and data reduction values of 1:1, 1:8, and 1:27. The interval determines how long particles advect before their end locations are saved. The longer the interval, the greater chance of particles exiting the node domain and vice versa. Further, the number of particles that exit the node domain and terminate directly impacts the accuracy of the approximation. Figure 4 shows the accuracy results when decomposing the domain across 1, 8, and 32 MPI tasks.

Interval Reduction L-BTO (s) L-MPI (s) Speed-up Discarded % Greatest Max Average Max Total Average Accuracy %
L2-norm L2-norm L2-norm
25 1:1 0.0059 0.0143 2.3x 2.7 6.85 4.97 6.15 99.7
50 1:1 0.0066 0.0145 2.1x 5.5 3.03 1.85 2.31 99.0
100 1:1 0.0067 0.0144 2.1x 10.9 7.87 6.81 1.09 95.5
25 1:8 0.0026 0.0065 2.4x 2.6 1.40 3.41 3.93 99.8
50 1:8 0.0033 0.0086 2.6x 5.4 3.71 1.23 1.32 99.4
100 1:8 0.0026 0.0075 2.8x 10.8 6.15 2.63 3.65 98.5
25 1:27 0.0024 0.0065 2.7x 1.6 5.24 4.87 1.61 99.9
50 1:27 0.0029 0.0048 1.6x 5.1 2.62 1.28 1.23 99.5
100 1:27 0.0027 0.0053 1.9x 10.4 1.25 6.53 5.01 97.9
Table 2: Speed-up and reconstruction accuracy results for the ABC data set WF2 experiments. Lagrangian-BTO and Lagrangian-MPI columns show average time per step in seconds. Reported results are measured across all intervals.

For these experiments, we measured accuracy by calculating pathlines using the basis flows generated by each technique and evaluating their accuracy in comparison to a ground truth at interpolated locations. The ground truth is calculated by performing RK4 advection using the full spatial and temporal resolution of the data set. We placed 1,000 particles in the domain to generate pathlines for this evaluation. We compared the similarity between ground truth and pathlines generated by interpolating Lagrangian basis flows using the average L2-norm metric. Our metric compared the accuracy of interpolated points along the trajectory [sane2019interpolation]. Lastly, for these experiments, we loaded Cloverleaf3D vector field data from disk and set the maximum simulation step size to 0.01 and used 800 simulation steps. Therefore, we generated pathlines for a duration of 800 simulation steps by stitching together results using successive batches of basis flows [agranovsky2014improved, bujack2015lagrangian, sane2018revisiting].

Figure 4 groups configurations by number of processes used, i.e., the degree of decomposition. We note that Lagrangian-MPI accuracy is not affected by degree of decomposition and hence remains constant irrespective of number of processors. When considering only a single processor, both methods lose particles that exit the entire domain during the interval of advection. Thus, performance on a single MPI task is identical for both methods. When decomposed across 8 tasks, we see the error of Lagrangian-BTO increases slightly for configurations that use a 1:27 data reduction. Specifically, accuracy reduces from 100% as accurate as Lagrangian-MPI to 96% as accurate when the interval increases from 25 to 100. For domain decomposition across 32 tasks, i.e., the highest degree of decomposition we consider, Lagrangian-BTO error increases as the interval increases and fewer particles are used. These experiments are valuable in demonstrating the limitations of the Lagrangian-BTO method. However, we believe our strong scaling study tests an extreme case, because we expect our target applications will have higher resolution grids and use a larger number of particles to capture the flow field behavior.

4.2 Arnold-Beltrami-Childress (ABC)

We used a 3D time-dependent variant of the ABC flow analytic vector field [brummell2001linear]. We used one complete period of the flow for a total of 1000 time steps at a grid resolution of . For the WF2 ABC data set experiments, we considered three options for interval (25, 50, 100) and three options for data reduction (1:1:, 1:8, 1:27). All tests use 16 nodes, 64 MPI tasks, with 4 MPI tasks using 4 GPUs on each node. Table 2 contains configuration details, percentage of discarded particles, speed-up achieved, and the reconstruction accuracy percentages for the ABC data set.

(a) Lagrangian-MPI
(b) Lagrangian-BTO
Figure 5: Qualitative comparison of post hoc FTLE visualizations generated using basis flows for the ABC data set.
Interval Reduction L-BTO (s) L-MPI (s) Speed-up Discarded % Greatest Max Average Max Total Average Accuracy %
L2-norm L2-norm L2-norm
5 1:1 0.0062 0.0089 1.4x 0.7 5.66 1.73 1.89 99.7
10 1:1 0.0035 0.0089 2.5x 2.7 3.12 1.20 1.45 98.1
5 1:8 0.0024 0.0044 1.8x 0.8 8.64 2.69 3.77 99.5
10 1:8 0.0022 0.0059 2.6x 2.1 4.41 1.45 2.06 97.3
5 1:27 0.0031 0.0038 1.2x 0.7 8.47 3 4.32 99.4
10 1:27 0.0024 0.0094 3.9x 1.9 4.27 1.72 2.48 96.8
5 1:64 0.0025 0.0037 1.4x 0.5 9.19 3.30 4.34 99.4
10 1:64 0.0029 0.0039 1.3x 1.5 4.89 1.96 2.80 96.4
Table 3: Speed-up and reconstruction accuracy results for the Jet flow data set WF2 experiments. Lagrangian-BTO and Lagrangian-MPI columns show average time per step in seconds. Reported results are measured across all intervals.
Interval Reduction L-BTO (s) L-MPI (s) Speed-up Discarded % Greatest Max Average Max Total Average Accuracy %
L2-norm L2-norm L2-norm
10 1:1 0.0027 0.0067 2.4x 4.1 5.06 1.52 3.93 99.5
20 1:1 0.0026 0.0067 2.5x 8.4 8.19 4.83 1.52 98.0
40 1:1 0.0026 0.0139 5.2x 15.9 3.97 2.25 9.02 88.5
50 1:1 0.0025 0.0079 3.1x 19.3 6.04 3.59 1.64 79.1
10 1:8 0.0024 0.0045 1.8x 3.2 9.68 1.67 3.96 99.4
20 1:8 0.0023 0.0107 4.6x 7.9 1.74 7.05 2.19 97.2
40 1:8 0.0027 0.0046 1.7x 15.5 6.19 2.92 1.18 85.6
50 1:8 0.0030 0.0045 1.4x 18.9 7.74 4.32 2.02 74.3
Table 4: Speed-up and reconstruction accuracy results for the Nyx simulation data set WF2 experiments. Lagrangian-BTO and Lagrangian-MPI columns show average time per step in seconds. Reported results are measured across all intervals.

Lagrangian-BTO demonstrates an average of 2.3x speed-up when compared to Lagrangian-MPI. For the ABC data set, reconstruction accuracy does not significantly deteriorate when using a smaller number of particles. However, for configurations with an interval equal to 100, a large number of particles exit the node boundaries and are terminated. For example, when the interval is set to 100 and a data reduction of 1:1 is used, over 1.7M particles are terminated every interval (approximately 10% of all seeds initially placed), which results in the reconstruction accuracy reducing to 95.5%. For the ABC data set, the number of particles terminated per interval approximately doubles every time the interval doubles. We observe that all configurations that have an interval of 25 or 50 achieve a reconstruction accuracy that is greater than 99%. When considering the average L2-norm error of individual intervals of the ABC data set in Figure 7(b) and 7(f), we observe a sinusoidal behavior in the error that is due to the sinusoidal ABC analytical function.

Figures 4(a) and 4(b) compare FTLE visualizations generated using approximately 2M basis flows computed over 100 time steps and saved by each technique (configuration in row 6 of Table 2). Although we observe small discontinuities in the ridges of the FTLE field generated using Lagrangian-BTO basis flows, the overall quality of the FTLE visualization is similar.

4.3 Jet Flow Simulation

The Jet data set is a simulation of a jet of high-velocity fluid entering a medium at rest. It was created using the Gerris Flow Solver [popinet2003gerris]. The vector field is defined over a uniform grid and we use a total of 500 time steps (previously subsampled). These WF2 experiments tested the performance of the Lagrangian-BTO when reconstructing this turbulent data set by considering two intervals (5, 10) and four data reduction values (1:1, 1:8, 1:27, 1:64). All tests used 16 nodes, 64 MPI tasks, with 4 MPI tasks using 4 GPUs on each node. Table 3 contains configuration information and results of the Jet data set.

Lagrangian-BTO demonstrates an average speed-up of 2x when compared to Lagrangian-MPI for this data set. The domain contains regions of high velocity resulting in the reconstruction accuracy of the Lagrangian-BTO analysis filter being adversely affected as the interval increases. For configurations with the interval set to 5, the data reduction is less consequential, and all configurations achieve greater than 99% accuracy. However, for configurations with the interval set to 10, the reconstruction accuracy decreases from 98.1% to 96.4% as data reduction ranges from 1:1 to 1:64. Thus, similar to the result observed in the strong scaling study in Section 4.1, the combination of larger interval and reduced number of particles results in less accurate reconstruction.

Figures 5(a) and 5(b) compare visualizations produced using 4.2M basis flows saved by each method (configuration in row 2 of Table 3). The Lagrangian-BTO FTLE visualization has some instances of a less pronounced structure in the FTLE ridges. However, the differences are localized and relatively small compared to the overall information conveyed by the images.

(a) Lagrangian-MPI
(b) Lagrangian-BTO
Figure 6: Qualitative comparison of post hoc FTLE visualizations generated using basis flows for the Jet flow data set.
(a) Lagrangian-MPI
(b) Lagrangian-BTO
Figure 7: Qualitative comparison of post hoc FTLE visualizations generated using basis flows for the Nyx data set.

4.4 Nyx Cosmology Simulation

The Nyx simulation is a N-body and gas dynamics code for large-scale cosmological simulations [almgren2013nyx]. We use a Nyx test executable named TurbForce to generate 500 cycles of a data set with an average time step of . The generated flow field grows in turbulence over the duration of the simulation. These WF2 experiments with the Nyx data set test the reconstruction accuracy of Lagrangian-BTO when considering four intervals (10, 20, 40, 50) and two data reduction values (1:1, 1:8). Similar to previous setups, we use 64 MPI tasks evenly distributed across 16 nodes with each MPI task using a single GPU. Table 4 details test configurations and results for the Nyx data set.

Lagrangian-BTO demonstrates an average speed-up of 2.8x over Lagrangian-MPI. Configurations with an interval of 10 or 20 have high reconstruction accuracies greater than 97%. However, as the interval increases, the reconstruction accuracy is adversely affected by the large number of particles terminated and the significant turbulence in this data set. Further, Lagrangian-BTO can reconstruct the field relatively more accurately when using a 1:1 data reduction compared to a 1:8 configuration. Figures 6(a) and 6(b) compare FTLE visualizations generated using approximately 2M basis flows saved by each technique (configuration in row 1 of Table 4). We observe qualitatively comparable FTLE ridge structures.

(a) Cloverleaf3D
(b) ABC
(c) Jet Flow
(d) Nyx
(e) Cloverleaf3D
(f) ABC
(g) Jet flow
(h) Nyx
Figure 8: The plots show the relation between number of particles terminated and average L2-norm for all intervals of a single configuration of each data set.  7(a) and 7(e) show row 10 of Table 1 for the Cloverleaf3D data set.  7(b) and 7(f) show row 7 of Table 2 for the ABC data set.  7(c) and 7(g) show row 4 of Table 3 for the Jet flow data set.  7(d) and 7(h) show row 6 of Table 4 for the Nyx data set.

5 Conclusion and Future Work

Performing accurate exploratory analysis and visualization of time-varying vector fields is particularly challenging in sparse temporal settings. Our proposed algorithm represents an important step in expanding the Lagrangian in situ reduction and post hoc exploration paradigm to be viable for large-scale simulations. Predecessor work demonstrated that accuracy-storage tradeoffs were clearly superior to the traditional approach [agranovsky2014improved]. However, this work did not place a significant emphasis on minimizing in situ execution times and ensuring scalability. Our work addresses this point, and the corresponding reduced execution times (2x to 4x) and improved scalability remove another barrier to adoption.

We feel the most surprising result from our work is the rate at which we achieve high reconstruction accuracy. Clearly, terminating particles at block boundaries makes for less useful basis flows, and can create issues where post hoc exploration accuracy suffers. However, our results empirically show that this feared case happens relatively infrequently in practice and is limited to instances of long intervals, i.e., large integration times. Our proposed approach works particularly well for practical configurations with a short or medium interval, while delivering the same performance benefits. 31 of our 35 tests gave accuracy that was greater than 96%. In addition to our quantitative evaluation, we qualitatively showed that comparable FTLE visualizations can be generated using basis flows extracted at a fraction of the in situ cost. These evaluations are relative to the Agranovsky et al. approach, which incurred significantly more cost in execution time and scales poorly. Further, while we are slightly less accurate than Agranovsky et al., our approach would still be much more accurate than the traditional Eulerian approach for time-varying flow visualization in sparse temporal settings.

In terms of future work, we aim to store particle trajectory termination locations on the boundary, develop Lagrangian-based advection schemes that can consume flow maps with trajectories stopping (and starting) at arbitrary times, integrate adaptive variable duration variable placement (VDVP) techniques [sane2019interpolation], and consider the use of flow field characteristics to guide flow map extraction. Further, research is required to address edges cases and the limitations of a communication-free model demonstrated by our experiments. We believe our work is foundational because it evaluated the use of a simple communication-free model and demonstrated improved scalability with only a reasonable loss of accuracy. Remaining communication-free is particularly essential for Lagrangian analysis techniques that aim to scale well and remain within in situ constraints of large-scale simulations.

Acknowledgements.
This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration.

References