1 Background and Related Work
A key operation for calculating a Lagrangian flow map in situ is computing pathlines in a distributed memory environment. We first discuss relevant Lagrangian analysis works, followed by a brief categorization of distributed memory techniques and the constraints preventing existing approaches from being adopted in situ. We end with a description of the steps involved in calculating a Lagrangian flow map in situ.
1.1 Lagrangian Analysis
In the Lagrangian specification, flow field information is stored using integral curves, where each integral curve encodes the trajectory of a single massless particle and describes its properties as it travels through space. Each integral curve provides insight regarding flow behavior in the vicinity of the particle’s trajectory [bujack2015lagrangian]. Collectively, a large number of integral curves spanning the spatial domain can be defined in terms of a flow map, i.e., a Lagrangian representation of the flow field. The flow map describes where a particle starting at position and time moves to in the time interval [garth2007efficient].
Lagrangian methods have been increasingly used for flow visualization in the past decade. Lagrangian coherent structures (LCS), introduced by Haller et al. [haller2001distinguished, haller2000lagrangian, haller2000finding], is a popular technique to visualize attracting and repelling surfaces. Given the high computational cost of calculating LCS, several efforts have been aimed at accelerating its computation and visualization [garth2007efficient, garth2009visualization, sadlo2007efficient, sadlo2011time]. Further, LCS are visualized for varying application domains, such as large eddy simulations used to assess the transport properties of multiscale ocean flows [ozgokmen2012multi] and the design of energyefficient buildings with respect to air circulation [schindler2012lagrangian].
More recently, in situ Lagrangian analysis has been used to extract timevarying flow field information. Most relevant to our work, Agranovsky et al. [agranovsky2014improved] extracted flow maps over a vector field in situ to represent the behavior of the flow for intervals of time. We describe this work in more detail in Section 1.3. Sane et al. [sane2018revisiting] extended the evaluation of Lagrangian in situ analysis by considering spatiotemporal tradeoffs across a range of configurations. Working with an SPH [GM77] simulation, Chandler et al. [chandler2015interpolation] stored the calculated flow maps in a modified kd tree and queried it for pathline interpolation. However, none of these works considered the scalability of extracting flow maps in situ.
Several works have presented Lagrangianbased advection methods for post hoc flow reconstruction using Lagrangian flow maps as input. Hlawatsch et al. [hlawatsch2011hierarchical] employed a hierarchical scheme post hoc to decrease the number of integration steps by constructing longer pathlines using previously computed Lagrangianbased trajectories. Agranovsky et al. [agranovsky2011extracting] optimized pathline interpolation using scattered particles and considered the use of Moving Least Squares and Barycentric coordinate interpolation. Chandler et al. [chandler2016analysis] presented an analysis showing Lagrangian interpolation error correlates with divergence in flow fields. Bujack et al. [bujack2015lagrangian] identified local truncation error propagation as a source of error in Lagrangianbased advection methods. Extending this work, Hummel et al. [hummel2016error] provided theoretical upper error bounds and pathline uncertainty visualizations. Aiming to reduce the error propagation, Sane et al. [sane2019interpolation] proposed a pathline interpolation scheme that is capable of using longer particle trajectories varying in seed placement and duration. In this work, we use Barycentric coordinate interpolation as our Lagrangianbased advection method for flow reconstruction [agranovsky2011extracting, agranovsky2014improved].
While these studies have broadened and improved in situ Lagrangian analysis from an aesthetics, reconstruction accuracy, and storage perspective, very little work has been devoted to performance considerations and viability.
1.2 Distributed Integral Curve Computation
Efficient computation of integral curves in a distributed memory environment is an extensively researched field. However, the majority of works are limited to steady state vector fields in a post hoc environment. The primary challenge addressed by these works is improving the scalability, load balance, and overall efficiency of distributed memory integral curve computation. Typical parallelization strategies adopted to improve performance are parallelizeoverdata [sujudi1996integration, peterka2011study, chen2008optimizing, yu2007parallel, nouanesengsy2011load], parallelizeoverparticles [dinan2009scalable, camp2011evaluating, muller2013distributed, guo2014advection], or a hybrid approach [pugmire2009scalable, kendall2011simplified, lu2014scalable]. Parallelizeoverdata techniques determine a domain decomposition and assign a subset of the total domain to each node. While calculating integral curves, these methods communicate particles between processors when required. Parallelizeoverparticles techniques assign a set of particles to each node and load data from disk on demand. For a complete review of the algorithms, we refer the reader to a recent survey by Zhang et al. [zhang2018survey].
Although involving similar keywords as our own work, the following related works are only applicable to steady state vector fields in a post hoc setting. Bleile et al. [bleile2017accelerating] accelerated streamline calculation by swapping traditional Eulerian and Lagrangianbased advection at node boundaries. In this case, after a particle is communicated across a boundary, a previously computed mapping is used to transport the particle across the entire node. Liao et al. [Liao2019ScalablePF] presented a communicationfree 3D LIC technique. They limit communication by using a preprocessing step to regroup unstructured grid cells and restricting particle advection to within the confines of a single cell.
Existing algorithms are difficult to adopt in the context of in situ calculation of pathlines. Operating in an in situ environment introduces new constraints with regard to which process can access what domain subset in space and time. First, domain decomposition and distribution are simulationdetermined. Second, the timevarying nature of simulation data in conjunction with in situ memory constraints means only a single time step can be accessed at any given time, and communicating this information across processors each step is prohibitively expensive. These constraints complicate techniques like data prefetching, rearrangement, or completing particle advection within a node before particle exchange. (In situ methods are currently limited to advancing a particle by a single step before the vector field changes.) Further, the problem of poor scalability remains when considering a large number of processors and communication between them every cycle.
1.3 In Situ Extraction of Lagrangian Flow Maps
Agranovsky et al. [agranovsky2014improved] were the first to use a Lagrangian representation for ISRPHE of a timevarying flow field. The method involves two distinct phases. The first phase involves in situ extraction of sets of basis flows (i.e., pathlines) and the second phase involves flow reconstruction using a Lagrangianbased advection method.
In the in situ phase, particles are seeded along a uniform grid and advected for a predetermined interval of time. In this context, an interval is defined as the number of cycles particles are advected for before saving their end positions and resetting them along a uniform grid. Thus, sets of basis flows are calculated in batches for nonoverlapping intervals of time. The number of seeds placed is configurable and depends on the desired data reduction. The data reduction is based on the vector field grid size, i.e., a 1:1 configuration uses as many particles as grid points, and a 1:27 configuration uses one particle for every twentyseven grid points. Importantly, in the Agranovsky method, each process communicates particle exchange information after each simulation step, which contributes significantly to the total execution time. On “write cycles,” i.e., the cycles where data is stored to disk, each compute node returns its particles to their originating nodes.
In the post hoc phase, new particle trajectories for a desired flow analysis can be generated by interpolating the extracted basis flows. Tracing a new pathline begins with identifying the right neighborhood of basis flows to “follow” for an interval of time. Successive sets of basis flows can then be used to stitch together a complete pathline. This Lagrangianbased advection scheme enables scientists to explore the flow field, considering particle trajectories that were not saved as basis flows.
2 Boundary Termination Optimization
In this section, we first define requirements for extracting flow field information as a set of pathlines in situ. Next, we describe Boundary Termination Optimization (BTO), which results in the communicationfree technique LagrangianBTO and our demonstration of it’s viability is the main contribution of this study. Further, we discuss the accuracyperformance tradeoff when comparing a communicationbased versus communicationfree model. Finally, we provide the details of the Lagrangianbased advection scheme we use for reconstruction and a theoretical error analysis.
The requirements surrounding the calculation of integral curves in distributed memory varies depending on whether the computation is for the purpose of post hoc visualization and analysis or in situ extraction of a flow map. Whereas in a post hoc setting an integral curve must continue particle integration across node boundaries, this is not necessary to calculate a Lagrangian representation of the flow field. We identify different requirements when extracting a Lagrangian flow map in situ and define them as follows:

Extraction of a flow map or set of pathlines in situ should demonstrate good scalability.

Flow field reconstruction using extracted pathlines should be accurate.
The method described in Section 1.3 demonstrated only the second requirement. We implement the method by Agranovsky et al. [agranovsky2014improved] using MPI[gropp1996high] for communication and refer to our implementation of this technique as LagrangianMPI. In this paper, we use LagrangianMPI as a baseline for comparison.
2.1 In Situ Extraction Using BTO
Our contribution with this work is a simple communicationfree algorithm for extracting a Lagrangian flow map in situ. The benefit of this approach is that it has less execution time and improved scalability characteristics, which reduces the burden on the simulation code. To improve performance, our approach requires a small modification to LagrangianMPI: eliminate information exchange and synchronization.
Similar to LagrangianMPI, we use an initially uniform seed placement and advect particles for predetermined nonoverlapping intervals of time. However, as opposed to continuing particle integration across node boundaries, our approach terminates and discards these particle trajectories. Figures 0(a) and 0(b) illustrate notional examples of basis flows calculated by LagrangianMPI and LagrangianBTO, respectively. Thus, we store only those particle trajectories that remain within the domain until the end of the interval. Terminating particles that require communication across node boundaries to continue trajectory integration, allows the approach to remain communicationfree. Since processors do not exchange particles, the Lagrangian analysis operator on each processor can operate independently and asynchronously.
We build both Lagrangian analysis techniques, i.e., LagrangianMPI and LagrangianBTO, as in situ analysis filters using the VTKm [moreland2016vtk], VTKh [larsen2017alpine], and Ascent [larsen2017alpine] libraries. VTKm is a platform portable scientific visualization library for shared memory parallel environments. VTKh is a distributed memory wrapper around VTKm. Ascent is an in situ visualization infrastructure that we use to both integrate with simulations and create a workflow when loading data sets from disk.
The LagrangianBTO filter has two operations to perform: particle advection and particle management. Particle advection is performed using RK4 interpolation implemented as a VTKm worklet [pugmire2018performance]. Particle management involves tracking particle trajectories, evaluating the validity, and managing memory to prevent invalid particles from being launched on GPU threads during advection. The LagrangianMPI filter has to perform three operations: particle advection, particle management, and communication. Consistent with the Agranovsky approach, communication of particles and particle information between ranks is performed using asynchronous, nonblocking, buffered MPI communication. Additionally, particle management further includes tracking of internal and external particles in order to return particles to originating nodes at write cycles.
AccuracyPerformance Tradeoff — The loss of information in the form of terminated and discarded particle trajectories reduces the quality of flow reconstruction. Flow information that will typically be lost near boundaries can, however, be interpolated using additional information from adjacent processes post hoc. With respect to simulation overhead, the communicationfree model offers a reduced cost and improved scalability. We believe evaluating the accuracyperformance tradeoff and determining the viability of a communicationfree model as an alternative lowcost choice for a scientist or researcher is valuable. Our hypothesis regarding this method is that the execution time will be substantially improved (since there is no communication required), but the accuracy will be only modestly affected.
Finally, the Agranovsky work demonstrated that the Lagrangian method can be much more accurate than the Eulerian approach, including some cases where the accuracy improved by over 10x. Even though our practice of terminating particles at the boundary will reduce accuracy compared to the Agranovsky approach, we still would be much more accurate than the Eulerian approach. If we also can demonstrate significantly faster execution times, then we believe our proposed method would be appealing to future researchers and domain scientists in many settings.
2.2 Post Hoc Reconstruction
To reconstruct the complete flow map or to trace new pathlines, we set up a parallelizeoverdata distributed memory Lagrangianbased advection scheme that is conceptually similar to previous work [agranovsky2014improved, chandler2015interpolation, sane2018revisiting]. For a given interval, each process loads basis flows generated from its own and adjacent processes for that interval. For any new trajectory to be calculated using basis flows, the particle must identify a neighborhood (convex hull) of basis flows to follow for the duration of an interval. We perform a Delaunay triangulation over the start locations of loaded basis flows to identify neighborhoods for tracing new particle trajectories. Figure 2 illustrates a notional example of triangulation using basis flows from the two nodes demonstrated in Figure 1. Using basis flows data from adjacent neighboring processes, i.e., flow information from all directions, allows approximation of regions where information has been lost. Once a particle neighborhood is identified, the location of the particle at the end of the interval can be calculated by following the neighborhood of basis flows. In our work, we use Barycentric coordinate interpolation to calculate the particle’s end position. To calculate a pathline for a duration greater than the interval of basis flows, a trajectory can be stitched together by using basis flows of successive nonoverlapping intervals.
We implemented our reconstruction workflow using CGAL for 3D parallel Delaunay triangulation [fabri2011cgal] and the vtkProbeFilter from the VTK library [schroeder2004visualization] for Barycentric coordinate interpolation. Further, we performed all our reconstruction and accuracy measurements on our inhouse research cluster.
2.3 Theoretical Error Analysis
We use a onedimensional linear interpolation of a function for
(1) 
The higher dimensional result satisfies Equation (1) for each component.
In our approach, a particle starting at that reaches the node boundary during the interval of advection is terminated. Thus, its function value (the flow map) is not known. However, we can reconstruct it from its known neighbors . Consider a particle whose path is interpolated post hoc using this reconstructed value. We get the same result as if we had used the closest existing neighbors directly, because let denote , then
(2) 
Bujack et al. [bujack2015lagrangian] previously established that the post hoc interpolation of pathlines is a numerical onestep integration method [GH10]. Its accuracy is bounded by its global truncation error at stitching step of
(3) 
with dimension , start time , end time , spatial Lipschitz constant , Hessian of the temporal derivative of the flow map , and spatial distance between the basis flows. As a result, the interpolation error is if the flow map has bounded second derivatives in space and first derivatives in time, which is a reasonable assumption for a differentiable vector field, because the solutions of an initial value problem depend smoothly on the initial conditions and time [hartman1973ordinary].
Equation (2) shows that the error bound still holds but with a larger . Its size is determined by the size of missing information, which is limited by the maximum distance particles can move in one time interval. If a particle is seeded further than away from the node boundary, it cannot reach it and must therefore have the correct flow map information. In the worst case, we can have missing information on both sides of the boundary, and therefore we get
(4) 
with the underlying velocity field and the temporal step size , which is the interval time between storing data to disk, which is usually around onethousandth of the total integration time. Please note that the future increase of the global truncation error of a particle that traverses this region can continue even after it has left the region, but for all particles that never enter the region close to the boundary, the original error bound holds.
3 Study Overview
In this section, we describe our experiments to evaluate performance and reconstruction accuracy, and the metrics we use to measure the same.
3.1 Experiment Setup
To evaluate the viability of the LagrangianBTO analysis filters, we set up two workflows.

WF1: In situ weak scaling study to evaluate speedup.

WF2: Evaluation of reconstruction accuracy by varying parameters used in flow map generation.
For the WF1 workflow, Ascent is directly connected to a simulation code. Lagrangian analysis parameters are specified as input to the Ascent pipeline. The simulation code generates vector field data and pauses when it invokes Ascent calls in order to perform in situ analysis. We scaled the number of processors used and proportionately increased the size of the simulation grid. Our WF1 experiments provide insight into the performance of both Lagrangian analysis filters at scale.
For the WF2 workflow, we load data set files from disk to create a theoretical in situ setup, and then call Ascent to initiate the Lagrangian analysis filters. We consider a fixed number of MPI tasks and nodes for WF2, i.e., we use a fixed domain decomposition for all tests. Our WF2
experiment configurations are designed to understand the range of specifications under which LagrangianBTO extracts accurate flow maps and to identify the potential limitations of using this method. We vary the value of the interval parameter (i.e., the number of cycles we wait before writing to disk) to understand the effect of longer advection intervals on flow map generation using the LagrangianBTO approach. In general, the longer the interval, the greater the probability of particles reaching the boundary and being terminated. Further, we consider the impact of various data reduction options. When using a very sparse number of particles to capture the behavior of the flow, the effect of losing particles to boundary termination could be greater. As mentioned in Section
1.3, data reduction values are described in the form of 1:X, i.e., one particle for every X grid points.3.2 Performance Metric
We measure performance in terms of the execution time of the Lagrangian analysis filters. All of our timings are measured using the timing functionality in Ascent. We measure the average time per cycle, which includes time to perform particle advection, particle management, and also communication in the case of LagrangianMPI. To simplify understanding the scalability of both Lagrangian analysis filters with respect to communication costs, we exclude cycles at the end of an interval. These cycles include a communication cost incurred to return all particles to the respective origin nodes for LagrangianMPI and an I/O cost to write information to disk for both methods. In this paper, we do not analyze the parallel I/O times, and we consider I/O optimization methods to be beyond the scope of this work.
3.3 Accuracy Metric
We measure the accuracy of flow field reconstruction by the LagrangianBTO analysis filter relative to the accuracy achieved by the corresponding LagrangianMPI analysis filter. Comparisons of LagrangianMPI to the traditional Eulerian method can be found in previous works [agranovsky2014improved, sane2018revisiting]. We measure total flow volume error using the average L2norm over all samples considered. The total average L2norm is calculated as
(5) 
where is the total number of particles, is the location of a LagrangianBTO interpolated particle at time , and is the location of the LagrangianMPI particle at time .
For a given total average L2norm value L, the reconstruction accuracy percentage is proportional to the length of cell side C for that specific configuration, and is calculated as
(6) 
We note that we use the total average L2norm in two contexts. First, to measure error when reconstructing the complete flow map as generated by the LagrangianMPI method. Second, to measure error of new pathlines traced using basis flows generated by both methods when compared to a ground truth.
In addition to the above total flow volume error measure, we report the maximum L2norm in two forms — the greatest maximum L2norm across all interval reconstructions, i.e., the error of the least accurately interpolated single basis flow, and the average maximum L2norm across all intervals.
3.4 Data Sets
We consider four data sets, namely, the Cloverleaf3D simulation, ArnoldBeltramiChildress flow, Jet flow simulation, and Nyx cosmology simulation.
3.5 Runtime Environment
We tested the Lagrangian analysis techniques by running our experiments on Summit (supercomputer at ORNL). Each node of Summit has two IBM Power9 CPUs, each with 22 cores running at 3.8 GHz and 512 GBytes of DDR4 memory. Further, nodes on Summit have enhanced onchip acceleration with each CPU connected via NVLink to 3 GPUs, for a total of 6 GPUs per node. Each GPU is an NVIDIA Tesla V100 with 5120 CUDA cores, 6.1 TeraFLOPS of double precision performance, and 16 GBytes of HBM2 memory.
4 Results
We organize our results into four subsections, 4.1 to 4.4. Each subsection is focused on one data set. Specifically, in subsection 4.1, we consider the Cloverleaf3D simulation with workflow WF1 for our weak scaling study and workflow WF2 for our strong scaling study. In subsections 4.2, 4.3, and 4.4, we consider the ABC, Nyx, and Jet flow simulations, respectively, and run experiments with workflow WF2.
For all data sets, we measured the accuracy of basis flows generated for every interval across all nodes. Tables 1, 2, 3, and 4 show reconstruction accuracy averaged over all the intervals, i.e., average reconstruction accuracy for the entire simulation duration. Figure 8 shows the change in reconstruction accuracy over every interval and the correlation to the number of particles terminated during that interval for a single configuration of each data set. Figures 7(a), 7(b), 7(c), and 7(d) use bubble size to represent the number of particles terminated and the corresponding figures 7(e), 7(f), 7(g), and 7(h) show average L2norm curves and reconstruction accuracy as a percentage.
To supplement our quantitative evaluation, Figures 2(c), 2(d), 5, 6, and 7 provide a qualitative comparison using colormapped visualizations of surfaces of subvolumes of FTLE scalar fields generated post hoc using basis flows calculated by LagrangianMPI and LagrangianBTO. Welldefined ridges in the FTLE field (identified by high scalar values), used to visualize Lagrangian Coherent Structures, are of particular interest in these figures.
Nodes  MPI  GPUs/  Dims  Step  LBTO (s)  LMPI (s)  Speedup  Discarded %  Greatest Max  Average Max  Total Average  Accuracy % 
Ranks  Node  Size  L2norm  L2norm  L2norm  
1  2  2  0.038  0.0050  0.0190  3.8x  1.9  2.52  7.58  4.53  99.6  
1  4  4  0.029  0.0106  0.0301  2.8x  2  2.44  7.68  4.19  99.5  
1  6  6  0.025  0.0158  0.0380  2.4x  2.8  3.43  1.20  4.89  99.4  
2  8  4  0.023  0.0109  0.0405  3.7x  1.8  1.61  6.49  2.60  99.6  
2  12  6  0.019  0.0173  0.0398  2.3x  2.4  2.84  1  3.09  99.5  
4  16  4  0.017  0.0107  0.0338  3.1x  2.9  1.11  1.97  4.38  99.2  
4  24  6  0.015  0.0178  0.0410  2.3x  4.6  5.83  2.35  6.28  98.8  
8  32  4  0.013  0.0140  0.0506  3.6x  4.1  3.11  1.44  3.83  99.2  
8  48  6  0.011  0.0201  0.0449  2.2x  4.9  5.70  3.15  4.45  98.9  
16  64  4  0.010  0.0140  0.0504  3.6x  4.5  8.39  3.58  3.62  99.1  
16  96  6  0.009  0.0200  0.0510  2.5x  —  —  —  —  —  
32  128  4  0.008  0.0180  0.0750  4.1x  —  —  —  —  —  
32  192  6  0.007  0.0301  0.0620  2.0x  —  —  —  —  —  
64  256  4  0.006  0.0230  0.0808  3.5x  —  —  —  —  —  
128  512  4  0.005  0.0303  0.1001  3.3x  —  —  —  —  —  
256  1024  4  0.004  0.0380  0.1390  3.6x  —  —  —  —  —  
512  2048  4  0.003  0.0475  0.1544  3.2x  —  —  —  —  — 
4.1 Cloverleaf3D Simulation
Cloverleaf3D is a threedimensional version of the LagrangianEulerian explicit hydrodynamics miniapplication Cloverleaf [mallinson2013cloverleaf]. It has been developed and used to evaluate techniques targeting Exascale applications.
4.1.1 Weak Scaling
We performed WF1 experiments, i.e., an in situ weak scaling study to evaluate performance, using the Cloverleaf3D simulation on Summit. For each run, we terminated the simulation after 500 cycles. Given that our study varies the resolution of the simulation with the number of processes, the total number of simulation steps varies. 4 of 17 experiment configurations completed before 500 cycles and all the simulations that are terminated at 500 cycles reach different stages of the simulation (variable step size). In general, the greater the spatial resolution of the data set, the greater the number of cycles required to reach completion. Our experiment configurations span across 2 MPI tasks to across 2048 MPI tasks, with each MPI rank operating on an approximately grid. In each case, a single MPI task is allocated a single GPU. Thus, our smallest configuration used 2 GPUs on a single compute node, and the largest used 2048 GPUs across 512 compute nodes. Additionally, given that each node on Summit has 6 GPUs, we varied the number of GPUs utilized on a single node to gauge onnode MPI communication optimizations and performance of particle advection using multiple GPUs on the same node. For each experiment, we set the maximum step size to 0.1. However, Cloverleaf3D uses an adaptive step size based on the simulation grid resolution. We report the average step size taken by the simulation, configuration information (number of nodes, MPI ranks, and GPUs per node, and dimensions), and scalability performance results in Table 1. Further, each weak scaling test extracted basis flows at an interval of every 25 cycles and used a data reduction of 1:8, i.e., one particle for every eight grid points.
Performance — Figure 2(a) compares the average time required per step by each technique. LagrangianBTO scales better than LagrangianMPI because it is a communicationfree analysis filter. For the extraction of Lagrangian flow maps in situ, LagrangianBTO demonstrates an average of 3x speedup over LagrangianMPI. We observe the cost of particle advection (for both techniques) increases as the scale of the simulation increases. However, each process operates on an approximately grid irrespective of the total number of MPI tasks, and each LagrangianBTO process operates without any knowledge of other processes. In addition to the number of particles being advected, multiple variables influence the cost of the particle advection worklet, namely, cell size, step size, and the vector field. Performing Runge Kutta fourthorder interpolation could require interpolating velocity information from multiple cells (determined by cell size, step size, and the vector field). Although the increased cost of particle advection affects the time required by both Lagrangian analysis techniques, exact comparisons between different configurations (i.e., rows in Table 1) would not account for the above parameters. We believe the performance of the particle advection worklet considering the above parameters is worth future investigation. Further, use of a faster particle advection kernel would result in greater speedups for the LagrangianBTO technique.
Varying number of GPUs/node — The “sawtooth” nature of the plots in Figures 2(a) and 2(b) is a result of alternating the number of GPUs being utilized between 4 and 6. Particle advection performs better with 4 GPUs per node versus 6. The use of shared memory by multiple GPUs on a single node and saturation of the NVLink by the VTKm particle advection worklet causes this effect. Figure 2(b) captures the difference in time between both curves, i.e., approximates the time spent on communication, and shows a reduction in MPI communication costs when using 6 ranks versus 4 per node, albeit scaling poorly as the number of nodes increases. Onnode MPI communication optimizations contribute to better performance when grouping a larger number of MPI tasks on each node. However, as the number of nodes increases, the cost of internode communication remains high in comparison to onnode communication.
Accuracy — Higher resolution simulation configurations use a proportionately larger number of particles and thus generate more basis flows. We measured accuracy for only a subset of the WF1 experiments (10 of 17), since calculating the reconstruction accuracy takes prohibitively long periods of time on our local cluster. LagrangianBTO terminated between 25% of basis flows on average across all experiments, i.e., 25% less data was saved to disk. However, after we interpolate the missing trajectories using the saved basis flows, we approximate the complete flow map (as generated by LagrangianMPI) with over 99% accuracy on average using the L2norm metric.
Figures 2(c) and 2(d) compare FTLE visualizations generated using approximately 2M basis flows saved by each technique (configuration in row 10 of Table 1). We observe no significant differences in the visualizations with both methods capturing the FTLE ridges with the same quality.
Overall, our weak scaling study showed that for the Cloverleaf3D data set, LagrangianBTO was on average 3x faster, while generating 99% as accurate flow maps on average and qualitatively comparable post hoc FTLE visualizations as LagrangianMPI. In situ analysis using LagrangianMPI contributed between 512% of the total simulation time, and in most cases, LagrangianBTO required 50%80% less time than the corresponding LagrangianMPI configuration.
4.1.2 Strong Scaling
Whereas the weak scaling study demonstrated the viability of using LagrangianBTO, the strong scaling study helps us understand its limitations. We considered a grid and decomposed it into smaller grids based on the number of MPI tasks. Further, for these WF2 experiments, we considered interval values of 25, 50, and 100, and data reduction values of 1:1, 1:8, and 1:27. The interval determines how long particles advect before their end locations are saved. The longer the interval, the greater chance of particles exiting the node domain and vice versa. Further, the number of particles that exit the node domain and terminate directly impacts the accuracy of the approximation. Figure 4 shows the accuracy results when decomposing the domain across 1, 8, and 32 MPI tasks.
Interval  Reduction  LBTO (s)  LMPI (s)  Speedup  Discarded %  Greatest Max  Average Max  Total Average  Accuracy % 

L2norm  L2norm  L2norm  
25  1:1  0.0059  0.0143  2.3x  2.7  6.85  4.97  6.15  99.7 
50  1:1  0.0066  0.0145  2.1x  5.5  3.03  1.85  2.31  99.0 
100  1:1  0.0067  0.0144  2.1x  10.9  7.87  6.81  1.09  95.5 
25  1:8  0.0026  0.0065  2.4x  2.6  1.40  3.41  3.93  99.8 
50  1:8  0.0033  0.0086  2.6x  5.4  3.71  1.23  1.32  99.4 
100  1:8  0.0026  0.0075  2.8x  10.8  6.15  2.63  3.65  98.5 
25  1:27  0.0024  0.0065  2.7x  1.6  5.24  4.87  1.61  99.9 
50  1:27  0.0029  0.0048  1.6x  5.1  2.62  1.28  1.23  99.5 
100  1:27  0.0027  0.0053  1.9x  10.4  1.25  6.53  5.01  97.9 
For these experiments, we measured accuracy by calculating pathlines using the basis flows generated by each technique and evaluating their accuracy in comparison to a ground truth at interpolated locations. The ground truth is calculated by performing RK4 advection using the full spatial and temporal resolution of the data set. We placed 1,000 particles in the domain to generate pathlines for this evaluation. We compared the similarity between ground truth and pathlines generated by interpolating Lagrangian basis flows using the average L2norm metric. Our metric compared the accuracy of interpolated points along the trajectory [sane2019interpolation]. Lastly, for these experiments, we loaded Cloverleaf3D vector field data from disk and set the maximum simulation step size to 0.01 and used 800 simulation steps. Therefore, we generated pathlines for a duration of 800 simulation steps by stitching together results using successive batches of basis flows [agranovsky2014improved, bujack2015lagrangian, sane2018revisiting].
Figure 4 groups configurations by number of processes used, i.e., the degree of decomposition. We note that LagrangianMPI accuracy is not affected by degree of decomposition and hence remains constant irrespective of number of processors. When considering only a single processor, both methods lose particles that exit the entire domain during the interval of advection. Thus, performance on a single MPI task is identical for both methods. When decomposed across 8 tasks, we see the error of LagrangianBTO increases slightly for configurations that use a 1:27 data reduction. Specifically, accuracy reduces from 100% as accurate as LagrangianMPI to 96% as accurate when the interval increases from 25 to 100. For domain decomposition across 32 tasks, i.e., the highest degree of decomposition we consider, LagrangianBTO error increases as the interval increases and fewer particles are used. These experiments are valuable in demonstrating the limitations of the LagrangianBTO method. However, we believe our strong scaling study tests an extreme case, because we expect our target applications will have higher resolution grids and use a larger number of particles to capture the flow field behavior.
4.2 ArnoldBeltramiChildress (ABC)
We used a 3D timedependent variant of the ABC flow analytic vector field [brummell2001linear]. We used one complete period of the flow for a total of 1000 time steps at a grid resolution of . For the WF2 ABC data set experiments, we considered three options for interval (25, 50, 100) and three options for data reduction (1:1:, 1:8, 1:27). All tests use 16 nodes, 64 MPI tasks, with 4 MPI tasks using 4 GPUs on each node. Table 2 contains configuration details, percentage of discarded particles, speedup achieved, and the reconstruction accuracy percentages for the ABC data set.
Interval  Reduction  LBTO (s)  LMPI (s)  Speedup  Discarded %  Greatest Max  Average Max  Total Average  Accuracy % 

L2norm  L2norm  L2norm  
5  1:1  0.0062  0.0089  1.4x  0.7  5.66  1.73  1.89  99.7 
10  1:1  0.0035  0.0089  2.5x  2.7  3.12  1.20  1.45  98.1 
5  1:8  0.0024  0.0044  1.8x  0.8  8.64  2.69  3.77  99.5 
10  1:8  0.0022  0.0059  2.6x  2.1  4.41  1.45  2.06  97.3 
5  1:27  0.0031  0.0038  1.2x  0.7  8.47  3  4.32  99.4 
10  1:27  0.0024  0.0094  3.9x  1.9  4.27  1.72  2.48  96.8 
5  1:64  0.0025  0.0037  1.4x  0.5  9.19  3.30  4.34  99.4 
10  1:64  0.0029  0.0039  1.3x  1.5  4.89  1.96  2.80  96.4 
Interval  Reduction  LBTO (s)  LMPI (s)  Speedup  Discarded %  Greatest Max  Average Max  Total Average  Accuracy % 

L2norm  L2norm  L2norm  
10  1:1  0.0027  0.0067  2.4x  4.1  5.06  1.52  3.93  99.5 
20  1:1  0.0026  0.0067  2.5x  8.4  8.19  4.83  1.52  98.0 
40  1:1  0.0026  0.0139  5.2x  15.9  3.97  2.25  9.02  88.5 
50  1:1  0.0025  0.0079  3.1x  19.3  6.04  3.59  1.64  79.1 
10  1:8  0.0024  0.0045  1.8x  3.2  9.68  1.67  3.96  99.4 
20  1:8  0.0023  0.0107  4.6x  7.9  1.74  7.05  2.19  97.2 
40  1:8  0.0027  0.0046  1.7x  15.5  6.19  2.92  1.18  85.6 
50  1:8  0.0030  0.0045  1.4x  18.9  7.74  4.32  2.02  74.3 
LagrangianBTO demonstrates an average of 2.3x speedup when compared to LagrangianMPI. For the ABC data set, reconstruction accuracy does not significantly deteriorate when using a smaller number of particles. However, for configurations with an interval equal to 100, a large number of particles exit the node boundaries and are terminated. For example, when the interval is set to 100 and a data reduction of 1:1 is used, over 1.7M particles are terminated every interval (approximately 10% of all seeds initially placed), which results in the reconstruction accuracy reducing to 95.5%. For the ABC data set, the number of particles terminated per interval approximately doubles every time the interval doubles. We observe that all configurations that have an interval of 25 or 50 achieve a reconstruction accuracy that is greater than 99%. When considering the average L2norm error of individual intervals of the ABC data set in Figure 7(b) and 7(f), we observe a sinusoidal behavior in the error that is due to the sinusoidal ABC analytical function.
Figures 4(a) and 4(b) compare FTLE visualizations generated using approximately 2M basis flows computed over 100 time steps and saved by each technique (configuration in row 6 of Table 2). Although we observe small discontinuities in the ridges of the FTLE field generated using LagrangianBTO basis flows, the overall quality of the FTLE visualization is similar.
4.3 Jet Flow Simulation
The Jet data set is a simulation of a jet of highvelocity fluid entering a medium at rest. It was created using the Gerris Flow Solver [popinet2003gerris]. The vector field is defined over a uniform grid and we use a total of 500 time steps (previously subsampled). These WF2 experiments tested the performance of the LagrangianBTO when reconstructing this turbulent data set by considering two intervals (5, 10) and four data reduction values (1:1, 1:8, 1:27, 1:64). All tests used 16 nodes, 64 MPI tasks, with 4 MPI tasks using 4 GPUs on each node. Table 3 contains configuration information and results of the Jet data set.
LagrangianBTO demonstrates an average speedup of 2x when compared to LagrangianMPI for this data set. The domain contains regions of high velocity resulting in the reconstruction accuracy of the LagrangianBTO analysis filter being adversely affected as the interval increases. For configurations with the interval set to 5, the data reduction is less consequential, and all configurations achieve greater than 99% accuracy. However, for configurations with the interval set to 10, the reconstruction accuracy decreases from 98.1% to 96.4% as data reduction ranges from 1:1 to 1:64. Thus, similar to the result observed in the strong scaling study in Section 4.1, the combination of larger interval and reduced number of particles results in less accurate reconstruction.
Figures 5(a) and 5(b) compare visualizations produced using 4.2M basis flows saved by each method (configuration in row 2 of Table 3). The LagrangianBTO FTLE visualization has some instances of a less pronounced structure in the FTLE ridges. However, the differences are localized and relatively small compared to the overall information conveyed by the images.
4.4 Nyx Cosmology Simulation
The Nyx simulation is a Nbody and gas dynamics code for largescale cosmological simulations [almgren2013nyx]. We use a Nyx test executable named TurbForce to generate 500 cycles of a data set with an average time step of . The generated flow field grows in turbulence over the duration of the simulation. These WF2 experiments with the Nyx data set test the reconstruction accuracy of LagrangianBTO when considering four intervals (10, 20, 40, 50) and two data reduction values (1:1, 1:8). Similar to previous setups, we use 64 MPI tasks evenly distributed across 16 nodes with each MPI task using a single GPU. Table 4 details test configurations and results for the Nyx data set.
LagrangianBTO demonstrates an average speedup of 2.8x over LagrangianMPI. Configurations with an interval of 10 or 20 have high reconstruction accuracies greater than 97%. However, as the interval increases, the reconstruction accuracy is adversely affected by the large number of particles terminated and the significant turbulence in this data set. Further, LagrangianBTO can reconstruct the field relatively more accurately when using a 1:1 data reduction compared to a 1:8 configuration. Figures 6(a) and 6(b) compare FTLE visualizations generated using approximately 2M basis flows saved by each technique (configuration in row 1 of Table 4). We observe qualitatively comparable FTLE ridge structures.
5 Conclusion and Future Work
Performing accurate exploratory analysis and visualization of timevarying vector fields is particularly challenging in sparse temporal settings. Our proposed algorithm represents an important step in expanding the Lagrangian in situ reduction and post hoc exploration paradigm to be viable for largescale simulations. Predecessor work demonstrated that accuracystorage tradeoffs were clearly superior to the traditional approach [agranovsky2014improved]. However, this work did not place a significant emphasis on minimizing in situ execution times and ensuring scalability. Our work addresses this point, and the corresponding reduced execution times (2x to 4x) and improved scalability remove another barrier to adoption.
We feel the most surprising result from our work is the rate at which we achieve high reconstruction accuracy. Clearly, terminating particles at block boundaries makes for less useful basis flows, and can create issues where post hoc exploration accuracy suffers. However, our results empirically show that this feared case happens relatively infrequently in practice and is limited to instances of long intervals, i.e., large integration times. Our proposed approach works particularly well for practical configurations with a short or medium interval, while delivering the same performance benefits. 31 of our 35 tests gave accuracy that was greater than 96%. In addition to our quantitative evaluation, we qualitatively showed that comparable FTLE visualizations can be generated using basis flows extracted at a fraction of the in situ cost. These evaluations are relative to the Agranovsky et al. approach, which incurred significantly more cost in execution time and scales poorly. Further, while we are slightly less accurate than Agranovsky et al., our approach would still be much more accurate than the traditional Eulerian approach for timevarying flow visualization in sparse temporal settings.
In terms of future work, we aim to store particle trajectory termination locations on the boundary, develop Lagrangianbased advection schemes that can consume flow maps with trajectories stopping (and starting) at arbitrary times, integrate adaptive variable duration variable placement (VDVP) techniques [sane2019interpolation], and consider the use of flow field characteristics to guide flow map extraction. Further, research is required to address edges cases and the limitations of a communicationfree model demonstrated by our experiments. We believe our work is foundational because it evaluated the use of a simple communicationfree model and demonstrated improved scalability with only a reasonable loss of accuracy. Remaining communicationfree is particularly essential for Lagrangian analysis techniques that aim to scale well and remain within in situ constraints of largescale simulations.