I Introduction
With the dawn of the big data era, managing the massive volume of data generated by dataintensive applications becomes extremely challenging, particularly for scientific simulations running on leadershipclass highperformance computing (HPC) systems and experiments running on federated instruments and sensor platforms. For instance, the XGC dynamic fusion simulation code [19, 8] from the Department of Energy (DoE)’s Princeton Plasma Physics Laboratory can generate 1 PB of data every 24 hours when running on the DoE’s fastest supercomputers. In the future, these physicists expect that their simulation codes will generate 10 PB of data per day. A more extreme example is the Square Kilometer Array (SKA) radio astronomy project [31], which plans to be generating data at a velocity of 1 PB/s in the next 10 to 20 years. Few storage systems can keep up with data generated at that velocity, including those built for the DoE’s leadershipclass supercomputers. Moreover, even if fullaccuracy data could be stored at these sizes and rates, processing the data with standard multipass data analysis routines may consume more computational resources and be more expensive to run than the original generation, leading to significant degradation in overall scientific productivity.
Current solutions for managing this overwhelming amount of complex and heterogeneous scientific data are usually passive and based on “rules of thumb.” For instance, to mitigate the pressure on I/O bandwidth, domain scientists often decide to decimate in time by reducing the frequency of outputting data based on some arbitrary factor (e.g., writing data every 1,000 simulation steps). Although this approach can effectively reduce the amount of data written to the storage system, it increases the potential risk of missing novel scientific discovery, as the data thrown away might contain some important features. Moreover, due to limited capacity, the data cannot reside in fast storage tiers such as parallel file systems forever. Eventually the data must be moved to much slower storage tiers, such as archival storage systems, and purged from fast storage. For example, on the DoE’s Summit supercomputer, data can only be kept on the parallel file system for 90 days before it is either moved to archival storage systems such as HPSS [15] or permanently deleted. Once data is moved to archival storage, it can take weeks or even months for scientists to retrieve the data before they could run their analysis procedures.
There are no universal solutions for addressing all of the technical and domainspecific requirements at such large data sizes. In fact, from the domain scientist’s perspective, having the capability to store a huge amount of data does not necessarily lead to more scientific discoveries. It is common that the most valuable scientific insights come from a small portion of the original data, while the remaining data is less useful. What domain scientists expect is the capability to reduce and reorganize the data intelligently based on their intentions so that: 1) they will not lose any valuable data that contains important features; and 2) they can quickly fetch the needed data when they run their analysis routines. We call such capability intelligent scientific data refactoring. To enable this capability, new algorithms such as multigridbased hierarchical data refactoring [1, 3, 2, 4] have recently been developed by the applied mathematics community.
In this work, we explore how one can apply the capability to refactor datasets into more reasonable sizes while maintaining the fidelity of domaindriven analytics. In particular, multigridbased hierarchical data refactoring is a class of data refactoring that decomposes the original data into series of prioritization classes, where each class contains a number of coefficients, called a coefficient class. Approximations to the original data can be reconstructed by selecting a number of coefficient classes based on accuracy requirements. Thus, hierarchical data refactoring gives both the data producers (e.g., scientific simulations) and consumers (e.g., data analysis routines) the flexibility to store, transport, and access data to satisfy space and/or accuracy requirements.
For example, data sharing can be optimized by intelligently moving coefficient classes through multitieredstorage systems (e.g., storage systems containing nonvolatile memory, magnetic disks, and tapes) and/or networks based on available capacity and bandwidth. Figure 1
shows original simulation data being refactored into five coefficient classes and then shared with data analysis routines via multitieredstorage systems and networks. When the accuracy can be estimated based on the number of selected coefficient classes, users can control the accuracy of the reconstructed data while storing and reading the data. For example, based on userdefined accuracy requirements, information encoded in the first four coefficient classes are enough for data analyses later on, so the fifth coefficient class is ignored. Then, the four coefficient classes can be intelligently shared over the storage systems and network based on their size, available bandwidth/capacities, and accuracy requirements from data analysis routines. In the figure, data analysis routine 1 only needs two coefficient classes to achieve the desired accuracy, while routine 2 needs four. The flexibility of choosing a reduced number of coefficient classes gives users the potential to greatly reduce data movement costs.
As great as the benefits of reduction in data movement and management costs may be, if the decomposition and recomposition routines used by data refactoring are expensive then the total process will not show an advantage. However, Graphics Processing Units (GPUs) have shown great potential to speed up scientific computations that fit their streaming execution model, as they offer high parallel computational power with high memory throughput. As the algorithms involved in multigridbased hierarchical data refactoring are both parallelizable and memory bound, accelerating the routines in data refactoring processes using GPUs is attractive. Furthermore, we have found that GPUbased refactoring also have other benefits for both CPU and GPUbased scientific applications.
For CPUbased scientific applications, even though the data are originally generated or processed on CPUs, we find that it can be costeffective to offload the data refactoring workloads to GPUs when they are available, especially given that fast CPUGPU interconnections such as PCIe and NVLinks are available in modern computing systems. For GPUbased scientific applications, GPUbased refactoring enables those operations to be performed in place on the GPUs, reducing the data movement cost between GPUs and CPUs. Together with emerging technologies such as GPUDirect Storage or GPUDirect RDMA, refactored data can be shared directly among remote GPUs or to/from the storage systems without the added costs of transferring data back to the host memory.
In this work, we focus on accelerating the two major routines, decomposition and recomposition, in multigridbased data refactoring on GPUs and evaluating the benefit for producer and consumer applications. Although the multigridbased algorithms are naturally parallelizable, achieving good performance faces the usual dual challenges on GPUs of minimizing the memory footprint and simultaneously improving the memory access efficiency. Specifically, our contributions are as follows:

We document in §III the first multigridbased data refactoring routines for modern GPU architectures that can potentially help build systems to reduce I/O pressure for a variety of scientific applications and workflows;

We design a series of systematic optimizations for multigridbased data refactoring in three levels: instruction level, kernel level, and programstructure level. Our optimization can balance both minimizing the memory footprint and improving the memory access efficiency, as seen in §III;

In §IV, we evaluate our designs on both a consumerclass desktop and the Summit supercomputer at Oak Ridge National Laboratory (ORNL) and achieve 110 and 330 speedups compared with the stateoftheart CPU designs. We achieve 45.42 TB/s throughput using 4096 NVIDIA Tesla V100 GPUs on Summit;

Finally, in §V, we showcase our work using two common scenarios in scientific computing: 1) reducing data movement costs between scientific simulations and in situ visualization applications; and 2) speeding up lossy compression for scientific data.
Ii Background: Multigridbased hierarchical data refactoring
We use the algorithms of Ainsworth et al. [1, 3, 2, 4] to demonstrate our optimizations for multigridbased data refactoring on GPUs. These algorithms support nonuniformlyspaced structured multidimensional data, which are commonly found in scientific computations.
These multigrid methods use hierarchical representations to approximate data. Specifically, they decompose data from fine grid representation to coarse grid representation in an iterative fashion, with a global correction to account for the impact of missing grid nodes in each iteration. If we use functions to represent the discrete values continuously, the decomposition from a fine grid level to a coarser grid level can be formulated with notations in Table I as follows,
(1) 
where is the piecewise linear function that takes the same values as original data for each node, and are the function approximations of at levels and , respectively, is the difference between the values of the fine grid nodes at level and their corresponding piecewise linear approximation, and is the global correction. According to Eq. (1), two major steps are involved at each level of the multigrid decomposition: 1) compute coefficients for the current multigrid level ; and 2) compute global correction and add it to the nodes in the next coarse grid (level ). In what follows, we introduce how to compute coefficients and corrections in details.
Symbol  Description 

Function represented by the original data.  
Nodes at grid level .  
Coefficients at grid level .  
Function space with respect to .  
The projection onto .  
The piecewise linear interpolant in space . 
Ii1 Compute coefficients
Mathematically, the coefficients store the difference between the data approximated by the nodes at level (i.e., ) and (i.e., ) before corrections are added. Since is also contained in and they have the same value for nodes in for both levels, the nonzero differences only occur on nodes in . Figure 2 shows how coefficients are calculated along one dimension through linear interpolation. It can be generalized to multidimensional cases easily by using multilinear interpolations for approximation.
Ii2 Compute correction
As proven in [1], the correction is the orthogonal projection of the calculated coefficients at grid level onto , so adding correction to the next coarse grid can better approximate data in current grid. To explain, we first define as the correction for grid at level . According to Eq. (1) we have the observation that:
(2) 
If we apply projection at grid level (i.e., ) to both sides of Eq. 2, it leads to a zero function since it belongs to . Also, since is in , . So, we can see that is the orthogonal projection of the coefficients onto . Namely, . So, the correction can be computed by solving a variational problem: find such that for all . Then, can be found by solving linear systems where
is the result of a tensor product of mass matrices of each dimension
^{1}^{1}1Definition of mass matrix in finite element method: https://demonstrations.wolfram.com/MassMatrixComputationInTheFiniteElementMethod, i.e., , where is the total number of dimensions, andis the load vector, which can be calculated using:
where is a transfer matrix that coverts basis functions from to and is the coefficient matrix at level , which consists of computed coefficients at and zeros at . In terms of dimensions, we take a 2D data set for an example. Assume the grid at level has nodes and grid at level has nodes. Then, has dimensions of . So, vectorized has dimensions of . has dimensions of . has dimensions of . has dimensions of . When solving , has dimensions of and has dimensions of . After devectorizing , the correction has dimensions of , which is added to the nodes at level .Overall process of decomposition and recomposition
We demonstrate this process of decomposing and recomposing using a 2D data in Figure 3. The lefthandside represents the full, original data set, while the righthandside is the refactored representation. The decomposition process moves from left to right along the top in the figure (i.e., finest gird to coarsest grid). There are four steps to be carried out, the coefficient and correction operations (II.1 and II.2) for each of the two levels. For multidimensional data, the computation of correction is done by working on each dimension in a prescribed order. For the 2D data in our example, the calculation proceeds along the rows first and then the columns. We will future explain this in Algorithm 3. The recomposition process moves from the right to the left along the bottom in the figure (i.e., coarsest grid to finest gird). Similar to decomposition, there are four total stages, but they occur in the reverse order. First, the corrections are calculated using coefficients and used to undo the corrections (i.e., subtract values) on the next coarser grid to restore their nodal values. Then, the coefficients are restored back to their original values by adding the interpolated values from the next coarser grid.
Iii Designing GPUaccelerated data refactoring
We now discuss the design details of our GPUaccelerated multigridbased hierarchical data refactoring. We first focus on the optimizations for each computing kernels involved in both the decomposition and recomposition processes. We classify the computing kernels into two categories and propose two optimized frameworks that can help optimize the two kinds of kernels on GPUs. Following the kernellevel optimization, we discuss optimizations to help each of the kernels efficiently work together so that their performance can be maximized. Finally, we discuss design details about how to efficiently use these kernels to build data refactoring tools for multidimensional data.
Iiia Designing optimized GPU multigrid kernels
There are five major computing kernels involved in the decomposition and recomposition process: 1) computing coefficients; 2) restore from coefficients; 3) mass matrix multiplication; 4) transfer matrix multiplication; and 5) correction solver. Based on their computation pattern, we can classify them into two categories: grid processing kernels and linear processing kernels. So, instead of designing optimizations for each kernel individually, we choose to optimize them by their type. The goal is to define optimizations for each processing type that generally work for each kernel belonging to that type. To accomplish our goal, we design optimized frameworks for the two processing types. A framework in our context is a design template of the computing kernel for a specific type. To actually design a computing kernel, we can follow the template to outline the general structure of the kernel and incorporate computing logics (e.g., call device functions) that are specific to the operations that kernel focuses on. One major benefit to designing each kernel this way is that optimizations for the common overall structure of the kernel can be separated from the specific computing logic of each kernel. We first focus on designing kernel frameworks. The major challenge when designing optimized multigrid kernels is balancing efficient parallelism and memory footprint, as sometimes optimal designs for each metric are not compatible. For example, outofplace processing usually bring better parallelism, but can cause a larger memory footprint. We develop balanced designs that can help achieve high performance with low memory footprint.
IiiA1 Grid processing framework
Two operations directly related to coefficients are computing coefficients and restoring the original nodal values using coefficients. The former is used in decomposition and the latter is used in recompositions. Although they have opposite effects, they share similar computational patterns. We focus on the design of a grid processing kernel framework that can be applied to both kernels. Grid processing has the following characteristics:

The computation is usually applied to the whole grid at a time;

A subset of nodes are used to update the nodal value of a disjoint subset of nodes in the grid;

Each update operation (e.g., interpolation) can be performed independently without dependencies, and operations can be different from each other;

The results need to be stored inplace.
The main calculation is to compute the interpolation at nodes in using nodal values in . The parallelization can favor either interpolation operations (i.e., parallelism ) or accessing nodal values (i.e., parallelism ). The former can lead to less thread divergence, while the latter can achieve higher memory access efficiency. The computation of coefficients is a memory bound operation as it needs to access nodes with a total of calculations (the total number of operations involving each node in is ). Therefore, it is essential to optimize in favor of memory access efficiency instead of computation. Thus, we choose to parallelize against the total number of memory accesses instead of the number of operations. Figure 4 shows the grid processing framework. As nodes in need to be shared among several neighbors during interpolation operations, we let each thread block coordinate work on a 3D block of data (or 2D block for 2D data) and use shared memory as a scratch space. We organize threads such that threads in the same warp load values that are consecutive in memory to achieve efficient coalesced memory access patterns. For computing, it is easy to have thread divergence if we assign threads to work on interpolation the way we load nodal values, since interpolation operations that are close to each other are likely to be different. Although choosing large block sizes can reduce thread divergence, it may cause the total number of threads to exceed the maximum allowed on a streaming multiprocessor (SM) or make the SM underutilized. Therefore, we apply a thread reassignment strategy that assigns threads in the same warp to compute the same interpolation type in the same direction to achieve divergencefree execution. Algorithm 1 shows how we calculate the threadinterpolation operation assignments that can eliminate thread divergence.
IiiA2 Linear processing kernel framework
Three steps directly related to the calculation of corrections require linear processing: mass matrix multiplication, transfer matrix multiplication, and correction solver. Linear processing has the following characteristics:

When applied to multidimensional data, this kind of operation can be sequentially applied one dimension at a time until all dimensions have been applied;

When applying to each dimension, the operation needs to be individually applied to each vector along that dimension that consists of nodes in the current grid level;

When applying to a vector, the calculation on each node depends on neighboring nodes so that they have to be processed in a certain order;

The results need to be stored inplace.
Similar to the calculation of coefficients, we consider optimizing both the parallelism and the memory footprint in a balanced way. There are two straightforward ways to parallelize the task at two different granularities: vectorwise and elementwise. Vectorwise parallelism is achieved by assigning the workload related to different vectors to different threads. Elementwise parallelism is achieved by assigning calculations related to different vector elements to different threads. However, neither of them satisfies our need for a balanced design. Vectorwise parallelism brings no extra memory footprint but it would suffer from low parallelism for small grids. Elementwise parallelism can exploit maximum parallelism but it brings a 100% extra memory footprint due to data dependencies.
Hence, we develop a novel linear processing algorithm that balances both minimizing extra memory footprint and maximizing parallel efficiency. We first parallelize the vectors by assigning a batch to a thread block. Since the update of each node depends on its neighboring elements, we use shared memory as scratch space to avoid polluting the unprocessed nodes. Specifically, we let each thread block iteratively work on a segment of coefficient vectors at a time until the whole coefficient vector is updated. Thus, as shown in Figure 6, during the computation we divide the coefficients in the vectors into six regions: 1) the processed region stores updated coefficients (gray dots); 2) the main region consists of coefficients that the current iteration is working on (green dots); 3) and 4) due to dependence on the neighboring coefficients, the original value of coefficients in the two ghost regions (red and cyan dots) are needed to update the coefficients in the main region; 5) for better streaming processor utilization, we prefetch coefficients needed for the next iteration (purple dots); and 6) we mark the unprocessed region as block dots. The regions move forward as the computation proceeds. One challenge to design the algorithm is to simultaneously consider maximizing coalesced global memory access patterns, minimize bank conflict in accessing shared memory, and minimize thread divergence. We use a dynamic coefficientthread assignment strategy [9, 26] so that both the accessing and computation of coefficients are optimized. Figure 5 shows the structure of the algorithms we use to implement our linear processing framework kernels based on the above requirements.
IiiB Designing kernels based on frameworks
To show how to design optimized kernels using grid and linear processing framework, we use mass matrix multiplication kernel as an example (other kernels have similar structure so they are omitted due to page space limit). Algorithms 2 shows the mass matrix multiplication plugin device function for the linear processing framework. We can see by using our optimized processing frameworks, we can decouple the design of logic of specific operations from handling memory accesses, while at the same time maintaining high parallelism and memory access efficiency. Also, by using the framework, we minimize the extra memory footprint needed for each kernel. First of all, there is no extra memory footprint for computing coefficients and restoring kernel that uses grid processing framework. Also, there is no extra memory footprint for two of the three kernels that use linear processing framework: mass matrix multiplication and transfer matrix multiplication. This is because our linear processing framework naturally avoids adding extra memory and for the mass and transfer matrices we store the matrix elements implicitly by computing in demand. The solve correction kernel needs an extra memory space per dimension, as the elements in updated main diagonal cannot be efficiently computed during the backward substitution process. This kernel still requires minimum extra memory. For example, with a 100100100 grid, the extra memory footprint is only about 0.3%.
IiiC Handling strided memory accesses in grids
For GPU memory, strided memory accesses decrease the memory access efficiency by a factor of the stride length. Since multigridbased data refactoring routines are memory bound, maintaining high memory access efficiency is especially important for overall performance. There are two kinds of strided memory accesses in the decomposition and recomposition routines. The first is caused by the nature of the linear storage pattern in memory. Only consecutive nodes in the leading dimension are stored in consecutive memory addresses. When working on consecutive nodes in other dimension, it would lead to strided memory accesses. The second is caused by the nature of the multigrid structure. As mentioned before there are
nodes in the grid at level during decomposition or recomposition. Therefore, each node is steps away for its neighboring nodes in each dimension. When working on the grid with a level smaller than (finest grid), accessing consecutive nodes in the grid would lead to strided memory accesses. The stride length increases exponentially as it equals to .To minimize strided memory accesses, we apply two optimization techniques. The first kind of strided memory access can impact both grid and linear processing kernels. We pack the nodes specific to a level of the grid in the working memory space to make the stride always equal to one. For the correction calculation, since there is working memory space for storing intermediate results (will be explained in detail in the next section), we can pack nodes in the working memory space. For coefficient related calculations, we can reuse the working memory space for the correction calculation for node packing. The memory space needed for packing nodes will be no larger than memory used for storing intermediate results for the correction calculation; it brings no extra memory footprint. The second kind of strided memory access can mainly impact linear processing kernels, as we can easily adjust the thread block size of the grid processing kernels to minimize the impact. In the case of 3D grids, strided memory accesses occur only when processing vectors along the second or third dimension. Due to our computeaccess decoupled design, our linear processing framework can be easily adapted to avoid strided memory accesses when processing on the second or third dimension. The key idea is to always batch vector processing on the xy plane for the second dimension and the xz plane for the third dimension. Due to space limitation, the detailed design of this framework is omitted.
IiiD Overall algorithms
Algorithm 3 shows how we use our optimized kernels to build data refactoring routines for multidimensional data on GPUs. For each level, the computed coefficients are also used for calculate corrections. This process involves altering the values of coefficients. So, to preserve the value of previously computed coefficients, the correction is computed in a working memory space. Specifically, during decomposition the computed coefficients is copied from working memory space to input/output memory space (line 5) before they are used for computing global corrections (line 610). Similarly, during recomposition the coefficients are first loaded to working memory space (line 16) for correction calculation (line 1721), so that they will not be corrupted when used to restore current level nodal values (line 25). The size of working memory space is equal to the original input size, which is the same for original CPU design [21] and our GPU design. In this work, we propose three kinds of optimization specifically for our GPU design: 1) As mentioned in Section III.B, we use the working memory space to pack nodes in a more condensed way to achieve better memory access efficiency for correction calculation without bringing extra memory footprint compared with the CPU design. The operation of packing and unpacking nodes are fused with node copy, applying corrections, undo corrections operations as marked in Algorithm 3. 2) When the working memory space is not used by correction calculation, we use the space to improve the memory access efficiency of coefficient calculation. Since coefficient calculation and correction calculation have to be done in order, they will not cause conflicts when sharing the same working memory space. 3) For linear processing kernels in correction calculations, we use the 2D design to build both 2D and 3D data refactoring routines since correction calculations can be done by calculating along each dimension at a time as mentioned in Section II. When working on 3D input, limited by the GPU memory space, each 2D slice of the data is usually much smaller compared with 2D input, so it is anticipated those 2D linear processing kernel would cause GPU underutilization. As processing different 2D slices for 3D input can be performed independently, we use CUDA streams to improve GPU utilization. This allows multiple 2D slices to be processed at the same time when there are enough GPU resources.
Iv Experimental Evaluation
We evaluate our work on two GPUenabled platforms. One is a GPUaccelerated desktop with an NVIDIA RTX 2080 Ti GPU with 11 GB of memory and one 8core Intel i79700K CPU with 32 GB of memory. The other is the Summit supercomputer at ORNL. Each node on Summit is equipped with 6 NVIDIA Tesla V100 GPUs with 16 GB memory on each GPU and two 22core^{2}^{2}2Only 21 cores/socket are accessible for computation. IBM POWER9 CPUs with 512 GB memory.
In our evaluation, we use datasets generated from the GrayScott ReactionDiffusion simulation [23, 13]. Each node in the input grid data is represented as double precision floating point numbers. Note that our data refactoring algorithms have deterministic computation time complexity regardless of the values in the chosen dataset, so it will yield the same performance for any datasets at have the same dimensions and size. We configure the simulation codes to generate 3D data so that each dimension is in the form of , where is an positive integer and is not necessary the same for all dimensions. If at least of one dimension is not in this form, one extra preprocessing step and the corresponding postprocessing step are necessary, which consist of one iteration of special decomposition and recomposition. Since the extra one iteration for preprocessing and postprocessing contributes a small portion of the total execution time, in order to better show the results of our optimizations on the main decomposition/recomposition loop we avoid those step in our tests by generating data with dimensions that follows the form of . To simplify, we let each dimension to have the same length. The 2D datasets are obtained from taking 2D slices of the 3D datasets.
In our experiments, we use the CPUbased multigridbased data refactoring implementations integrated in the MGARD lossy compression software [21] as our baseline.
Iva Evaluation at kernel level
We first show the performance improvement we achieve from accelerating several major kernels in multigridbased data refactoring on GPUs. Table III and II show the maximum, minimum, and average performance improvement on the Summit supercomputer and the GPUaccelerated desktop. Note that since computing coefficients and restoring from coefficients are almost exactly the same computation pattern except minor operations, we only show the performance of computing coefficients in our evaluation. We can see that since the computing coefficients, mass matrix multiplication, and transfer matrix multiplication tend to be easily parallelizable, they achieve considerable performance improvement compared with serial CPU implementations. Solving corrections is naturally less parallelizable, so it is expected to have less improvements. Comparing 2D and 3D, since the three kernels used for calculating the correction in 2D data are reused for 3D, the only kernel unique to 3D is the computation of coefficients. It is anticipated to see the speedup for the 3D case to be lower than for the 2D case because each thread block for calculating the 3D coefficients has much longer latency due to the interpolation on one extra dimension and fewer concurrent thread blocks due to the high thread block resource usage. Another factor limiting the maximum speedup for the 3D case is that available GPU memory prohibits us from testing on larger data sizes that can leads to similar parallelism compared with 2D.
As our linear processing framework involves complicated designs, we study the mass matrix multiplication kernel designed following the framework to show how our optimization impacts performance. Figure 7 shows the memory throughput of mass matrix multiplication as different optimizations are applied. Specifically, we show the performance of mass matrix multiplication kernel when decomposing a 40974097 grid of input data, which needs 12 levels of decomposition. We can see the original serial CPU version suffers from degraded performance due to inefficient memory access when grid spacing is large (level is small). This is also the case for the naive GPU design. The naive GPU design parallelizes the workload vectorwise [5] without applying memory access efficiency optimizations proposed in this work. When mass multiplication kernel are designed following our linear processing framework we can see it can achieve much better performance and can sustain similar performance when grid spacing is large (i.e., is small) and only degrades on small grids, which bring minor impacts to the overall performance.
Grid Size  Kernel  Max  Min  Avg. 

5–513  Comp. Coefficients  158.88x  7.67x  64.00x 
5–8193  Comp. Coefficients  774.97x  47.07x  316.76x 
Mass Matrix Mult.  2405.83x  219.42x  1154.65x  
Trans. Matrix Mult.  791.32x  51.15x  406.82x  
Solve Correction  506.46x  77.52x  317.01x 
Grid Size  Kernel  Max  Min  Avg. 

  Comp. Coefficients  440.04x  9.95x  189.04x 
  Comp. Coefficients  2919.47x  61.08x  1045.16x 
Mass Matrix Mult.  2141.62x  203.89x  1139.05x  
Trans. Matrix Mult.  1949.72x  166.81x  950.46x  
Solve Correction  329.72x  154.20x  250.04x 
IvB Evaluation on overall data refactoring routines
IvB1 Overall time breakdown
To show the absolute time taken by each kernel and how much each of them contributes to the overall time, we profiled the endtoend time breakdown for both decomposition and recomposition on both 2D and 3D data on CPU and GPU as shown in Table IV. We can see that besides the major kernels we have focused on, other operations such as memory copies also take considerable time. For example, memory copies take about 23.2%40.1% of the total time on CPU. The memory copy is part of the algorithm used to copy computed coefficients and the derived correction in and out from the scratch memory space and they cannot be avoided. This is also true for GPU implementation, but we are able to take advantage of the scratch memory to optimize memory access efficiency for each kernel. For example, by packing nodes in a more condensed way onto the scratch memory, it enables much greater performance gain for computing kernels.
IvB2 Optimizing resource utilization for 3D data
Next, we evaluate the optimizations specialized for 3D data. We parallelize the 2D kernels against the third dimension through the use of CUDA streams that range from one stream (baseline) to 64 CUDA streams. We show the speedups of using different numbers of streams compared with using one stream with the largest possible data size 513513513 on one NVIDIA Tesla V100 GPU on a Summit computing node. As shown in Figure 8, up to 2.6 and 3.2 speedups are achieved by using eight CUDA streams for data decomposition and recomposition.
Dims.  2D ()  3D ()  
Ops.  Decomposition  Recomposition  Decomposition  Recomposition  
Dev.  Serial CPU  GPU  Serial CPU  GPU  Serial CPU  GPU  Serial CPU  GPU  
Time(s)  %Total  Time(s)  %Total  Time(s)  %Total  Time(s)  %Total  Time(s)  %Total  Time(s)  %Total  Time(s)  %Total  Time(s)  %Total  
CC  2.50  16.6%  3.57e3  7.4%  2.08  11.6%  3.85e3  7.1%  4.36  17.0%  2.98e2  4.7%  4.65  15.1%  2.30e2  3.2% 
MM  3.16  20.9%  5.44e3  11.2%  3.17  17.6%  5.47e3  10.1%  5.63  21.9%  5.99e2  9.5%  5.70  18.6%  5.97e2  8.4% 
TM  2.87  19.1%  4.95e3  10.2%  2.87  16.0%  4.99e3  9.2%  5.18  20.1%  7.02e2  11.1%  5.23  17.0%  7.20e2  10.1% 
SC  2.64  17.5%  1.63e2  33.8%  2.64  14.7%  1.66e2  30.7%  4.55  17.7%  3.18e1  50.4%  4.61  15.0%  3.22e1  45.0% 
MC  3.90  25.9%  5.55e3  11.5%  6.68  40.1%  4.83e3  19.8%  5.98  23.3%  6.75e3  1.1%  8.37  34.3%  6.21e3  1.9% 
PN      1.25e2  25.9%      1.25e2  23.1%      1.47e1  23.2%      2.25e1  31.4% 
CC: Calculation of coefficients; MM: Mass matrix multiplication; TM: Transfer matrix multiplication; SC: Solve for corrections; MC: Memory copy; PN: Packing nodes. 
IvB3 Single node performance
Table V shows the overall performance improvement achieved using one GPU compared with one CPU core on our GPUaccelerated desktop and the Summit supercomputer. On the GPUaccelerated desktop, our designs achieve up to 102.31 and 116.52 performance improvement for decomposition and recomposition on 2D data. For 3D data, our designs achieve up to 64.03 and 91.13 performance improvement for decomposition and recomposition. On Summit, our designs achieve up to 311.18 and 330.76 speedup for decomposition and recomposition on 2D data, respectively. For 3D data, our designs achieve up to 103.41 and 135.92 performance improvement for decomposition and recomposition. We also show the extra memory footprint used by our GPU optimization compared with the original CPU designs. We can see our designs only bring equal or less than 6% extra memory footprint for smallsized inputs and less than 1% extra memory footprint for large inputs. Table VI also shows the overall performance improvement achieved by parallelizing data refactoring routines among all available GPUs compared with all CPU cores on the desktop and a single computing node on Summit.
Dims.  Input  GPUacc. desktop  Summit@ORNL  Extra Mem. Footprint*  

Decomp.  Recomp.  Decomp.  Recomp.  
2D  0.40x  0.39x  0.30x  0.30x  6.06%  
1.00x  0.92x  0.79x  0.87x  3.08%  
2.30x  2.54x  2.29x  2.29x  1.55%  
6.39x  6.41x  6.80x  6.83x  0.78%  
16.00x  15.22x  19.46x  20.13x  0.39%  
31.26x  33.21x  50.19x  51.05x  0.20%  
57.94x  64.92x  108.77x  110.92x  0.10%  
85.40x  97.29x  217.97x  219.77x  0.05%  
102.31x  116.52x  311.18x  330.76x  0.02%  
3D  2.08x  2.61x  1.14x  1.33x  0.28%  
7.59x  9.57x  4.03x  4.83x  0.07%  
20.11x  24.41x  16.20x  19.13x  0.02%  
36.95x  54.07x  52.06x  65.22x  0.05‰  
64.03x  91.13x  103.41x  135.92x  0.01‰  
*Extra memory footprint compared with the original CPU designs. 
GPUaccelerated desktop  Summit@ORNL  
2D:  2D:  
3D:  3D:  
Decomp.  Recomp.  Decomp.  Recomp.  
2D:  12.79x  14.57x  44.45x  47.25x 
3D:  8.00x  11.39x  14.77x  19.42x 
IvB4 Multinode performance
To show the potential of using GPUaccelerated data refactoring in largescale scientific applications, we conduct weak scaling test on Summit. For largescale parallelization, we choose to parallelize the workload by assigning each GPU an equal sized data partition and do decomposition and recomposition independently. Due to the nature of multigridbased data refactoring, parallelizing the workload this way brings great largescale performance with negligible impact on decomposition and recomposition results. We assign each GPU to one MPI process and do GPUbased data refactoring on 1 GB of simulation data. We scale the number of processes (GPUs) up to 4096 in our tests with 4 GPUs per computing node on Summit. As shown in Figure 9, with 4096 GPUs we achieve 45.42 TB/s and 40.45 TB/s aggregated throughput for decomposition and recomposition for 2D, respectively; for 3D data, the corresponding numbers are 17.78 TB/s and 19.86 TB/s. These numbers show great potential in speeding up data refactoringbased I/O operations, especially when used in combine with emerging new technologies that help move data in or out of the GPU efficiently, such as GPUCPU NVLink and GPUDirect Storage.
V Showcase
Data refactoring algorithms were designed to offer much greater flexibility when managing large scientific data than the traditional methods. With welldesigned data management, data can be shared between scientific applications more intelligently with a large reduction in I/O costs. However, inefficient data refactoring routines can diminish the benefits brought by data refactoring itself. Here we use two examples to show the benefits of GPUbased data refactoring over the CPU designs.
Va Visualization workflow
First we show how our GPU optimizations can make data refactoring effective when used for I/O cost reduction in scientific workflows that rely on filebased data sharing. Figure 10 shows the cost of writing and reading a 4 TB simulation data file using 4096 and 512 processes using the stateoftheart ADIOS I/O library [20] on Summit with GPUaccelerated data refactoring enabled. By writing or reading fewer coefficient classes, we can see immediate cost reduction in file write and read. When our efficient GPUaccelerated data refactoring is used, we can see this reduction in the cost of file write and read can be effectively translated into a reduction in the total I/O cost. Although multigridbased data refactoring allows us to encode the most important information in the data with a few coefficient classes, it would not reduce the total I/O cost unless those coefficient classes can be efficiently computed or used for data recovery. For example, in our experiments we achieve 95 accuracy for a chosen feature in the visualization result (i.e., the total area of the isosurfaces) with only three out of ten coefficient classes. This can be effectively translated into 66 I/O cost reduction.
VB Lossy compression
Multigridbased hierarchical data refactoring can also be used as a preconditioner in scientific lossy compression software. As one of the key components in lossy compression workflows, it is important to have efficient data refactoring in order to make fast lossy compression possible. We showcase how our GPUaccelerated data refactoring can help improve the performance of lossy compression workflows in the MGARD lossy compression software. MGARD is a CPUbased lossy compressor with three components in its workflow: multigridbased data refactoring, quantization, and entropy encoding. Figure 11 show the time breakdown of the each component in MGARD [21] when data refactoring remains on the CPU (left bars) or is offloaded to the GPU (right bars). In our test, besides the data refactoring process, we also offload the quantization and dequantization processes to the GPUs, since it can help reduce the GPUCPU data transfer cost. The entropy encoding stage (ZLib lossless compression) is kept on the CPU. We can see our GPUaccelerated data refractoring can greatly reduce the overall execution time of the lossy compression workflows.
Vi Related Works
In this section, we present a brief survey of existing literature that focuses on optimizing multigrid methods for GPU architectures.
Multigrid methods are effective solvers commonly used for a wide range of problems in numerical analysis. In order to improve the computing efficiency of multigrid methods, many existing studies attempt to design new parallel algorithms for multigrid methods to take advantage of GPUs’ capability to do massively parallel operations. Particularly, these algorithms need to expose sufficient finegrained parallelism that can be properly mapped to the GPU architectures. For example, a parallel algebraic multigrid method is developed in [6], which targets GPU devices and exposes substantial finegrained parallelism in both the construction of the multigrid hierarchy and the cycling or solve stage. Similarly, [11] shows that it is possible to significantly accelerate algebraic multigrid methods on GPUs by carefully selecting algorithms with sufficient finegrained parallelism. By combining graph coloring and greedy maximal independent set computations, [27] significantly improves the performance of a 3D unstructured geometric multigrid solver. In [25], a multiGPU implementation of Krylov subspace methods with algebraic multigrid preconditioners is proposed, which preserves the effects of finegrained parallelism with shared memory on the GPU, while distributing data across multiple GPUs with minimal communication. To accelerate lattice QCD multigrid on GPUs, [10] leverages all sources of parallelism that the underlying stencil problem possesses to achieve high efficiency even for the coarsest of grids.
There are some studies that focus on evaluating and optimizing the implementation of parallel multigrid algorithms for GPU architectures from a software engineering perspective. For instance, [29] analyzes the performance of an optimized GPUbased implementation of the geometric multigrid method on different stateoftheart NVIDIA GPUs. Similar work also includes [24, 7, 12, 30, 18]. [14] takes portability into account and provides a unified user interface so that optimized multigrid solvers can run on diverse platforms including multicore CPUs and GPUs. Similarly, [32] studies a variety of optimization techniques for geometric multigrid on different multi and manycore processors. In [5], a compilerbased autotuning framework is studied which delivers performance portability across CPU and GPUaccelerated platforms for the geometric multigrid linear solvers found in many scientific applications. A library called AmgX is designed and implemented in [22], which provides dropin GPU acceleration of distributed algebraic multigrid and preconditioned iterative methods. In [16], a classical algebraic multigrid solver and a smoothed aggregation algebraic multigrid solver are implemented on NVIDIA GPUs. [17] presents a CUDA implementation of the parallel multigrid solver for linear complementary problems. Besides CUDA, multigrid solvers implemented using OpenCL are also proposed, such as the implementation for a multigrid gradient vector flow computation presented by [28].
Comparing with existing multigrid solvers, the data refactoring approach that we aim to optimize in this work share similar settings such as multiple interlocking grids structure. However, the multigridbased data refactoring is designed and optimized targeting a different goal compared with multigrid solvers. Multigrid solvers aim to accelerate the processing of solving liner systems while multigridbased data refactoring approach aims to reconstruct the scientific data progressively with hierarchical representations. This leads to two fundamental differences that make existing optimizations hard to be applied to our work: 1) although some operations in data refactoring are similar to the ones in multigrid solvers, the operations in data refactoring are composed in a unique way; 2) the correction in this work is designed specifically for the orthogonal projection in data refactoring, while the correction in multigrid solvers is used to generate the fine grid solution. In addition, our optimizations are proposed specifically for handling largevolume scientific data, which means we need to consider not only limited GPU memory but also cases where refactoring process can share resources with original scientific computations on GPUs. So, it is essential to optimize for low memory footprint. As performance is also important, we need optimizations that balance both parallelism and memory footprint. Although part of the kernels used in data refactoring share similar computation patterns as the ones used in multigrid solvers, it is challenging to directly leverage existing works to achieve good parallelism and memory footprint balance when used in data refactoring. For example, [5] only uses vectorwise parallelism without inplace update, which can cause lower performance with large memory footprint for our data refactoring. We observe this performance degradation in our test as shown in Figure 7.
Vii Conclusion
As I/O becomes one of the major performance bottleneck for scientific computing, data refactoring shows great potential to reduce I/O costs. In this work, we used GPUs to accelerate multigridbased hierarchical data refactoring for scientific data using highly optimized data refactoring kernels specialized for scientific data refactoring. We evaluated our designs on two platforms including the world leadershipclass supercomputer Summit at ORNL. Compared with the startoftheart CPU design, our GPU version can speed up the data refactoring process by up to 330.76 with 45.42 TB/s aggregated throughput on 4096 GPUs. Finally, we showcased our work using a largescale scientific visualization workflow and the MGARD lossy compression technique. Together, these results demonstrate that scientists have another opportunity for dealing with their high data throughput requirements. Inline refactoring of scientific data can offer performance improvements and temporal fidelity that can benefit a number of science scenarios.
References
 [1] (2018) Multilevel techniques for compression and reduction of scientific data—the univariate case. Computing and Visualization in Science 19 (56), pp. 65–76. Cited by: 3rd item, §I, §II2, §II.
 [2] (2019) Multilevel techniques for compression and reduction of scientific data—quantitative control of accuracy in derived quantities. SIAM Journal on Scientific Computing 41 (4), pp. A2146–A2171. Cited by: 3rd item, §I, §II.
 [3] (2019) Multilevel techniques for compression and reduction of scientific data—the multivariate case. SIAM Journal on Scientific Computing 41 (2), pp. A1278–A1303. Cited by: 3rd item, §I, §II.
 [4] (202001) Multilevel techniques for compression and reduction of scientific data—the unstructured case. in preparation. Cited by: 3rd item, §I, §II.
 [5] (2017) Compilerbased code generation and autotuning for geometric multigrid on gpuaccelerated supercomputers. Parallel Computing 64 (C). Cited by: §IVA, §VI, §VI.
 [6] (2012) Exposing finegrained parallelism in algebraic multigrid methods. SIAM Journal on Scientific Computing 34 (4), pp. 123–152. Cited by: §VI.

[7]
(2013)
Parallel unsmoothed aggregation algebraic multigrid algorithms on gpus.
In
Numerical Solution of Partial Differential Equations: Theory, Algorithms, and Their Applications
, pp. 81–102. Cited by: §VI.  [8] (2004) Numerical study of neoclassical plasma pedestal in a tokamak geometry. Physics of Plasmas 11 (5), pp. 2649–2667. Cited by: §I.
 [9] (2019) TSM2: optimizing tallandskinny matrixmatrix multiplication on GPUs. In Proceedings of the ACM International Conference on Supercomputing, pp. 106–116. Cited by: §IIIA2.
 [10] (2016) Accelerating lattice qcd multigrid on gpus using finegrained parallelization. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16, pp. 68:1–68:12. Cited by: §VI.
 [11] (2012) GAMPACK (gpu accelerated algebraic multigrid package). In 13th European Conference on the Mathematics of Oil Recovery, Cited by: §VI.
 [12] (2014) Numerical study of geometric multigrid methods on cpugpu heterogeneous computers. Advances in Applied Mathematics and Mechanics 6 (1), pp. 1–23. Cited by: §VI.
 [13] GrayScott Simulation Code. Note: https://github.com/pnorbert/adiosvm/tree/master/Tutorial/grayscott[Online; accessed 2019] Cited by: §IV.
 [14] (2012) Parallel smoothers for matrixbased geometric multigrid methods on locally refined meshes using multicore cpus and gpus. In Facing the Multicore  Challenge II: Aspects of New Paradigms and Technologies in Parallel Computing, R. Keller, D. Kramer, and J. Weiss (Eds.), pp. 158–171. Cited by: §VI.
 [15] (2020 (accessed April 21, 2020)) HPSS. External Links: Link Cited by: §I.
 [16] (2015) Accelerating algebraic multigrid solvers on nvidia gpus. Computers and Mathematics with Applications 70 (5). Cited by: §VI.
 [17] (2015) Multigrid method for linear complementarity problem and its implementation on gpu. IAENG International Journal of Applied Mathematics 45 (3), pp. 193–197. Cited by: §VI.
 [18] (2013) A geometric multigrid solver on gpu clusters. In GPU Solutions to Multiscale Problems in Science and Engineering, pp. 407–422. Cited by: §VI.
 [19] (2009) Fullf gyrokinetic particle simulation of centrally heated global itg turbulence from magnetic axis to edge pedestal top in a realistic tokamak geometry. Nuclear Fusion 49 (11), pp. 115021. Cited by: §I.
 [20] (2014) Hello adios: the challenges and lessons of developing leadership class i/o frameworks. Concurrency and Computation: Practice and Experience 26 (7), pp. 1453–1473. Cited by: §VA.
 [21] (2020 (accessed April 21, 2020)) MGARD lossy compression software. External Links: Link Cited by: §IIID, §IV, §VB.
 [22] (2014) AmgX: a library for gpu accelerated algebraic multigrid and preconditioned iterative methods. SIAM Journal on Scientific Computing 37 (5), pp. 602–626. Cited by: §VI.
 [23] (1993) Complex patterns in a simple system. Science 261 (5118), pp. 189–192. Cited by: §IV.
 [24] (2014) GPU acceleration of algebraic multigrid preconditioners for discrete elliptic field problems. IEEE Transactions on Magnetics 50 (2). Cited by: §VI.
 [25] (2015) Multigpu acceleration of algebraic multigrid preconditioners for elliptic field problems. IEEE Transactions on Magnetics 51 (3). Cited by: §VI.
 [26] (2020) ISM2: optimizing irregularshaped matrixmatrix multiplication on gpus. arXiv preprint arXiv:2002.03258. Cited by: §IIIA2.
 [27] (2014) GPU accelerated three dimensional unstructured geometric multigrid solver. In International Conference on High Performance Computing and Simulation, HPCS ’14. Cited by: §VI.
 [28] (2016) Multigrid gradient vector flow computation on the gpu. Journal of RealTime Image Processing 12, pp. 593–601. Cited by: §VI.
 [29] (2015) GPU accelerated geometric multigrid method: performance comparison on recent nvidia architectures. In 19th International Conference on System Theory, Control and Computing, ICSTCC ’15. Cited by: §VI.
 [30] (2013) GPUaccelerated algebraic multigrid for applied cfd. Procedia Engineering 61 (), pp. 1–23. Cited by: §VI.
 [31] (2004) Science with the square kilometer array: motivation, key science projects, standards and assumptions. arXiv preprint astroph/0409274. Cited by: §I.
 [32] (2012) Optimization of geometric multigrid for emerging multi and manycore processors. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’12. Cited by: §VI.