With ever-increasing complexity of modern scientific research, today’s high performance computing (HPC) applications and advanced instruments are producing extremely large volumes of data in their simulations or experiments. Hardware/Hybrid Accelerated Cosmology Code (HACC) , for example, could produce 20TB of simulation data at only one run with hundreds of simulation iterations and trillions of particles to involve.
During the past five years, quite a few outstanding error-bounded lossy compressors have been developed to resolve the big data issue, however, the compression/decompression throughput is still far lower than the target performance demanded by many use-cases, such as instrument data compression and in-memory compression. Linear Coherent Light Source (LCLS-II)  could generate the instrument data at a rate of 250GB/s , and these data need to be stored and transferred to a parallel file system (FPS) timely for post hoc analysis. By comparison, the single-core CPU performance of existing lossy compressors is generally only 200400MBs [25, 16] and the GPU performance is only 1066GB/s [28, 5], which has also been verified in our experiments. Another typical example is exa-scale parallel quantum computing (QC) simulation, which requires a fairly large memory capacity (e.g.,
256PB when simulating 50 qubits each with a double precision) for each run in practice. In order to reduce memory requirement significantly, the QC simulation researchers  have developed a method to store the lossy-compressed data in memory and decompress them whenever needed in the course of the simulation, while suffering from considerable overhead in simulation time (even up to 20 in worst case), which is totally undesired by users.
In this paper, we focus on how to significantly accelerate both compression and decompression performance for error-bounded lossy compression while keeping a certain high compression ratio, which faces two grand challenges to resolve. On the one hand, in order to pursue the ultra-high lossy compression/decompression performance, we have to strict the whole design to be limited to only super-lightweight operations including addition/subtraction/bitwise operations, which is a serious challenge. Specifically, the relatively expensive operations such as multiplication and division should be suppressed because of its significantly higher cost. All of the existing efficient error-bounded lossy compressors, however, depend on such expensive operations. For instance, SZ2.1 
relies on the linear regression prediction, which involves masses of multiplications to compute the coefficients. Moreover, SZ2.1 relies on a linear-scale quantization to control the user-specified error bound, which involves a division operation (quantzation_bin= ) on each data point. ZFP  is another state-of-the-art error-bounded lossy compressor, which is designed based on the data transform, also involving masses of matrix-multiplication operations. On the other hand, how to optimize the performance to adapt to different device architectures is very challenging, because it requires fairly detailed and non-trivial performance tuning in the regard of practical experiments with numerous real-world scientific datasets across different domains.
To address the above serious challenges, we propose a novel, Ultra-Fast error-bounded lossy Compression framework – namely UFZ, which can also be extended/implemented efficiently for different devices such as CPU and GPU. The key contributions are summarized as follows:
We develop UFZ, which composes only lightweight operations such as bitwise operation, addition and subtraction. UFZ also supports strict control of the compression errors within user-specified error bound, thanks to our elaborate design in its error-control mechanism.
We implement UFZ for CPU and GPU accelerator, and also optimize their performances carefully based on these device architectures, respectively.
We comprehensively evaluate UFZ by running it with 6 real-world scientific datasets on heterogeneous compute nodes offered by different supercomputers including ORNL Summit and ANL ThetaGPU. We rigorously compare UFZ to two state-of-the-art lossy compressors SZ and ZFP, as well as their GPU versions – cuSZ, cuZFP.
Experiments show that UFZ is 25 as fast as the second-best existing error-bounded lossy compressor on CPU and 510 as fast as the second-best on GPU, with respect to both compression and decompression. At such a high performance, UFZ can still get a very nice compression ratio (312 for the overall compression ratio of each application; and up to 124 for the compression ratio of specific field) with good reconstructed data quality.
The rest of the paper is organized as follows. In Section II, we discuss related work. In Section IV, we present the design overview of our ultra-fast error-bounded lossy compression framework. In Section V, we propose optimization strategies in improving the performance for different devises including CPU and GPU. In Section VI, we present and discuss the performance evaluation results. Finally, we conclude the paper with a vision of the future work in Section VII.
Ii Related Work
Basically, high-speed scientific data compression can be split into two categories - lossless compression and lossy compression, which will be discussed in the following text, especially in the regard of performance/speed.
High-speed lossless compressors have been developed in particular because of the strong demand on compression performance in many use-cases. Facebook Zstd , for example, was developed particularly for the sake of high performance, with very similar compression ratios compared with other state-of-the-art lossless compressors such as Zlib  and Gzip . In general, Zstd could be 56 faster than Zlib as shown in , such that it has been widely integrated/used in 80+ production-level software/libraries/platforms. Unfortunately, Zstd supports only lossless compression, which would suffer very low compression ratios (1.22 in most of cases) when compressing scientific datasets that are mainly composed of floating-point values (to be shown later).
High-speed lossy compression has also gained significant attentions by compressor developers or scientific applications/researchers. SZ [7, 25, 16] is a typical fast error-bounded lossy compressor, which can reach 200300MB/s in compression and decompression speed [7, 25, 16]. However, it is still not as fast as expected by the quantum computing simulations , so a faster lossy compression method (namely QCZ in the following text) was customized with comparable compression ratios (especially for a high-precision compression with relative error bound of 1E-4 or 1E-5). ZFP  is another fast error-bounded lossy compressor, which is well known for its relatively high compression ratios and fairly high compression speed in both CPU and GPU. Based on our experiments (to be shown later), ZFP and QCZ has comparable compression speed, and they are generally 1.52 as fast as SZ. In fact, SZ already has a fairly high performance compared with other compressors as demonstrated in literature: it has a comparable performance with FPZIP  and SZauto  and about one to two orders of magnitude higher performance than ISABELA , MGARD  and TTHRESH .
Because of the high demand on ultra-fast error-bounded lossy compressors, a few specific error-controlled lossy compression algorithms have been developed for GPU accelerators in particular, and cuSZ  and cuZFP  are two leading ones. The cuSZ is the only GPU-based lossy compressor supporting absolute error bound for scientific data compression. It was designed based on the classic prediction-based compression model SZ and optimized for GPU performance significantly by leveraging a dual-quantization strategy  to deal with the Lorenzo prediction. Since ZFP’s core stage is performing a customized orthogonal data transform that can be executed in the form of matrix-multiplication, cuZFP can leverage high-performance CUDA library to reach a very high throughput. CuZFP, however, does not support error-bounded compression but only fixed-rate compression, which suffers from very low compression ratios, as verified in .
In comparison with all the above related works, our proposed UFZ is about 25 as fast as the second-fastest lossy compressor ZFP on CPU and 210 as fast as the second-fastest (cuSZ) on GPU, also with relatively high compression ratios (312 depending on user’s error bound).
Iii Problem Formulation
In this section, we formulate the research problem we focus on in this paper: optimization of the error-bounded lossy compression/decompression performance with as high compression ratios as possible. Specifically, given a scientific dataset (denoted by ) composed of data values each denoted by , where =1,2,3,,. The objective of our work is to develop an error-bounded lossy compressor with an ultra high performance in both compression and decompression for both CPU and GPU, also strictly respecting the user-required error bound, which can be represented as the following formula.
where CT and DT represent the compression throughput and decompression throughput, respectively; and denote the original data value and decompressed data value in the dataset, respectively; is referred to as the user-specified absolute error bound, and CR means the compression ratio, which is defined as the ratio of the original data size to the lossy compressed data size. In order to obtain as high performance as possible, the compression ratio (CR) would definitely be not optimal. However, we still hope to get a relatively high CR (expected to be over 5 or 10), which is still much higher than lossless compression ratio (generally 1.22 for scientific data).
where is the number of data points in the dataset , represents the number of bytes per value in (e.g., = 4 when the original data precision is single-precision floating point); and denote the time cost when compressing the dataset and the time cost when decompressing the corresponding compressed data, respectively.
In addition to the maximum compression error (i.e., error bound as shown in Formula (1
)), we will also evaluate the reconstructed data quality by commonly used statistical data distortion metrics such as Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) , which have been commonly used in lossy compression and visualization community. In general, the higher the PSNR or the higher the SSIM, the better the reconstructed data quality.
Iv Ultra-fast Error-bounded Lossy Compression Framework – UFZ
In this section, we present the design overview of our ultra-fast error-bounded lossy compression framework – UFZ. Detailed performance optimization strategies for CPU and GPU will be discussed in next section.
Our design is motivated by the fact that most of the scientific datasets are pretty smooth in space, such that all the values in a small block (e.g., 16 or 32 consecutive data points) are likely very close to each other, thus the mean of the minimal value and maximal value in the block can be used to represent the whole block based on a certain error bound. Figure 1 shows the visualization of four typical fields across from four different real-world simulation datasets (Miranda large-eddy simulation , Nyx cosmology simulation , QMCPack quantum chemistry , and Hurricane climate simulation ), clearly demonstrating the high smoothness of the data in local spatial regions. Furthermore, Figure 2
shows the cumulative distribution function (CDF) of block’s relative value range111A block’s relative value range is defined as the ratio of the block’s value range to the dataset’s global value range. The reason we check the block’s relative value range is that the error-bounded lossy compression is often performed via value-range based relative error bound , where the absolute error bound is calculated based on the dataset’s global value range.. It verifies that the four scientific datasets all exhibit fairly high smoothness of the local data without loss of generality. Specifically, for the Miranda dataset and QMCPack dataset, 80+% of blocks have very small relative value ranges (0.01), when the block size is 8.
We design our compressor UFZ in terms of the local smoothing feature, as illustrated in Figure 3. The fundamental idea is organizing the whole dataset as many small 1D blocks (or segments) and checking whether the mean of min and max (denoted by ) in each block can be used to represent all values in this block with deviations respecting user-specified error bound. If yes, we call this block ‘constant’ block, so we just need to store for this block of data; or else, we compress all the data points in this block by analyzing their IEEE 754 representations in terms of the user-required error bound.
|The dataset given for compression|
|user-specified error bound|
|the data points in the original raw dataset|
|th block in the dataset|
|mean of min and max in Block|
|variation radius of Block|
|the required mantissa bits calculated via and for|
|normalized values based on in each block|
|identical leading bits of compared with|
We describe Algorithm 1 as follows. As mentioned previously, the whole dataset is split into many small fixed-size 1D blocks and the compression will be executed block by block (line 2). Because of the high smoothness of data in locality, quite a few data blocks may have the values already respect the error bound based on the mean of min and max (denoted by ) (line 46), and such blocks are called ‘constant’ blocks, which will be compressed by simply storing the corresponding value.
For each of the non-constant blocks, we first normalize the data by subtracting the variation radius of the block (i.e, mean of min and max for the data in the block), and then compress each such normalized value by IEEE 754 binary representation analysis according to the following three steps:
Line 7: We compute the required number of mantissa bits (denoted as ) based on user-specified error bound, by the following formula.
where denotes getting the exponent of the number , denotes the variation radius of data in the block
, and sizeof(type) refers to the length of the data type (e.g., 32bits for single-precision floating-point type). The idea is normalizing the data values by subtracting the mean of min and max, such that the maximum exponent of each normalized value is foreseeable and thus the required mantissa bits are estimable by combining the exponent of the error bound value.
Line 9: Compute identical leading bytes by an XOR operation between the normalized data value and its preceding data value . The number of leading zeros after the XOR operation indicates the number of identical leading bytes between the two data points.
Line 10: We encode the number of identical leading bytes for each data point by a 2-bit code: 00, 01, 10, and 11 corresponds to 0, 1, 2, and 3 identical leading bytes, respectively. We use a 2-bit-per-value array (called xor_leadingzero_array) to carry these 2-bit codes, as illustrated in Figure 4.
Line 11: We commit the necessary mantissa bits, i.e., error-bounded based required bits (denoted as ) excluding identical leading bytes (denoted by , a.k.a, mid-bytes), to a particular mantissa bit array (denoted as mb_array), as shown in Figure 4.
V Performance Optimization for Various Devices
In this section, we describe our specific performance optimization strategies for CPU and GPU, respectively.
V-a Performance Optimization for CPU
In this subsection, we describe how to accelerate the UFZ in our CPU code by an efficient bitwise right shifting operation, which mainly involves the line 712 in Algorithm 1. This is a fundamental optimization strategy which can also be applied in other devices/accelerators such as GPU. In what follows, we first describe a potential performance issue in the UFZ design, followed by our optimization solution thereafter.
As illustrated in Figure 5, the mantissa bits that need to be stored for the normalized value should exclude the identical bytes and the insignificant bits which are calculated based on the user-specified error bound and variation radius of the corresponding block. The number of such necessary mantissa bits is generally not a multiple of 8 (to be verified later), so that committing/storing these bits in the compressed data requires specific bitwise operation strategies.
Storing a short bit-array with an arbitrary number of bits is a very common operation in lossy compression. The most straight-forward solution (Solution A as shown in Figure 5) is treating the given bit-array as a particular integer, and populate the target bit-stream pool (i.e., mb-array in the figure) by applying a couple of bit-wise operations (such as bit-shift, bit-and and bit-or) on the integer number. Many lossy compressors store the arbitrary bits in this way, such as Pastri . An alternative solution (Solution B as shown in Figure 5) is splitting the necessary bits into two parts – a number of necessary bytes ( bytes) + a few residual bits ( bits), which was adopted by SZ [7, 8]. In this solution, the residual bits with varied number of bits still need to be gathered in a target array by a set of bit-wise operations.
By comparison, we develop an ultra-fast method (solution C as shown in Figure 5) to deal with the necessary bits very efficiently. The basic idea is bitwise right-shifting the normalized value by bits, where is given in Formula (5), such that the number of the necessary bits to be stored is always a multiple of 8. As such, the necessary mantissa bits can be represented by an integer number of bytes, with eliminated residual bits. In this situation, we just need to use memory copy operation to commit the necessary bits to one byte-array, which would be fairly fast.
V-A1 Investigating Space Overhead of Bitwise Right-Shifting
Note that the bitwise right-shifting operation may increase the total number of required bits to store, thus reducing the compression ratios in turn. In the following text, we will show that the increased number of bits per value because of the bitwise right-shifting operation is very limited compared with the compressed data size, thanks to the design of identical leading bytes. That being said, such a space overhead introduced to compressed data size is negligible in most cases. In fact, although the bitwise right-shifting operation may increase the required number of bits, this operation may also potentially increase the number of identical leading bytes, such that some necessary bits could be “recorded” by the identical leading array instead. In other words, after the bitwise right-shifting operation, the necessary bits tend to increase on the right end but tend to decrease on its left end, which forms a counteraction to a certain extent.
We use Figure 6 (based on two real-world simulation datasets with different value range based error bounds ) to show the specific space overhead of our solution designed with bitwise right-shifting operation, as compared with the compressed data size. Specifically, the space overhead is defined as the ratio of the increased storage space introduced by the bitwise right-shifting method to the compressed data size, as presented in Formula (6).
where CR is compression ratio, refers to the original data size (thus /CR means compressed data size), refers to the total amount of necessary bytes to store under the Solution C (our solution), and refers to the total amount of necessary bytes to store by Solution A or B.
According to Figure 6 which involves a total of about 100 different fields across these two applications, it is clearly observed that the space overhead is always lower than 12% for all the fields, and average overhead for each case (with a specific block size) is always around or lower than 5% compared with the compressed data size. We give an example to further explain how small the overhead is as follows. Specifically, for the field ‘density’ in the Miranda simulation dataset, the original data size is 256384384bytes = 144MB, the compression ratio of UFZ is 9.923, so the compressed data size is about 15.2MB. Our characterization shows that Solution B and Solution C lead to 81,340,334 necessary bits (i.e., 10,167,542 bytes) and 83,054,120 necessary bits (i.e., 10,381,765 bytes), respectively, which means the overhead is only =1.4% for this field.
V-A2 Exploring The Optimal Block Size
Different block sizes may affect the compressed data sizes (i.e., compression ratios) significantly, thus it is necessary to investigate the most appropriate setting about block size for the UFZ. As described previously, there are two types of the blocks in the design, which are called ‘constant’ blocks (line 45 in Algorithm 1) and ‘nonconstant’ blocks (line 612 in Algorithm 1), respectively. Before exploring the optimal block size, we need to understand how the two types of blocks contribute to the compressed data size (or compression ratios), which are analyzed as follows (with three impact factors summarized).
Analysis of Constant Blocks Constant blocks refer to the blocks each of which can be approximated by using one data value (i.e., mean of min and max). As such, the smaller block size, the more data points to be included in the constant blocks, because of the finer-grained block-wise processing, as illustrated in Figure 7 (a). As shown in the figure, the first set of 8 data points can form a constant block because of relatively small block size. In this sense, the compression ratio tend to increase with decreasing block size because all the values within the constant block can be approximated by one value (i.e., ), which is called impact factor in the following text. However, since each constant block needs to store a constant value in the compressed data, the smaller block size, the larger number of to be stored, which may also decrease the compression ratio in turn, as illustrated in Figure 7 (b). We call this phenomenon impact factor . Specifically, for the relatively smooth regions in the dataset, the algorithm still needs to store multiple s even though a large number of adjacent data points could be approximated by only one uniform value instead. This may introduce significant overhead because of extra unnecessary to store, thus leading to the lower compression ratios.
Fig. 7: Constant block’s pros and cons when block size is small
Analysis of Nonconstant Blocks On the one hand, the impact factor also applies on nonconstant blocks as they also need to store for data denormalization during the decompression. On the other hand, smaller block size may tend to get higher compression ratios, because of the following reason (we call it impact factor ). In fact, the smaller block size, the smaller variation in the block (i.e., smaller ), and thus the fewer necessary bits to store. Specifically, as shown in Figure 7 (a), the first 8-point block has much smaller data variation than the other one, so that the corresponding required exponent would be smaller, leading to fewer required mantissa bits (according to Formula (4)).
Based on the above analysis, different block sizes may have distinct pros and cons to the compression quality in the regard of the two different types of blocks. It is not obvious what block size can get the best compression quality. In what follows, we explore the best block size setting by characterizing the compression ratios and Peak Signal to Noise Ratio (PSNR) with different block sizes, as presented in Figure 8. PSNR is a critical lossy compression data quality assessment metric, which has been widely used in the lossy compression and visualization community [26, 7, 25, 32, 18]. PSNR is defined in Formula (7).
where and are the min value and max value in the dataset , and MSE refers to the mean squared error between the original dataset and reconstructed dataset . The higher PSNR, the higher precision of the reconstructed data.
In the exploration, we checked many different error bounds from 1E-3 through 1E-6. Because of space limit, we present in Figure 8 only the results about the value range based error bound of 1E-3 and 1E-4, which compress 7 fields of the Miranda simulation dataset by UFZ. Other error bounds and datasets exhibit very similar results.
From Figure 8, we can observe that the compression ratio increases with block size in most of cases, while the PSNR always stays at the same level across different block size settings. This motivates us the best block size is 128 based on such a comprehensive characterization covering multiple applications and error bounds. This characterization also indicates that the impact factor dominates the overall compression ratios, because this is the only factor that may enhance the compression ratio with increasing block size.
V-B Performance Optimization for GPU
In this section, we describe our design and implementation for the cuUFZ - CUDA GPU version of UFZ. UFZ is considered as an irregular application to the GPU platforms due to the data dependencies in both the compression and decompression. In order to maximize the utilization of the GPU computing capacity, algorithmic adjustments and architecture-specific optimizations for the cuUFZ are needed.
Compression: The basic design of cuUFZ’s compression is that each CUDA thread-block handles one data-block. A thread-block is configured in two-dimensions. The data-block size is chosen as a multiple of warp size to optimize the performance. There are two phases during the cuUFZ compression. In the first phase, the compressor distinguishes the constant and non-constant data-blocks by calculating their values and deviation radius. The entire data-block set is processed by the thread-blocks iteratively. A thread-block with a non-constant data-block would enter the second phase, whereas the one with a constant data-block would record the data-block index and then immediately go forward to process the next data-block. With this fashion, the workload imbalance can be significantly mitigated, in that the number of data-blocks is considerably larger than the number of thread-blocks. We also note that the calculations of min and max value are the main computing workloads during the first phase, and they can be efficiently parallelized by leveraging the CUDA warp-level operations.
Only the thread-blocks with non-constant data-blocks go through the second phase. In the second phase, the leading-number-based compression is performed. Performing the computation of xor_leadingzero_array and mb_array (a.k.a., mid-bytes) on GPU is straightforward. However, writing the mid-bytes back to global memory in a compact format is challenging. Unlike the serial implementation, in cuUFZ, the starting address to write mid-bytes of each data-block remains unknown until the total number of mid-bytes of all its preceding data-blocks is computed. Therefore, a prefix scan should be performed before writing the mid-bytes to the memory. Prefix-scan on GPU has been well studied . We leverage the classical design and implement it using 2-level in-warp shuffles.
Decompression: The basic design of cuUFZ’s decompression is similar to its compression, in which each thread-block handles one data-block. Since the decompression of the constant data-block is very lightweight, we only decompress the non-constant data-blocks in GPU. The decompression consists of two components: the leading-byte retrieval and the mid-byte retrieval. Implementing the latter is relatively straightforward as it just need to read the bytes from the compressed data. We note that, due to the same reason as in the compression, a parallel prefix-scan is applied before retrieving the mid-bytes. However, implementing the leading-byte retrieval is challenging since retrieving the bytes from the preceding adjacent element no longer works in the parallel environment.
In Figure 9, we illustrate the challenge by an example (see first row in particular), which displays an eight-element data-block with the mid-bytes highlighted in yellow. In the serial code, we can retrieve the third leading-byte of the third element (B33) by reading the mid-byte of the second element (B32) and then subsequently retrieve B34 by reading B33. However, in the parallel context, B33 and B34 will be simultaneously retrieved, which will cause the read-after-write (RAW) hazard. The same issue occurs when B27 and B28 are retrieved simultaneously.
To solve the above issue, the decompression needs to predetermine the mid-byte position each leading-byte should read from. For example, in this case, both B33 and B34 should read from B32. To this end, we propose an index-propagation. This algorithm assigns each byte an initial reading-position. The ones for leading-bytes are the first element’s index (1 in the example) while the ones for mid-bytes are their own elements’ indices. Then a parallel interleaved addressing propagation will be performed to propagate the position values. If the source value greater than the destination value, the source value will be used as the new reading position for the current byte. With the interleaved addressing scheme, the propagation complexity is reduced to O(logn). In the running example, we only need three rounds of shuffles with strides 1, 2, and 4 to finish the propagation. Notice that we omit the shuffles with stride=4 in Figure9 to save space. It does not change the final position values in this case. After the index-propagation, each leading-byte knows which mid-byte it should read from to retrieve its value as shown in the last row of Figure 9.
Vi Performance Evaluation
In this section, we analyze the evaluation results, which are performed using 6 real-world scientific datasets on heterogeneous devices offered by two different supercomputers.
Vi-a Experimental Setup
|Application||# fields||Size per field||Description|
|CESM-ATM (CE.) ||77||18003600||Atomosphere simulation of Community Earth System Model|
|Hurricane (Hu.) ||13||100500500||simulation of Hurricane ISABEL|
|Miranda (Mi.) ||7||256384384||large-eddy simulation of multicomponent flows with turbulent mixing|
|Nyx (Ny.) ||6||512512512||adaptive mesh, massively-parallel, cosmological simulation|
|QMCPack (QM.) ||2||288/8161156969||simulation for electronic structure of atoms, molecules and solids|
|SCALE-LetKF (SL.) ||12||9812001200||SCALE-RM weather simulation based on LETKF filter|
We perform our GPU experiments on both A100 GPU (offered by ANL ThetaGPU ) and V100 GPU (offered by ORNL Summit ). We perform our CPU experiments on ANL ThetaGPU’s compute nodes. We compare our developed ultra-fast compressor UFZ with two outstanding lossy compressors – SZ [7, 25] and ZFP , since they are arguably the fastest existing error-bounded compressors based on prior studies [32, 7] and they both have GPU versions that can be compared with our solution UFZ in the experiments.
Vi-B Evaluation Results
First of all, we check the data reconstruction quality under our UFZ for all the simulation datasets involved in our experiments. We conclude that the overall visual quality looks great when the value range based error bound (denoted by ) is set to 1E-21E-4 for UFZ. Due to space limit, we demonstrate the visual quality, PSNR, and Structural Similarity Index (SSIM) using only the Hurricane-SIABEL simulation dataset (CLOUDf48), as shown in Figure 10 (compression ratios are 14.6, 18, and 20.64, respectively). We can observe that the reconstructed data’s visual quality is pretty high, even zooming in the top-left corn by 50, though a few artifacts can be seen in the dark blue area of Figure 10 (d). How to further mitigate or remove artifacts will be our future work.
We present compression ratios (CR) of the three compressors in Table III
, by showing the minimal, overall (i.e., Harmonic mean), and maximum CR, respectively, for all the fields in each application. The table shows that UFZ can get very high compression ratios (e.g., 124 for CESM) when REL=1E-2. Its overall compression ratio is 312 in all the cases, which is 0.53 lower compared with ZFP and 330 lower compared with SZ. This is reasonable because SZ and ZFP adopt advanced multidimensional data analysis and sophisticated encoding methods, which can get fairly high compression ratios but may suffer lower execution performance on both CPU and GPU in turn (to be shown later). By comparison, the overall compression ratio for the lossless compressor Zstd is only 1.121.49, which is lower than that of UFZ by about 200400%.
We present the single-CPU compression throughput and decompression throughput in Table III and Table V, respectively. The numbers shown in the tables are the overall performance considering all the fields for each application. Through the tables, we can observe that our developed UFZ compressor significantly outperforms the other two error-bounded lossy compressors in both compression speed and decompression speed. In absolute terms, for compression, UFZ is about 2.55 as fast as ZFP, and about 57 as fast as SZ. For decompression, UFZ is about 24 as fast as both ZFP and SZ. Such a high performance in UFZ is mainly attributed to two factors: (1) the super-lightweight skeleton design (Algorithm 1), and (2) bitwise right-shifting strategy proposed in Section V-A.
We evaluate the GPU performance for cuUFZ, cuZFP, and cuSZ on two cutting-edge supercomputers – ANL thetaGPU (A100) and ORNL Summit (V100), respectively. The compression and decompression performance results regarding all the fields of each application are presented in Figure 11 and Figure 12, respectively.
According to Figure 11, the peak performance of UFZ can reach up to 264GB/s (see Hurricane’s result in Figure 11 (a)). The overall compression performance of UFZ is 150216GB/s on thetaGPU and 140188GB/s on Summit. By comparison, both cuSZ and cuZFP suffer from very low GPU performance (9.886GB/s on thetaGPU and 1252GB/s on Summit).
According to Figure 12, the peak performance of UFZ can reach up to 446GB/s (see Miranda’s result in Figure 12 (a)). The overall decompression performance of UFZ is 150291GB/s on thetaGPU and 120243GB/s on Summit. By comparison, both cuSZ and cuZFP suffer from much lower decompression performance on GPU (9.767GB/s on thetaGPU and 13.748GB/s on Summit).
We also evaluate the overall data dumping/loadling performance on ANL thetaGPU nodes with different execution scales. Specifically, as for the data dumping experiment, we use an MPI code to launch 641024 ranks/cores, each performing a lossy compression using NYX dataset and writing compressed data onto PFS. For the data loading experiment, each MPI rank reads the compressed data from parallel file system (PFS) and then performs decompression. We present the performance breakdown in Figure 13 in terms of different value-range based error bounds.
Through the figure, we can clearly observe that the UFZ obtains the highest overall performance in both data dumping and data loading on thetaGPU. In absolute terms, the solution with UFZ takes only time to dump or load data than other solutions in most cases. That is, the I/O performance is improved by 100%200% under UFZ. The key reason is that the thetaGPU has relatively fast I/O speed, so that the compression/decompression stage turns the key bottleneck at the execution scales in our experiments.
Vii Conclusion and Future Work
In this paper, we propose an ultra-fast error-bounded lossy compression framework – UFZ. We perform comprehensive evaluations using 6 real-world scientific datasets and two cutting-edge supercomputers’ heterogeneous resources. The key insights are summarized as follows.
With the same error bound, UFZ has reasonably lower compression ratios than SZ and ZFP does (0.33 lower than ZFP and 330 lower than SZ) because it has no sophisticated data prediction/transform step and no expensive encoding algorithms such as Huffman encoding.
On CPU: with the same error bound, UFZ is 2.55 as fast as ZFP and 57 as fast as SZ in compression; UFZ is 24 as fast as both SZ and ZFP in decompression.
On GPU: with the same error bound, UFZ’s peak performance in compression and decompression on single GPU can reach up to 264GB/s and 446GB/s, respectively. This is 216 as fast as SZ and ZFP on GPU.
When compressing&writing compressed data to parallel file system (PFS) or reading&decompressing compressed data from PFS on ANL ThetaGPU with 641024 cores, the overall data dumping/loading performance under UFZ is higher than that with SZ or ZFP by 100%200%, because of UFZ’s ultra-fast compression and decompression and relatively fast I/O speed on ThetaGPU.
In the future work, we plan to explore how to further improve compression ratios for UFZ.
This research was supported by the Exascale Computing Project (ECP), Project Number: 17-SC-20-SC, a collaborative effort of two DOE organizations – the Office of Science and the National Nuclear Security Administration, responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, to support the nation’s exascale computing imperative. The material was supported by the U.S. Department of Energy, Office of Science, and by DOE’s Advanced Scientific Research Computing Office (ASCR) under contract DE-AC02-06CH11357, and supported by the National Science Foundation under Grant No. 2003709 and OAC-2104023. We acknowledge the computing resources provided on ThetaGPU, which is operated by the Argonne Leadership Computing Facility at Argonne National Laboratory.
-  (2018-12-01) Multilevel techniques for compression and reduction of scientific data—the univariate case. Computing and Visualization in Science 19 (5), pp. 65–76. External Links: Cited by: §II.
TTHRESH: tensor compression for multidimensional visual data. CoRR abs/1806.05952. External Links: Cited by: §II.
-  (2019) Use cases of lossy compression for floating-point data in scientific datasets. International Journal of High Performance Computing Applications (IJHPCA) 33, pp. 1201–1220. Cited by: §I.
-  (2015) Zstandard – real-time data compression algorithm. http://facebook.github.io/zstd/. Cited by: §II.
-  (2020) Note: https://github.com/LLNL/zfp/tree/develop/src/cuda_zfpOnline Cited by: §I, §II.
-  (1996) GZIP file format specification version 4.3. Cited by: §II.
-  (2016) Fast error-bounded lossy HPC data compression with SZ. In IEEE International Parallel and Distributed Processing Symposium, pp. 730–739. Cited by: §II, §V-A2, §V-A, §VI-A.
-  (2019) Efficient lossy compression for scientific data based on pointwise relative error bound. IEEE Transactions on Parallel and Distributed Systems 30 (2), pp. 331–345. External Links: Cited by: §I, §V-A.
-  (2018) PaSTRI: error-bounded lossy compression for two-electron integrals in quantum chemistry. In 2018 IEEE International Conference on Cluster Computing (CLUSTER), Vol. , pp. 1–11. External Links: Cited by: §V-A.
-  (2016) HACC: Extreme scaling and performance across diverse architectures. Communications of the ACM 60 (1), pp. 97–104. Cited by: §I.
-  Hurricane ISABEL simulation dataset in IEEE Visualization 2004 Test. Note: http://vis.computer.org/vis2004contest/data.htmlOnline Cited by: §IV, TABLE II.
-  (2015) The community earth system model (CESM), large ensemble project: a community resource for studying climate change in the presence of internal climate variability. Bulletin of the American Meteorological Society 96 (8), pp. 1333–1349. Cited by: TABLE II.
-  (2018-04) QMCPACK: an open sourceab initioquantum monte carlo package for the electronic structure of atoms, molecules and solids. 30 (19), pp. 195901. Cited by: §IV, TABLE II.
-  (2011) Compressing the incompressible with isabela: in-situ reduction of spatio-temporal data. In Euro-Par 2011 Parallel Processing, E. Jeannot, R. Namyst, and J. Roman (Eds.), Berlin, Heidelberg, pp. 366–379. External Links: Cited by: §II.
-  (2021) ThetaGPU machine overview. Note: https://www.alcf.anl.gov/support-center/theta/theta-thetagpu-overviewOnline Cited by: §VI-A.
-  (2018) Error-controlled lossy compression optimized for high compression ratios of scientific datasets. In 2018 IEEE International Conference on Big Data, Cited by: §I, §I, §II.
-  (2006) Fast and efficient compression of floating-point data. IEEE Transactions on Visualization and Computer Graphics 12 (5), pp. 1245–1250. Cited by: §II.
-  (2014) Fixed-rate compressed floating-point arrays. IEEE Transactions on Visualization and Computer graphics 20 (12), pp. 2674–2683. Cited by: §I, §II, §V-A2, §VI-A.
-  Parallel prefix sum (scan) with cuda. Cited by: §V-B.
-  Miranda turbulence simulation. Note: https://wci.llnl.gov/simulation/computer-codes/mirandaOnline Cited by: §IV, TABLE II.
-  NYX simulation. Note: https://amrex-astro.github.io/NyxOnline Cited by: §IV, TABLE II.
-  Scientific data reduction benchmark. Note: https://sdrbench.github.io/Online Cited by: §VI-A.
-  (2017) Linac Coherent Light Source (LCLS-II). Note: https://lcls.slac.stanford.edu/Online Cited by: §I.
-  Note: https://www.olcf.ornl.gov/summit/ Cited by: §VI-A.
-  (2017) Significantly improving lossy compression for scientific data sets based on multidimensional prediction and error-controlled quantization. In 2017 IEEE International Parallel and Distributed Processing Symposium, pp. 1129–1139. Cited by: §I, §II, §V-A2, §VI-A.
-  (2019) Z-checker: a framework for assessing lossy compression of scientific data. The International Journal of High Performance Computing Applications 33 (2), pp. 285–303. External Links: Cited by: §III, §V-A1, §V-A2, footnote 1.
The local ensemble transform Kalman filter (letkf) data assimilation package for the scale-rm weather model. Note: https://github.com/gylien/scale-letkfOnline Cited by: TABLE II.
-  (2020) CuSZ: an efficient gpu-based error-bounded lossy compression framework for scientific data. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques, PACT ’20, pp. 3–15. Cited by: §I, §II.
-  (2020) FRaZ: a generic high-fidelity fixed-ratio lossy compression framework for scientific floating-point data. Note: https://arxiv.org/abs/2001.06139Online Cited by: §II.
-  (2011-02-11)(Website) External Links: Cited by: §III.
-  (2019) Full-state quantum circuit simulation by using data compression. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19, New York, USA. External Links: Cited by: §I, §II.
-  (2020) Significantly improving lossy compression for hpc datasets with second-order prediction and parameter optimization. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’20, New York, NY, USA, pp. 89–100. External Links: Cited by: §II, §V-A2, §VI-A.
-  Note: https://www.zlib.net/Online Cited by: §II.