I Introduction
With everincreasing complexity of modern scientific research, today’s high performance computing (HPC) applications and advanced instruments are producing extremely large volumes of data in their simulations or experiments. Hardware/Hybrid Accelerated Cosmology Code (HACC) [10], for example, could produce 20TB of simulation data at only one run with hundreds of simulation iterations and trillions of particles to involve.
During the past five years, quite a few outstanding errorbounded lossy compressors have been developed to resolve the big data issue, however, the compression/decompression throughput is still far lower than the target performance demanded by many usecases, such as instrument data compression and inmemory compression. Linear Coherent Light Source (LCLSII) [23] could generate the instrument data at a rate of 250GB/s [3], and these data need to be stored and transferred to a parallel file system (FPS) timely for post hoc analysis. By comparison, the singlecore CPU performance of existing lossy compressors is generally only 200400MBs [25, 16] and the GPU performance is only 1066GB/s [28, 5], which has also been verified in our experiments. Another typical example is exascale parallel quantum computing (QC) simulation, which requires a fairly large memory capacity (e.g.,
256PB when simulating 50 qubits each with a double precision) for each run in practice
[31]. In order to reduce memory requirement significantly, the QC simulation researchers [31] have developed a method to store the lossycompressed data in memory and decompress them whenever needed in the course of the simulation, while suffering from considerable overhead in simulation time (even up to 20 in worst case), which is totally undesired by users.In this paper, we focus on how to significantly accelerate both compression and decompression performance for errorbounded lossy compression while keeping a certain high compression ratio, which faces two grand challenges to resolve. On the one hand, in order to pursue the ultrahigh lossy compression/decompression performance, we have to strict the whole design to be limited to only superlightweight operations including addition/subtraction/bitwise operations, which is a serious challenge. Specifically, the relatively expensive operations such as multiplication and division should be suppressed because of its significantly higher cost. All of the existing efficient errorbounded lossy compressors, however, depend on such expensive operations. For instance, SZ2.1 [16]
relies on the linear regression prediction, which involves masses of multiplications to compute the coefficients. Moreover, SZ2.1 relies on a linearscale quantization to control the userspecified error bound, which involves a division operation (
quantzation_bin= [8]) on each data point. ZFP [18] is another stateoftheart errorbounded lossy compressor, which is designed based on the data transform, also involving masses of matrixmultiplication operations. On the other hand, how to optimize the performance to adapt to different device architectures is very challenging, because it requires fairly detailed and nontrivial performance tuning in the regard of practical experiments with numerous realworld scientific datasets across different domains.To address the above serious challenges, we propose a novel, UltraFast errorbounded lossy Compression framework – namely UFZ, which can also be extended/implemented efficiently for different devices such as CPU and GPU. The key contributions are summarized as follows:

We develop UFZ, which composes only lightweight operations such as bitwise operation, addition and subtraction. UFZ also supports strict control of the compression errors within userspecified error bound, thanks to our elaborate design in its errorcontrol mechanism.

We implement UFZ for CPU and GPU accelerator, and also optimize their performances carefully based on these device architectures, respectively.

We comprehensively evaluate UFZ by running it with 6 realworld scientific datasets on heterogeneous compute nodes offered by different supercomputers including ORNL Summit and ANL ThetaGPU. We rigorously compare UFZ to two stateoftheart lossy compressors SZ and ZFP, as well as their GPU versions – cuSZ, cuZFP.

Experiments show that UFZ is 25 as fast as the secondbest existing errorbounded lossy compressor on CPU and 510 as fast as the secondbest on GPU, with respect to both compression and decompression. At such a high performance, UFZ can still get a very nice compression ratio (312 for the overall compression ratio of each application; and up to 124 for the compression ratio of specific field) with good reconstructed data quality.
The rest of the paper is organized as follows. In Section II, we discuss related work. In Section IV, we present the design overview of our ultrafast errorbounded lossy compression framework. In Section V, we propose optimization strategies in improving the performance for different devises including CPU and GPU. In Section VI, we present and discuss the performance evaluation results. Finally, we conclude the paper with a vision of the future work in Section VII.
Ii Related Work
Basically, highspeed scientific data compression can be split into two categories  lossless compression and lossy compression, which will be discussed in the following text, especially in the regard of performance/speed.
Highspeed lossless compressors have been developed in particular because of the strong demand on compression performance in many usecases. Facebook Zstd [4], for example, was developed particularly for the sake of high performance, with very similar compression ratios compared with other stateoftheart lossless compressors such as Zlib [33] and Gzip [6]. In general, Zstd could be 56 faster than Zlib as shown in [4], such that it has been widely integrated/used in 80+ productionlevel software/libraries/platforms. Unfortunately, Zstd supports only lossless compression, which would suffer very low compression ratios (1.22 in most of cases) when compressing scientific datasets that are mainly composed of floatingpoint values (to be shown later).
Highspeed lossy compression has also gained significant attentions by compressor developers or scientific applications/researchers. SZ [7, 25, 16] is a typical fast errorbounded lossy compressor, which can reach 200300MB/s in compression and decompression speed [7, 25, 16]. However, it is still not as fast as expected by the quantum computing simulations [31], so a faster lossy compression method (namely QCZ in the following text) was customized with comparable compression ratios (especially for a highprecision compression with relative error bound of 1E4 or 1E5). ZFP [18] is another fast errorbounded lossy compressor, which is well known for its relatively high compression ratios and fairly high compression speed in both CPU and GPU. Based on our experiments (to be shown later), ZFP and QCZ has comparable compression speed, and they are generally 1.52 as fast as SZ. In fact, SZ already has a fairly high performance compared with other compressors as demonstrated in literature: it has a comparable performance with FPZIP [17] and SZauto [32] and about one to two orders of magnitude higher performance than ISABELA [14], MGARD [1] and TTHRESH [2].
Because of the high demand on ultrafast errorbounded lossy compressors, a few specific errorcontrolled lossy compression algorithms have been developed for GPU accelerators in particular, and cuSZ [28] and cuZFP [5] are two leading ones. The cuSZ is the only GPUbased lossy compressor supporting absolute error bound for scientific data compression. It was designed based on the classic predictionbased compression model SZ and optimized for GPU performance significantly by leveraging a dualquantization strategy [28] to deal with the Lorenzo prediction. Since ZFP’s core stage is performing a customized orthogonal data transform that can be executed in the form of matrixmultiplication, cuZFP can leverage highperformance CUDA library to reach a very high throughput. CuZFP, however, does not support errorbounded compression but only fixedrate compression, which suffers from very low compression ratios, as verified in [29].
In comparison with all the above related works, our proposed UFZ is about 25 as fast as the secondfastest lossy compressor ZFP on CPU and 210 as fast as the secondfastest (cuSZ) on GPU, also with relatively high compression ratios (312 depending on user’s error bound).
Iii Problem Formulation
In this section, we formulate the research problem we focus on in this paper: optimization of the errorbounded lossy compression/decompression performance with as high compression ratios as possible. Specifically, given a scientific dataset (denoted by ) composed of data values each denoted by , where =1,2,3,,. The objective of our work is to develop an errorbounded lossy compressor with an ultra high performance in both compression and decompression for both CPU and GPU, also strictly respecting the userrequired error bound, which can be represented as the following formula.
(1) 
where CT and DT represent the compression throughput and decompression throughput, respectively; and denote the original data value and decompressed data value in the dataset, respectively; is referred to as the userspecified absolute error bound, and CR means the compression ratio, which is defined as the ratio of the original data size to the lossy compressed data size. In order to obtain as high performance as possible, the compression ratio (CR) would definitely be not optimal. However, we still hope to get a relatively high CR (expected to be over 5 or 10), which is still much higher than lossless compression ratio (generally 1.22 for scientific data).
The compression throughput and decompression throughput are defined in Formula (2) and Formula (3), respectively.
(2) 
(3) 
where is the number of data points in the dataset , represents the number of bytes per value in (e.g., = 4 when the original data precision is singleprecision floating point); and denote the time cost when compressing the dataset and the time cost when decompressing the corresponding compressed data, respectively.
In addition to the maximum compression error (i.e., error bound as shown in Formula (1
)), we will also evaluate the reconstructed data quality by commonly used statistical data distortion metrics such as Peak Signal to Noise Ratio (PSNR)
[26] and Structural Similarity Index Measure (SSIM) [30], which have been commonly used in lossy compression and visualization community. In general, the higher the PSNR or the higher the SSIM, the better the reconstructed data quality.Iv Ultrafast Errorbounded Lossy Compression Framework – UFZ
In this section, we present the design overview of our ultrafast errorbounded lossy compression framework – UFZ. Detailed performance optimization strategies for CPU and GPU will be discussed in next section.
Our design is motivated by the fact that most of the scientific datasets are pretty smooth in space, such that all the values in a small block (e.g., 16 or 32 consecutive data points) are likely very close to each other, thus the mean of the minimal value and maximal value in the block can be used to represent the whole block based on a certain error bound. Figure 1 shows the visualization of four typical fields across from four different realworld simulation datasets (Miranda largeeddy simulation [20], Nyx cosmology simulation [21], QMCPack quantum chemistry [13], and Hurricane climate simulation [11]), clearly demonstrating the high smoothness of the data in local spatial regions. Furthermore, Figure 2
shows the cumulative distribution function (CDF) of block’s relative value range
^{1}^{1}1A block’s relative value range is defined as the ratio of the block’s value range to the dataset’s global value range. The reason we check the block’s relative value range is that the errorbounded lossy compression is often performed via valuerange based relative error bound [26], where the absolute error bound is calculated based on the dataset’s global value range.. It verifies that the four scientific datasets all exhibit fairly high smoothness of the local data without loss of generality. Specifically, for the Miranda dataset and QMCPack dataset, 80+% of blocks have very small relative value ranges (0.01), when the block size is 8.We design our compressor UFZ in terms of the local smoothing feature, as illustrated in Figure 3. The fundamental idea is organizing the whole dataset as many small 1D blocks (or segments) and checking whether the mean of min and max (denoted by ) in each block can be used to represent all values in this block with deviations respecting userspecified error bound. If yes, we call this block ‘constant’ block, so we just need to store for this block of data; or else, we compress all the data points in this block by analyzing their IEEE 754 representations in terms of the userrequired error bound.
We present the pseudocode of the skeleton design in Algorithm 1 to further describe details. Table I summarizes all key notations to assist understanding of the algorithm.
Notation  Description 

The dataset given for compression  
userspecified error bound  
the data points in the original raw dataset  
th block in the dataset  
mean of min and max in Block  
variation radius of Block  
the required mantissa bits calculated via and for  
normalized values based on in each block  
identical leading bits of compared with 
We describe Algorithm 1 as follows. As mentioned previously, the whole dataset is split into many small fixedsize 1D blocks and the compression will be executed block by block (line 2). Because of the high smoothness of data in locality, quite a few data blocks may have the values already respect the error bound based on the mean of min and max (denoted by ) (line 46), and such blocks are called ‘constant’ blocks, which will be compressed by simply storing the corresponding value.
For each of the nonconstant blocks, we first normalize the data by subtracting the variation radius of the block (i.e, mean of min and max for the data in the block), and then compress each such normalized value by IEEE 754 binary representation analysis according to the following three steps:

Line 7: We compute the required number of mantissa bits (denoted as ) based on userspecified error bound, by the following formula.
(4) where denotes getting the exponent of the number , denotes the variation radius of data in the block
, and sizeof(type) refers to the length of the data type (e.g., 32bits for singleprecision floatingpoint type). The idea is normalizing the data values by subtracting the mean of min and max, such that the maximum exponent of each normalized value is foreseeable and thus the required mantissa bits are estimable by combining the exponent of the error bound value
. 
Line 9: Compute identical leading bytes by an XOR operation between the normalized data value and its preceding data value . The number of leading zeros after the XOR operation indicates the number of identical leading bytes between the two data points.

Line 10: We encode the number of identical leading bytes for each data point by a 2bit code: 00, 01, 10, and 11 corresponds to 0, 1, 2, and 3 identical leading bytes, respectively. We use a 2bitpervalue array (called xor_leadingzero_array) to carry these 2bit codes, as illustrated in Figure 4.

Line 11: We commit the necessary mantissa bits, i.e., errorbounded based required bits (denoted as ) excluding identical leading bytes (denoted by , a.k.a, midbytes), to a particular mantissa bit array (denoted as mb_array), as shown in Figure 4.
V Performance Optimization for Various Devices
In this section, we describe our specific performance optimization strategies for CPU and GPU, respectively.
Va Performance Optimization for CPU
In this subsection, we describe how to accelerate the UFZ in our CPU code by an efficient bitwise right shifting operation, which mainly involves the line 712 in Algorithm 1. This is a fundamental optimization strategy which can also be applied in other devices/accelerators such as GPU. In what follows, we first describe a potential performance issue in the UFZ design, followed by our optimization solution thereafter.
As illustrated in Figure 5, the mantissa bits that need to be stored for the normalized value should exclude the identical bytes and the insignificant bits which are calculated based on the userspecified error bound and variation radius of the corresponding block. The number of such necessary mantissa bits is generally not a multiple of 8 (to be verified later), so that committing/storing these bits in the compressed data requires specific bitwise operation strategies.
Storing a short bitarray with an arbitrary number of bits is a very common operation in lossy compression. The most straightforward solution (Solution A as shown in Figure 5) is treating the given bitarray as a particular integer, and populate the target bitstream pool (i.e., mbarray in the figure) by applying a couple of bitwise operations (such as bitshift, bitand and bitor) on the integer number. Many lossy compressors store the arbitrary bits in this way, such as Pastri [9]. An alternative solution (Solution B as shown in Figure 5) is splitting the necessary bits into two parts – a number of necessary bytes ( bytes) + a few residual bits ( bits), which was adopted by SZ [7, 8]. In this solution, the residual bits with varied number of bits still need to be gathered in a target array by a set of bitwise operations.
By comparison, we develop an ultrafast method (solution C as shown in Figure 5) to deal with the necessary bits very efficiently. The basic idea is bitwise rightshifting the normalized value by bits, where is given in Formula (5), such that the number of the necessary bits to be stored is always a multiple of 8. As such, the necessary mantissa bits can be represented by an integer number of bytes, with eliminated residual bits. In this situation, we just need to use memory copy operation to commit the necessary bits to one bytearray, which would be fairly fast.
(5) 
VA1 Investigating Space Overhead of Bitwise RightShifting
Note that the bitwise rightshifting operation may increase the total number of required bits to store, thus reducing the compression ratios in turn. In the following text, we will show that the increased number of bits per value because of the bitwise rightshifting operation is very limited compared with the compressed data size, thanks to the design of identical leading bytes. That being said, such a space overhead introduced to compressed data size is negligible in most cases. In fact, although the bitwise rightshifting operation may increase the required number of bits, this operation may also potentially increase the number of identical leading bytes, such that some necessary bits could be “recorded” by the identical leading array instead. In other words, after the bitwise rightshifting operation, the necessary bits tend to increase on the right end but tend to decrease on its left end, which forms a counteraction to a certain extent.
We use Figure 6 (based on two realworld simulation datasets with different value range based error bounds [26]) to show the specific space overhead of our solution designed with bitwise rightshifting operation, as compared with the compressed data size. Specifically, the space overhead is defined as the ratio of the increased storage space introduced by the bitwise rightshifting method to the compressed data size, as presented in Formula (6).
(6) 
where CR is compression ratio, refers to the original data size (thus /CR means compressed data size), refers to the total amount of necessary bytes to store under the Solution C (our solution), and refers to the total amount of necessary bytes to store by Solution A or B.
According to Figure 6 which involves a total of about 100 different fields across these two applications, it is clearly observed that the space overhead is always lower than 12% for all the fields, and average overhead for each case (with a specific block size) is always around or lower than 5% compared with the compressed data size. We give an example to further explain how small the overhead is as follows. Specifically, for the field ‘density’ in the Miranda simulation dataset, the original data size is 256384384bytes = 144MB, the compression ratio of UFZ is 9.923, so the compressed data size is about 15.2MB. Our characterization shows that Solution B and Solution C lead to 81,340,334 necessary bits (i.e., 10,167,542 bytes) and 83,054,120 necessary bits (i.e., 10,381,765 bytes), respectively, which means the overhead is only =1.4% for this field.
VA2 Exploring The Optimal Block Size
Different block sizes may affect the compressed data sizes (i.e., compression ratios) significantly, thus it is necessary to investigate the most appropriate setting about block size for the UFZ. As described previously, there are two types of the blocks in the design, which are called ‘constant’ blocks (line 45 in Algorithm 1) and ‘nonconstant’ blocks (line 612 in Algorithm 1), respectively. Before exploring the optimal block size, we need to understand how the two types of blocks contribute to the compressed data size (or compression ratios), which are analyzed as follows (with three impact factors summarized).

Analysis of Constant Blocks Constant blocks refer to the blocks each of which can be approximated by using one data value (i.e., mean of min and max). As such, the smaller block size, the more data points to be included in the constant blocks, because of the finergrained blockwise processing, as illustrated in Figure 7 (a). As shown in the figure, the first set of 8 data points can form a constant block because of relatively small block size. In this sense, the compression ratio tend to increase with decreasing block size because all the values within the constant block can be approximated by one value (i.e., ), which is called impact factor in the following text. However, since each constant block needs to store a constant value in the compressed data, the smaller block size, the larger number of to be stored, which may also decrease the compression ratio in turn, as illustrated in Figure 7 (b). We call this phenomenon impact factor . Specifically, for the relatively smooth regions in the dataset, the algorithm still needs to store multiple s even though a large number of adjacent data points could be approximated by only one uniform value instead. This may introduce significant overhead because of extra unnecessary to store, thus leading to the lower compression ratios.
Fig. 7: Constant block’s pros and cons when block size is small 
Analysis of Nonconstant Blocks On the one hand, the impact factor also applies on nonconstant blocks as they also need to store for data denormalization during the decompression. On the other hand, smaller block size may tend to get higher compression ratios, because of the following reason (we call it impact factor ). In fact, the smaller block size, the smaller variation in the block (i.e., smaller ), and thus the fewer necessary bits to store. Specifically, as shown in Figure 7 (a), the first 8point block has much smaller data variation than the other one, so that the corresponding required exponent would be smaller, leading to fewer required mantissa bits (according to Formula (4)).
Based on the above analysis, different block sizes may have distinct pros and cons to the compression quality in the regard of the two different types of blocks. It is not obvious what block size can get the best compression quality. In what follows, we explore the best block size setting by characterizing the compression ratios and Peak Signal to Noise Ratio (PSNR) with different block sizes, as presented in Figure 8. PSNR is a critical lossy compression data quality assessment metric, which has been widely used in the lossy compression and visualization community [26, 7, 25, 32, 18]. PSNR is defined in Formula (7).
(7) 
where and are the min value and max value in the dataset , and MSE refers to the mean squared error between the original dataset and reconstructed dataset . The higher PSNR, the higher precision of the reconstructed data.
In the exploration, we checked many different error bounds from 1E3 through 1E6. Because of space limit, we present in Figure 8 only the results about the value range based error bound of 1E3 and 1E4, which compress 7 fields of the Miranda simulation dataset by UFZ. Other error bounds and datasets exhibit very similar results.
From Figure 8, we can observe that the compression ratio increases with block size in most of cases, while the PSNR always stays at the same level across different block size settings. This motivates us the best block size is 128 based on such a comprehensive characterization covering multiple applications and error bounds. This characterization also indicates that the impact factor dominates the overall compression ratios, because this is the only factor that may enhance the compression ratio with increasing block size.
VB Performance Optimization for GPU
In this section, we describe our design and implementation for the cuUFZ  CUDA GPU version of UFZ. UFZ is considered as an irregular application to the GPU platforms due to the data dependencies in both the compression and decompression. In order to maximize the utilization of the GPU computing capacity, algorithmic adjustments and architecturespecific optimizations for the cuUFZ are needed.
Compression: The basic design of cuUFZ’s compression is that each CUDA threadblock handles one datablock. A threadblock is configured in twodimensions. The datablock size is chosen as a multiple of warp size to optimize the performance. There are two phases during the cuUFZ compression. In the first phase, the compressor distinguishes the constant and nonconstant datablocks by calculating their values and deviation radius. The entire datablock set is processed by the threadblocks iteratively. A threadblock with a nonconstant datablock would enter the second phase, whereas the one with a constant datablock would record the datablock index and then immediately go forward to process the next datablock. With this fashion, the workload imbalance can be significantly mitigated, in that the number of datablocks is considerably larger than the number of threadblocks. We also note that the calculations of min and max value are the main computing workloads during the first phase, and they can be efficiently parallelized by leveraging the CUDA warplevel operations.
Only the threadblocks with nonconstant datablocks go through the second phase. In the second phase, the leadingnumberbased compression is performed. Performing the computation of xor_leadingzero_array and mb_array (a.k.a., midbytes) on GPU is straightforward. However, writing the midbytes back to global memory in a compact format is challenging. Unlike the serial implementation, in cuUFZ, the starting address to write midbytes of each datablock remains unknown until the total number of midbytes of all its preceding datablocks is computed. Therefore, a prefix scan should be performed before writing the midbytes to the memory. Prefixscan on GPU has been well studied [19]. We leverage the classical design and implement it using 2level inwarp shuffles.
Decompression: The basic design of cuUFZ’s decompression is similar to its compression, in which each threadblock handles one datablock. Since the decompression of the constant datablock is very lightweight, we only decompress the nonconstant datablocks in GPU. The decompression consists of two components: the leadingbyte retrieval and the midbyte retrieval. Implementing the latter is relatively straightforward as it just need to read the bytes from the compressed data. We note that, due to the same reason as in the compression, a parallel prefixscan is applied before retrieving the midbytes. However, implementing the leadingbyte retrieval is challenging since retrieving the bytes from the preceding adjacent element no longer works in the parallel environment.
In Figure 9, we illustrate the challenge by an example (see first row in particular), which displays an eightelement datablock with the midbytes highlighted in yellow. In the serial code, we can retrieve the third leadingbyte of the third element (B33) by reading the midbyte of the second element (B32) and then subsequently retrieve B34 by reading B33. However, in the parallel context, B33 and B34 will be simultaneously retrieved, which will cause the readafterwrite (RAW) hazard. The same issue occurs when B27 and B28 are retrieved simultaneously.
To solve the above issue, the decompression needs to predetermine the midbyte position each leadingbyte should read from. For example, in this case, both B33 and B34 should read from B32. To this end, we propose an indexpropagation. This algorithm assigns each byte an initial readingposition. The ones for leadingbytes are the first element’s index (1 in the example) while the ones for midbytes are their own elements’ indices. Then a parallel interleaved addressing propagation will be performed to propagate the position values. If the source value greater than the destination value, the source value will be used as the new reading position for the current byte. With the interleaved addressing scheme, the propagation complexity is reduced to O(logn). In the running example, we only need three rounds of shuffles with strides 1, 2, and 4 to finish the propagation. Notice that we omit the shuffles with stride=4 in Figure
9 to save space. It does not change the final position values in this case. After the indexpropagation, each leadingbyte knows which midbyte it should read from to retrieve its value as shown in the last row of Figure 9.Vi Performance Evaluation
In this section, we analyze the evaluation results, which are performed using 6 realworld scientific datasets on heterogeneous devices offered by two different supercomputers.
Via Experimental Setup
Table II describes all the application datasets used in our experiments. All the datasets are downloaded from the wellknown scientific data reduction benchmark website [22].
Application  # fields  Size per field  Description 

CESMATM (CE.) [12]  77  18003600  Atomosphere simulation of Community Earth System Model 
Hurricane (Hu.) [11]  13  100500500  simulation of Hurricane ISABEL 
Miranda (Mi.) [20]  7  256384384  largeeddy simulation of multicomponent flows with turbulent mixing 
Nyx (Ny.) [21]  6  512512512  adaptive mesh, massivelyparallel, cosmological simulation 
QMCPack (QM.) [13]  2  288/8161156969  simulation for electronic structure of atoms, molecules and solids 
SCALELetKF (SL.) [27]  12  9812001200  SCALERM weather simulation based on LETKF filter 
We perform our GPU experiments on both A100 GPU (offered by ANL ThetaGPU [15]) and V100 GPU (offered by ORNL Summit [24]). We perform our CPU experiments on ANL ThetaGPU’s compute nodes. We compare our developed ultrafast compressor UFZ with two outstanding lossy compressors – SZ [7, 25] and ZFP [18], since they are arguably the fastest existing errorbounded compressors based on prior studies [32, 7] and they both have GPU versions that can be compared with our solution UFZ in the experiments.
ViB Evaluation Results
First of all, we check the data reconstruction quality under our UFZ for all the simulation datasets involved in our experiments. We conclude that the overall visual quality looks great when the value range based error bound (denoted by ) is set to 1E21E4 for UFZ. Due to space limit, we demonstrate the visual quality, PSNR, and Structural Similarity Index (SSIM) using only the HurricaneSIABEL simulation dataset (CLOUDf48), as shown in Figure 10 (compression ratios are 14.6, 18, and 20.64, respectively). We can observe that the reconstructed data’s visual quality is pretty high, even zooming in the topleft corn by 50, though a few artifacts can be seen in the dark blue area of Figure 10 (d). How to further mitigate or remove artifacts will be our future work.
We present compression ratios (CR) of the three compressors in Table III
, by showing the minimal, overall (i.e., Harmonic mean), and maximum CR, respectively, for all the fields in each application. The table shows that UFZ can get very high compression ratios (e.g., 124 for CESM) when REL=1E2. Its overall compression ratio is 3
12 in all the cases, which is 0.53 lower compared with ZFP and 330 lower compared with SZ. This is reasonable because SZ and ZFP adopt advanced multidimensional data analysis and sophisticated encoding methods, which can get fairly high compression ratios but may suffer lower execution performance on both CPU and GPU in turn (to be shown later). By comparison, the overall compression ratio for the lossless compressor Zstd is only 1.121.49, which is lower than that of UFZ by about 200400%.We present the singleCPU compression throughput and decompression throughput in Table III and Table V, respectively. The numbers shown in the tables are the overall performance considering all the fields for each application. Through the tables, we can observe that our developed UFZ compressor significantly outperforms the other two errorbounded lossy compressors in both compression speed and decompression speed. In absolute terms, for compression, UFZ is about 2.55 as fast as ZFP, and about 57 as fast as SZ. For decompression, UFZ is about 24 as fast as both ZFP and SZ. Such a high performance in UFZ is mainly attributed to two factors: (1) the superlightweight skeleton design (Algorithm 1), and (2) bitwise rightshifting strategy proposed in Section VA.
REL  CE.  Hu.  Mi.  Ny.  QM.  SL.  

1E2  1034  796  959  1087  969  1032  
UFZ  1E3  822  750  833  877  902  703 
1E4  752  662  807  722  813  663  
1E2  392  256  249  418  323  258  
ZFP  1E3  288  213  211  284  275  208 
1E4  234  181  280  226  208  174  
1E2  236  193  186  258  205  217  
SZ  1E3  170  153  161  229  216  156 
1E4  143  130  139  164  147  124 
REL  CE.  Hu.  Mi.  Ny.  QM.  SL.  

1E2  1221  1085  1950  1450  1292  1408  
UFZ  1E3  1022  1006  1546  1218  1083  975 
1E4  925  864  1319  956  928  886  
1E2  485  476  498  732  685  360  
ZFP  1E3  327  371  401  455  524  395 
1E4  246  297  327  333  376  299  
1E2  559  451  549  635  588  519  
SZ  1E3  381  291  444  534  462  334 
1E4  269  229  392  359  282  236 
We evaluate the GPU performance for cuUFZ, cuZFP, and cuSZ on two cuttingedge supercomputers – ANL thetaGPU (A100) and ORNL Summit (V100), respectively. The compression and decompression performance results regarding all the fields of each application are presented in Figure 11 and Figure 12, respectively.
According to Figure 11, the peak performance of UFZ can reach up to 264GB/s (see Hurricane’s result in Figure 11 (a)). The overall compression performance of UFZ is 150216GB/s on thetaGPU and 140188GB/s on Summit. By comparison, both cuSZ and cuZFP suffer from very low GPU performance (9.886GB/s on thetaGPU and 1252GB/s on Summit).
According to Figure 12, the peak performance of UFZ can reach up to 446GB/s (see Miranda’s result in Figure 12 (a)). The overall decompression performance of UFZ is 150291GB/s on thetaGPU and 120243GB/s on Summit. By comparison, both cuSZ and cuZFP suffer from much lower decompression performance on GPU (9.767GB/s on thetaGPU and 13.748GB/s on Summit).
We also evaluate the overall data dumping/loadling performance on ANL thetaGPU nodes with different execution scales. Specifically, as for the data dumping experiment, we use an MPI code to launch 641024 ranks/cores, each performing a lossy compression using NYX dataset and writing compressed data onto PFS. For the data loading experiment, each MPI rank reads the compressed data from parallel file system (PFS) and then performs decompression. We present the performance breakdown in Figure 13 in terms of different valuerange based error bounds.
Through the figure, we can clearly observe that the UFZ obtains the highest overall performance in both data dumping and data loading on thetaGPU. In absolute terms, the solution with UFZ takes only time to dump or load data than other solutions in most cases. That is, the I/O performance is improved by 100%200% under UFZ. The key reason is that the thetaGPU has relatively fast I/O speed, so that the compression/decompression stage turns the key bottleneck at the execution scales in our experiments.
Vii Conclusion and Future Work
In this paper, we propose an ultrafast errorbounded lossy compression framework – UFZ. We perform comprehensive evaluations using 6 realworld scientific datasets and two cuttingedge supercomputers’ heterogeneous resources. The key insights are summarized as follows.

With the same error bound, UFZ has reasonably lower compression ratios than SZ and ZFP does (0.33 lower than ZFP and 330 lower than SZ) because it has no sophisticated data prediction/transform step and no expensive encoding algorithms such as Huffman encoding.

On CPU: with the same error bound, UFZ is 2.55 as fast as ZFP and 57 as fast as SZ in compression; UFZ is 24 as fast as both SZ and ZFP in decompression.

On GPU: with the same error bound, UFZ’s peak performance in compression and decompression on single GPU can reach up to 264GB/s and 446GB/s, respectively. This is 216 as fast as SZ and ZFP on GPU.

When compressing&writing compressed data to parallel file system (PFS) or reading&decompressing compressed data from PFS on ANL ThetaGPU with 641024 cores, the overall data dumping/loading performance under UFZ is higher than that with SZ or ZFP by 100%200%, because of UFZ’s ultrafast compression and decompression and relatively fast I/O speed on ThetaGPU.
In the future work, we plan to explore how to further improve compression ratios for UFZ.
Acknowledgements
This research was supported by the Exascale Computing Project (ECP), Project Number: 17SC20SC, a collaborative effort of two DOE organizations – the Office of Science and the National Nuclear Security Administration, responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, to support the nation’s exascale computing imperative. The material was supported by the U.S. Department of Energy, Office of Science, and by DOE’s Advanced Scientific Research Computing Office (ASCR) under contract DEAC0206CH11357, and supported by the National Science Foundation under Grant No. 2003709 and OAC2104023. We acknowledge the computing resources provided on ThetaGPU, which is operated by the Argonne Leadership Computing Facility at Argonne National Laboratory.
References
 [1] (20181201) Multilevel techniques for compression and reduction of scientific data—the univariate case. Computing and Visualization in Science 19 (5), pp. 65–76. External Links: ISSN 14330369 Cited by: §II.

[2]
(2018)
TTHRESH: tensor compression for multidimensional visual data
. CoRR abs/1806.05952. External Links: Link, 1806.05952 Cited by: §II.  [3] (2019) Use cases of lossy compression for floatingpoint data in scientific datasets. International Journal of High Performance Computing Applications (IJHPCA) 33, pp. 1201–1220. Cited by: §I.
 [4] (2015) Zstandard – realtime data compression algorithm. http://facebook.github.io/zstd/. Cited by: §II.
 [5] (2020) Note: https://github.com/LLNL/zfp/tree/develop/src/cuda_zfpOnline Cited by: §I, §II.
 [6] (1996) GZIP file format specification version 4.3. Cited by: §II.
 [7] (2016) Fast errorbounded lossy HPC data compression with SZ. In IEEE International Parallel and Distributed Processing Symposium, pp. 730–739. Cited by: §II, §VA2, §VA, §VIA.
 [8] (2019) Efficient lossy compression for scientific data based on pointwise relative error bound. IEEE Transactions on Parallel and Distributed Systems 30 (2), pp. 331–345. External Links: Document Cited by: §I, §VA.
 [9] (2018) PaSTRI: errorbounded lossy compression for twoelectron integrals in quantum chemistry. In 2018 IEEE International Conference on Cluster Computing (CLUSTER), Vol. , pp. 1–11. External Links: Document Cited by: §VA.
 [10] (2016) HACC: Extreme scaling and performance across diverse architectures. Communications of the ACM 60 (1), pp. 97–104. Cited by: §I.
 [11] Hurricane ISABEL simulation dataset in IEEE Visualization 2004 Test. Note: http://vis.computer.org/vis2004contest/data.htmlOnline Cited by: §IV, TABLE II.
 [12] (2015) The community earth system model (CESM), large ensemble project: a community resource for studying climate change in the presence of internal climate variability. Bulletin of the American Meteorological Society 96 (8), pp. 1333–1349. Cited by: TABLE II.
 [13] (201804) QMCPACK: an open sourceab initioquantum monte carlo package for the electronic structure of atoms, molecules and solids. 30 (19), pp. 195901. Cited by: §IV, TABLE II.
 [14] (2011) Compressing the incompressible with isabela: insitu reduction of spatiotemporal data. In EuroPar 2011 Parallel Processing, E. Jeannot, R. Namyst, and J. Roman (Eds.), Berlin, Heidelberg, pp. 366–379. External Links: ISBN 9783642234002 Cited by: §II.
 [15] (2021) ThetaGPU machine overview. Note: https://www.alcf.anl.gov/supportcenter/theta/thetathetagpuoverviewOnline Cited by: §VIA.
 [16] (2018) Errorcontrolled lossy compression optimized for high compression ratios of scientific datasets. In 2018 IEEE International Conference on Big Data, Cited by: §I, §I, §II.
 [17] (2006) Fast and efficient compression of floatingpoint data. IEEE Transactions on Visualization and Computer Graphics 12 (5), pp. 1245–1250. Cited by: §II.
 [18] (2014) Fixedrate compressed floatingpoint arrays. IEEE Transactions on Visualization and Computer graphics 20 (12), pp. 2674–2683. Cited by: §I, §II, §VA2, §VIA.
 [19] Parallel prefix sum (scan) with cuda. Cited by: §VB.
 [20] Miranda turbulence simulation. Note: https://wci.llnl.gov/simulation/computercodes/mirandaOnline Cited by: §IV, TABLE II.
 [21] NYX simulation. Note: https://amrexastro.github.io/NyxOnline Cited by: §IV, TABLE II.
 [22] Scientific data reduction benchmark. Note: https://sdrbench.github.io/Online Cited by: §VIA.
 [23] (2017) Linac Coherent Light Source (LCLSII). Note: https://lcls.slac.stanford.edu/Online Cited by: §I.
 [24] Note: https://www.olcf.ornl.gov/summit/ Cited by: §VIA.
 [25] (2017) Significantly improving lossy compression for scientific data sets based on multidimensional prediction and errorcontrolled quantization. In 2017 IEEE International Parallel and Distributed Processing Symposium, pp. 1129–1139. Cited by: §I, §II, §VA2, §VIA.
 [26] (2019) Zchecker: a framework for assessing lossy compression of scientific data. The International Journal of High Performance Computing Applications 33 (2), pp. 285–303. External Links: Document Cited by: §III, §VA1, §VA2, footnote 1.

[27]
The local ensemble transform Kalman filter (letkf) data assimilation package for the scalerm weather model
. Note: https://github.com/gylien/scaleletkfOnline Cited by: TABLE II.  [28] (2020) CuSZ: an efficient gpubased errorbounded lossy compression framework for scientific data. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques, PACT ’20, pp. 3–15. Cited by: §I, §II.
 [29] (2020) FRaZ: a generic highfidelity fixedratio lossy compression framework for scientific floatingpoint data. Note: https://arxiv.org/abs/2001.06139Online Cited by: §II.
 [30] (20110211)(Website) External Links: Link Cited by: §III.
 [31] (2019) Fullstate quantum circuit simulation by using data compression. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19, New York, USA. External Links: ISBN 9781450362290 Cited by: §I, §II.
 [32] (2020) Significantly improving lossy compression for hpc datasets with secondorder prediction and parameter optimization. In Proceedings of the 29th International Symposium on HighPerformance Parallel and Distributed Computing, HPDC ’20, New York, NY, USA, pp. 89–100. External Links: ISBN 9781450370523 Cited by: §II, §VA2, §VIA.
 [33] Note: https://www.zlib.net/Online Cited by: §II.