Log In Sign Up

SZx: an Ultra-fast Error-bounded Lossy Compressor for Scientific Datasets

Today's scientific high performance computing (HPC) applications or advanced instruments are producing vast volumes of data across a wide range of domains, which introduces a serious burden on data transfer and storage. Error-bounded lossy compression has been developed and widely used in scientific community, because not only can it significantly reduce the data volumes but it can also strictly control the data distortion based on the use-specified error bound. Existing lossy compressors, however, cannot offer ultra-fast compression speed, which is highly demanded by quite a few applications or use-cases (such as in-memory compression and online instrument data compression). In this paper, we propose a novel ultra-fast error-bounded lossy compressor, which can obtain fairly high compression performance on both CPU and GPU, also with reasonably high compression ratios. The key contributions are three-fold: (1) We propose a novel, generic ultra-fast error-bounded lossy compression framework – called UFZ, by confining our design to be composed of only super-lightweight operations such as bitwise and addition/subtraction operation, still keeping a certain high compression ratio. (2) We implement UFZ on both CPU and GPU and optimize the performance according to their architectures carefully. (3) We perform a comprehensive evaluation with 6 real-world production-level scientific datasets on both CPU and GPU. Experiments show that UFZ is 2 16X as fast as the second-fastest existing error-bounded lossy compressor (either SZ or ZFP) on CPU and GPU, with respect to both compression and decompression.


page 1

page 2

page 4

page 5

page 6

page 7

page 9

page 10


cuSZ(x): Optimizing Error-Bounded Lossy Compression for Scientific Data on GPUs

Error-bounded lossy compression is a critical technique for significantl...

Exploring Autoencoder-Based Error-Bounded Compression for Scientific Data

Error-bounded lossy compression is becoming an indispensable technique f...

ROIBIN-SZ: Fast and Science-Preserving Compression for Serial Crystallography

Crystallography is the leading technique to study atomic structures of p...

Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs

More and more HPC applications require fast and effective compression te...

Exploring Lossy Compressibility through Statistical Correlations of Scientific Datasets

Lossy compression plays a growing role in scientific simulations where t...

Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures

Today's high-performance computing (HPC) applications are producing vast...

SDC Resilient Error-bounded Lossy Compressor

Lossy compression is one of the most important strategies to resolve the...

I Introduction

With ever-increasing complexity of modern scientific research, today’s high performance computing (HPC) applications and advanced instruments are producing extremely large volumes of data in their simulations or experiments. Hardware/Hybrid Accelerated Cosmology Code (HACC) [10], for example, could produce 20TB of simulation data at only one run with hundreds of simulation iterations and trillions of particles to involve.

During the past five years, quite a few outstanding error-bounded lossy compressors have been developed to resolve the big data issue, however, the compression/decompression throughput is still far lower than the target performance demanded by many use-cases, such as instrument data compression and in-memory compression. Linear Coherent Light Source (LCLS-II) [23] could generate the instrument data at a rate of 250GB/s [3], and these data need to be stored and transferred to a parallel file system (FPS) timely for post hoc analysis. By comparison, the single-core CPU performance of existing lossy compressors is generally only 200400MBs [25, 16] and the GPU performance is only 1066GB/s [28, 5], which has also been verified in our experiments. Another typical example is exa-scale parallel quantum computing (QC) simulation, which requires a fairly large memory capacity (e.g.,

256PB when simulating 50 qubits each with a double precision) for each run in practice

[31]. In order to reduce memory requirement significantly, the QC simulation researchers [31] have developed a method to store the lossy-compressed data in memory and decompress them whenever needed in the course of the simulation, while suffering from considerable overhead in simulation time (even up to 20 in worst case), which is totally undesired by users.

In this paper, we focus on how to significantly accelerate both compression and decompression performance for error-bounded lossy compression while keeping a certain high compression ratio, which faces two grand challenges to resolve. On the one hand, in order to pursue the ultra-high lossy compression/decompression performance, we have to strict the whole design to be limited to only super-lightweight operations including addition/subtraction/bitwise operations, which is a serious challenge. Specifically, the relatively expensive operations such as multiplication and division should be suppressed because of its significantly higher cost. All of the existing efficient error-bounded lossy compressors, however, depend on such expensive operations. For instance, SZ2.1 [16]

relies on the linear regression prediction, which involves masses of multiplications to compute the coefficients. Moreover, SZ2.1 relies on a linear-scale quantization to control the user-specified error bound, which involves a division operation (

quantzation_bin= [8]) on each data point. ZFP [18] is another state-of-the-art error-bounded lossy compressor, which is designed based on the data transform, also involving masses of matrix-multiplication operations. On the other hand, how to optimize the performance to adapt to different device architectures is very challenging, because it requires fairly detailed and non-trivial performance tuning in the regard of practical experiments with numerous real-world scientific datasets across different domains.

To address the above serious challenges, we propose a novel, Ultra-Fast error-bounded lossy Compression framework – namely UFZ, which can also be extended/implemented efficiently for different devices such as CPU and GPU. The key contributions are summarized as follows:

  • We develop UFZ, which composes only lightweight operations such as bitwise operation, addition and subtraction. UFZ also supports strict control of the compression errors within user-specified error bound, thanks to our elaborate design in its error-control mechanism.

  • We implement UFZ for CPU and GPU accelerator, and also optimize their performances carefully based on these device architectures, respectively.

  • We comprehensively evaluate UFZ by running it with 6 real-world scientific datasets on heterogeneous compute nodes offered by different supercomputers including ORNL Summit and ANL ThetaGPU. We rigorously compare UFZ to two state-of-the-art lossy compressors SZ and ZFP, as well as their GPU versions – cuSZ, cuZFP.

  • Experiments show that UFZ is 25 as fast as the second-best existing error-bounded lossy compressor on CPU and 510 as fast as the second-best on GPU, with respect to both compression and decompression. At such a high performance, UFZ can still get a very nice compression ratio (312 for the overall compression ratio of each application; and up to 124 for the compression ratio of specific field) with good reconstructed data quality.

The rest of the paper is organized as follows. In Section II, we discuss related work. In Section IV, we present the design overview of our ultra-fast error-bounded lossy compression framework. In Section V, we propose optimization strategies in improving the performance for different devises including CPU and GPU. In Section VI, we present and discuss the performance evaluation results. Finally, we conclude the paper with a vision of the future work in Section VII.

Ii Related Work

Basically, high-speed scientific data compression can be split into two categories - lossless compression and lossy compression, which will be discussed in the following text, especially in the regard of performance/speed.

High-speed lossless compressors have been developed in particular because of the strong demand on compression performance in many use-cases. Facebook Zstd [4], for example, was developed particularly for the sake of high performance, with very similar compression ratios compared with other state-of-the-art lossless compressors such as Zlib [33] and Gzip [6]. In general, Zstd could be 56 faster than Zlib as shown in [4], such that it has been widely integrated/used in 80+ production-level software/libraries/platforms. Unfortunately, Zstd supports only lossless compression, which would suffer very low compression ratios (1.22 in most of cases) when compressing scientific datasets that are mainly composed of floating-point values (to be shown later).

High-speed lossy compression has also gained significant attentions by compressor developers or scientific applications/researchers. SZ [7, 25, 16] is a typical fast error-bounded lossy compressor, which can reach 200300MB/s in compression and decompression speed [7, 25, 16]. However, it is still not as fast as expected by the quantum computing simulations [31], so a faster lossy compression method (namely QCZ in the following text) was customized with comparable compression ratios (especially for a high-precision compression with relative error bound of 1E-4 or 1E-5). ZFP [18] is another fast error-bounded lossy compressor, which is well known for its relatively high compression ratios and fairly high compression speed in both CPU and GPU. Based on our experiments (to be shown later), ZFP and QCZ has comparable compression speed, and they are generally 1.52 as fast as SZ. In fact, SZ already has a fairly high performance compared with other compressors as demonstrated in literature: it has a comparable performance with FPZIP [17] and SZauto [32] and about one to two orders of magnitude higher performance than ISABELA [14], MGARD [1] and TTHRESH [2].

Because of the high demand on ultra-fast error-bounded lossy compressors, a few specific error-controlled lossy compression algorithms have been developed for GPU accelerators in particular, and cuSZ [28] and cuZFP [5] are two leading ones. The cuSZ is the only GPU-based lossy compressor supporting absolute error bound for scientific data compression. It was designed based on the classic prediction-based compression model SZ and optimized for GPU performance significantly by leveraging a dual-quantization strategy [28] to deal with the Lorenzo prediction. Since ZFP’s core stage is performing a customized orthogonal data transform that can be executed in the form of matrix-multiplication, cuZFP can leverage high-performance CUDA library to reach a very high throughput. CuZFP, however, does not support error-bounded compression but only fixed-rate compression, which suffers from very low compression ratios, as verified in [29].

In comparison with all the above related works, our proposed UFZ is about 25 as fast as the second-fastest lossy compressor ZFP on CPU and 210 as fast as the second-fastest (cuSZ) on GPU, also with relatively high compression ratios (312 depending on user’s error bound).

Iii Problem Formulation

In this section, we formulate the research problem we focus on in this paper: optimization of the error-bounded lossy compression/decompression performance with as high compression ratios as possible. Specifically, given a scientific dataset (denoted by ) composed of data values each denoted by , where =1,2,3,,. The objective of our work is to develop an error-bounded lossy compressor with an ultra high performance in both compression and decompression for both CPU and GPU, also strictly respecting the user-required error bound, which can be represented as the following formula.


where CT and DT represent the compression throughput and decompression throughput, respectively; and denote the original data value and decompressed data value in the dataset, respectively; is referred to as the user-specified absolute error bound, and CR means the compression ratio, which is defined as the ratio of the original data size to the lossy compressed data size. In order to obtain as high performance as possible, the compression ratio (CR) would definitely be not optimal. However, we still hope to get a relatively high CR (expected to be over 5 or 10), which is still much higher than lossless compression ratio (generally 1.22 for scientific data).

The compression throughput and decompression throughput are defined in Formula (2) and Formula (3), respectively.


where is the number of data points in the dataset , represents the number of bytes per value in (e.g., = 4 when the original data precision is single-precision floating point); and denote the time cost when compressing the dataset and the time cost when decompressing the corresponding compressed data, respectively.

In addition to the maximum compression error (i.e., error bound as shown in Formula (1

)), we will also evaluate the reconstructed data quality by commonly used statistical data distortion metrics such as Peak Signal to Noise Ratio (PSNR)

[26] and Structural Similarity Index Measure (SSIM) [30], which have been commonly used in lossy compression and visualization community. In general, the higher the PSNR or the higher the SSIM, the better the reconstructed data quality.

Iv Ultra-fast Error-bounded Lossy Compression Framework – UFZ

In this section, we present the design overview of our ultra-fast error-bounded lossy compression framework – UFZ. Detailed performance optimization strategies for CPU and GPU will be discussed in next section.

Our design is motivated by the fact that most of the scientific datasets are pretty smooth in space, such that all the values in a small block (e.g., 16 or 32 consecutive data points) are likely very close to each other, thus the mean of the minimal value and maximal value in the block can be used to represent the whole block based on a certain error bound. Figure 1 shows the visualization of four typical fields across from four different real-world simulation datasets (Miranda large-eddy simulation [20], Nyx cosmology simulation [21], QMCPack quantum chemistry [13], and Hurricane climate simulation [11]), clearly demonstrating the high smoothness of the data in local spatial regions. Furthermore, Figure 2

shows the cumulative distribution function (CDF) of block’s relative value range

111A block’s relative value range is defined as the ratio of the block’s value range to the dataset’s global value range. The reason we check the block’s relative value range is that the error-bounded lossy compression is often performed via value-range based relative error bound [26], where the absolute error bound is calculated based on the dataset’s global value range.. It verifies that the four scientific datasets all exhibit fairly high smoothness of the local data without loss of generality. Specifically, for the Miranda dataset and QMCPack dataset, 80+% of blocks have very small relative value ranges (0.01), when the block size is 8.

(a) Miranda (pressure:slice128)

(b) Nyx cosmology (temperature)

(c) QMCPack (slice500)

(d) Hurricane (U:slice60)
Fig. 1: Demonstrating High Smoothness of Scientific Datasets

(a) Miranda (pressure:slice128)

(b) Nyx cosmology (temperature)

(c) QMCPack (slice500)

(d) Hurricane (U:slice60)
Fig. 2: Cumulative Distribution Function (CDF) of Block’s Value Range

We design our compressor UFZ in terms of the local smoothing feature, as illustrated in Figure 3. The fundamental idea is organizing the whole dataset as many small 1D blocks (or segments) and checking whether the mean of min and max (denoted by ) in each block can be used to represent all values in this block with deviations respecting user-specified error bound. If yes, we call this block ‘constant’ block, so we just need to store for this block of data; or else, we compress all the data points in this block by analyzing their IEEE 754 representations in terms of the user-required error bound.

Fig. 3: Design Architecture/Workflow of UFZ

We present the pseudo-code of the skeleton design in Algorithm 1 to further describe details. Table I summarizes all key notations to assist understanding of the algorithm.

Notation Description
The dataset given for compression
user-specified error bound
the data points in the original raw dataset
th block in the dataset
mean of min and max in Block
variation radius of Block
the required mantissa bits calculated via and for
normalized values based on in each block
identical leading bits of compared with
TABLE I: Key Notations Involved in The ULF Algorithm

We describe Algorithm 1 as follows. As mentioned previously, the whole dataset is split into many small fixed-size 1D blocks and the compression will be executed block by block (line 2). Because of the high smoothness of data in locality, quite a few data blocks may have the values already respect the error bound based on the mean of min and max (denoted by ) (line 46), and such blocks are called ‘constant’ blocks, which will be compressed by simply storing the corresponding value.

Input: dataset , user-specified error bound , block size (denoted )

Output: compressed data stream in form of bytes

1:   0, 0;/*Set 0 to all counters*/
2:  for each block with block size  do
3:      Compute for ; /*Compute mean of min and max*/
4:      if (: then
5:          array ; /*Collect for ‘constant’ blocks*/
6:      else
7:          Compute required # mantissa bits (denoted as );
8:          for each normalized value in  do
9:              Compute identical_leading_bytes for and ;
10:              Encode identical_leading_bytes into xor_leadingzero_array;
11:              mb_array ; /*Commit required bits excluding */
12:          end for
13:      end if
14:      Aggregate output: array, xor_leadingzero_array, mb_array;
15:  end for
Algorithm 1 Skeleton Design of UFZ

For each of the non-constant blocks, we first normalize the data by subtracting the variation radius of the block (i.e, mean of min and max for the data in the block), and then compress each such normalized value by IEEE 754 binary representation analysis according to the following three steps:

  • Line 7: We compute the required number of mantissa bits (denoted as ) based on user-specified error bound, by the following formula.


    where denotes getting the exponent of the number , denotes the variation radius of data in the block

    , and sizeof(type) refers to the length of the data type (e.g., 32bits for single-precision floating-point type). The idea is normalizing the data values by subtracting the mean of min and max, such that the maximum exponent of each normalized value is foreseeable and thus the required mantissa bits are estimable by combining the exponent of the error bound value


  • Line 9: Compute identical leading bytes by an XOR operation between the normalized data value and its preceding data value . The number of leading zeros after the XOR operation indicates the number of identical leading bytes between the two data points.

  • Line 10: We encode the number of identical leading bytes for each data point by a 2-bit code: 00, 01, 10, and 11 corresponds to 0, 1, 2, and 3 identical leading bytes, respectively. We use a 2-bit-per-value array (called xor_leadingzero_array) to carry these 2-bit codes, as illustrated in Figure 4.

  • Line 11: We commit the necessary mantissa bits, i.e., error-bounded based required bits (denoted as ) excluding identical leading bytes (denoted by , a.k.a, mid-bytes), to a particular mantissa bit array (denoted as mb_array), as shown in Figure 4.

Fig. 4: Compressing non-constant block by binary representation analysis: suppose three adjacent normalized values in a non-constant block are 0.1234, 0.1235, and 0.1211, respectively.

V Performance Optimization for Various Devices

In this section, we describe our specific performance optimization strategies for CPU and GPU, respectively.

V-a Performance Optimization for CPU

In this subsection, we describe how to accelerate the UFZ in our CPU code by an efficient bitwise right shifting operation, which mainly involves the line 712 in Algorithm 1. This is a fundamental optimization strategy which can also be applied in other devices/accelerators such as GPU. In what follows, we first describe a potential performance issue in the UFZ design, followed by our optimization solution thereafter.

As illustrated in Figure 5, the mantissa bits that need to be stored for the normalized value should exclude the identical bytes and the insignificant bits which are calculated based on the user-specified error bound and variation radius of the corresponding block. The number of such necessary mantissa bits is generally not a multiple of 8 (to be verified later), so that committing/storing these bits in the compressed data requires specific bitwise operation strategies.

Fig. 5: Illustration of 3 Ways to Store Necessary Mantissa Bits (Solution C is our performance optimization strategy)

Storing a short bit-array with an arbitrary number of bits is a very common operation in lossy compression. The most straight-forward solution (Solution A as shown in Figure 5) is treating the given bit-array as a particular integer, and populate the target bit-stream pool (i.e., mb-array in the figure) by applying a couple of bit-wise operations (such as bit-shift, bit-and and bit-or) on the integer number. Many lossy compressors store the arbitrary bits in this way, such as Pastri [9]. An alternative solution (Solution B as shown in Figure 5) is splitting the necessary bits into two parts – a number of necessary bytes ( bytes) + a few residual bits ( bits), which was adopted by SZ [7, 8]. In this solution, the residual bits with varied number of bits still need to be gathered in a target array by a set of bit-wise operations.

By comparison, we develop an ultra-fast method (solution C as shown in Figure 5) to deal with the necessary bits very efficiently. The basic idea is bitwise right-shifting the normalized value by bits, where is given in Formula (5), such that the number of the necessary bits to be stored is always a multiple of 8. As such, the necessary mantissa bits can be represented by an integer number of bytes, with eliminated residual bits. In this situation, we just need to use memory copy operation to commit the necessary bits to one byte-array, which would be fairly fast.


V-A1 Investigating Space Overhead of Bitwise Right-Shifting

Note that the bitwise right-shifting operation may increase the total number of required bits to store, thus reducing the compression ratios in turn. In the following text, we will show that the increased number of bits per value because of the bitwise right-shifting operation is very limited compared with the compressed data size, thanks to the design of identical leading bytes. That being said, such a space overhead introduced to compressed data size is negligible in most cases. In fact, although the bitwise right-shifting operation may increase the required number of bits, this operation may also potentially increase the number of identical leading bytes, such that some necessary bits could be “recorded” by the identical leading array instead. In other words, after the bitwise right-shifting operation, the necessary bits tend to increase on the right end but tend to decrease on its left end, which forms a counteraction to a certain extent.

We use Figure 6 (based on two real-world simulation datasets with different value range based error bounds [26]) to show the specific space overhead of our solution designed with bitwise right-shifting operation, as compared with the compressed data size. Specifically, the space overhead is defined as the ratio of the increased storage space introduced by the bitwise right-shifting method to the compressed data size, as presented in Formula (6).


where CR is compression ratio, refers to the original data size (thus /CR means compressed data size), refers to the total amount of necessary bytes to store under the Solution C (our solution), and refers to the total amount of necessary bytes to store by Solution A or B.

(a) Hurricane-ISABEL (=1E-3)

(b) Miranda (=1E-3)

(c) Hurricane-ISABEL (=1E-4)

(d) Miranda (=1E-4)

(e) Hurricane-ISABEL (=1E-5)

(f) Miranda (=1E-5)
Fig. 6: Space overhead of bitwise right-shifting used in UFZ, showing the min, 2nd-min, avg, 2nd-max, and max overhead for two application datasets each with multiple fields

According to Figure 6 which involves a total of about 100 different fields across these two applications, it is clearly observed that the space overhead is always lower than 12% for all the fields, and average overhead for each case (with a specific block size) is always around or lower than 5% compared with the compressed data size. We give an example to further explain how small the overhead is as follows. Specifically, for the field ‘density’ in the Miranda simulation dataset, the original data size is 256384384bytes = 144MB, the compression ratio of UFZ is 9.923, so the compressed data size is about 15.2MB. Our characterization shows that Solution B and Solution C lead to 81,340,334 necessary bits (i.e., 10,167,542 bytes) and 83,054,120 necessary bits (i.e., 10,381,765 bytes), respectively, which means the overhead is only =1.4% for this field.

V-A2 Exploring The Optimal Block Size

Different block sizes may affect the compressed data sizes (i.e., compression ratios) significantly, thus it is necessary to investigate the most appropriate setting about block size for the UFZ. As described previously, there are two types of the blocks in the design, which are called ‘constant’ blocks (line 45 in Algorithm 1) and ‘nonconstant’ blocks (line 612 in Algorithm 1), respectively. Before exploring the optimal block size, we need to understand how the two types of blocks contribute to the compressed data size (or compression ratios), which are analyzed as follows (with three impact factors summarized).

  • Analysis of Constant Blocks Constant blocks refer to the blocks each of which can be approximated by using one data value (i.e., mean of min and max). As such, the smaller block size, the more data points to be included in the constant blocks, because of the finer-grained block-wise processing, as illustrated in Figure 7 (a). As shown in the figure, the first set of 8 data points can form a constant block because of relatively small block size. In this sense, the compression ratio tend to increase with decreasing block size because all the values within the constant block can be approximated by one value (i.e., ), which is called impact factor

    in the following text. However, since each constant block needs to store a constant value in the compressed data, the smaller block size, the larger number of to be stored, which may also decrease the compression ratio in turn, as illustrated in Figure 7 (b). We call this phenomenon impact factor

    . Specifically, for the relatively smooth regions in the dataset, the algorithm still needs to store multiple s even though a large number of adjacent data points could be approximated by only one uniform value instead. This may introduce significant overhead because of extra unnecessary to store, thus leading to the lower compression ratios.

    (a) Pros of Small Block Size

    (b) Cons of Small Block Size
    Fig. 7: Constant block’s pros and cons when block size is small
  • Analysis of Nonconstant Blocks On the one hand, the impact factor

    also applies on nonconstant blocks as they also need to store for data denormalization during the decompression. On the other hand, smaller block size may tend to get higher compression ratios, because of the following reason (we call it impact factor

    ). In fact, the smaller block size, the smaller variation in the block (i.e., smaller ), and thus the fewer necessary bits to store. Specifically, as shown in Figure 7 (a), the first 8-point block has much smaller data variation than the other one, so that the corresponding required exponent would be smaller, leading to fewer required mantissa bits (according to Formula (4)).

Based on the above analysis, different block sizes may have distinct pros and cons to the compression quality in the regard of the two different types of blocks. It is not obvious what block size can get the best compression quality. In what follows, we explore the best block size setting by characterizing the compression ratios and Peak Signal to Noise Ratio (PSNR) with different block sizes, as presented in Figure 8. PSNR is a critical lossy compression data quality assessment metric, which has been widely used in the lossy compression and visualization community [26, 7, 25, 32, 18]. PSNR is defined in Formula (7).


where and are the min value and max value in the dataset , and MSE refers to the mean squared error between the original dataset and reconstructed dataset . The higher PSNR, the higher precision of the reconstructed data.

(a) CR (=1E-3)

(b) CR (=1E-4)

(c) PSNR (=1E-3)

(d) PSNR (=1E-4)
Fig. 8: Compression quality of Miranda data with various block sizes

In the exploration, we checked many different error bounds from 1E-3 through 1E-6. Because of space limit, we present in Figure 8 only the results about the value range based error bound of 1E-3 and 1E-4, which compress 7 fields of the Miranda simulation dataset by UFZ. Other error bounds and datasets exhibit very similar results.

From Figure 8, we can observe that the compression ratio increases with block size in most of cases, while the PSNR always stays at the same level across different block size settings. This motivates us the best block size is 128 based on such a comprehensive characterization covering multiple applications and error bounds. This characterization also indicates that the impact factor

dominates the overall compression ratios, because this is the only factor that may enhance the compression ratio with increasing block size.

V-B Performance Optimization for GPU

In this section, we describe our design and implementation for the cuUFZ - CUDA GPU version of UFZ. UFZ is considered as an irregular application to the GPU platforms due to the data dependencies in both the compression and decompression. In order to maximize the utilization of the GPU computing capacity, algorithmic adjustments and architecture-specific optimizations for the cuUFZ are needed.

Compression: The basic design of cuUFZ’s compression is that each CUDA thread-block handles one data-block. A thread-block is configured in two-dimensions. The data-block size is chosen as a multiple of warp size to optimize the performance. There are two phases during the cuUFZ compression. In the first phase, the compressor distinguishes the constant and non-constant data-blocks by calculating their values and deviation radius. The entire data-block set is processed by the thread-blocks iteratively. A thread-block with a non-constant data-block would enter the second phase, whereas the one with a constant data-block would record the data-block index and then immediately go forward to process the next data-block. With this fashion, the workload imbalance can be significantly mitigated, in that the number of data-blocks is considerably larger than the number of thread-blocks. We also note that the calculations of min and max value are the main computing workloads during the first phase, and they can be efficiently parallelized by leveraging the CUDA warp-level operations.

Only the thread-blocks with non-constant data-blocks go through the second phase. In the second phase, the leading-number-based compression is performed. Performing the computation of xor_leadingzero_array and mb_array (a.k.a., mid-bytes) on GPU is straightforward. However, writing the mid-bytes back to global memory in a compact format is challenging. Unlike the serial implementation, in cuUFZ, the starting address to write mid-bytes of each data-block remains unknown until the total number of mid-bytes of all its preceding data-blocks is computed. Therefore, a prefix scan should be performed before writing the mid-bytes to the memory. Prefix-scan on GPU has been well studied [19]. We leverage the classical design and implement it using 2-level in-warp shuffles.

Decompression: The basic design of cuUFZ’s decompression is similar to its compression, in which each thread-block handles one data-block. Since the decompression of the constant data-block is very lightweight, we only decompress the non-constant data-blocks in GPU. The decompression consists of two components: the leading-byte retrieval and the mid-byte retrieval. Implementing the latter is relatively straightforward as it just need to read the bytes from the compressed data. We note that, due to the same reason as in the compression, a parallel prefix-scan is applied before retrieving the mid-bytes. However, implementing the leading-byte retrieval is challenging since retrieving the bytes from the preceding adjacent element no longer works in the parallel environment.

Fig. 9: Schematic graph showing the index propagation for parallel leading-byte retrievals (assume the number of required bytes for this block is 3).

In Figure 9, we illustrate the challenge by an example (see first row in particular), which displays an eight-element data-block with the mid-bytes highlighted in yellow. In the serial code, we can retrieve the third leading-byte of the third element (B33) by reading the mid-byte of the second element (B32) and then subsequently retrieve B34 by reading B33. However, in the parallel context, B33 and B34 will be simultaneously retrieved, which will cause the read-after-write (RAW) hazard. The same issue occurs when B27 and B28 are retrieved simultaneously.

To solve the above issue, the decompression needs to predetermine the mid-byte position each leading-byte should read from. For example, in this case, both B33 and B34 should read from B32. To this end, we propose an index-propagation. This algorithm assigns each byte an initial reading-position. The ones for leading-bytes are the first element’s index (1 in the example) while the ones for mid-bytes are their own elements’ indices. Then a parallel interleaved addressing propagation will be performed to propagate the position values. If the source value greater than the destination value, the source value will be used as the new reading position for the current byte. With the interleaved addressing scheme, the propagation complexity is reduced to O(logn). In the running example, we only need three rounds of shuffles with strides 1, 2, and 4 to finish the propagation. Notice that we omit the shuffles with stride=4 in Figure

9 to save space. It does not change the final position values in this case. After the index-propagation, each leading-byte knows which mid-byte it should read from to retrieve its value as shown in the last row of Figure 9.

Vi Performance Evaluation

In this section, we analyze the evaluation results, which are performed using 6 real-world scientific datasets on heterogeneous devices offered by two different supercomputers.

Vi-a Experimental Setup

Table II describes all the application datasets used in our experiments. All the datasets are downloaded from the well-known scientific data reduction benchmark website [22].

Application # fields Size per field Description
CESM-ATM (CE.) [12] 77 18003600 Atomosphere simulation of Community Earth System Model
Hurricane (Hu.) [11] 13 100500500 simulation of Hurricane ISABEL
Miranda (Mi.) [20] 7 256384384 large-eddy simulation of multicomponent flows with turbulent mixing
Nyx (Ny.) [21] 6 512512512 adaptive mesh, massively-parallel, cosmological simulation
QMCPack (QM.) [13] 2 288/8161156969 simulation for electronic structure of atoms, molecules and solids
SCALE-LetKF (SL.) [27] 12 9812001200 SCALE-RM weather simulation based on LETKF filter
TABLE II: Applications (all datasets here are originally stored in single-precision floating-points)

We perform our GPU experiments on both A100 GPU (offered by ANL ThetaGPU [15]) and V100 GPU (offered by ORNL Summit [24]). We perform our CPU experiments on ANL ThetaGPU’s compute nodes. We compare our developed ultra-fast compressor UFZ with two outstanding lossy compressors – SZ [7, 25] and ZFP [18], since they are arguably the fastest existing error-bounded compressors based on prior studies [32, 7] and they both have GPU versions that can be compared with our solution UFZ in the experiments.

Vi-B Evaluation Results

First of all, we check the data reconstruction quality under our UFZ for all the simulation datasets involved in our experiments. We conclude that the overall visual quality looks great when the value range based error bound (denoted by ) is set to 1E-21E-4 for UFZ. Due to space limit, we demonstrate the visual quality, PSNR, and Structural Similarity Index (SSIM) using only the Hurricane-SIABEL simulation dataset (CLOUDf48), as shown in Figure 10 (compression ratios are 14.6, 18, and 20.64, respectively). We can observe that the reconstructed data’s visual quality is pretty high, even zooming in the top-left corn by 50, though a few artifacts can be seen in the dark blue area of Figure 10 (d). How to further mitigate or remove artifacts will be our future work.

(a) original data

(b) =1E-3,PSNR=74.4,SSIM=0.93

(c) =4E-3,PSNR=62,SSIM=0.89

(d) =1E-2,PSNR=54.6,SSIM=0.865
Fig. 10: Visual Quality of UFZ on Hurricane-ISABEL Simulation (the compression ratios are 14.6, 18, 20.64, respectively)

We present compression ratios (CR) of the three compressors in Table III

, by showing the minimal, overall (i.e., Harmonic mean), and maximum CR, respectively, for all the fields in each application. The table shows that UFZ can get very high compression ratios (e.g., 124 for CESM) when REL=1E-2. Its overall compression ratio is 3

12 in all the cases, which is 0.53 lower compared with ZFP and 330 lower compared with SZ. This is reasonable because SZ and ZFP adopt advanced multidimensional data analysis and sophisticated encoding methods, which can get fairly high compression ratios but may suffer lower execution performance on both CPU and GPU in turn (to be shown later). By comparison, the overall compression ratio for the lossless compressor Zstd is only 1.121.49, which is lower than that of UFZ by about 200400%.

max width=0.99 CESM Hurricane Miranda Nyx QMCPack Scale-LetKF REL min avg max min avg max min avg max min avg max min avg max min avg max 1E-2 4 9.1 124 4 6.6 21.1 8.2 11.8 16.2 4.8 11.34 124 9.2 9.4 9.7 7.6 10.6 25.2 UFZ 1E-3 2.84 4.61 19.3 2.9 4 17.6 4.5 7.2 12.5 3.2 5.9 119 4.3 4.4 4.4 3.65 4.7 7.8 1E-4 2.14 3.3 17 2.1 3 16.2 2.7 4.5 9.5 2.4 3.7 75 2.9 2.9 2.9 2.7 3.14 5.6 1E-2 8 13.6 46 6.4 11.3 25.8 30.5 46.6 74.6 22.5 38.8 1.1k 39.1 39.2 39.4 9.4 14.5 23.8 ZFP 1E-3 4.3 7.9 30 4.3 6.7 13.2 20.6 25.6 38.5 8.2 13.1 150 21 21.1 21.2 6.4 7.8 13.4 1E-4 3 5.1 18.8 2.9 4.32 10.4 11 14.5 22.9 4.1 6.2 74 10.3 10.3 10.4 3.9 4.6 7.7 1E-2 34.4 151 3k 20.4 49.8 339 92.8 126 234 263 507 21k 201 213 227 26.3 84 746 SZ 1E-3 15.6 151 840 9.24 17.5 58.8 49.6 59.5 75.2 36.7 79 3.6k 52 54.3 56.8 18.9 26.5 140 1E-4 6.4 38.3 104 5.6 9.8 31 25.1 29.6 35 10.3 18.2 621 18.9 19.2 19.6 10 13.9 23.1 zstd - 1.03 1.44 17.1 1.08 1.49 19.56 1.6 1.21 4.86 1.08 1.12 1.14 1.18 1.19 1.2 1.08 1.37 2.95

TABLE III: Compression Ratios (Original Data Size / Compressed Data Size)

We present the single-CPU compression throughput and decompression throughput in Table III and Table V, respectively. The numbers shown in the tables are the overall performance considering all the fields for each application. Through the tables, we can observe that our developed UFZ compressor significantly outperforms the other two error-bounded lossy compressors in both compression speed and decompression speed. In absolute terms, for compression, UFZ is about 2.55 as fast as ZFP, and about 57 as fast as SZ. For decompression, UFZ is about 24 as fast as both ZFP and SZ. Such a high performance in UFZ is mainly attributed to two factors: (1) the super-lightweight skeleton design (Algorithm 1), and (2) bitwise right-shifting strategy proposed in Section V-A.

REL CE. Hu. Mi. Ny. QM. SL.
1E-2 1034 796 959 1087 969 1032
UFZ 1E-3 822 750 833 877 902 703
1E-4 752 662 807 722 813 663
1E-2 392 256 249 418 323 258
ZFP 1E-3 288 213 211 284 275 208
1E-4 234 181 280 226 208 174
1E-2 236 193 186 258 205 217
SZ 1E-3 170 153 161 229 216 156
1E-4 143 130 139 164 147 124
TABLE IV: Overall Compression Throughput on CPU (MB/s)
REL CE. Hu. Mi. Ny. QM. SL.
1E-2 1221 1085 1950 1450 1292 1408
UFZ 1E-3 1022 1006 1546 1218 1083 975
1E-4 925 864 1319 956 928 886
1E-2 485 476 498 732 685 360
ZFP 1E-3 327 371 401 455 524 395
1E-4 246 297 327 333 376 299
1E-2 559 451 549 635 588 519
SZ 1E-3 381 291 444 534 462 334
1E-4 269 229 392 359 282 236
TABLE V: Overall Decompression Throughput on CPU (MB/s)

We evaluate the GPU performance for cuUFZ, cuZFP, and cuSZ on two cutting-edge supercomputers – ANL thetaGPU (A100) and ORNL Summit (V100), respectively. The compression and decompression performance results regarding all the fields of each application are presented in Figure 11 and Figure 12, respectively.

According to Figure 11, the peak performance of UFZ can reach up to 264GB/s (see Hurricane’s result in Figure 11 (a)). The overall compression performance of UFZ is 150216GB/s on thetaGPU and 140188GB/s on Summit. By comparison, both cuSZ and cuZFP suffer from very low GPU performance (9.886GB/s on thetaGPU and 1252GB/s on Summit).

(a) thetaGPU (A100)

(b) Summit (V100)
Fig. 11: Overall Compression Throughput Per GPU (GB/s)

According to Figure 12, the peak performance of UFZ can reach up to 446GB/s (see Miranda’s result in Figure 12 (a)). The overall decompression performance of UFZ is 150291GB/s on thetaGPU and 120243GB/s on Summit. By comparison, both cuSZ and cuZFP suffer from much lower decompression performance on GPU (9.767GB/s on thetaGPU and 13.748GB/s on Summit).

(a) thetaGPU (A100)

(b) Summit (V100)
Fig. 12: Overall Decompression Throughput Per GPU (GB/s)

We also evaluate the overall data dumping/loadling performance on ANL thetaGPU nodes with different execution scales. Specifically, as for the data dumping experiment, we use an MPI code to launch 641024 ranks/cores, each performing a lossy compression using NYX dataset and writing compressed data onto PFS. For the data loading experiment, each MPI rank reads the compressed data from parallel file system (PFS) and then performs decompression. We present the performance breakdown in Figure 13 in terms of different value-range based error bounds.

(a) dumping performance (1E-2)

(b) loading performance (1E-2)

(c) dumping performance (1E-3)

(d) loading performance (1E-3)

(e) dumping performance (1E-4)

(f) loading performance (1E-4)
Fig. 13: Data Dumping/Loading Performance on thetaGPU (NYX dataset)

Through the figure, we can clearly observe that the UFZ obtains the highest overall performance in both data dumping and data loading on thetaGPU. In absolute terms, the solution with UFZ takes only time to dump or load data than other solutions in most cases. That is, the I/O performance is improved by 100%200% under UFZ. The key reason is that the thetaGPU has relatively fast I/O speed, so that the compression/decompression stage turns the key bottleneck at the execution scales in our experiments.

Vii Conclusion and Future Work

In this paper, we propose an ultra-fast error-bounded lossy compression framework – UFZ. We perform comprehensive evaluations using 6 real-world scientific datasets and two cutting-edge supercomputers’ heterogeneous resources. The key insights are summarized as follows.

  • With the same error bound, UFZ has reasonably lower compression ratios than SZ and ZFP does (0.33 lower than ZFP and 330 lower than SZ) because it has no sophisticated data prediction/transform step and no expensive encoding algorithms such as Huffman encoding.

  • On CPU: with the same error bound, UFZ is 2.55 as fast as ZFP and 57 as fast as SZ in compression; UFZ is 24 as fast as both SZ and ZFP in decompression.

  • On GPU: with the same error bound, UFZ’s peak performance in compression and decompression on single GPU can reach up to 264GB/s and 446GB/s, respectively. This is 216 as fast as SZ and ZFP on GPU.

  • When compressing&writing compressed data to parallel file system (PFS) or reading&decompressing compressed data from PFS on ANL ThetaGPU with 641024 cores, the overall data dumping/loading performance under UFZ is higher than that with SZ or ZFP by 100%200%, because of UFZ’s ultra-fast compression and decompression and relatively fast I/O speed on ThetaGPU.

In the future work, we plan to explore how to further improve compression ratios for UFZ.


This research was supported by the Exascale Computing Project (ECP), Project Number: 17-SC-20-SC, a collaborative effort of two DOE organizations – the Office of Science and the National Nuclear Security Administration, responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, to support the nation’s exascale computing imperative. The material was supported by the U.S. Department of Energy, Office of Science, and by DOE’s Advanced Scientific Research Computing Office (ASCR) under contract DE-AC02-06CH11357, and supported by the National Science Foundation under Grant No. 2003709 and OAC-2104023. We acknowledge the computing resources provided on ThetaGPU, which is operated by the Argonne Leadership Computing Facility at Argonne National Laboratory.


  • [1] M. Ainsworth, O. Tugluk, B. Whitney, and S. Klasky (2018-12-01) Multilevel techniques for compression and reduction of scientific data—the univariate case. Computing and Visualization in Science 19 (5), pp. 65–76. External Links: ISSN 1433-0369 Cited by: §II.
  • [2] R. Ballester-Ripoll, P. Lindstrom, and R. Pajarola (2018)

    TTHRESH: tensor compression for multidimensional visual data

    CoRR abs/1806.05952. External Links: Link, 1806.05952 Cited by: §II.
  • [3] F. Cappello, S. Di, S. Li, X. Liang, G. M. Ali, D. Tao, C. Yoon Hong, X. Wu, Y. Alexeev, and T. F. Chong (2019) Use cases of lossy compression for floating-point data in scientific datasets. International Journal of High Performance Computing Applications (IJHPCA) 33, pp. 1201–1220. Cited by: §I.
  • [4] Y. Collet (2015) Zstandard – real-time data compression algorithm. Cited by: §II.
  • [5] cuZFP (2020) Note: Cited by: §I, §II.
  • [6] L. P. Deutsch (1996) GZIP file format specification version 4.3. Cited by: §II.
  • [7] S. Di and F. Cappello (2016) Fast error-bounded lossy HPC data compression with SZ. In IEEE International Parallel and Distributed Processing Symposium, pp. 730–739. Cited by: §II, §V-A2, §V-A, §VI-A.
  • [8] S. Di, D. Tao, X. Liang, and F. Cappello (2019) Efficient lossy compression for scientific data based on pointwise relative error bound. IEEE Transactions on Parallel and Distributed Systems 30 (2), pp. 331–345. External Links: Document Cited by: §I, §V-A.
  • [9] A. M. Gok, S. Di, Y. Alexeev, D. Tao, V. Mironov, X. Liang, and F. Cappello (2018) PaSTRI: error-bounded lossy compression for two-electron integrals in quantum chemistry. In 2018 IEEE International Conference on Cluster Computing (CLUSTER), Vol. , pp. 1–11. External Links: Document Cited by: §V-A.
  • [10] S. Habib, V. Morozov, N. Frontiere, H. Finkel, A. Pope, K. Heitmann, K. Kumaran, V. Vishwanath, T. Peterka, J. Insley, et al. (2016) HACC: Extreme scaling and performance across diverse architectures. Communications of the ACM 60 (1), pp. 97–104. Cited by: §I.
  • [11] Hurricane ISABEL simulation dataset in IEEE Visualization 2004 Test. Note: Cited by: §IV, TABLE II.
  • [12] J. Kay, C. Deser, A. Phillips, A. Mai, C. Hannay, G. Strand, J. Arblaster, S. Bates, G. Danabasoglu, J. Edwards, et al. (2015) The community earth system model (CESM), large ensemble project: a community resource for studying climate change in the presence of internal climate variability. Bulletin of the American Meteorological Society 96 (8), pp. 1333–1349. Cited by: TABLE II.
  • [13] J. Kim and et-al. (2018-04) QMCPACK: an open sourceab initioquantum monte carlo package for the electronic structure of atoms, molecules and solids. 30 (19), pp. 195901. Cited by: §IV, TABLE II.
  • [14] S. Lakshminarasimhan, N. Shah, S. Ethier, S. Klasky, R. Latham, R. Ross, and N. F. Samatova (2011) Compressing the incompressible with isabela: in-situ reduction of spatio-temporal data. In Euro-Par 2011 Parallel Processing, E. Jeannot, R. Namyst, and J. Roman (Eds.), Berlin, Heidelberg, pp. 366–379. External Links: ISBN 978-3-642-23400-2 Cited by: §II.
  • [15] LCRC (2021) ThetaGPU machine overview. Note: Cited by: §VI-A.
  • [16] X. Liang, S. Di, D. Tao, S. Li, S. Li, H. Guo, Z. Chen, and F. Cappello (2018) Error-controlled lossy compression optimized for high compression ratios of scientific datasets. In 2018 IEEE International Conference on Big Data, Cited by: §I, §I, §II.
  • [17] P. Lindstrom and M. Isenburg (2006) Fast and efficient compression of floating-point data. IEEE Transactions on Visualization and Computer Graphics 12 (5), pp. 1245–1250. Cited by: §II.
  • [18] P. Lindstrom (2014) Fixed-rate compressed floating-point arrays. IEEE Transactions on Visualization and Computer graphics 20 (12), pp. 2674–2683. Cited by: §I, §II, §V-A2, §VI-A.
  • [19] Mark Harris, Shubhabrata Sengupta and John D. Owens Parallel prefix sum (scan) with cuda. Cited by: §V-B.
  • [20] Miranda turbulence simulation. Note: Cited by: §IV, TABLE II.
  • [21] NYX simulation. Note: Cited by: §IV, TABLE II.
  • [22] Scientific data reduction benchmark. Note: Cited by: §VI-A.
  • [23] SLAC National Accelerator Laboratory (2017) Linac Coherent Light Source (LCLS-II). Note: Cited by: §I.
  • [24] Summit Note: Cited by: §VI-A.
  • [25] D. Tao, S. Di, Z. Chen, and F. Cappello (2017) Significantly improving lossy compression for scientific data sets based on multidimensional prediction and error-controlled quantization. In 2017 IEEE International Parallel and Distributed Processing Symposium, pp. 1129–1139. Cited by: §I, §II, §V-A2, §VI-A.
  • [26] D. Tao, S. Di, H. Guo, Z. Chen, and F. Cappello (2019) Z-checker: a framework for assessing lossy compression of scientific data. The International Journal of High Performance Computing Applications 33 (2), pp. 285–303. External Links: Document Cited by: §III, §V-A1, §V-A2, footnote 1.
  • [27]

    The local ensemble transform Kalman filter (letkf) data assimilation package for the scale-rm weather model

    Note: Cited by: TABLE II.
  • [28] J. Tian et al. (2020) CuSZ: an efficient gpu-based error-bounded lossy compression framework for scientific data. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques, PACT ’20, pp. 3–15. Cited by: §I, §II.
  • [29] R. Underwood, S. Di, J. C. Calhoun, and F. Cappello (2020) FRaZ: a generic high-fidelity fixed-ratio lossy compression framework for scientific floating-point data. Note: Cited by: §II.
  • [30] Z. Wang, A. C. Bovick, H. R. Sheikh, and E. P. Simoncelli (2011-02-11)(Website) External Links: Link Cited by: §III.
  • [31] X. Wu, S. Di, E. M. Dasgupta, F. Cappello, H. Finkel, Y. Alexeev, and F. T. Chong (2019) Full-state quantum circuit simulation by using data compression. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19, New York, USA. External Links: ISBN 9781450362290 Cited by: §I, §II.
  • [32] K. Zhao, S. Di, X. Liang, S. Li, D. Tao, Z. Chen, and F. Cappello (2020) Significantly improving lossy compression for hpc datasets with second-order prediction and parameter optimization. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’20, New York, NY, USA, pp. 89–100. External Links: ISBN 9781450370523 Cited by: §II, §V-A2, §VI-A.
  • [33] Zlib Note: Cited by: §II.