I Introduction
Largescale highperformance computing (HPC) applications can generate extremely large volumes of scientific data. For instance, the Hardware/Hybrid Accelerated Cosmology Code (HACC) [hacc] can simulate 110 trillion particles in one simulation and produce up to 220 TB of data per snapshot, bringing up a total of 22 PB of data during the simulation [miraio] with only one hundred timesteps/snapshots. Such a large volume of data is imposing an unprecedented burden on supercomputer storage and interconnects [liang2018error]
for both storing data to persistent storage and loading data for postanalysis and visualization. Therefore, data reduction has attracted great attention from HPC application users for reducing the volumes of data to be moved to/from storage systems. The common approaches are simply decimating snapshots periodically and adopting an interpolation for data reconstruction. However, such approaches result in a significant loss of valuable information for postanalysis
[liangdecimationdrbsd4]. Traditional data deduplication and lossless compression have also been used for shrinking data size but suffer from very limited reduction ratios on HPC floatingpoint datasets. Specifically, deduplication generally reduces the size of scientific datasets by only 20% to 30% [meister2012study], and lossless compression achieves a reduction ratio of up to about 2:1 in general [son2014data]. This is far from scientists’ desired compression ratios, which are around 10:1 or higher (such as Community Earth Simulation Model (CESM) [baker2014methodology]).Errorbounded lossy compression has been proposed to significantly reduce data size while ensuring acceptable data distortion for users [sz17]. SZ [sz16, sz17] is a stateoftheart errorbounded lossy compression framework for scientific data (to be detailed in §II), which often offers higher compression qualities (or better rate distortions) than other stateoftheart techniques [liang2018error]. However, as illustrated in prior work [sz16, sz17], SZ suffers from low compression and decompression throughput, which is only tens to hundreds of megabytes per second on a single CPU core. This throughput is far from enough for extremescale applications or advanced instruments with extremely high data acquisition rates, which is a major concern for corresponding users. The LCLSII laser [lcls], for instance, may produce data at a rate of 250 GB/s [usecaseFranck], such that corresponding researchers require an extremely fast compression solution that can still have relatively high compression ratios—for example, 10:1—with preserved data accuracy. In order to match such a high data production rate, leveraging multiple graphics processing units (GPUs) is a fairly attractive solution because of its massive singleinstruction multiplethread (SIMT) mechanism and its high programmability as opposed to FPGAs or ASICs [tian2020wavesz]. Moreover, the SZ algorithm follows time complexity and employs large amounts of read and write operations in the memory, and hence its performance is eventually bounded by memory bandwidth. Stateoftheart GPUs cannot only provide high computation capability but also provide high memory bandwidth. For example, NVIDIA V100 GPU can provide at least one higher order magnitude of memory bandwidth than stateoftheart CPUs can [nv100].
SZ, however, cannot be run on GPUs efficiently because of the lack of parallelism in its design. The main challenges are twofold: the tight dependency in the predictionquantization step of the SZ algorithm incurs expensive synchronizations across iterations in a GPU implementation; and during the customized Huffman coding step of the SZ algorithm, coding and decoding each symbol based on the constructed Huffman tree involve many different branches (see §II for more details). This process causes serious warp divergence and random memory access issues, which inevitably lead to low GPU memory bandwidth utilization and performance.
To solve these issues, this paper presents an optimized GPU version of the SZ algorithm, called cuSZ^{1}^{1}1The code is available at https://github.com/hipdaclab/cuSZ., and proposes a series of optimization techniques for cuSZ to achieve high compression and decompression throughputs on GPUs. Specifically, we focus on the main performance bottlenecks (Lorenzo prediction [ibarria2003out] and customized Huffman coding [sz17]) and improve their performance for GPUs. We propose a novel technique called dualquantization that can be applied to any predictionbased compression algorithms to alleviate the tight dependency in its prediction step. Moreover, according to prior work [usecaseFranck], a strict errorcontrolling scheme of lossy compression is needed by many HPC applications for their scientific explorations and postanalyses. However, the stateoftheart GPUbased lossy compressors such as cuZFP [cuZFP] are not errorbounded. To the best of our knowledge, cuSZ is the first strictly errorbounded lossy compressor on GPU for scientific data. Our contributions are summarized as follows.

[noitemsep, topsep=2pt, leftmargin=1.3em]

We propose a generic dualquantization scheme to entirely remove the data dependencies in the prediction step of lossy compression and apply it to Lorenzo predictor in SZ.

We develop an efficient customized Huffman coding for SZ on GPUs with finegrained and coarsegrained parallelism.

We carefully implement cuSZ and optimize its performance on CUDA architecture. In particular, we finetune the chunk size in Huffman coding and develop an adaptive method that selects 32bit or 64bit representation dynamically for Huffman code and can significantly improve the utilization of GPU memory bandwidth.

We evaluate our proposed cuSZ on five realworld HPC application datasets provided by a public repository, Scientific Data Reduction Benchmarks [sdrbench], and compare it with other stateoftheart methods on both CPUs and GPUs. Experiments show that the cuSZ can significantly improve both compression throughput by up to 370.1 and 13.1 over the production version of SZ running on single CPU core and multiple CPU cores, respectively. cuSZ has up to 3.48 higher compression ratio than another advanced GPU supported lossy compressor with reasonable data distortion.
The rest of the paper is organized as follows. In §II, we discuss the SZ lossy compression in detail. In §III, we propose our novel optimizations for the GPU version of SZ and implement it using CUDA. In §IV, we present the evaluation results based on five realworld simulation datasets from the Scientific Data Reduction Benchmarks and compare cuSZ with other stateoftheart compressors on both CPU and GPU. In §V, we discuss related work. In §VI, we present our conclusions and discuss our future work.
Ii SZ Background
Many scientific applications require a strict errorbounded control when using lossy compression to achieve accurate postanalysis and visualization for scientific discovery, as well as a high compression ratio. SZ [sz16, sz17] is a predictionbased lossy compression framework designed for scientific data that strictly controls the global upper bound of compression error. Given a userset error bound , SZ guarantees , where and are the original value and the decompressed value, respectively. SZ’s algorithm involves five key steps: preprocessing, data prediction, linearscaling quantization, customized variablelength encoding, and optional lossless compression, e.g., gzip [gzip] and Zstd [zstd].

[noitemsep, topsep=2pt, leftmargin=1.3em, label=0)]

Preprocessing SZ performs a preprocessing step, such as linearization in version 1.0 or a logarithmic transform for the pointwise relative error bound in version 2.0 [xincluster18].

Data Prediction SZ predicts the value of each data point by a datafitting predictor, e.g., a Lorenzo predictor [ibarria2003out] (abbreviated as predictor) based on its neighboring values. In order to guarantee that the compression error is always within the userset error bound, the predicted values must be exactly the same in between the compression procedure and decompression procedure. To this end, the neighbor values used in the prediction have to be the decompressed values instead of the original values.

LinearScaling Quantization SZ computes the difference between the predicted value and original value for each data point and performs a linearscaling quantization [sz17] to convert the difference to an integer based on the userset error bound.

Customized VariableLength Coding SZ adopts a customized Huffman coding algorithm to reduce the data size significantly, because the integer codes generated by the linearscaling quantization are likely distributed unevenly, especially when the data are mostly predicted accurately.

Lossless Compression The last step of SZ optionally further compresses the encoded data by a lossless compressor such as Zstd [zstd], which may significantly reduce the data size due to potential repeated patterns in the bitstream.
In this work, we focus mainly on the SZ compressor, because much prior work [foster2017computing, di2018efficient, liang2018error, pastri, understandcompressionipdps18, jin2019deepsz, liang2019significantly, zhao2020significantly] has verified that SZ yields the best compression quality among all the predictionbased compressors. However, it is nontrivial to port SZ on GPUs because of the strict constraints in its compression design. For instance, the data used in the prediction must be updated by decompressed values, such that the data prediction in the SZ compressor [sz16, sz17] needs to be performed one by one in a sequential order. This requirement introduces a loopcarried readafterwrite (RAW) dependency during the compression (will be discussed in §IIIA2), making SZ hard to parallelize.
We mainly focus on SZ1.4 instead of SZ2.0 because the 2.0 model is particularly designed for lowprecision use cases with visualization goals, in which the compression ratio can reach up to several hundred while the reconstructed data often have large data distortions. Recent studies [usecaseFranck], however, demonstrate that scientists often require a relatively high precision (or low error bound) for their sophisticated postanalysis beyond visualization purposes. In this situation (with relatively low error bounds), SZ2.0 has very similar (or even slightly worse, if not for all the cases) compression qualities to those of SZ1.4, as demonstrated in [liang2018error]
. Accordingly, our design for the GPUaccelerated SZ lossy compression is based on SZ1.4 and takes advantage of both algorithmic and GPU hardware characteristics. Moreover, the current CPU version of SZ does not support SIMD vectorization and has no specific improvement on the arithmetic performance. Therefore, the CPU baseline used in our following evaluation is based on the nonvectorized singlecore and multicore implementation.
Iii Design Methodology of cuSZ
In this section, we propose our novel lossy compression design, cuSZ, for CUDA architectures based on the SZ model. A system overview of our proposed cuSZ is shown in Fig. 1. We develop different coarse and finegrained parallelization techniques to each subprocedure in compression and decompression. Specifically, we first employ a datachunking technique to exploit coarsegrained data parallelism. The chunking technique is used throughout the whole cuSZ design, including lossless (step 2 and 3) and lossy (step 1, 4, and 5) procedures in both compression and decompression. We then deeply analyze the RAW data dependency in SZ and propose a novel twophase predictionquantization approach, namely, dualquantization, which totally eliminates the data dependency in the predictionquantization step. Furthermore, we provide an indepth breakdown analysis of Huffman coding and develop an efficient Huffman coding on GPUs with multiple optimizations. A summary of our parallelization techniques is shown in TABLE I.
compression 
sequential 
coarse
grained

fine
grained

atomic 
dualquantization  
histogram  
build Huffman tree  
canonize codebook  
Huffman encode (fixlength)  
deflate (fix to variablelength)  
decompression  
inflate (Huffman decode)  
reversed dualquantization 
Iiia Parallelizing PredictionQuantization in Compression
In this section, we discuss our proposed optimization techniques to parallelize SZ’s predictionquantization procedure on GPU architectures. We first chunk the original data to gain coarsegrained parallelization, and then we assign a thread to each data point for finegrained inchunk parallel computations.
IiiA1 Chunking and Padding
Fig. 2
illustrates our chunking and padding technique. For each chunked data block, we assign a thread to each data point (i.e., finegrained parallelism). To avoid complex modifications to the prediction function after chunking, we add a padding layer to each block in the predictionquantization step. We set all the values in the padding layer to
0 such that they do not affect the predicted values of the points neighboring to the padding layer, as shown in Fig. 2. We note that in the original SZ, the uppermost points and leftmost points (denoted by “outer layer”, shaded in Fig. 2) are saved as unpredictable data directly. In our chunking version, however, directly storing these points for each block would significantly degrade the compression ratio. Therefore, we apply prediction to the outer layer instead, such that every point in the block is consistently processed based on the predictor, avoiding thread/warp divergence. Moreover, we initialize the padding layer with 0s; the prediction for each outerlayer point falls back to 1D 1order Lorenzo, as shown in Fig. 2. Based on our empirical result, we adopt for 1D data, 1616 for 2D data, and 888 for 3D data.IiiA2 DualQuantization Scheme
In the following discussion, we use circle and bullet to denote the compression and decompression procedure, respectively. We use start to denote all the values related to the data reconstruction in compression. The subscript represents the th iteration.
ReadAfterWrite in SZ
In the original SZ algorithm, all the data points need to go through prediction, quantization, and in situ reconstruction iteratively, which causes intrinsic readafterwrite (RAW) dependencies (as illustrated in Fig. 3).
We describe the loopcarried RAW dependency issue in detail below. For any data point at the th iteration in SZ compression, given a predicted value , the prediction error (i.e., ) is converted to an integer and a corresponding quantization code based on the user set error bound . Then, the reconstructed prediction error and the reconstructed value are generated by using , , and . After that, is written back to replace . This procedure ensures that is equivalent to the reconstructed during decompression (as shown in Fig. 3); however, the th iteration must wait until the update completes at the end of the th iteration, which incurs loopcarried data dependency. Also note that is written back in the last step of the current iteration, and its written value is used at the beginning of the next iteration, therefore, the two consecutive iterations cannot overlap. Hence, under the original design of the predictionquantization in SZ, it is infeasible to effectively exploit finegrained parallelism and efficiently utilize SIMT on GPUs. We present the original SZ’s predictionquantization procedure in Algorithm 1 in detail.
Proposed DualQuantization Approach
To eliminate the RAW dependency, we propose a dualquantization scheme by modifying the data representation during the predictionquantization procedure. Our dualquantization (abbreviated as dualquant) consists of two steps: prequantization and postquantization.
Given a dataset with an arbitrary dimension, we first perform a quantization based on the userset and convert it to a new dataset:
where any is strictly a multiple of . We call this step prequantization (abbreviated as prequant). In order to avoid overflow, is stored by using a floatingpoint data type. We note that the error introduced by prequant (defined as posterror) is strictly bounded by the userset error bound, that is, .
After the prequantization, we can calculate each predicted value based on its surrounding values (denoted by ) and the predictor (denoted by ) as
The second step, called postquantization (abbreviated as postquant), is the counterpart of the linearscaling quantization in the original SZ. postquant computes the difference between the predicted value and the prequantized value. In postquant, we encode prequant introduced error,
to . Note that is quantitatively equivalent to , but they are in different representations: is a floatingpoint value to avoid overflow, while is an integer, which is used in the subsequent lossless coding (e.g., Huffman coding).
In the following text, we explain in detail why the dualquant method effectively eliminates the RAW dependency. Conceptually, similar to the original SZ, we can construct and during the compression, as shown in Fig. 3. In fact, for th iteration, is strictly equal to , because casting quantization code to is a exact reversed procedure of casting to .
Similarly, and are also strictly equivalent. Consequently, unlike the original SZ that must write back to update before the th iteration, always holds in our proposed dualquant approach. As illustrated in Fig. 3, after prequant, all are dependency free for postquant. By eliminating the loopcarried RAW dependency (marked as arrows in Fig. 3), we can effectively parallelize the dualquant procedure by performing finegrained (perpoint) parallel computation, which is commonly seen in image processing [zhang2010image]. We illustrate the detailed dualquant procedure in Algorithm 2.
Lorenzo Predictor with Binomial Coefficients
According to the prior work proposed by Tao et al. [sz17], the generalized predictor is given by
where and . For example, 1D 1order predictor is , and 2D 1order predictor is , as illustrated in Fig. 1. We note that all the coefficients in the formula of the predictor are integers; thus, the prediction computation consists of mathematically integerbased operations (additions and multiplications) and results in unit weight. This ensures that no division is needed, and the data reconstruction based on dualquant is fairly precise and robust with respect to machine , however, the original SZ using precise floatingpoint operations suffers from underflow. Moreover, the predicted values which are integers will be completely corrected by the saved quantization codes in decompression, so the final compression error is still bounded by .
IiiB Efficient Customized Huffman Coding
To efficiently compress the quantization codes generated by dualquant, we develop an efficient customized Huffman coding for SZ on GPUs. Specifically, Huffman coding consists of the following subprocedures: calculate the statistical frequency for each quantization bin (as a symbol); build the Huffman tree based on the frequencies and generate a base codebook along with each code bitwidth; transform the base codebook to the canonical Huffman codebook (called canonization); encode in parallel according to the codebook, and concatenate Huffman codes into a bitstream (called deflating). And Huffman decoding is composed of retrieving the reverse codebook and decoding accordingly.
Note that the fourth subprocedure of encoding can be further decomposed into two steps for finegrained optimization. Codebookbased encoding is basically memory copy and can be fully parallelized in a fine granularity, whereas deflating can be performed only sequentially (except blockwise parallelization discussed in §IIIA1) because of its atomic operations. We discuss Huffman coding on GPUs step by step as follows.
IiiB1 Histogram for Quantization Bins
The first step of Huffman coding is to build a histogram representing the frequency of each quantization bin from the data prediction step. The GPU histograming algorithm that we use is derived from the algorithm proposed by GómezLuna et al. [GmezLuna2012AnOA]. This algorithm minimizes conflicts in updating the histogram bin locations by replicating the histogram for each thread block and storing the histogram in shared memory. Where possible, conflict is further reduced by replicating the histogram such that each block has access to multiple copies. All threads inside a block read a specified partition of the quantization codes and use atomic operations to update a specific replicated histogram. As each block finishes its portion of the predicted data, the replicated histograms are combined via a parallel reduction into a single global histogram, which is used to construct the final codebook in Huffman coding.
IiiB2 Constructing Huffman Codebook
In order to build the optimal Huffman tree, the local symbol frequencies need to be aggregated to generate the global symbol frequencies for the whole dataset. By utilizing the aggregated frequencies, we build a codebook according to the Huffman tree for encoding. Note that the number of symbols—namely, the number of quantization bins—is a limited number (generally no greater than 65,536) that is much smaller than the data size (generally, millions of data points or more). This leads to a much lower number of nodes in the Huffman tree compared with the data size, such that the time complexity of building a Huffman tree is considered low. We note that building Huffman tree sequentially on CPU benefits from high CPUfrequency and low memoryaccess latency. However, it requires CPUtoGPU/GPUtoCPU transfer of frequencies/codebook before/after building the tree, and communicating these two small messages would incur nonnegligible overheads. Therefore, we adopt one GPU thread to build the Huffman tree sequentially to avoid such overheads.
We propose an adaptive codeword representation to enhance the utilization of memory bandwidth, which improves the Huffman encoding performance in turn. We illustrate the organization of the codebook in Fig. 4
. The codebook is organized by units of unsigned integers, and each contains a variablelength Huffman codeword from LSB (the rightmost bits or the least significant bits) and its bitwidth from MSB (the leftmost bits or the most significant bits). According to the pessimistic estimation of maximum bitwidth of optimal Huffman codeword
[abu2000maximal], one is supposed to useuint64_t
to hold each bitwidthcodeword representation.
For example, the maximum bitwidth could be 33 bits for CLDHGH field from CESMATM dataset in the worst case.
However, we note that using 32bit unsigned integers (i.e., uint32_t
) to represent a bitwidthcodeword tuple can significantly improve the Huffman coding and decoding performance compared with using 64bit unsigned integers (i.e., uint64_t
), because of higher GPU memory bandwidth utilization.
Thus, we propose to dynamically select uint32_t
or uint64_t
representation for the Huffman bitwidthcodeword based on the practical maximum bitwidth instead of pessimistic estimation. We show the performance evaluation with different representations in §IV.
The theoretical time complexity is for building a Huffman tree and for a traversing tree, where is the number of symbols (quantization bins). Our experiments show that the real execution time of building a Huffman tree is consistent with the theoretical time complexity analysis (see TABLE III). On the other hand, the number of symbols is determined by the smoothness of the dataset and the userdesired error bound (1,024 by default). For example, with a relatively large error bound such as the valuerangebased relative error bound ^{2}^{2}2Valuerangebased relative error bound (denoted by valrel) is the error bound relative to the value range of the dataset. of , we observe that most of the symbols are concentratedly distributed near the central of codebook. As the error bound decreases, the symbols become more evenly distributed. Thus, determining a suitable number of quantization bins is important for high performance in constructing a codebook.
IiiB3 Canonizing Codebook
A canonical Huffman codebook [barnett2003canonical] holds the same bitwidth of each codeword as the original Huffman codebook (i.e., base codebook), while its bijective mapping (between quantization code and Huffman codeword) and variable codeword make the memory layout organized more efficiently. The time complexity of sequentially building a canonical codebook from the base codebook is , where is the number of symbols (i.e., the number of quantization bins) and is sufficiently small compared with the data size. By using a canonical codebook, we can (i) decode without the Huffman tree, (ii) efficiently cache the reverse codebook for high decoding throughput, and (iii) maintain exactly the same compression ratio as the base Huffman codebook.
The process of building canonical codebook can be decomposed into the following subprocedures: linear scanning of the base codebook (sequentially ), which is parallelized at fine granularity with atomic operations; loosely radixsorting of the codewords by bitwidth (sequentially ), which cannot be parallelized because of the intrinsic RAW dependency; and building the reverse codebook (sequentially ), which is parallelized with finegrained parallelism.
It is intuitive to separate the functionalities of the aforementioned subprocedures and implement them into independent CUDA kernels with different configurations (i.e., threads per thread block, thread blocks per grid).
Based on our profiling results on an NVIDIA V100 GPU, however, launching a CUDA kernel usually takes about 60 microseconds (s) (about 200 s for three kernels) measured by 11 kernels launched in total.
Moreover, any two consecutive subprocedures require an additional expensive synchronization (i.e., cudaDeviceSynchronize
).
However, our experiment indicates that the kernel time of building a canonical codebook is sufficiently fast (e.g., about 200 s for ); thus, we integrate all the three subprocedures in one single kernel.
We note that this single kernel must be launched with multiple thread blocks because of two reasons. On the one hand, multiple thread blocks provide a high scalability for the parallel reads/writes in subprocedures and . On the other hand, unlike histogramming that saves only the frequencies in the shared memory, this kernel requires saving both the codebook and its footprint, which may exceed the maximum allowable capacity of shared memory in a single thread block (e.g., 96 KB for a V100). Since the shared memory is visible only to its own thread block, however, synchronizing codewords across different thread blocks is impossible. Thus, we use global memory instead of shared memory to save the codebook. As a result, we have to set a lightweight ingrid barrier between two subprocedures in the kernel. Specifically, we employ the stateoftheart CUDA API—Cooperative Groups [harris2017cooperative]—to achieve ingrid synchronizations. We note that launching each Cooperative Groups takes only about 30 s on a V100, which significantly reduces the overhead compared to launching multiple kernels.
IiiB4 Encoding and Deflating
We design an efficient solution to perform the encoding by GPU threads in parallel. Encoding involves looking up a symbol in the codebook and performing a memory copy. After we adaptively select a 32/64bit unsigned integer to represent a Huffman code with its bitwidth, the encoding step is sufficiently parallelized. To generate the dense bitstream of Huffman codes within each data block, we conduct deflating in order to concatenate the Huffman codes and remove the unnecessary zero bits according to the saved bitwidths.
Since the deflated code is organized sequentially, we apply the coarsegrained chunkwise parallelization technique discussed in §IIIA1. In particular, a data chunk for compression and decompression is mapped to one GPU thread. Note that the chunk size for deflating is not necessarily the same as the global chunk size, and it does not rely on the dimensionality. We optimize the deflating chunk size by evaluating the performance with different sizes (will be showed in §IVB1). We also employ memory reuse technique to reduce the GPU memory footprint in deflating. Specifically, we reuse the memory space of Huffman codes for the deflated bitstream because the latter uses significantly less memory space and does not have any conflict when writing the deflated bitstream to the designated location.
IiiC Decompression
cuSZ’s decompression consists of two steps: Huffman decoding (or inflating the densely concatenated Huffman bitstream) and reversed dualquant. In inflating, we first use the previously built reverse codebook to retrieve the quantization codes from the deflated Huffman bitstream. Then, based on the retrieved quantization codes, we reconstruct the floatingpoint data values. Note that only coarsegrained chunking can be applied to decompression, and its chunk size is determined in compression. The reason is that the two steps both have a RAW dependency issue. In fact, retrieving the variablelength codes has the same pattern as loopcarried RAW dependency. For the reversed dualquantization procedure, each data point cannot be decompressed until its preceding values are fully reconstructed.
Iv Experimental Evaluation
In this section, we present our experimental setup (including platform, baselines, and datasets) and our evaluation results.
Iva Experimental Setup
Evaluation Platform
We conduct our experimental evaluation using PantaRhei cluster [pantarhei]. We perform the experiments on an NVIDIA V100 GPU [nv100] from the cluster and compare with lossy compressors on two 20core Intel Xeon Gold 6148 CPUs from the cluster. The GPU is connected to the host via 16lane PCIe 3.0 interconnect. We use NVIDIA CUDA 9.2 and its default profiler to measure the kernel time.
datasets  type 





fp32 




fp32 




fp32 




fp32 




fp32 


Comparison Baselines
We compare our cuSZ with two baselines: SZ1.4.13.5 and cuZFP [cuZFP]. For SZ1.4, we adopt the default setting: 16 bits for linearscaling quantization (i.e., 1,024 quantization bins), best_compression mode, and best_speed mode for gzip, which lead to a good tradeoff between compression ratio and performance.
#quant.  128  256  512  1024  2048  4096  8192 

build tree  0.48  0.77  1.80  2.13  6.46  12.68  25.06 
get codebook  0.20  1.14  2.36  2.69  7.09  14.43  25.65 
total  0.68  2.16  4.16  4.81  13.55  27.10  50.71 
Test Datasets
We conduct our evaluation and comparison based on five typical realworld HPC simulation datasets of each dimensionality from the Scientific Data Reduction Benchmarks suite [sdrbench]: 1D HACC cosmology particle simulation [hacc], 2D CESMATM climate simulation [cesmatm], 3D Hurricane ISABEL simulation [hurricane], 3D Nyx cosmology simulation [nyx], and 4D QMCPACK quantum Monte Carlo simulation [qmcpack]. They have been widely used in prior works [tao2018optimizing, liang2018error, xincluster18, usecaseFranck, liang2019improving] and are good representatives of productionlevel simulation datasets. TABLE II shows all 112 fields^{3}^{3}3The QMCPACK dataset includes only one field but with two representations, including raw data and preconditioned data. across these datasets. The data sizes for the five datasets are 6.3 GB, 2.0 GB, 1.9 GB, 3.0 GB, and 1.2 GB, respectively. Note that our evaluated HACC dataset is consistent with realworld scenarios that generate petabytes of data. For example, according to [hacc], a typical largescale HACC simulation for cosmological surveys runs on 16,384 nodes each with 128 million particles and generates 5 PB over the whole simulation. The simulation contains 100 individual snapshots of roughly 3 GB per node. We evaluate a single snapshot for each dataset instead of all the snapshots, because the compressibility of most of the snapshots usually has strong similarity. Moreover, when the field is too large to fit in a single GPU’s memory, cuSZ divides it into blocks and then compresses them block by block.
IvB Evaluation Results and Analysis
In this section, we evaluate the compression performance and quality of cuSZ and compare it with CPUSZ and cuZFP on the tested datasets.
IvB1 Compression Performance
We first evaluate the performance of dualquant of cuSZ. The average throughput of the dualquant step on each tested dataset is shown in TABLE VII. Compared with the original serial CPUSZ, the predictionquantization throughput is improved by more than 1000 via our proposed dualquant on the GPU. This improvement is because dualquant entirely eliminates the RAW dependency and leads to finegrained (perpoint) parallel computation, which is significantly accelerated on the GPU.
We then evaluate the performance of our implemented Huffman coding step by step. First, we conduct the experiment of Huffman histogram computation and show its throughput performance.^{4}^{4}4All throughputs shown in the paper are measured based on the original data size and time. Efficiently computing a histogram on a GPU is an open challenging problem, because of the way that multiple threads need to write to the same memory locations simultaneously. Here, we present a method that, while a bottleneck in the Huffman process, is a 2 improvement from a serial implementation.
Next, we perform the experiment of constructing codebook with different numbers of quantization bins, as shown in TABLE III. We note that the execution times of building a Huffman tree and creating a codebook are consistent with our time complexity analyses in §IIIB2. We use 1,024 quantization bins by default. Since the time overhead of constructing a codebook depends only on the number of quantization bins, it is almost fixed—for example, 4.81 ms—for the remaining experiments. We also note that a larger data size lowers the relative performance overhead of constructing a codebook, thus leading to higher overall performance.







enc.64  s  4,274.3  97.1  385.8  2,044.7  2,401.4  
GB/s  250.9  255.1  251.7  251.1  251.1  
enc.32  s  2,839.3  64.1  255.8  1,358.6  1,595.6  
GB/s  377.7  386.6  379.6  377.9  377.9 
We also evaluate the performance of encoding and decoding based on the canonical codebook. To increase the memory bandwidth utilization, we adapt online selection of Huffman codeword representation between a uint32_t
and a uint64_t
.
TABLE IV illustrates that our encoding achieves about 250 GB/s for uint64_t
and about 380 GB/s^{5}^{5}5NVIDIA V100 GPU has a theoretical peak memory bandwidth of 900 GB/s. for uint32_t
, based on the test with all 111 fields under the error bound of 1e4. Hence, we conclude that using a uint32_t enables significantly higher performance than using a uint64_t.
Because of the coarsegrained chunkwise parallelization, the performance of deflating is about 60 GB/s, which is lower than the encoding throughput of 380 GB/s.
Consequently, the Huffman coding performance is bounded mainly by the deflating throughput.
To improve the deflating and inflating performance, we further evaluate different chunk sizes and identify the appropriate sizes for both deflating and inflating on the tested datasets, as shown in TABLE VI. Specifically, we evaluate chunk sizes ranging from to , due to different field sizes. We observe that using a total of around 2e4 concurrent threads consistently achieves the optimal throughput. Note that inflating must follow exactly the same data chunking strategy as deflating; thus we need to select the same chunk size. Even under this constraint, our selected chunk sizes still achieve throughputs close to the peak ones, as illustrated in TABLE VI. Therefore, we conclude that the overall optimal performance can be achieved by setting up a total of 2e4 concurrent threads in practice.
bitrate  CR  PSNR  bitrate  CR  PSNR  
cesmatm  3.08 bits  10.4  85.3 dB  12 bits  2.7  88.7 dB 
hurricane  3.45 bits  9.3  87.0 dB  12 bits  2.7  81.9 dB 
nyx  2.49 bits  12.8  86.0 dB  6 bits  5.3  85.1 dB 
qmcpack  3.38 bits  9.5  85.0 dB  8 bits  4.0  84.0 dB 
cuSZ  cuZFP 
chunk 

size 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
hacc  
1071.8 mb 280,953,867 f32  
#thread  deflate  inflate 
1.4e5  4.6  2.8 
6.9e4  5.1  5.1 
3.4e4  13.6  12.1 
1.7e4  63.1  35.0 
8.6e3  65.8  28.1 
4.3e3  45.9  14.3 
cesm  
24.7 mb 6,480,000 f32  
#thread  deflate  inflate 
1.0e5  11.3  25.0 
5.1e4  15.5  37.8 
2.5e4  67.1  41.6 
1.3e4  55.6  30.7 
6.3e3  48.2  19.6 
hurricane  
95.4 mb 25,000,000 f32  
#thread  deflate  inflate 
9.8e4  5.1  11.0 
4.9e4  10.2  9.4 
2.4e4  64.6  34.2 
1.2e4  57.3  27.7 
6.1e3  50.7  17.8 
nyx  
512 mb 134,217,728 f32  
#thread  deflate  inflate 
1.3e5  4.7  5.9 
6.6e4  5.7  6.3 
3.3e4  25.1  16.1 
1.6e4  69.7  52.4 
8.2e3  72.4  42.6 
4.1e3  50.0  23.1 
qmcpack  
601.5 mb 157,684,320 f32  
#thread  deflate  inflate 
1.5e5  4.7  5.1 
7.7e4  5.2  6.2 
3.8e4  12.9  11.1 
1.9e4  72.7  40.3 
9.6e3  75.9  29.0 
4.8e3  56.0  16.1 







mb/s  mb/s  mb/s  
CPUSZ  hacc  137.7  328.6      94.1  
cesmatm  105.0  459.1      85.5  
hurricane  93.8  504.0      78.5  
nyx  98.5  648.7      84.7  
qmcpack  97.5  396.2      80.8  
histogram  codebook  coding  
gb/s  gb/s  ms  gb/s  gb/s  gb/s  gb/s  
cuSZ  hacc  207.7  602.8  5.16  54.1  40.0  53.2  22.8  
cesmatm  252.1  345.3  4.33  57.2  41.1  81.9  27.4  
hurricane  175.8  418.0  4.81  55.2  38.2  40.8  19.7  
nyx  200.2  427.6  3.84  58.8  41.1  134.1  31.6  
qmcpack  189.6  346.1  4.09  61.0  40.7  99.2  28.9  
cuZFP  hacc                
cesmatm          47.6  27.7  17.5  
hurricane          83.7  27.7  20.8  
nyx          71.3  56.3  31.7  
qmcpack          72.6  42.5  26.8 





mb/s  mb/s  mb/s  
196.0  659.3  151.1  
502.2  451.9  237.9  
524.5  306.8  185.0  
670.4  300.5  201.8  
660.3  313.4  211.1  
canonical  
dec. gb/s  gb/s  gb/s  
35.0  16.8  11.4  
41.6  58.5  24.3  
34.2  43.9  19.2  
52.4  29.7  19.0  
40.3  22.4  14.4  
      
    113.1  
    102.2  
    103.1  
    115.5 
Next, we evaluate the overall compression and decompression performance of cuSZ, as shown in TABLE VII. We compare cuSZ with cuZFP in terms of the kernel performance and the overall performance that includes the GPUtoCPU communication cost. Note that the performance of cuZFP is highly related to its userset fixed bitrate according to the previous study [jin2020understanding], whereas the performance of cuSZ is hardly affected by the userset error bound. Therefore, we choose the acceptable fixed bitrate for cuZFP, which generates data distortion (i.e., PSNR of about 85 dB) similar to that of cuSZ, as shown in TABLE V. Also, note that we exclude cuZFP for HACC in TABLE VII, because cuZFP generates fairly low compression quality on 1D HACC. In particular, even when the bitrate is as high as 16, the PSNR is only about 20 dB, which is not usable. The throughput in TABLE VII is calculated based on the original data size rather than the size of the data transferred between the GPU and CPU. TABLE VII shows that cuZFP has a higher kernel throughput but lower GPUtoCPU throughput than does cuSZ. The reason is that cuSZ provides a much higher compression ratio than does cuZFP with the same data distortion.
We note that the overall throughputs of cuSZ and cuZFP are close to each other with respect to the CPUGPU interconnect (16lane PCIe 3.0) bandwidth in our evaluation. Generally speaking, many applications in GPUbased HPC systems generate the data on GPUs, so the compression needs to be directly performed on the data in the GPU memory, and the compressed data currently must be transferred from GPUs to disks through CPUs. Current stateoftheart CPUGPU interconnect technologies such as NVLink [foley2017ultra] can typically provide a theoretical transfer rate of 50 GB/s over two links, while our cuSZ’s compression kernel can provide comparable throughput of about 40 GB/s. Although cuZFP’s compression kernel achieves about 70 GB/s, its overall throughput is limited by the CPUGPU bandwidth of 50 GB/s. So, the data transfer between CPU and GPU is still the bottleneck for highthroughput compression kernels (e.g., not higher than 50 GB/s). Moreover, the decompression throughput of cuSZ is lower than its compression throughput and that of cuZFP. This is because only coarsegrained chunking can be applied to decompression, as mentioned in §IIIC. Here we argue that the compression throughput is more important than the decompression throughput, because users use the CPUSZ mainly to decompress the data for postanalysis and visualization instead of the GPU after the compressed data is transferred and stored to parallel file systems [usecaseFranck, jin2020understanding].
We note that cuSZ on the CESMATM dataset exhibits much lower performance than on other datasets. This is due to the fact that each field of the CESMATM dataset is fairly small (25 MB), such that the codebook construction cost turns out to be relatively high compared with other steps for this dataset. In fact, the codebook construction would not be a bottleneck for a relatively large dataset (such as hundreds of MBs per field), which is more common in practice (e.g., HACC, Nyx, QMCPACK).
We also compare the performance of cuSZ with that of the production version of SZ running on a single CPU core and multiple CPU cores. The parallelization of OpenMPSZ is achieved by simply chunking the whole data without any further algorithmic optimization (such as our proposed dualquant). In particular, each thread is assigned with a fixedsize block and runs the original sequential CPUSZ code. The points on the border are handled similar to cuSZ (as shown in Fig. 2). The main differences between OpenMPSZ and cuSZ are fourfold: In the proposed dualquant, each point in cuSZ is assigned to a GPU thread, whereas OpenMPSZ uses a CPU thread to handle a block of data points. After postquant, the data are transformed into integers (units of error bound), and all the following arithmetic operations are performed on these integers. Hence cuSZ does not need to handle the errors that are introduced by floatingpoint operations (e.g., underflow). OpenMPSZ does not fully parallelize Huffman coding, whereas cuSZ provides an efficient parallel implementation of Huffman coding on GPU. OpenMPSZ supports only 3D datasets, so in our comparison we use 3D Hurricane Isabel and Nyx and mark n/a for non3D datasets in Fig. 5. It illustrate the compression and decompression throughput of cuSZ (considering the CPUGPU communication overhead) and CPUSZ. Compared with the serial SZ, the overall compression performance can improved by 242.9 to 370.1. cuSZ also improves the overall performance by 11.0 to 13.1 over SZ running with OpenMP on 32 cores.
IvB2 Compression Quality
We then present the compression quality of cuSZ compared with another advanced GPUsupported lossy compressor—cuZFP—based on the compression ratios and data distortions on the tested datasets. We use the
peak signaltonoise ratio (PSNR)
^{6}^{6}6PSNR is calculated as , where is the number of data points and / is the maximal/minimal value. Root mean squared error (RMSE) is obtained by , where and refer to the original and decompressed values, respectively. to evaluate the quality of the reconstructed data. The larger the PSNR, the lower reconstructed distortion, hence the more accurate postanalysis.We compare cuSZ and cuZFP only on two 3D datasets—Hurricane Isabel and Nyx—because the compression quality of cuZFP on the 1D/2D datasets is much lower than that on the 3D datasets . For a fair comparison, we plot the ratedistortion curves for both cuSZ and cuZFP on all the fields of the two datasets and compare their compression quality in PSNR at the same compression ratio.
Fig. 6 shows the ratedistortion curves of cuSZ and cuZFP on the Nyx dataset. We observe that cuSZ generally has a higher PSNR than does cuZFP with the same compression ratio on the Nyx dataset. In other words, cuSZ provides a much higher compression ratio compared with cuZFP given the same compression quality. The main reason is twofold: ZFP has better compression quality with the absolute error bound (fixaccuracy) mode than with the fixedrate mode (as indicated by the ZFP developer [fixaccuracybetterthanfixrate]); and the predictor of cuSZ has a higher decorrelation efficiency than does the block transform of cuZFP, especially on the field with a large value range and concentrated distribution, such as baryon_density.
Similar results for cuSZ and cuZFP are observed on the Hurricane Isabel dataset, as shown in Fig. 7. We note that the ratedistortion curves for cuSZ—namely, QCLOUD, QICE, CLOUD—notably increase when the compression ratio decreases. This is because there are areas full of zeros, causing the compression ratio to change very slowly when the error bound is smaller than a certain value. In other words, most of the nonzeros are unpredictable, and the zeros are always predictable.
We also illustrate the overall ratedistortion curves of cuSZ and cuZFP on the Hurricane and Nyx dataset, as shown in Fig. 8. For example, cuSZ provides a 2.41 (2.49 vs. 6) lower bitrate over cuZFP on the Nyx dataset and a 3.48 (3.45 vs. 12) lower bitrate over cuZFP on the Hurricane Isabel dataset, with reasonable PSNRs, as shown in TABLE V.
field  SZ1.4  cuSZ  field  SZ1.4  cuSZ 

CLOUDf48  84.99  94.18  QSNOWf48  84.31  93.36 
CLOUDf48.log10  84.51  87.17  QSNOWf48.log10  84.87  84.93 
Pf48  84.79  84.79  QVAPORf48  84.79  84.80 
PRECIPf48  85.35  92.86  TCf48  84.79  84.79 
PRECIPf48.log10  84.82  84.77  Uf48  84.79  84.79 
QCLOUDf48  85.03  98.91  Vf48  84.79  84.79 
QCLOUDf48.log10  85.22  95.21  Wf48  84.79  84.79 
QGRAUPf48  88.21  97.02  baryon_density  89.71  98.25 
QGRAUPf48.log10  84.90  84.82  dark_matter_density  86.57  87.77 
QICEf48  84.61  95.51  temperature  84.77  84.77 
QICEf48.log10  85.56  85.77  velocity_x  84.77  84.77 
QRAINf48  85.36  97.37  velocity_y  84.77  84.77 
QRAINf48.log10  84.93  84.56  velocity_z  84.77  84.77 
Hurricane avg.  85.01  86.96  Nyx avg.  85.58  85.98 
CLOUDf48  
min  1%  25%  50%  75%  99%  max  range 
0.00e+0  0.00e+0  0.00e+0  0.00e+0  0.00e+0  2.53e4  2.05e3  2.05e3 
2.05e7 89.20% in , and 89.20% in  
2.05e8 88.50% in , and 88.50% in  
QSNOWf48  
min  1%  25%  50%  75%  99%  max  range 
0.00e+0  0.00e+0  1.11e10  1.96e9  6.34e9  6.01e5  8.56e4  8.56e4 
8.56e8 88.90% in , and 88.90% in  
8.56e9 80.90% in , and 80.90% in  
baryon density  
min  1%  25%  50%  75%  99%  max  range 
5.80e2  1.37e1  3.22e1  5.06e1  8.75e1  7.42e+0  1.16e+5  1.16e+5 
1.16e+1 99.50% in , and 99.50% in  
1.16e+0 83.30% in , and 84.40% in 
The reason is that, according to §IIIA1, cuSZ sets all the values in the padding layer to 0 and uses these zeros to predict the topleft data points, resulting in better prediction on the tested datasets, especially for the fields with large value ranges and a large majority of values close to zero (such as CLOUDf48, QSNOWf48, and baryon_density as shown in TABLE IX). However, SZ1.4’s prediction highly depends on the first data point’s value, so it may cause low prediction accuracy when the first data point deviates largely from most of the other points. Therefore, cuSZ and SZ1.4 have similar PSNRs on the datasets represented by the logarithmic scale (as shown in TABLE VIII).
V Related Work
Va GPUAccelerated Scientific Lossy Compression
Scientific data compression has been studied for many years for reducing storage and I/O overhead. It includes two main categories: lossless compression and lossy compression. Lossless compressors for scientific datasets such as FPC [fpc] and FPZIP [lindstrom2006fast] ensure that the decompressed data is unchanged, but they provide only a limited compression ratio because of the significant randomness of the ending mantissa bit of HPC floatingpoint data. According to a recent study [son2014data], the compression ratio of lossless compressors for scientific datasets is generally up to 2:1, which is much lower than the userdesired ratio for HPC applications.
Errorbounded lossy compression significantly reduces the size of scientific data while maintaining desired data characteristics. Traditional lossy compressors (such as JPEG [jpeg]) are designed for image and visualization purposes; however, they are difficult to be applied to scientific datasets because of scientists’ specific data fidelity requirement. Recently, errorbounded lossy compressors (such as SZ [sz17] and ZFP [zfp]) have been developed for scientific datasets. Such compressors provide strict error controls according to user requirements. Both SZ and ZFP, for example, provide an absolute error bound in their CPU version.
Different from SZ’s predictionbased compression algorithm, ZFP’s algorithm is based on a block transform. It first splits the whole dataset into many small blocks. It then compresses the data in each block separately in four main steps: exponent alignment, customized nearorthogonal transform, fixedpoint integer conversion, and bitplanebased embedded coding. A truncation is performed based on the userset bitrate. Recently, the ZFP team released their GPUsupported CUDA version, called cuZFP [cuZFP]. cuZFP provides much higher throughputs for compression and decompression compared with the CPU version [jin2020understanding]. However, the current cuZFP version supports fixedrate mode, which significantly limit its adoption in practice.
VB Huffman Coding on GPU
During the Huffman coding process, a specific method is used to determine the bit representation for each symbol, which results in variable length prefix codes. The set of these prefix codes make up the codebook, with each prefix code based on the symbols frequency in the data. This codebook is then used to replace each input symbol with its corresponding prefix code. Previous studies have shown that Huffman coding achieves better performance in parallel on a GPU than in serial on a CPU. In general, parallel Huffman coding obtains each codeword from a lookup table (generated by a Huffman tree) and concatenates codewords together with other codewords. However, a severe performance issue arises when different threads write codewords with different lengths, which results in warp divergence on GPU [xiang2014warp]. The most deviation between methods occurs in concatenating codewords.
FuentesAlventosa et al. [fuentes2014cuvle] proposed a GPU implementation of Huffman coding using CUDA with a given table of variablelength codes, which improves the performance by more than 20 compared with a serial CPU implementation. Rahmani et al. [Rahmani_Topal_Akinlar_2014] proposed a CUDA implementation of Huffman coding based on serially constructing the Huffman codeword tree and parallel generating the byte stream, which can achieve up to 22 speedups compared with a serial CPU implementation without any constraint on the maximum codeword length or data entropy. Lal et al. [Lal_Lucas_Juurlink_2017] proposed a Huffmancodingbased memory compression for GPUs (called E^{2}
MC) based on a probability estimation of symbols. It uses an intermediate buffer to reduce the required memory bandwidth. In order to place the codeword into the correct memory location, E
^{2}MC extends the codeword to the size of the buffer length and uses a barrel shifter to write the codeword to the correct location. Once shifted, the codeword is bitwise ORed with the intermediate buffer, and the write location is increased by the codeword length.Vi Conclusion and Future Work
In this work, we propose cuSZ, a highperformance GPUbased lossy compressor for NVIDIA GPU architectures that effectively improves the compression throughput for SZ compared with the production version on CPUs. We propose a dualquantization scheme to completely remove the strong data dependency in SZ’s predictionquantization step and implement an efficient customized Huffman coding. We also propose a series of techniques to optimize the performance of cuSZ, including finetuning the chunk size, adaptively selecting Huffman code representation, and reusing memory. Experiments on five realworld HPC simulation datasets show that our proposed cuSZ improves the compression throughput by 242.9 to 370.1 over the serial CPU version and 11.0 to 13.1 over the parallel CPU version. Compared with another stateoftheart GPUsupported lossy compressor, cuSZ improves the compression ratio by 2.41 to 3.48
with reasonable data distortion on the tested datasets. We plan to further optimize the performance of decompression, implement other data prediction methods such as linearregressionbased predictor, and evaluate the performance improvements of parallel I/O with
cuSZ.Acknowledgments
This research was supported by the Exascale Computing Project (ECP), Project Number: 17SC20SC, a collaborative effort of two DOE organizations  the Office of Science and the National Nuclear Security Administration, responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, to support the nation’s exascale computing imperative. The material was supported by the U.S. Department of Energy, Office of Science, under contract DEAC0206CH11357. This work was also supported by the National Science Foundation under Grants CCF1619253, OAC1948447, OAC2034169, and OAC2003624. We acknowledge the computing resources provided by the University of Alabama.
[]
Ii SZ Background
Many scientific applications require a strict errorbounded control when using lossy compression to achieve accurate postanalysis and visualization for scientific discovery, as well as a high compression ratio. SZ [sz16, sz17] is a predictionbased lossy compression framework designed for scientific data that strictly controls the global upper bound of compression error. Given a userset error bound , SZ guarantees , where and are the original value and the decompressed value, respectively. SZ’s algorithm involves five key steps: preprocessing, data prediction, linearscaling quantization, customized variablelength encoding, and optional lossless compression, e.g., gzip [gzip] and Zstd [zstd].

[noitemsep, topsep=2pt, leftmargin=1.3em, label=0)]

Preprocessing SZ performs a preprocessing step, such as linearization in version 1.0 or a logarithmic transform for the pointwise relative error bound in version 2.0 [xincluster18].

Data Prediction SZ predicts the value of each data point by a datafitting predictor, e.g., a Lorenzo predictor [ibarria2003out] (abbreviated as predictor) based on its neighboring values. In order to guarantee that the compression error is always within the userset error bound, the predicted values must be exactly the same in between the compression procedure and decompression procedure. To this end, the neighbor values used in the prediction have to be the decompressed values instead of the original values.

LinearScaling Quantization SZ computes the difference between the predicted value and original value for each data point and performs a linearscaling quantization [sz17] to convert the difference to an integer based on the userset error bound.

Customized VariableLength Coding SZ adopts a customized Huffman coding algorithm to reduce the data size significantly, because the integer codes generated by the linearscaling quantization are likely distributed unevenly, especially when the data are mostly predicted accurately.

Lossless Compression The last step of SZ optionally further compresses the encoded data by a lossless compressor such as Zstd [zstd], which may significantly reduce the data size due to potential repeated patterns in the bitstream.
In this work, we focus mainly on the SZ compressor, because much prior work [foster2017computing, di2018efficient, liang2018error, pastri, understandcompressionipdps18, jin2019deepsz, liang2019significantly, zhao2020significantly] has verified that SZ yields the best compression quality among all the predictionbased compressors. However, it is nontrivial to port SZ on GPUs because of the strict constraints in its compression design. For instance, the data used in the prediction must be updated by decompressed values, such that the data prediction in the SZ compressor [sz16, sz17] needs to be performed one by one in a sequential order. This requirement introduces a loopcarried readafterwrite (RAW) dependency during the compression (will be discussed in §IIIA2), making SZ hard to parallelize.
We mainly focus on SZ1.4 instead of SZ2.0 because the 2.0 model is particularly designed for lowprecision use cases with visualization goals, in which the compression ratio can reach up to several hundred while the reconstructed data often have large data distortions. Recent studies [usecaseFranck], however, demonstrate that scientists often require a relatively high precision (or low error bound) for their sophisticated postanalysis beyond visualization purposes. In this situation (with relatively low error bounds), SZ2.0 has very similar (or even slightly worse, if not for all the cases) compression qualities to those of SZ1.4, as demonstrated in [liang2018error]
. Accordingly, our design for the GPUaccelerated SZ lossy compression is based on SZ1.4 and takes advantage of both algorithmic and GPU hardware characteristics. Moreover, the current CPU version of SZ does not support SIMD vectorization and has no specific improvement on the arithmetic performance. Therefore, the CPU baseline used in our following evaluation is based on the nonvectorized singlecore and multicore implementation.
Iii Design Methodology of cuSZ
In this section, we propose our novel lossy compression design, cuSZ, for CUDA architectures based on the SZ model. A system overview of our proposed cuSZ is shown in Fig. 1. We develop different coarse and finegrained parallelization techniques to each subprocedure in compression and decompression. Specifically, we first employ a datachunking technique to exploit coarsegrained data parallelism. The chunking technique is used throughout the whole cuSZ design, including lossless (step 2 and 3) and lossy (step 1, 4, and 5) procedures in both compression and decompression. We then deeply analyze the RAW data dependency in SZ and propose a novel twophase predictionquantization approach, namely, dualquantization, which totally eliminates the data dependency in the predictionquantization step. Furthermore, we provide an indepth breakdown analysis of Huffman coding and develop an efficient Huffman coding on GPUs with multiple optimizations. A summary of our parallelization techniques is shown in TABLE I.
compression 
sequential 
coarse
grained

fine
grained

atomic 
dualquantization  
histogram  
build Huffman tree  
canonize codebook  
Huffman encode (fixlength)  
deflate (fix to variablelength)  
decompression  
inflate (Huffman decode)  
reversed dualquantization 
Iiia Parallelizing PredictionQuantization in Compression
In this section, we discuss our proposed optimization techniques to parallelize SZ’s predictionquantization procedure on GPU architectures. We first chunk the original data to gain coarsegrained parallelization, and then we assign a thread to each data point for finegrained inchunk parallel computations.
IiiA1 Chunking and Padding
Fig. 2
illustrates our chunking and padding technique. For each chunked data block, we assign a thread to each data point (i.e., finegrained parallelism). To avoid complex modifications to the prediction function after chunking, we add a padding layer to each block in the predictionquantization step. We set all the values in the padding layer to
0 such that they do not affect the predicted values of the points neighboring to the padding layer, as shown in Fig. 2. We note that in the original SZ, the uppermost points and leftmost points (denoted by “outer layer”, shaded in Fig. 2) are saved as unpredictable data directly. In our chunking version, however, directly storing these points for each block would significantly degrade the compression ratio. Therefore, we apply prediction to the outer layer instead, such that every point in the block is consistently processed based on the predictor, avoiding thread/warp divergence. Moreover, we initialize the padding layer with 0s; the prediction for each outerlayer point falls back to 1D 1order Lorenzo, as shown in Fig. 2. Based on our empirical result, we adopt for 1D data, 1616 for 2D data, and 888 for 3D data.IiiA2 DualQuantization Scheme
In the following discussion, we use circle and bullet to denote the compression and decompression procedure, respectively. We use start to denote all the values related to the data reconstruction in compression. The subscript represents the th iteration.
ReadAfterWrite in SZ
In the original SZ algorithm, all the data points need to go through prediction, quantization, and in situ reconstruction iteratively, which causes intrinsic readafterwrite (RAW) dependencies (as illustrated in Fig. 3).
We describe the loopcarried RAW dependency issue in detail below. For any data point at the th iteration in SZ compression, given a predicted value , the prediction error (i.e., ) is converted to an integer and a corresponding quantization code based on the user set error bound . Then, the reconstructed prediction error and the reconstructed value are generated by using , , and . After that, is written back to replace . This procedure ensures that is equivalent to the reconstructed during decompression (as shown in Fig. 3); however, the th iteration must wait until the update completes at the end of the th iteration, which incurs loopcarried data dependency. Also note that is written back in the last step of the current iteration, and its written value is used at the beginning of the next iteration, therefore, the two consecutive iterations cannot overlap. Hence, under the original design of the predictionquantization in SZ, it is infeasible to effectively exploit finegrained parallelism and efficiently utilize SIMT on GPUs. We present the original SZ’s predictionquantization procedure in Algorithm 1 in detail.
Proposed DualQuantization Approach
To eliminate the RAW dependency, we propose a dualquantization scheme by modifying the data representation during the predictionquantization procedure. Our dualquantization (abbreviated as dualquant) consists of two steps: prequantization and postquantization.
Given a dataset with an arbitrary dimension, we first perform a quantization based on the userset and convert it to a new dataset:
where any is strictly a multiple of . We call this step prequantization (abbreviated as prequant). In order to avoid overflow, is stored by using a floatingpoint data type. We note that the error introduced by prequant (defined as posterror) is strictly bounded by the userset error bound, that is, .
After the prequantization, we can calculate each predicted value based on its surrounding values (denoted by ) and the predictor (denoted by ) as
The second step, called postquantization (abbreviated as postquant), is the counterpart of the linearscaling quantization in the original SZ. postquant computes the difference between the predicted value and the prequantized value. In postquant, we encode prequant introduced error,
to . Note that is quantitatively equivalent to , but they are in different representations: is a floatingpoint value to avoid overflow, while is an integer, which is used in the subsequent lossless coding (e.g., Huffman coding).
In the following text, we explain in detail why the dualquant method effectively eliminates the RAW dependency. Conceptually, similar to the original SZ, we can construct and during the compression, as shown in Fig. 3. In fact, for th iteration, is strictly equal to , because casting quantization code to is a exact reversed procedure of casting to .
Similarly, and are also strictly equivalent. Consequently, unlike the original SZ that must write back to update before the th iteration, always holds in our proposed dualquant approach. As illustrated in Fig. 3, after prequant, all are dependency free for postquant. By eliminating the loopcarried RAW dependency (marked as arrows in Fig. 3), we can effectively parallelize the dualquant procedure by performing finegrained (perpoint) parallel computation, which is commonly seen in image processing [zhang2010image]. We illustrate the detailed dualquant procedure in Algorithm 2.
Lorenzo Predictor with Binomial Coefficients
According to the prior work proposed by Tao et al. [sz17], the generalized predictor is given by
where and . For example, 1D 1order predictor is , and 2D 1order predictor is , as illustrated in Fig. 1. We note that all the coefficients in the formula of the predictor are integers; thus, the prediction computation consists of mathematically integerbased operations (additions and multiplications) and results in unit weight. This ensures that no division is needed, and the data reconstruction based on dualquant is fairly precise and robust with respect to machine , however, the original SZ using precise floatingpoint operations suffers from underflow. Moreover, the predicted values which are integers will be completely corrected by the saved quantization codes in decompression, so the final compression error is still bounded by .
IiiB Efficient Customized Huffman Coding
To efficiently compress the quantization codes generated by dualquant, we develop an efficient customized Huffman coding for SZ on GPUs. Specifically, Huffman coding consists of the following subprocedures: calculate the statistical frequency for each quantization bin (as a symbol); build the Huffman tree based on the frequencies and generate a base codebook along with each code bitwidth; transform the base codebook to the canonical Huffman codebook (called canonization); encode in parallel according to the codebook, and concatenate Huffman codes into a bitstream (called deflating). And Huffman decoding is composed of retrieving the reverse codebook and decoding accordingly.
Note that the fourth subprocedure of encoding can be further decomposed into two steps for finegrained optimization. Codebookbased encoding is basically memory copy and can be fully parallelized in a fine granularity, whereas deflating can be performed only sequentially (except blockwise parallelization discussed in §IIIA1) because of its atomic operations. We discuss Huffman coding on GPUs step by step as follows.
IiiB1 Histogram for Quantization Bins
The first step of Huffman coding is to build a histogram representing the frequency of each quantization bin from the data prediction step. The GPU histograming algorithm that we use is derived from the algorithm proposed by GómezLuna et al. [GmezLuna2012AnOA]. This algorithm minimizes conflicts in updating the histogram bin locations by replicating the histogram for each thread block and storing the histogram in shared memory. Where possible, conflict is further reduced by replicating the histogram such that each block has access to multiple copies. All threads inside a block read a specified partition of the quantization codes and use atomic operations to update a specific replicated histogram. As each block finishes its portion of the predicted data, the replicated histograms are combined via a parallel reduction into a single global histogram, which is used to construct the final codebook in Huffman coding.
IiiB2 Constructing Huffman Codebook
In order to build the optimal Huffman tree, the local symbol frequencies need to be aggregated to generate the global symbol frequencies for the whole dataset. By utilizing the aggregated frequencies, we build a codebook according to the Huffman tree for encoding. Note that the number of symbols—namely, the number of quantization bins—is a limited number (generally no greater than 65,536) that is much smaller than the data size (generally, millions of data points or more). This leads to a much lower number of nodes in the Huffman tree compared with the data size, such that the time complexity of building a Huffman tree is considered low. We note that building Huffman tree sequentially on CPU benefits from high CPUfrequency and low memoryaccess latency. However, it requires CPUtoGPU/GPUtoCPU transfer of frequencies/codebook before/after building the tree, and communicating these two small messages would incur nonnegligible overheads. Therefore, we adopt one GPU thread to build the Huffman tree sequentially to avoid such overheads.
We propose an adaptive codeword representation to enhance the utilization of memory bandwidth, which improves the Huffman encoding performance in turn. We illustrate the organization of the codebook in Fig. 4
. The codebook is organized by units of unsigned integers, and each contains a variablelength Huffman codeword from LSB (the rightmost bits or the least significant bits) and its bitwidth from MSB (the leftmost bits or the most significant bits). According to the pessimistic estimation of maximum bitwidth of optimal Huffman codeword
[abu2000maximal], one is supposed to useuint64_t
to hold each bitwidthcodeword representation.
For example, the maximum bitwidth could be 33 bits for CLDHGH field from CESMATM dataset in the worst case.
However, we note that using 32bit unsigned integers (i.e., uint32_t
) to represent a bitwidthcodeword tuple can significantly improve the Huffman coding and decoding performance compared with using 64bit unsigned integers (i.e., uint64_t
), because of higher GPU memory bandwidth utilization.
Thus, we propose to dynamically select uint32_t
or uint64_t
representation for the Huffman bitwidthcodeword based on the practical maximum bitwidth instead of pessimistic estimation. We show the performance evaluation with different representations in §IV.
The theoretical time complexity is for building a Huffman tree and for a traversing tree, where is the number of symbols (quantization bins). Our experiments show that the real execution time of building a Huffman tree is consistent with the theoretical time complexity analysis (see TABLE III). On the other hand, the number of symbols is determined by the smoothness of the dataset and the userdesired error bound (1,024 by default). For example, with a relatively large error bound such as the valuerangebased relative error bound ^{2}^{2}2Valuerangebased relative error bound (denoted by valrel) is the error bound relative to the value range of the dataset. of , we observe that most of the symbols are concentratedly distributed near the central of codebook. As the error bound decreases, the symbols become more evenly distributed. Thus, determining a suitable number of quantization bins is important for high performance in constructing a codebook.
IiiB3 Canonizing Codebook
A canonical Huffman codebook [barnett2003canonical] holds the same bitwidth of each codeword as the original Huffman codebook (i.e., base codebook), while its bijective mapping (between quantization code and Huffman codeword) and variable codeword make the memory layout organized more efficiently. The time complexity of sequentially building a canonical codebook from the base codebook is , where is the number of symbols (i.e., the number of quantization bins) and is sufficiently small compared with the data size. By using a canonical codebook, we can (i) decode without the Huffman tree, (ii) efficiently cache the reverse codebook for high decoding throughput, and (iii) maintain exactly the same compression ratio as the base Huffman codebook.
The process of building canonical codebook can be decomposed into the following subprocedures: linear scanning of the base codebook (sequentially ), which is parallelized at fine granularity with atomic operations; loosely radixsorting of the codewords by bitwidth (sequentially ), which cannot be parallelized because of the intrinsic RAW dependency; and building the reverse codebook (sequentially ), which is parallelized with finegrained parallelism.
It is intuitive to separate the functionalities of the aforementioned subprocedures and implement them into independent CUDA kernels with different configurations (i.e., threads per thread block, thread blocks per grid).
Based on our profiling results on an NVIDIA V100 GPU, however, launching a CUDA kernel usually takes about 60 microseconds (s) (about 200 s for three kernels) measured by 11 kernels launched in total.
Moreover, any two consecutive subprocedures require an additional expensive synchronization (i.e., cudaDeviceSynchronize
).
However, our experiment indicates that the kernel time of building a canonical codebook is sufficiently fast (e.g., about 200 s for ); thus, we integrate all the three subprocedures in one single kernel.
We note that this single kernel must be launched with multiple thread blocks because of two reasons. On the one hand, multiple thread blocks provide a high scalability for the parallel reads/writes in subprocedures and . On the other hand, unlike histogramming that saves only the frequencies in the shared memory, this kernel requires saving both the codebook and its footprint, which may exceed the maximum allowable capacity of shared memory in a single thread block (e.g., 96 KB for a V100). Since the shared memory is visible only to its own thread block, however, synchronizing codewords across different thread blocks is impossible. Thus, we use global memory instead of shared memory to save the codebook. As a result, we have to set a lightweight ingrid barrier between two subprocedures in the kernel. Specifically, we employ the stateoftheart CUDA API—Cooperative Groups [harris2017cooperative]—to achieve ingrid synchronizations. We note that launching each Cooperative Groups takes only about 30 s on a V100, which significantly reduces the overhead compared to launching multiple kernels.
IiiB4 Encoding and Deflating
We design an efficient solution to perform the encoding by GPU threads in parallel. Encoding involves looking up a symbol in the codebook and performing a memory copy. After we adaptively select a 32/64bit unsigned integer to represent a Huffman code with its bitwidth, the encoding step is sufficiently parallelized. To generate the dense bitstream of Huffman codes within each data block, we conduct deflating in order to concatenate the Huffman codes and remove the unnecessary zero bits according to the saved bitwidths.
Since the deflated code is organized sequentially, we apply the coarsegrained chunkwise parallelization technique discussed in §IIIA1. In particular, a data chunk for compression and decompression is mapped to one GPU thread. Note that the chunk size for deflating is not necessarily the same as the global chunk size, and it does not rely on the dimensionality. We optimize the deflating chunk size by evaluating the performance with different sizes (will be showed in §IVB1). We also employ memory reuse technique to reduce the GPU memory footprint in deflating. Specifically, we reuse the memory space of Huffman codes for the deflated bitstream because the latter uses significantly less memory space and does not have any conflict when writing the deflated bitstream to the designated location.
IiiC Decompression
cuSZ’s decompression consists of two steps: Huffman decoding (or inflating the densely concatenated Huffman bitstream) and reversed dualquant. In inflating, we first use the previously built reverse codebook to retrieve the quantization codes from the deflated Huffman bitstream. Then, based on the retrieved quantization codes, we reconstruct the floatingpoint data values. Note that only coarsegrained chunking can be applied to decompression, and its chunk size is determined in compression. The reason is that the two steps both have a RAW dependency issue. In fact, retrieving the variablelength codes has the same pattern as loopcarried RAW dependency. For the reversed dualquantization procedure, each data point cannot be decompressed until its preceding values are fully reconstructed.
Iv Experimental Evaluation
In this section, we present our experimental setup (including platform, baselines, and datasets) and our evaluation results.
Iva Experimental Setup
Evaluation Platform
We conduct our experimental evaluation using PantaRhei cluster [pantarhei]. We perform the experiments on an NVIDIA V100 GPU [nv100] from the cluster and compare with lossy compressors on two 20core Intel Xeon Gold 6148 CPUs from the cluster. The GPU is connected to the host via 16lane PCIe 3.0 interconnect. We use NVIDIA CUDA 9.2 and its default profiler to measure the kernel time.
datasets  type 





fp32 




fp32 




fp32 




fp32 




fp32 


Comparison Baselines
We compare our cuSZ with two baselines: SZ1.4.13.5 and cuZFP [cuZFP]. For SZ1.4, we adopt the default setting: 16 bits for linearscaling quantization (i.e., 1,024 quantization bins), best_compression mode, and best_speed mode for gzip, which lead to a good tradeoff between compression ratio and performance.
#quant.  128  256  512  1024  2048  4096  8192 

build tree  0.48  0.77  1.80  2.13  6.46  12.68  25.06 
get codebook  0.20  1.14  2.36  2.69  7.09  14.43  25.65 
total  0.68  2.16  4.16  4.81  13.55  27.10  50.71 
Test Datasets
We conduct our evaluation and comparison based on five typical realworld HPC simulation datasets of each dimensionality from the Scientific Data Reduction Benchmarks suite [sdrbench]: 1D HACC cosmology particle simulation [hacc], 2D CESMATM climate simulation [cesmatm], 3D Hurricane ISABEL simulation [hurricane], 3D Nyx cosmology simulation [nyx], and 4D QMCPACK quantum Monte Carlo simulation [qmcpack]. They have been widely used in prior works [tao2018optimizing, liang2018error, xincluster18, usecaseFranck, liang2019improving] and are good representatives of productionlevel simulation datasets. TABLE II shows all 112 fields^{3}^{3}3The QMCPACK dataset includes only one field but with two representations, including raw data and preconditioned data. across these datasets. The data sizes for the five datasets are 6.3 GB, 2.0 GB, 1.9 GB, 3.0 GB, and 1.2 GB, respectively. Note that our evaluated HACC dataset is consistent with realworld scenarios that generate petabytes of data. For example, according to [hacc], a typical largescale HACC simulation for cosmological surveys runs on 16,384 nodes each with 128 million particles and generates 5 PB over the whole simulation. The simulation contains 100 individual snapshots of roughly 3 GB per node. We evaluate a single snapshot for each dataset instead of all the snapshots, because the compressibility of most of the snapshots usually has strong similarity. Moreover, when the field is too large to fit in a single GPU’s memory, cuSZ divides it into blocks and then compresses them block by block.
IvB Evaluation Results and Analysis
In this section, we evaluate the compression performance and quality of cuSZ and compare it with CPUSZ and cuZFP on the tested datasets.
IvB1 Compression Performance
We first evaluate the performance of dualquant of cuSZ. The average throughput of the dualquant step on each tested dataset is shown in TABLE VII. Compared with the original serial CPUSZ, the predictionquantization throughput is improved by more than 1000 via our proposed dualquant on the GPU. This improvement is because dualquant entirely eliminates the RAW dependency and leads to finegrained (perpoint) parallel computation, which is significantly accelerated on the GPU.
We then evaluate the performance of our implemented Huffman coding step by step. First, we conduct the experiment of Huffman histogram computation and show its throughput performance.^{4}^{4}4All throughputs shown in the paper are measured based on the original data size and time. Efficiently computing a histogram on a GPU is an open challenging problem, because of the way that multiple threads need to write to the same memory locations simultaneously. Here, we present a method that, while a bottleneck in the Huffman process, is a 2 improvement from a serial implementation.
Next, we perform the experiment of constructing codebook with different numbers of quantization bins, as shown in TABLE III. We note that the execution times of building a Huffman tree and creating a codebook are consistent with our time complexity analyses in §IIIB2. We use 1,024 quantization bins by default. Since the time overhead of constructing a codebook depends only on the number of quantization bins, it is almost fixed—for example, 4.81 ms—for the remaining experiments. We also note that a larger data size lowers the relative performance overhead of constructing a codebook, thus leading to higher overall performance.







enc.64  s  4,274.3  97.1  385.8  2,044.7  2,401.4  
GB/s  250.9  255.1  251.7  251.1  251.1  
enc.32  s  2,839.3  64.1  255.8  1,358.6  1,595.6  
GB/s  377.7  386.6  379.6  377.9  377.9 
We also evaluate the performance of encoding and decoding based on the canonical codebook. To increase the memory bandwidth utilization, we adapt online selection of Huffman codeword representation between a uint32_t
and a uint64_t
.
TABLE IV illustrates that our encoding achieves about 250 GB/s for uint64_t
and about 380 GB/s^{5}^{5}5NVIDIA V100 GPU has a theoretical peak memory bandwidth of 900 GB/s. for uint32_t
, based on the test with all 111 fields under the error bound of 1e4. Hence, we conclude that using a uint32_t enables significantly higher performance than using a uint64_t.
Because of the coarsegrained chunkwise parallelization, the performance of deflating is about 60 GB/s, which is lower than the encoding throughput of 380 GB/s.
Consequently, the Huffman coding performance is bounded mainly by the deflating throughput.
To improve the deflating and inflating performance, we further evaluate different chunk sizes and identify the appropriate sizes for both deflating and inflating on the tested datasets, as shown in TABLE VI. Specifically, we evaluate chunk sizes ranging from to , due to different field sizes. We observe that using a total of around 2e4 concurrent threads consistently achieves the optimal throughput. Note that inflating must follow exactly the same data chunking strategy as deflating; thus we need to select the same chunk size. Even under this constraint, our selected chunk sizes still achieve throughputs close to the peak ones, as illustrated in TABLE VI. Therefore, we conclude that the overall optimal performance can be achieved by setting up a total of 2e4 concurrent threads in practice.
bitrate  CR  PSNR  bitrate  CR  PSNR  
cesmatm  3.08 bits  10.4  85.3 dB  12 bits  2.7  88.7 dB 
hurricane  3.45 bits  9.3  87.0 dB  12 bits  2.7  81.9 dB 
nyx  2.49 bits  12.8  86.0 dB  6 bits  5.3  85.1 dB 
qmcpack  3.38 bits  9.5  85.0 dB  8 bits  4.0  84.0 dB 
cuSZ  cuZFP 
chunk 

size 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
hacc  
1071.8 mb 280,953,867 f32  
#thread  deflate  inflate 
1.4e5  4.6  2.8 
6.9e4  5.1  5.1 
3.4e4  13.6  12.1 
1.7e4  63.1  35.0 
8.6e3  65.8  28.1 
4.3e3  45.9  14.3 
cesm  
24.7 mb 6,480,000 f32  
#thread  deflate  inflate 
1.0e5  11.3  25.0 
5.1e4  15.5  37.8 
2.5e4  67.1  41.6 
1.3e4  55.6  30.7 
6.3e3  48.2  19.6 
hurricane  
95.4 mb 25,000,000 f32  
#thread  deflate  inflate 
9.8e4  5.1  11.0 
4.9e4  10.2  9.4 
2.4e4  64.6  34.2 
1.2e4  57.3  27.7 
6.1e3  50.7  17.8 
nyx  
512 mb 134,217,728 f32  
#thread  deflate  inflate 
1.3e5  4.7  5.9 
6.6e4  5.7  6.3 
3.3e4  25.1  16.1 
1.6e4  69.7  52.4 
8.2e3  72.4  42.6 
4.1e3  50.0  23.1 
qmcpack  
601.5 mb 157,684,320 f32  
#thread  deflate  inflate 
1.5e5  4.7  5.1 
7.7e4  5.2  6.2 
3.8e4  12.9  11.1 
1.9e4  72.7  40.3 
9.6e3  75.9  29.0 
4.8e3  56.0  16.1 







mb/s  mb/s  mb/s  
CPUSZ  hacc  137.7  328.6      94.1  
cesmatm  105.0  459.1      85.5  
hurricane  93.8  504.0   
Comments
There are no comments yet.