1 Introduction
An efficient scientific data compressor is increasingly critical to the success of today’s scientific research because of the extremely large volumes of data produced by today’s highperformance computing (HPC) applications. The Community Earth Simulation Model (CESM) [1, 2], for instance, produces terabytes of data every day. In the Hardware/Hybrid Accelerated Cosmology Code (HACC) [3] (a wellknown cosmology simulation code), the number of particles to simulate could reach up to 3.5 trillion, which may produce 60 petabytes of data to store. On the one hand, such large volumes of data cannot be stored even in a parallel file system (PFS) of a supercomputer, such as the Mira [4] supercomputer at Argonne because it has only 20 petabytes of storage space. On the other hand, the I/O bandwidth may also become a serious bottleneck. The memory of extremescale systems continues to grow, with a factor of 5 or more expected for the next generation of systems compared with the current one (e.g., the Aurora supercomputer [5] has over 5 PB total memory); however, although the Burst Buffer technology [6] can relieve the I/O burden to some extent, the bandwidth of PFS is still developing relatively slowly compared with the memory capacity and peak performance. Hence, storing application data to file systems for postanalysis will take much longer than in current systems.
Errorcontrolled lossy compressors for scientific data sets have been studied for years, because they not only significantly reduce data size but also keep decompressed data valid to users. The existing lossy compressors, however, exhibit largely different compression qualities depending on various data sets because of their different algorithms and the diverse features of scientific data. The atmosphere simulation (called ATM) in the CESM model, for example, has over 100 fields (i.e., variables), each of which may have largely different features. We note that various variables or data sets work better with different compression techniques. For instance, SZ [7, 8, 9] exhibits better compression quality than does ZFP [10, 11]
on some data sets, whereas ZFP is better on others. With the same level of distortion of data—peak signaltonoise ratio (PSNR), for instance—SZ exhibits a better compression ratio than does ZFP on 72.8% of the fields in the ATM simulation data, while ZFP wins on the remaining 27.2% fields. One key question is: Can we develop a lightweight online selection method that can estimate on the fly the bestfit compression technique for any data set, such that the overall compression quality can be improved significantly for that application?
In this work, we propose a novel online selection method for optimizing the errorcontrolled lossy compression quality of structured HPC scientific data in terms of ratedistortion, which is the first attempt to our knowledge. The lossy compression quality is assessed mainly by ratedistortion in the scientific data compression community [8, 10, 12]. However, designing an effective method that can select the bestfit compressor based on ratedistortion is challenging because ratedistortion is not a single metric for a given data set; rather it involves a series of compression cases with different data distortions and compression ratios. Hence, selecting the bestfit compressor based on ratedistortion by simply running SZ and ZFP once based on sampled data points is impossible. Unlike Lu et al.’s work [11], which selects the best compressor based on a specific error bound and sampled data points, we have to model accurately the principles of the two stateoftheart lossy compressors both theoretically and empirically. This is nontrivial because of the diverse data features and multiple complicated compression stages involved in the two compressors. Moreover, we must assure that our estimation algorithm has little computation cost, in order to keep a high overall execution performance for the in situ compression.
The contributions of this paper are as follows.

We conduct an indepth analysis of the existing lossy compression techniques and divide the procedure of lossy compression into three stages: lossless transformation for energy compaction; lossy compression for data reduction; and lossless entropy encoding, which is a fundamental step for the compressionquality estimation.

We explore a series of efficient strategies to predict the compression quality (such as compression ratio and data distortion) for the two leading lossy compressors (i.e., SZ and ZFP) accurately. Specifically, we derive some formulas and approaches for accurate prediction of PSNR and the number of bits to represent a data value on average (i.e., bitrate) based on indepth analysis of their compression principles with the three compression stages.

Based on our compressionquality estimation, we develop a novel online method to select the bestfit compressor between SZ and ZFP for each data set, leading to the best lossy compression results. We adopt ratedistortion as the selection criterion because it involves both compression ratio and data distortion and it has been broadly used to assess compression quality in many domains [2, 10, 8].

We evaluate the performance and compression quality of our proposed solution on a parallel system with 1,024 cores. Experiments on structured data sets from realworld HPC simulations show that our solution can significantly improve the compression ratio with the same level of data distortion and comparable performance. The compression ratio can be improved by up to 70% because of a very high accuracy (around 99%) in selecting the best compressor in our method. With our solution, the overall performance in loading and storing data can be improved by 79% and 68% on 1,024 cores, respectively, compared with the secondbest evaluated approach.
The remainder of this paper is organized as follows. In Section 2, we discuss the related work in scientific data compression. In Section 3, we introduce the overall architecture of our proposed automatic online selection method. In Section 4, we analyze the three critical stages in detail based on existing lossy compressors. In Section 5, we discuss how to predict the compression quality for SZ and ZFP accurately. In Section 6, we present and analyze the experimental results. In Section 7, we provide concluding remarks and a brief discussion of future work.
2 Related Work
The issue of scientific data compression has been studied for years. The data compressors can be split into two categories: lossless compressor and lossy compressor.
Lossless compressors make sure that reconstructed data set after decompression is exactly the same as the original data set. Such a constraint significantly limits their compression ratio on scientific data, for whatever generic bytestream compressors (such as Gzip [13] and bzip2 [14]) or whatever floatingpoint data compressors (such as FPC [15] and FPZIP [16]). The reason is that scientific data is composed mainly of floatingpoint values and their tailing mantissa bits could be too random to compress effectively [17].
Lossy compression techniques for scientific data sets generated by HPC applications also have been studied for years, and the existing stateoftheart compressors include SZ [7, 8, 9], ZFP [10], ISABELA [18], FPZIP [16], SSEM [19], VAPOR [20], and NUMARCK [21]. Basically, their compression models can be summarized into two categories: predictionbased model [8, 18, 16, 21, 22] and transformbased model [10, 19]. A predictionbased compressor needs to predict data values for each data point and encodes the difference between every predicted value and its corresponding real value based on a quantization method. Typical examples are SZ [7, 8, 9], ISABELA [18], and FPZIP [16]. The block transformbased compressor transforms original data to another space where the majority of the generated data are very small (close to zero), such that they can be stored with a certain loss in terms of the user’s required error bound. For instance, JPEG [23], SSEM [19] and VAPOR [20], and ZFP [10] adopt discrete cosine transform, discrete wavelet transform and a customized orthogonal transform, respectively.
Recently, many research studies [11, 24, 7, 8, 9, 10, 25, 26, 27, 28, 29] have showed that SZ and ZFP are two leading lossy compressors for HPC scientific data. Specifically, SZ predicts each data point’s value by its preceding neighbors in the multidimensional space and then performs an errorcontrolled quantization and customized Huffman coding to shrink the data size significantly. ZFP splits the whole data set into many small blocks with an edge size of 4 along each dimension and compresses the data in each block separately by a series of carefully designed steps, including alignment of exponent, orthogonal transform, fixedpoint integer conversion, and bitplanebased embedded coding. For more details, we refer readers to [8] and [10] for SZ and ZFP, respectively. Fu et al. [30] proposed an onthefly lossy compression method for a highperformance earthquake simulation. Their lossy compression can reduce the memory cost by 50% and improve the overall performance 24% on Sunway TaihuLight supercomputer[31]. This onthefly lossy compression scheme reduces 32bit floatingpoint data to 16bit by using an adaptive binary representation. We exclude this approach in our work because it is limited compression ratio of 2.
How to integrate different compression techniques into one framework and use their distinct advantages to optimize the compression quality is a challenging topic. Blosc [32] is a successful lossless compressor based on multiple different lossless compression methods, including FastLZ [33], LZ4/LZ4HC [34], Snappy [35], Zlib [36], and Zstandard [37]. However, no such a compressor has been designed based on different lossy compression techniques for optimizing the ratedistortion, leading to a huge gap for the demand of efficient errorcontrolled compression on scientific data sets. Although Lu et al. [11] model the compression performance of SZ and ZFP to select the best compressor inbetween, they focus only on the compression ratio given a specific maximum error. Their method does not model and select the best compressor based on statistical metrics such as RMSE (root mean square error) or PSNR (a simple derivation from RMSE). Such average error metrics are more important for visualization of scientific data than is the maximum error [2, 38]. Although there are some more complex metrics (such as Structural Similarity Index) that can also evaluate compression schemes in terms of visual quality, for generality and simplicity [39], our research focuses on maximizing PSNR for a given compression ratio, by modeling and selecting online the best compressor between SZ and ZFP with low performance overhead.
3 Architecture of Proposed Online Automatic Selection Method for Lossy Compression
Lossy compression can be divided into three stages as shown in Figure 1. In Stage I, the original data is transformed to another data domain (e.g., frequency domain) by lossless transformations. Here lossless transformation means the reconstructed data will be lossless if one transforms the original data and does the corresponding inversetransformation right away. The transformed data is easier to compress because of the efficient energy compaction [40]
. Energy compaction means that the energy is more concentrated in some elements of the transformed data compared to the distribution of energy in the original data. Stage II reduces the data size but also introduces errors. The most commonly used techniques for Stage II are vector quantization
[41] (static quantization) and embedded coding [42] (dynamic quantization). Stage III performs entropy coding for further lossless data reduction, and it is sometimes optional. Figure 1 shows four stateoftheart lossy compressors for HPC scientific data and their corresponding techniques in each stage.We design our online selection method based on the analysis of the three critical compression stages. Specifically, we propose a novel optimization strategy comprising four steps as shown in Figure 2. The first step takes the input scientific data sets and performs Stage I’s transformation on the sampled data points. The second step uses the transformed data points from Step 1 to estimate the compression quality (including compression ratio and distortion of data) based on our proposed estimation model. The third step selects the bestfit lossy compression strategy based on SZ and ZFP. The fourth step constructs a lossy compressor and uses it for compressing the data set. As confirmed by recent research [11, 8, 10, 25, 22, 43, 24], SZ and ZFP are two leading lossy compressors for HPC scientific data and can well represent predictionbased and transformationbased lossy compressors, respectively. Accordingly, our online selection method mainly is based on these two stateoftheart lossy compressors without loss of generality.
In this paper, we adopt a practical in situ model [44] as many stateoftheart lossy compressors (such as ISABELA, SZ, and ZFP) use as well. Here in situ means data analysis, visualization, and compression happen without first writing data to persistent storage. Thus, in the in situ compression model, the compression will be conducted after the entire computation in each simulation time step such that the entire analysis data is already kept in memory. Overall, our online method can select the bestfit compressor onthefly during the in situ compression, aiming at reducing data storage and I/O time overheads.
4 Analysis of Lossless Transformations for Energy Compaction in Stage I
In this section, we provide an indepth analysis of the impact of Stage I (i.e., predictionbased transformation (PBT) and block orthogonal transformation (BOT)) on the overall distortion of data. As presented in Figure 1, Stage I is lossless. However, this does not mean that if the data in the transformed domain is changed, the overall distortion level (such as mean squared error (MSE)) of the finally reconstructed data can stay the same as that of the transformed data. The reason is that the data in the new transformed domain will be largely different from the original data. Based on an indepth analysis of the two transformation methods in Stage I, we prove that the normbased error value (e.g, MSE) keeps unchanged after the inverse transformation of PBT and BOT. This fundamental analysis implies that we can predict the overall distortion of the finally decompressed data for SZ and ZFP by estimating the data distortion in Stage II.
4.1 PredictionBased Transformation (PBT)
In this subsection, we introduce predictionbased lossy compression, and then we infer that the pointwise compression error (i.e., the difference between any original data value and its decompressed value) is equal to the error introduced by vector quantization or embedded encoding in Stage II.
In the compression phase of the predictionbased lossy compression (as shown in the top subfigure of Figure 3), the first step is to predict the value of each data point and calculate the prediction errors. We define PBT as the process of generating a set of prediction errors (denoted by ) based on the original data set (denoted by ), data point by data point during the compression. The prediction error will be quantized or encoded in Stage II.
During the decompression, one needs to reconstruct the prediction errors based on quantization method or embedded encoding and then reconstruct the overall data set by an inverse PBT (as presented in the bottom subfigure of Figure 3). We define the inverse PBT (denoted iPBT) as the procedure of constructing the decompressed data set (denoted ), data point by data point, based on the reconstructed prediction errors (denoted ) during the decompression.
In what follows, we infer that the following equation must hold for PBT.
(1) 
During the compression, the prediction method generally predicts the value of each data point based on the data points nearby in space because of the potential high consecutiveness of the data set. The Lorenzo predictor [45], for example, approximates each data point by the values of its preceding adjacent data points. ^{1}^{1}1Lorenzo predictor uses 1 neighbor per data point for 1D data, 3 neighbors per data point for 2D data, and 7 neighbors per data point for 3D data. Since the neighboring data values to be used to reconstruct each data point during the decompression are actually the decompressed values instead of the original values, in practice, one has to assure that the compression and decompression stage have exactly the same prediction procedure (including the data values used in the prediction method); otherwise, the data loss will be propagated during the decompression. Hence, the predicted values during the compression must be equal to the predicted values during the decompression. That is, we have . Then, we can derive Equation (1) based on the following two equations: and .
Based on Equation (1), we can easily derive the following theorem.
Theorem 1.
The pointwise compression error in the original data space is the same as the pointwise compression error in the PBTtransformed data space.
4.2 Block Orthogonal Transformation (BOT)
In the following discussion, we first introduce the principle of the block orthogonal transformation. We then prove a critical feature: the norm based compression error (such as MSE) in the original data space is the same as the compression error in the BOT transformed data space.
Let us first describe the elementwise tensor (matrix) norms that we will use in the following discussion. One can treat a tensor as a vector and calculate its elementwise norm based on a specific vector norm. For example, by using vector
norm, we can define the elementwise norm of a tensor as follow.(2) 
Further, if is an matrix and we choose , Equation (2) can be simplified to
(3) 
where returns the sum of diagonal entries of a square matrix. Equation (3) defines the elementwise norm (a.k.a. Frobenius norm) of a square matrix.
Block transformationbased lossy compressors divide the entire data set into multiple data blocks and perform blockwise transformation at Stage I. Unlike predictionbased transformation, each block transformation has no dependency and can be performed independently. Each block transformation is composed of several 1D linear transformations that can be performed along each axis within the block. For example, in a 2D data array, 1D linear transformation is applied to each row (
axis) and each column (axis). Each 1D linear transformation can be calculated as a multiplication of the transformation matrix and 1D vector.Many lossy compressors adopt orthogonal matrices in their transformations. For example, SSEM uses the Haar wavelet transform and ZFP uses a selfoptimized orthogonal matrix. Here an orthogonal matrix
means its columns and rows are orthogonal unit vectors, i.e., , whereis the identity matrix. The most significant advantage of using orthogonal transformation is the property of
norm invariance after transformation, that is,(4) 
Based on this property, we can prove that the normbased compression error in the original data space is the same as the compression error in the BOTtransformed data space. We will prove it later in this subsection.
The block size in the BOTbased lossy compressor is usually set to the power of 2. ZFP and SSEM, for example, set the block size to , where is the dimension size (). JPEG uses as the block size in 2D image data. In our work, without loss of generality, we consider the block size in BOT to be and do not limit the dimension of the data set. Note that here the “dimension” represents the dimensionality of each data point rather than the number of fields in the data sets.
Based on prior research [10], the transformation matrix of most existing wellknown BOTs can be expressed as a uniform parametric form as
where is a parameter. Specifically, and corresponds to discrete HWT and DCT II, respectively, which are two most common transforms. Moreover, represents slant transform, highcorrelation transform, and WalshHadamard transform.
In what follows, we discuss the unified formulas of BOT for any dimensional data. We use to denote the BOT and to denote the data block. can be represented by a tensor where . Since the orthogonal transformation is performed on the 1D vector, we need to rearrange ’s elements to form an matrix and do a matrixmatrix multiplication. The dimensional tensor can be unfolded along the directions by index mapping. We use axis, axis, , axis to denote the directions. Specifically, the unfolding along the th direction axis () will map the tensor element to the matrix element , where . We use , , , to denote the unfolding operations along the axis, axis, , axis, respectively. Accordingly, we can fold the tensor from the unfolded matrix by the inverse index mapping, denoted by , , , . Thus, can be expressed as the following operations.
Lemma 2.
Block orthogonal transformation (BOT) preserves the norm on any dimenstional data sets.
Proof.
We still denote the orthogonal transformation matrix by . Because the and are both index mapping operations, the values and elementwise norm will remain unchanged. Thus, we can write
Then, based on Equation (4), we can get
So is held for the th operation (), which demonstrates that every operation in the BOT can keep unchanged. Therefore, we have proved this theorem. ∎
We still use to denote the decompressed block data, to denote the transformed block data in the compression, and to denote the decompressed transformed block data in the decompression. We have
and
Thus, due to the linearity of , we have
Based on Lemma 2, we have
This equation also holds when is composed of multiple data blocks. That is, we already prove the following theorem.
Theorem 3.
The normbased compression error in the original data space is the same as the compression error in the BOTtransformed data space on any dimenstional data sets.
Note that the reasons that Theorem 3 focuses on norm include two aspects: on one hand, the “norm invariance” property (as shown in Equation (4)) of BOT only holds for norm in terms of the elementwise norms because of Equation 3; on the other hand, normbased error (such as MSE or PSNR) has been considered as one of the most critical indicators to assess the overall data distortion in literature, because it is closely related to the visual quality [46], unlike maximum compression error (i.e., norm based error).
4.3 Data Sampling for CompressionQuality Estimation
In our proposed automatic online selection method, we first sample the data points (i.e., Step 1) and then perform the transformations on them (i.e., Step 2) in order to estimate the overall compression quality, as shown in Figure 2. The distance between two data blocks sampled nearby will be fixed in the same dimension and different across dimensions, such that all sampled blocks can be distributed uniformly throughout the entire data set. We use the term sampling rate to represent the sampling frequency, which is denoted by . In the evaluation section, we present the accuracy of our estimation model with respect to different sampling rates. Based on our experiments (to be shown later), a sampling rate of can provide a good accuracy with low performance overhead. Therefore, we choose as the default sampling rate in our implementation. Note that for PBT, the prediction over the sampled data points is actually based on their original real neighbors instead of their neighbors in the sampled points. Thus, the sampling process for PBT will not introduce additional errors.
5 Compression Quality Estimation of Lossy Data Reduction in Stage II
In this section, we provide an indepth analysis of the lossy data reduction in Stage II. We then propose a general estimation model to predict the compression quality (including compression ratio and compression error) accurately for lossy compressors with vector quantization or embedded coding in Stage II based on the theorems derived in Section 4. After that, we apply our estimation model specifically to SZ and ZFP to predict their compression quality (i.e., Step 2 as shown in Figure 2). We then discuss the implementation details of our proposed online automatic selection algorithm.
For notation, we use to denote the transformed data from the original data . That is, are the input data of Stage II and the output data of Stage I.
5.1 Estimation Based on Static Quantization
Unlike datadependent quantization approach (such as embedded coding that will be discussed later), static quantization is determined before performing quantization. Vector quantization [41] is one of the most popular static quantization methods. It converts (i.e., prediction errors in PBT or transformed data in BOT) to another set of integer values, which are easier to compress. Specifically, the value range is split into multiple intervals (i.e., quantization bins) based on some method, such as equalsize quantization or logscale quantization (discussed later). Then, the compressor needs to go through all the transformed data () to determine in which bins they are located, and represents their values by the corresponding bin indexes, which are integer values. During the decompression, the midpoint of each quantization bin will be used to reconstruct the data that are located in the bin; it is called the estimated value (or quantized value) in the following discussion. The effectiveness of the data reduction in vector quantization depends on the distribution of the transformed data . Moreover, the quantization step introduces errors to , and such errors will be added to the decompressed data.
We build a model to estimate the data reduction level (e.g., compression ratio) and the data distortion level (e.g., mean squared error), based on a vector quantization method with a specific distribution of . In the following, we define some important notations to be used. We denote
as the probability density function (PDF) of
, that is,. Based on our observation, the probability distribution of
is symmetric in a large majority of cases. The blue area in Figure 4 exemplifies the typical probability distribution of the prediction errors generated by the SZ lossy compressor using the ATM data set. All other tested data sets show the same symmetry. Therefore, we assume to be symmetric without loss of generality (i.e., is equal to as illustrated in Figure 4), and the number of vector quantization bins is represented by . We denote the length of the th quantization bin, where because of the symmetry property.5.1.1 Estimation of bitrate
Bitrate is defined as the average number of bits used in the compressed data as per value. As discussed previously, a large number of transformed data values generated by the vector quantization are supposed to gather in a few quantization bins. That is, they are represented by a few integer bin indexes (Stage II), such that the data size can be reduced significantly by entropy encoding (Stage III). We combine our discussion for Stage II and Stage III, in order to estimate the overall reduction size achieved by the quantization.
Given a number of symbols, an entropy encoding method (such as Huffman coding [47] and arithmetic coding [48]) can assign a number of bits to represent these symbols based on their frequencies. Since Shannon entropy theory [49] gives the expected number of bits to represent these symbols, we can use the entropy value of the bins to estimate the expected bit rate used to represent all quantized values. The estimation equation is shown as follows:
(5) 
where is the probability of the th quantization bin.
The probability of each bin can be calculated by the integral of its probability density function value. Specifically, , where is the th quantization bin and . Since the integral is relatively complex to compute, we use to approximate . Therefore, the estimation of the bit rate based on the ’s PDF is
To further simply the equation, let denote the midpoint of the th bin, namely, . Then we have
(6) 
where . Note that the midpoint of the th bin is (i.e., ) according to the symmetry property.
Therefore, we can estimate the bit rate value by Equation (6) given the probability density function of . (We discuss our method to estimate the ’s PDF in detail later.) Note that the compression ratio can be calculated by dividing the number of bits per floatingpoint value by the bitrate, for example, 32/bitrate for singleprecision data and 64/bitrate for doubleprecision data.
5.1.2 Estimate of compression error
As proved in Theorem 1 and 3, the PBT and BOT are both normpreserving transformations. Thus, the normbased error, such as the mean squared error (MSE), introduced by Stage II stays unchanged after decompression. Therefore, we can estimate the norm based compression error by estimating the error of Stage II.
We denote as the quantized values of . The MSE between and can be calculated by
(7) 
where represents the expectation. Note that is a step function, since the values in each bin are quantized to the same value. Lossy compressors such as NUMARACK[21], SSEM[19], and SZ[8] often use the midpoint of the quantization bin to approximate the values located in it. Therefore, when . We can further estimate the MSE based on the probability density function and the step function as follows:
After that, we can calculate the root mean squared error (RMSE), normalized root mean squared error (NRMSE), and peak signaltonoise ratio (PSNR) as follows:
(8) 
where represents the value range of the original data . Thus far, we have established the estimation equations for bit rate and norm based compression error. We are now ready to derive the estimation of the most significant metric: ratedistortion.
5.1.3 Estimation of ratedistortion
Ratedistortion is an important metric to compare different lossy compressors, such as fixed rate lossy compressors (e.g., ZFP) and fixed accuracy lossy compressors (e.g., SZ). For fair comparison, people usually plot the ratedistortion curve for the different lossy compressors and compare the distortion quality with the same rate. Generally, the higher the ratedistortion curve, the better the lossy compression quality. Here the term “rate” means bit rate in bits/value, and “distortion” usually adopts PSNR.
Based on the estimation of bit rate and PSNR proposed above (i.e., Equations (6) and (8)), the ratedistortion depends only on , given the probability distribution of . However, it is difficult to optimize the values
for the ratedistortion during the preparation stage, even if the probability distribution is classic distribution, such as Gaussian distribution. In the following, we analyze three common, effective vector quantization approaches; the analysis can be extended by including more vector quantization methods.
5.1.4 Detailed analysis of three vector quantization Cases

[topsep=4pt]

Linear quantization: This is the simplest yet effective vector quantization approach, which is adopted by SZ lossy compressor. Under this approach, all quantization bins have the same length, (i.e., ). On the other hand, the quantization bin can cover all the prediction errors as long as the number of bins is large enough, hence, . So, Equations (6) and (8) can be simplified as follows:
(9) (10) Equation (10) tells us that the PSNR depends only on the unified quantization bin size regardless of the distribution of transformed data from Stage I. For example, the SZ lossy compressor sets the bin size to twice the absolute error bound (i.e., ) to make sure the maximum pointwise compression error within . So, based on Equation (10), our PSNR estimation for SZ lossy compressor becomes
(11) Note that is the valuerangebased relative error bound [8] (denoted by ) defined by SZ. Unlike the pointwise relative error that is compared with each data value, valuerangebased relative error is compared with value range of each data field. Thus our model can estimate the SZ’s PSNR precisely based on the valuerangebased relative error bound.

Logscale quantization: Logscale quantization is an alternative to the linear quantization, and its bin sizes follows a logarithm distribution. Suppose one is using bins to quantize , in order to cover the maximum absolute value in , is chosen to be . If , falls into the th bin; if , falls into the th bin; if , falls into the th bin. Thus, the logscale quantization uses , , as the bin size where . Compared with linear quantization, logscale quantization usually a has higher PSNR but a lower compression ratio. The reason is that logscale quantization assigns a larger number of finer bins to the highfrequency (central) regions. Thus, according to Equation (7), logscale quantization’s PSNR can be higher than linear quantization. On the other hand, the distribution of the interval frequencies with logscale quantization is more even than with linear quantization, leading to a poor entropy encoding result. Hence, for various data, it is hard to tell directly which quantization method is better in terms of ratedistortion. The most effective way is to compare their ratedistortion estimations.

Equalprobability quantization: This vector quantization approach is employed by the NUMARCK lossy compressor. This method generates equal probability for each quantization interval; hence, . In this case, the estimation of bit rate equals . It shows that the entropy encoding has no effect on the intervals with the same frequency. The PSNR estimation will be . The bin size can be estimated by the clusteringbased approximation approach (proposed by Chen et al. [21]
), such as the Kmeans cluster algorithm, whose time overhead is expensive.
5.2 Estimation Based on Dynamic Quantization
Dynamic quantization is a datadependent manner and encodes the data progressively. For example, embedded coding (EC) [42]
is the most commonly used dynamic quantization approach. It is the most important part of the BOTbased lossy compressors, such as JPEG2000 and ZFP. It generates a stream of bits that are put into order based on their impact on the error. Many variances of EC have been proposed in previous work
[50, 51, 52]. As we proved in Section 4.2, the norm based compression error in the original data space is equal to the compression error in the transformed space, so the bits in the same bitplane (as shown as the blue dash line in Figure 5 should be encoded at a time, that is the case for most of the EC variances.Figure 5 shows an example with 16 transformed (44 data block) to be encoded. Each datum is represented by its binary format. EC starts from the leftmost bitplane and ends at the maximum bitplane, as shown as the purple dash line in the figure. The maximum bitplane is determined by the bit budget or the error bound set by users. For each value, EC encodes only its significant bits (i.e., nonzero bits). We use a red dashed line to indicate the significant bits for the 16 values. The BOTbased transformed data is roughly in order (i.e., large values appear ahead of small values). We can observe that the red dashed line in Figure 5 exhibits a staircase shape. We can use this feature to estimate the bitrate and compression error for embedded coding.
5.2.1 Estimation of bitrate
To estimate the bitrate after embedded coding, we need to estimate the number of significant bits (denoted by ) for each value. We also use a sampling approach to make an estimation. Specifically, we first sample some data points (marked in green in the figure) and count their . After that, we use these sampled data points and their
to interpolate
for the remaining data points. We then calculate the average of all (denoted by ) and use it as the approximate bitrate value, that is, . The key reason we can adopt sampling and interpolation is that the significant bits of BOTbased transformed data exhibit a staircase shape, as discussed previously.5.2.2 Estimation of compression error
Similar to the estimation of bitrate, we can estimate the compression error, for example, MSE by calculating the MSE of all the sampled data points (denoted by ). Note that after the first step of exponent alignment, different blocks may have different exponent offsets and maximum bitplanes. Hence, in order to calculate the overall MSE for all the sampled data points, each sampled data point’s error is calculated by multiplication of its truncated error in binary representation and its block’s exponent offset value. Finally, we can estimate the overall PSNR by the PSNR of the sampled data points: .
We use to denote the sampling rate in Stage I and to denote the sampling rate used in embedded coding. We observe that low may significantly affect the estimation accuracy, but the estimation accuracy is not that sensitive to . Thus, as the default setting of our solution, we sample 3 data points for one 1D data block, 9 data points for one 2D data blocks, and 16 data points for one 3D data block. We adopt a low rate for sampling the data blocks such that our estimation model can achieve both high estimation accuracy and low overhead (illustrated later).
5.3 Implementation of Proposed Online Selection Method
We develop the automatic online selection method based on our proposed estimation model for two leading errorcontrolled lossy compressors. As discussed above, we apply our estimation model to both SZ and ZFP to predict their compression quality accurately. Specifically, SZ adopts a multidimensional prediction model for its PBT in Stage I and linear quantization for its vector quantization in Stage II. We use Equations (9) and (11) to predict its bitrate and PSNR, respectively. ZFP uses an optimized orthogonal transformation for its BOT in Stage I and grouptestingbased EC [10] for its EC in Stage II. We use and to predict its bitrate and PSNR, respectively.
Our online selection method adopts the ratedistortion as a criterion to select the bestfit compression technique between SZ and ZFP. Specifically, for each field/variable, our solution first estimates ZFP’s bitrate and PSNR based on a given errorbound set by users. Next, it estimates SZ’s bitrate based on the PSNR estimated for ZFP, due to the high PSNR estimation accuracy in our model. Then, it selects the bestfit compressor with smaller bitrate estimated and performs the corresponding lossy compression, as shown in Algorithm 1. Note that the compression errors of ZFP follow a Gaussianlike distribution while those of SZ follow an uniformlike distribution [26]. Thus, in order to keep the same PSNR, ZFP needs a larger error bound as an input than does SZ. Accordingly, with (line 7), the calculated absolute error bound for SZ (i.e., ) is smaller than the absolute error bound (i.e., ), which can guarantee the compression errors to be still bounded by pointwise after decompression.
6 Evaluation Results and Analysis
In this section, we first describe the experimental platform and the HPC scientific data sets used for evaluation. We then evaluate the accuracy of our estimation model and analyze the time and memory overhead of our online selection method. We then present the experimental results based on a parallel environment with up to 1,024 cores.
6.1 Experimental Setting and Scientific Simulation Data
We conduct our experimental evaluations on the Blues cluster [53] at Argonne Laboratory Computing Resource Center using 1,024 cores (i.e., 64 nodes, each with two Intel Xeon E52670 processors and 64 GB DDR3 memory, and each processor with 16 cores). The storage system uses General Parallel File Systems (GPFS). These file systems are located on a raid array and served by multiple file servers. The I/O and storage systems are typical highend supercomputer facilities. We use the fileperprocess mode with POSIX I/O [54] on each process for reading/writing data in parallel ^{2}^{2}2POSIX I/O performance is close to other parallel I/O performance such as MPIIO [55] when thousands of files are written/read simultaneously on GPFS, as indicated by a recent study [56].. We perform our evaluations on various single floatingpoint data sets including 2D ATM data sets from climate simulations, 3D Hurricane data sets from the simulation of the hurricane Isabela, and 3D NYX data sets from cosmology simulation. The details of the data sets are described in Table I. We use SZ1.4.11 with the default mode and ZFP0.5.0 with the fixed accuracy mode for the following evaluations.
6.2 Accuracy of CompressionQuality Estimation
We evaluate our model based on three criteria: average error of estimating bitrate, average error of estimating PSNR, and accuracy of selecting the bestfit compression technique under different sampling rates. Note that here we use PSNR instead of MSE because previous work [10, 8] usually adopt PSNR for ratedistortion evaluation.
Tables II and III show the average errors of bitrate and PSNR under different sampling rates (i.e., 1%, 5%, and 10%). They exhibit that our estimation model has a relatively high accuracy in estimating PSNR with low sampling rate. For example, for both SZ and ZFP, with 5% sampling rate, the average PSNR estimation errors are within 2% for the ATM data sets and within 4% for the Hurricane data sets.
As for the bitrate estimation, the experiments based on ATM and Hurricane data sets show that the bitrate values estimated for SZ are always lower than the real bitrate values after compression, and the estimation error can be up to 19% in some cases. The reason is that our model adopts the Shannon entropy theory (i.e., Equation (6)) to approximate the bitrate for SZ. Note that the entropy value is the optimal value in theory, while the designed/implemented entropy encoding algorithm (such as Huffman encoding) may not reach such a theoretical optimum in practice. This situation typically happens when the data set has a lot of similar values such that it is easy to compress with a high compression ratio.
To address the above issue, we improve the estimation accuracy for SZ by introducing a positive offset, which is set to 0.5 bits/value based on our experiments using realworld simulation data. Tables II, III, IV, and V
present the accuracy (average and standard deviation of relative estimation error) of the offsetbased estimation for SZ and the original estimation approach for ZFP. We can see that our final estimation model can always predict both the bitrate and PSNR accurately for the two compressors. In relative terms, the bitrate estimation errors fall into the interval [8.5%, 7.5%] for SZ. Note that here the negative values represent that the estimated values are lower than the real values. As for ZFP, the bitrate estimation errors are limited within 5.7% on the ATM data sets and 0.9% on the Hurricane data sets, when the sampling rate is higher than 5%. The PSNR estimation errors vary from 3.5% to 0.6% when the sampling rate is set to 5%. Hence, we suggest setting the sampling rate to 5% in practice, which also has little time cost (presented latter).
As illustrated in Table II and III, our estimation of bitrate is more accurate for ZFP than SZ in most instances. The reason is that we estimate the bitrate by calculating the entropy value (i.e., Equation (5)) for SZ because of the Entropy encoding step (Huffman coding) adopted in SZ. As mentioned above, entropy value represents an optimal bitrate (or lower bound) in theory, which leads to a certain estimation error. The tables also show that our estimation of PSNR is more accurate for SZ than ZFP under all the tested sampling rates. The key reason is that the symmetric distribution of prediction errors in SZ is not related to its prediction accuracy, but the staircase shape of transformed coefficients in ZFP is highly dependent on its transformation efficiency. Therefore, our PSNR modelling based on SZ’s quantization errors is more accurate than that based on ZFP’s truncation errors. We also note that the standard deviation of bitrate error is much higher for ZFP than SZ on the ATM data sets, as shown in Table IV. This is because ZFP’s block orthogonal transformation may have low decorrelation efficiency on certain fields in the ATM data sets, so the transformed coefficients can still have high correlation and the staircase shape (as shown in Figure 5) cannot be always established on these fields, which can result in large bitrate error fluctuations and relatively high standard deviation. In addition, we note that since our proposed estimation of PSNR is based on the approach to control the maximum normbased compression error, the estimated PSNRs are always lower than the real PSNRs, leading to the negative PSNR errors shown in Table II and III. Finally, it is worth noting that the Hurricane data sets have more highcompressionratio variables than the ATM data sets. In other words, the Hurricane data sets are relatively easier to compress compared with the ATM data sets. Hence, using the entropy value (i.e., the optimal value in theory) to estimate the bitrate is more accurate for the Hurricane data sets than for the ATM data sets. Consequently, considering the 0.5 bits/value offset for SZ, the bitrate errors are always negative on the ATM data sets (as shown in Table II while positive on the Hurricane data sets (as shown in Table III). In the future work, we can further improve our estimation method by introducing the offset unless the data set is relatively hard to compress.
We next evaluate the selection accuracy, which is calculated as the ratio of the number of correct selections to the total number of variables or data sets. The correct selection means our model make a correct decision by selecting the bestfit compression technique. The selection accuracy is 98.7% on the Hurricane data sets and 88.3% on the ATM data sets. In fact, the 1.3% wrong selection brings only 0.08% compressionratio degradation on the Hurricane data sets, and the 11.7% wrong selection leads to only 3.3% compressionratio degradation on the ATM data sets, as shown in Figure 7. The reason the wrong selections leads to little degradation is that almost all the wrong selections actually happen only when the two compressors exhibit close bitrates with the same PSNR, such that selecting either of them may not affect the final overall compression quality by much.
SZ  ZFP  SZ  ZFP  SZ  ZFP  
Bitrate  7.5%  5.7%  7.4%  5.7%  7.3%  5.6% 
PSNR  2.5%  4.1%  1.1%  2.0%  0.6%  1.6% 
SZ  ZFP  SZ  ZFP  SZ  ZFP  
Bitrate  4.5%  8.0%  8.5%  0.9%  4.6%  0.9% 
PSNR  2.6%  6.3%  1.1%  3.5%  0.8%  3.1% 
SZ  ZFP  SZ  ZFP  SZ  ZFP  
Bitrate  8.9%  23.9%  8.8%  23.6%  8.8%  23.5% 
PSNR  5.6%  6.0%  3.1%  4.0%  1.5%  3.8% 
SZ  ZFP  SZ  ZFP  SZ  ZFP  
Bitrate  10.4%  11.9%  16.0%  2.0%  10.8%  3.1% 
PSNR  2.2%  5.1%  1.2%  3.3%  2.0%  1.0% 
Time (sec.)  SZ  ZFP  Time (sec.)  SZ  ZFP  Time (sec.)  SZ  ZFP  
NYX  1.4%  1.2%  5.6%  4.7%  9.8%  8.4%  
ATM  1.5%  1.9%  4.9%  6.3%  9.2%  11.9%  
Hurricane  1.3%  1.7%  5.4%  7.2%  9.2%  12.5% 
6.3 Overhead Analysis
Next, we analyze the overhead of our automatic online selection method with respect to both time and memory.
6.3.1 Time overhead
Time overhead comes from two parts: the transformation of sampled data points in Step 1 (as shown in Figure 2) and the estimation of compression quality in Step 2 (also shown in Figure 2). For the first part, the overhead of sampled data transformation is scaled linearly with the sampling rate . Hence, if we assume Stage I takes a percentage (denoted by ) of the total compression time, the overhead can be expressed as , where is the number of data points. For example, is up to 60% based on our experiments, so the time overhead of the sampled data transformation is up to of SZ’s compression time, under a default sampling rate of . For the second part, when we estimate compression quality, the time complexity is with vector quantization based on Equations (6) and (8), where is the number of quantization bins, which is very small in general compared with the data size . Hence, the time overhead complexity with embedded coding is . Therefore, the overall time overhead can be expressed as with a low constant coefficient, i.e., .
Table VI shows the time overhead on the NYX, ATM, and Hurricane data sets compared with the compression time of SZ and ZFP. It illustrates that the time overhead scales linearly with the sampling rate, consistent with our analysis.
6.3.2 Memory overhead
Memory overhead results from the storage of the approximate probability density function. It can be expressed as , where is the number of bins used to represent the PDF. Note that although is larger than , is still very small compared with the data size . Specifically, we use quantization bins (i.e., ) in our evaluation. The dimensions of each field in the ATM and Hurricane data sets are and (i.e., ), respectively. Therefore, the memory overheads are about 1.0% and 0.3% on the ATM and Hurricane data sets, respectively.
6.4 Analysis of Adaptability between Selection Methods Based on FixedPSNR vs. FixedMaximumError
We compare our proposed selection method based on fixed PSNR to the solution based on fixed maximum error (proposed by Lu et al. [11]) using the NYX, ATM, and Hurricane data sets, as shown in Figure 6. Their solution simply selects the compressor with the highest compression ratio based on a fixed error bound (called selection based on error bound). Unlike their work that adopted pointwise relative error bound [11], we improved their selection method by using the absolute error bound instead, since both SZ and ZFP have better ratedistortions when using absolute error bound mode rather than using pointwise relative error bound mode, as confirmed in the previous studies [25, 43]. Specifically, for each data field, we set the absolute error bound to of its value range. Figure 6(a) shows that the selection method based on error bound always chooses SZ as the bestfit compressor for all the tested fields because SZ always leads to the higher compression ratios than ZFP does on these fields given a specific error bound. We note that ZFP overpreserves the compression error with respect to the userset error bound. Thus, ZFP may have a higher PSNR than does SZ, even if its compression ratio is lower. Our proposed method is designed to select the compressor that has lower bitrate (i.e., higher compression ratio) with the same PSNR (called selection based on ratedistortion), leading to better overall ratedistortion result. Figure 6(b) shows that our method can select the different bestfit compressors based on the ratedistortion for the different fields in the tested data sets.
6.5 Empirical Performance Evaluation
We evaluate the overall performance of our proposed solution in parallel. Let us first consider the compression ratio improvement achieved by our compressor. Figure 7 shows that the compression ratio of SZ, ZFP, and our solution on the NYX, ATM, and Hurricane data sets with different error bounds. Our solution can outperform both SZ and ZFP because our online selection method attempt to select the better compression approach for each field in the data sets. Note that the optimum bar represents the compression ratios in an ideal case assuming that the bestfit compressors can always be selected for any fields in the data sets. Specifically, the compression ratio of our solution outperforms that of the worst solution by 62%, 36%, 19% on the Hurricane data sets, by 28%, 38%, 20% on the ATM data sets, and 70%, 17%, 12% on the NYX data sets with the valuerangebased relative error bound of , , , respectively. We compare our solution with the worst solution because our proposed selection method can almost always select the bestfit compressor; however, a user is likely to keep using the same, but maybe the worst, compressor for all data sets.
In Figures 8 and 9, we present the throughputs (in GB/s) of storing and loading data to GPFS with different solutions. We increase the scale from 1 to 1,024 processes. We set the valuerangebased relative error bound to a reasonable value [8]. We test each experiment five times and use their average time to calculate the throughputs. The storing and loading throughputs are calculated based on the compression/decompression time and I/O time. We compare our solution with the other two solutions based on SZ and ZFP compressors and a baseline solution. The baseline solution is storing and loading the uncompressed data directly without any compression. Figures 8 and 9 illustrate that our optimized compressor can achieve the highest stroing and loading throughputs compared with SZ and ZFP. Our optimized compressor can outperform the secondbest solution by 68% of the storing throughput and by 79% of the loading throughput with 1,024 processes. Our proposed solution has higher throughputs because it can achieve higher compression ratios than both SZ and ZFP with little extra overhead, so the time of writing and reading data is reduced significantly, leading to higher overall throughputs. Similarly, SZ has higher overall throughputs than ZFP does because of achieving a higher overall compression ratio on the tested data sets with the same PSNR, although the compression/decompression rates of SZ are lower than those of ZFP [8]. We note that the compression/decompression rates have a linear speedup with the number of processors (as illustrated in Figure 10 in [8]) and more processes will lead to higher unexpected I/O contention and data management cost by GPFS when writing/reading data simultaneously. Hence, we expect that the performance gains of our solution compared with SZ and ZFP will further increase with scale because of the inevitable bottleneck of the I/O bandwidth.
7 Conclusion
In this paper, we propose a novel online, lowoverhead selection method that can select the bestfit lossy compressor between two leading compressors, SZ and ZFP, optimizing the ratedistortion for HPC data sets, This is the first attempt to derive such an approach to the best of our knowledge. We develop a generic opensource toolkit/library under a BSD license. We evaluate our solution on realworld production HPC scientific data sets across multiple domains in parallel with up to 1,024 cores. The key findings are as follows.

The average error of our estimation with default sampling rate on bitrate (i.e., compression ratio) can be limited to within 8.5% and 5.7% for SZ and ZFP, respectively.

The average error of our estimation with default sampling rate on PSNR (i.e., data distortion) can be limited to within 1.1% and 3.5% for SZ and ZFP, respectively.

The accuracy of selecting the bestfit compressor with default sampling rate is 88.3 98.7%, with little analysis/estimation time overhead (within 5.4% and 7.3% for SZ and ZFP, respectively).

Our solution improves the compression ratio by 1270% compared with that of SZ and ZFP, with the same distortion (PSNR) of the data.

The overall performance in loading and storing data can be improved by 79% and 68% on 1,024 cores with our solution, respectively, compared with the secondbest solution.
We plan to extend our optimization solution (such as estimation model) to more errorcontrolled lossy compression techniques, including more quantization approaches and blockbased transformations, to further improve the compression qualities for more HPC scientific data sets.
Acknowledgments
This research was supported by the Exascale Computing Project (ECP), Project Number: 17SC20SC, a collaborative effort of two DOE organizations  the Office of Science and the National Nuclear Security Administration, responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, to support the nation’s exascale computing imperative. The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory (Argonne). Argonne, a U.S. Department of Energy Office of Science laboratory, is operated under Contract No. DEAC0206CH11357. We acknowledge the computing resources provided by LCRC at Argonne National Laboratory.
References
 [1] Community Earth Simulation Model (CESM), http://www.cesm.ucar.edu/, 2018, online.
 [2] A. H. Baker, H. Xu, J. M. Dennis, M. N. Levy, D. Nychka, S. A. Mickelson, J. Edwards, M. Vertenstein, and A. Wegener, “A methodology for evaluating the impact of data compression on climate simulation data,” in Proceedings of the 23rd International Symposium on HighPerformance Parallel and Distributed Computing. ACM, 2014, pp. 203–214.
 [3] S. Habib, A. Pope, H. Finkel, N. Frontiere, K. Heitmann, D. Daniel, P. Fasel, V. Morozov, G. Zagaris, T. Peterka et al., “HACC: Simulating sky surveys on stateoftheart supercomputing architectures,” New Astronomy, vol. 42, pp. 49–65, 2016.
 [4] Mira supercomputer, https://www.alcf.anl.gov/mira/, 2018, online.
 [5] Aurora Supercomputer, https://www.alcf.anl.gov/programs/auroraesp, 2018, online.
 [6] N. Liu, J. Cope, P. Carns, C. Carothers, R. Ross, G. Grider, A. Crume, and C. Maltzahn, “On the role of burst buffers in leadershipclass storage systems,” in 2012 IEEE 28th Symposium on Mass Storage Systems and Technologies. IEEE, 2012, pp. 1–11.
 [7] S. Di and F. Cappello, “Fast errorbounded lossy hpc data compression with SZ,” in 2016 IEEE International Parallel and Distributed Processing Symposium. IEEE, 2016, pp. 730–739.
 [8] D. Tao, S. Di, Z. Chen, and F. Cappello, “Significantly improving lossy compression for scientific data sets based on multidimensional prediction and errorcontrolled quantization,” in 2017 IEEE International Parallel and Distributed Processing Symposium. IEEE, 2017, pp. 1129–1139.
 [9] X. Liang, S. Di, D. Tao, S. Li, S. Li, H. Guo, Z. Chen, and F. Cappello, “Errorcontrolled lossy compression optimized for high compression ratios of scientific datasets,” in Big Data (Big Data), 2018 IEEE International Conference on. IEEE, 2018, p. To appear.
 [10] P. Lindstrom, “Fixedrate compressed floatingpoint arrays,” IEEE Transactions on Visualization and Computer Graphics, vol. 20, no. 12, pp. 2674–2683, 2014.
 [11] T. Lu, Q. Liu, X. He, H. Luo, E. Suchyta, J. Choi, N. Podhorszki, S. Klasky, M. Wolf, T. Liu et al., “Understanding and modeling lossy compression schemes on hpc scientific data,” in 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2018, pp. 348–357.
 [12] D. Tao, S. Di, H. Guo, Z. Chen, and F. Cappello, “Zchecker: A framework for assessing lossy compression of scientific data,” The International Journal of High Performance Computing Applications, p. 1094342017737147, 2017.
 [13] P. Deutsch, “GZIP file format specification version 4.3,” RFC, vol. 1952, pp. 1–12, 1996. [Online]. Available: https://doi.org/10.17487/RFC1952
 [14] Bzip2, http://www.bzip.org/, 2018, online.
 [15] M. Burtscher and P. Ratanaworabhan, “FPC: A highspeed compressor for doubleprecision floatingpoint data,” IEEE Transactions on Computers, vol. 58, no. 1, pp. 18–31, 2009.
 [16] P. Lindstrom and M. Isenburg, “Fast and efficient compression of floatingpoint data,” IEEE Transactions on Visualization and Computer Graphics, vol. 12, no. 5, pp. 1245–1250, 2006.
 [17] L. A. B. Gomez and F. Cappello, “Improving floating point compression through binary masks,” in 2013 IEEE International Conference on Big Data. IEEE, 2013, pp. 326–331.
 [18] S. Lakshminarasimhan, N. Shah, S. Ethier, S. Klasky, R. Latham, R. Ross, and N. F. Samatova, “Compressing the incompressible with ISABELA: Insitu reduction of spatiotemporal data,” in European Conference on Parallel Processing. Springer, 2011, pp. 366–379.
 [19] N. Sasaki, K. Sato, T. Endo, and S. Matsuoka, “Exploration of lossy compression for applicationlevel checkpoint/restart,” in 2015 IEEE International Parallel and Distributed Processing Symposium. IEEE, 2015, pp. 914–922.
 [20] VAPOR, https://www.vapor.ucar.edu/, 2018, online.

[21]
Z. Chen, S. W. Son, W. Hendrix, A. Agrawal, W.k. Liao, and A. Choudhary, “NUMARCK: machine learning algorithm for resiliency and checkpointing,” in
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 2014, pp. 733–744. 
[22]
D. Tao, S. Di, Z. Chen, and F. Cappello, “Exploration of patternmatching techniques for lossy compression on cosmology simulation data sets,” in
High Performance Computing  ISC High Performance 2017 International Workshops, DRBSD, ExaComm, HCPM, HPCIODC, IWOPH, IXPUG, P^3MA, VHPC, Visualization at Scale, WOPSSS, Frankfurt, Germany, June 1822, 2017, Revised Selected Papers, 2017, pp. 43–54.  [23] G. K. Wallace, “The JPEG still picture compression standard,” IEEE Transactions on Consumer Electronics, vol. 38, no. 1, pp. xviii–xxxiv, 1992.
 [24] A. Huebl, R. Widera, F. Schmitt, A. Matthes, N. Podhorszki, J. Y. Choi, S. Klasky, and M. Bussmann, “On the scalability of data reduction techniques in current and upcoming hpc systems from an application perspective,” in International Conference on High Performance Computing. Springer, 2017, pp. 15–29.
 [25] P. Lindstrom, “Zfp compression ratio and quality,” https://computation.llnl.gov/projects/floatingpointcompression/zfpcompressionratioandquality/, 2018, online.
 [26] P. Lindstro, “Error distributions of lossy floatingpoint compressors,” in Joint Statistical Meetings 2017, 2017, pp. 2574–2589.
 [27] X. Liang, S. Di, D. Tao, Z. Chen, and F. Cappello, “An efficient transformation scheme for lossy data compression with pointwise relative error bound,” in 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2018, pp. 179–189.
 [28] A. M. Gok, S. Di, Y. Alexeev, D. Tao, V. Mironov, X. Liang, and F. Cappello, “Pastri: Errorbounded lossy compression for twoelectron integrals in quantum chemistry,” in 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2018, pp. 1–11.
 [29] D. Tao, S. Di, Z. Chen, and F. Cappello, “Indepth exploration of singlesnapshot lossy compression techniques for nbody simulations,” in Big Data (Big Data), 2017 IEEE International Conference on. IEEE, 2017, pp. 486–493.
 [30] H. Fu, C. He, B. Chen, Z. Yin, Z. Zhang, W. Zhang, T. Zhang, W. Xue, W. Liu, W. Yin et al., “18.9pflops nonlinear earthquake simulation on sunway taihulight: enabling depiction of 18hz and 8meter scenarios,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2017, p. 2.
 [31] H. Fu, J. Liao, J. Yang, L. Wang, Z. Song, X. Huang, C. Yang, W. Xue, F. Liu, F. Qiao et al., “The sunway taihulight supercomputer: system and applications,” Science China Information Sciences, vol. 59, no. 7, p. 072001, 2016.
 [32] Blosc compressor, http://blosc.org/, 2018, online.
 [33] FastLZ, http://fastlz.org/, 2018, online.
 [34] LZ4, http://lz4.github.io/lz4/, 2018, online.
 [35] Snappy, https://google.github.io/snappy/, 2018, online.
 [36] Zlib, http://www.zlib.net/, 2018, online.
 [37] Zstandard, http://facebook.github.io/zstd/, 2018, online.
 [38] J. Woodring, S. Mniszewski, C. Brislawn, D. DeMarle, and J. Ahrens, “Revisiting wavelet compression for largescale climate data using jpeg 2000 and ensuring data precision,” in Large Data Analysis and Visualization (LDAV), 2011 IEEE Symposium on. IEEE, 2011, pp. 31–38.
 [39] A. Hore and D. Ziou, “Image quality metrics: Psnr vs. ssim,” in Pattern recognition (icpr), 2010 20th international conference on. IEEE, 2010, pp. 2366–2369.
 [40] H. S. Malvar and D. H. Staelin, “The lot: Transform coding without blocking effects,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 4, pp. 553–559, 1989.
 [41] A. Gersho and R. M. Gray, Vector quantization and signal compression. Springer Science & Business Media, 2012, vol. 159.
 [42] E. Ordentlich, M. Weinberger, and G. Seroussi, “A lowcomplexity modeling approach for embedded coding of wavelet coefficients,” in Data Compression Conference, 1998. DCC’98. Proceedings. IEEE, 1998, pp. 408–417.
 [43] S. Di, D. Tao, and F. Cappello, “An efficient approach to lossy compression with pointwise relative error bound,” https://web.njit.edu/~qliu/assets/sc17drbsd2shengdi.pdf, 2018, online.
 [44] A. C. Bauer, H. Abbasi, J. Ahrens, H. Childs, B. Geveci, S. Klasky, K. Moreland, P. O’Leary, V. Vishwanath, B. Whitlock et al., “In situ methods, infrastructures, and applications on high performance computing platforms,” in Computer Graphics Forum, vol. 35, no. 3. Wiley Online Library, 2016, pp. 577–597.
 [45] L. Ibarria, P. Lindstrom, J. Rossignac, and A. Szymczak, “Outofcore compression and decompression of large ndimensional scalar fields,” in Computer Graphics Forum, vol. 22, no. 3. Wiley Online Library, 2003, pp. 343–348.
 [46] S. Guthe and W. Straßer, “Realtime decompression and visualization of animated volume data,” in Visualization, 2001. VIS’01. Proceedings. IEEE, 2001, pp. 349–572.
 [47] D. A. Huffman, “A method for the construction of minimumredundancy codes,” Proceedings of the IRE, vol. 40, no. 9, pp. 1098–1101, 1952.
 [48] I. H. Witten, R. M. Neal, and J. G. Cleary, “Arithmetic coding for data compression,” Communications of the ACM, vol. 30, no. 6, pp. 520–540, 1987.
 [49] C. E. Shannon, A mathematical theory of communication. ACM, 2001, vol. 5, no. 1.
 [50] C.J. Lian, K.F. Chen, H.H. Chen, and L.G. Chen, “Analysis and architecture design of blockcoding engine for ebcot in JPEG 2000,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 3, pp. 219–230, 2003.
 [51] H.C. Fang, Y.W. Chang, T.C. Wang, C.J. Lian, and L.G. Chen, “Parallel embedded block coding architecture for JPEG 2000,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 9, pp. 1086–1097, 2005.
 [52] J.S. Chiang, C.H. Chang, Y.S. Lin, C.Y. Hsieh, and C.H. Hsia, “Highspeed EBCOT with dual contextmodeling coding architecture for JPEG2000,” in Proceedings of the 2004 International Symposium on Circuits and Systems, 2004. ISCAS’04., vol. 3. IEEE, 2004, pp. III–865.
 [53] Blues cluster, https://www.lcrc.anl.gov/systems/resources/blues/, 2018, online.
 [54] B. Welch, “Posix io extensions for hpc,” in Proceedings of the 4th USENIX Conference on File and Storage Technologies (FAST), 2005.
 [55] R. Thakur, W. Gropp, and E. Lusk, “On implementing mpiio portably and with high performance,” in Proceedings of the sixth workshop on I/O in parallel and distributed systems. ACM, 1999, pp. 23–32.
 [56] A. Turner, “Parallel I/O Performance,” https://www.archer.ac.uk/training/virtual/20170208ParallelIO/2017_02_ParallelIO_ARCHERWebinar.pdf, 2017, online.