Error-bounded lossy compressors (zfp; fpzip; sz; sz2; sz3) have been effective in significantly reducing large volumes of data produced by scientific simulations (ssem; hacc; impact-hpdc14) or instruments/devices (APSU; LCLSII), while controlling the data distortion based on user’s requirement. Accordingly, error-bounded lossy compression has been thought of as one of the best ways to resolve today’s big science data issue.
Silent data corruptions (SDC), however, are nonnegligible when running lossy compressors, as discussed below.
If one lossy compressor is employed by a high performance computing (HPC) application, it will likely need to deal with a vast volume of data produced by extreme-scale simulations. Various possible failures/errors have to be taken into account. Many existing solutions such as multi-level checkpointing/restart (CR) mechanism focus only on fail-stop issues that are perceived by hardware or operating systems. By contrast, the soft errors, a.k.a. silent data corruption (SDC), may change the data in memory, cache or even register silently, because of inevitably unexpected malfunctions in the system components. Such errors are more dangerous than fail-stop issues, because they may cause biased results in the end of simulation silently.
Remote sensor technology continues to increase in fidelity for space systems, so large amounts of data are being collected by orbiting satellites or space vehicles and transmitted to other stations (e.g., ground stations, other satellites). However, the devices (such as interplanetary space probe) deployed in space would be more error-prone than the regular devices on the earth. To address this issue, some fault tolerance techniques (Lin-ft-space; Jacobs-ft-space) have been proposed for specific algorithms such as matrix multiplication and FFT. However, when lossy compressors are used by the space systems to compress image data, the whole compression procedure has to be protected against soft errors. Otherwise, the corrupted data may let scientists miss important findings or draw a misleading conclusion.
There are no lossy compressors designed particularly in the consideration of the possible SDCs. Mat Noor and Vladimirova (lossless-ft-space) made a parallel fault-tolerant Integer KLT implementation for lossless hyperspectral image compression on board satellites. Unlike the lossless compression, designing SDC detection/correction method for lossy compression is more challenging since decompressed data will deviate from original data even though there is no SDC.
In this paper, we propose the SDC resilient lossy compression based on SZ (sz3) - one of the best generic error-bounded lossy compressors for large-scale scientific datasets verified by many studies (sz3; understand-compression-ipdps18). Not only can our solution detect the SDCs during the compression/decompression but it can also automatically correct the SDCs in some cases.
In general, the SDC errors can be classified into two categories, memory error and computation error, and our solution can protect SZ against both of the two errors. The memory error is introduced by soft errors that corrupt a data value in memory fromto . The computation error is introduced by soft errors in logic unit which yields wrong computation results such as .
The main idea of this paper is analyzing each subroutine in the SZ framework elaborately and designing a series of fault tolerance strategies carefully, such that the lossy compressor can be protected against SDCs effectively with little overhead. We summarize the detailed contributions as follows.
We comprehensively analyze each subroutine of SZ with respect to possible memory/computation errors. The analysis unveils that some parts of SZ are naturally error resilient, while other parts are fragile to SDCs. The SDC errors striking these parts may cause wrong decompressed data. Thus, it is critical to protect those parts by specific fault tolerance strategies.
We propose an efficient SDC resilient lossy compression solution based on the SZ compression framework. e reorganize the SZ compression model by dividing each dataset into small blocks and making the compression work totally independent across blocks. Such a design is able to control the impact of SDCs on the decompressed data. On the other hand, we design a series of SDC resilient strategies based on SZ’s principle, which can not only detect SDCs in most of cases but also correct SDCs in some cases.
We implement our SDC resilient compressor based on our elaborate design. We evaluate its fault tolerance ability in the presence of SDCs and the corresponding overhead in the fault-free situation, as well as the possible impact to the compression quality. We perform the experiments with real-world simulation data across multiple science domains and image data which were taken by New Horizons space probe (new-horizons) in the space. Experiments show that our designed independent-block based compression model has very limited execution overheads (10% in most cases). On the other hand, the experiments also confirm that our fault tolerance solution yields little overhead (7.3% at 2048 cores) and correct decompression results in the presence of soft errors. When injecting one and two errors, respectively, during the compression at runtime, our solution can significantly improve the resilience for SZ (92% running cases with correct decompressed data compare to only 71.2% and 47% of the original SZ).
We organize our paper as follows. In Section 2, we discuss related work. In Section 3, we formulate the research problem in terms of the SZ compression framework. In Section 4, we provide an in-depth analysis of the fault tolerance ability of SZ In Section 5, we present our fault tolerance methodology. Then we evaluate our methods in Section 6. Lastly we conclude the paper and discuss the future work.
2. Related Work
We discuss the related work in two facets: the fault tolerance ability of existing lossy compressors and the existing solutions designed to protect other applications against SDCs.
So far, there have been many lossy compressors (isabela; fpzip; zfp; sz; sz2; sz3; numarck; ssem) developed to significantly reduce the large volume of data produced by scientific simulations. All the lossy compressors, basically, could be classified into two categories - transform-based compression (zfp; ssem) and prediction-based compression (fpzip; sz; sz2; sz3). None of the transform-based compressors are immune to the SDCs. In fact, if the data in the transformed domain are corrupted because of memory or computation error, multiple data values in the original data domain could be affected. As for the prediction-based model, the SDC issue could be also fatal to the reconstruction of data. In SZ, for example, if the data prediction on some data point is corrupted silently during the compression, the predicted value on that data point would be inconsistent during the compression and decompression, leading to uncontrolled decompression errors.
Much work has been done to fight against the memory error and computation error, respectively. From the perspective of hardware, error correcting code (ECC) has been implemented to detect and correct bit flips in memory. ECC can correct single-bit flipped memory errors but cannot detect or correct any computation errors. Hardware redundancy adopts redundant hardware to execute the same application with the same input and compare the outputs from the different hardwares. Software redundancy means running the same program on the single hardware multiple times and compare the outputs from different runs. Thus, double modular redundancy (DMR) is needed for error detection with overhead and triple modular redundancy (TMR) is needed for error correction with overhead.
Such high overhead of modular redundancy to handle SDCs has motivated algorithm based fault tolerance (ABFT) (abft-huang1984), which aims to exploit the special characteristics of an application or algorithm to detect and correct soft errors. Despite the fact that ABFT requires a significant algorithm integration effort, the tiny overhead of ABFT makes it very attractive. Most of the existing ABFT methods, however, focus on popular arithmetic algorithms such as matrix operations (abft-huang1984). To the best of our knowledge, no ABFT work has been done for lossy compression algorithms, which is a significant gap in the context of scientific data compression.
3. Background and Problem Formulation
3.1. SZ Lossy Compression Framework
SZ (sz3) is an error-bounded lossy compressor designed for scientific data. According to the recent studies (sz2; sz3; understand-compression-ipdps18), it can effectively reduce the data size for many scientific simulations, such as climate simulation, cosmological simulation, quantum simulation, and chemical simulation.
Basically, SZ includes four critical stages during the compression: (1) data prediction, (2) linear-scaling quantization, (3) variable-length encoding, and (4) lossless compression such as Zstd (zstd). In the data prediction step, SZ (sz16; sz17; sz3) splits the whole dataset into multiple blocks in the size of 6x6x6 and then perform the compression in each block based on two alternative prediction methods - an improved Lorenzo predictor (lorenzo)
or linear regression. The second step - linear-scaling quantization converts each raw data value (such as floating-point value) to an integer index (or quantization bin) based on the user-set error bound and the difference of the predicted value and original value. The remaining two steps are used to reduce the data size by performing Huffman encoding on the quantization bin index array and adopting lossless compression. This may significantly reduce the data size because the distribution of quantization bin indices are likely fairly non-uniform especially when the data are relatively smooth in space.
3.2. Algorithm based fault tolerance (ABFT)
ABFT achieves SDC detection and correction by leveraging the characteristics of the algorithms. In high level explanations, ABFT detects SDCs by checking if some relationship is respected and correct the errors by another introduced set of computation. Each ABFT technique has to be developed for a particular approach composed by one or more algorithms. We give an example to illustrate how ABFT detects/corrects soft errors in general. Given an array at timestamp , then at a later timestamp , one attempts to detect if there was a memory error that corrupted a value in during the period . In order to detect the error, we can leverage a checksum (sum = a[i]). Specifically, we can calculate the sum of at and , respectively. Suppose the two calculated sums are denoted by and , respectively. If , we can conclude there must be an SDC error happening to during the period . In order to locate where the SDC error is in the array , we can leverage an extra computation: isum = i*a[i]. Specifically, assuming the value at index is corrupted during the time period , according to and , one can derive the SDC location index = . This example illustrates that it is viable to detect and even correct the single-data-point error just by introducing a few more light-weight computations.
3.3. Error model and assumptions
We identify the error model in this subsection. In our study, we focus on both memory error and computation error. As for the memory error model, the errors could randomly happen anywhere in the whole memory at any time during the life time of a process in the form of bit-flips. As for the computation errors, their impact could appear in the form of bit-flips on the computation results. Similar to other ABFT research, the flow control error (FCE) is beyond the scope of our work because the general solutions are designed on the compiler/instruction/hardware level (cfc-not). Moreover, it is too difficult to comprehensively detect the FCEs even for professional FCE detection tools according to recent studies (cfc-not)
. Without loss of generality, we assume that the occurrence probability of multiple computation errors or memory errors is extremely low for one block of data during one compression, since one block is generally very small (such as 101010 in size). Similar to other ABFTs (xinfft; panruosvd; jieyangipdps), we assume the checksum itself is error free because of its tiny computation time compared with the compression time.
3.4. Formulation of SDC Detection Evaluation in SZ
As mentioned previously, SZ has four stages in the whole course of compression, and we mainly focus on the single-data-point SDC error (either computation error or memory error) happening at each stage, without loss of generality. In addition, we mainly focus on the dominant data structures (i.e., all the data structures taking linear space of the number of data points ) that take the majority of memory footprint in SZ because they are the major objects affected by SDCs if any. The rest parts (called negligible space in the following text) could be considered error free. Which parts taking negligible space will be discussed later in this paper.
The objective of our work is to detect and correct both computational errors and memory errors in each stage of SZ compression as much as possible. There are three important metrics to evaluate our designed SDC resilient lossy compressor, as listed below.
SDC detection/correction ability. What kinds of SDCs could be detected or corrected? What is the accuracy and coverage rate of SDC error detection?
Computational Overhead. It is defined as the ratio of the extra computation time to the total original execution time in an error-free situation.
Impact to Compression Result. Whether the SDC resilient lossy compressor can still respect the user-specified error bound for the decompressed data? What is the compression overhead: i.e, how much the compression ratio would be degraded under the SDC resilient compressor compared with the original compressor?
All the three evaluation metrics can be used to all lossy compressors, which is the first resilience formulation in the context of lossy compression, to the best of our knowledge.
4. Resilience Analysis of SZ 2.1
In this section, we analyze the resilience (SDC detection/correction ability and impact) of SZ 2.1 based on its principle.
4.1. SDC Resiliency – Computation error
We analyze SZ’s natural resilience based on when/where the computation error could happen, including calculation of regression coefficients, selecting bestfit predictor by sampling method, and data prediction and calculation of decompressed data, huffman encoding and lossless compression. We call the first two stages ‘prediction preparation’.
4.1.1. SDC resilience in the prediction preparation
A computation error in prediction preparation stage may only lower compression ratio to a certain extent but it would not affect the correctness of decompressed data (i.e., still strictly respecting error bound). That is, the decompressed data is still the golden result in spite of the computation error in prediction preparation. In fact, although the computation error may lead to inaccurate regression coefficients or incorrect bestfit predictor selection, exactly the same coefficients/selection will be used for both compression and decompression. The compression ratio could be affected because the data prediction may be less accurate due to the inaccurate coefficients or incorrect predictor selection.
4.1.2. SDC Resilience in the data prediction and calculation of decompressed data
Data prediction is the most critical step in SZ. In order to guarantee the error bound, the neighboring data values used to predict each data point during the compression have to be exactly the same values to be used during the decompression. That is, SZ needs to obtain the decompressed data values during compression. We demonstrate the key compression procedure in Figure 1 (a), which is conducted in a loop of scanning all data blocks. It involves 5 key steps.
Calculate predicted value (line 2).
Compute the difference between the real value and the predicted value (line 3).
Calculate error quantization bins (line 4).
Calculate the decompressed data (line 6) which will be used to predict the following data points in compression.
Double-check the correctness of the compression based on the given error bound against possible machine epsilon error (line 7-8): specifically, the decompressed value would be reconstructed based on the quantization bin and compared with the true value.
In the following, we analyze the fault tolerance ability of the key procedure of compression upon a computation error occurring in the code segment presented in Figure 1 (a), based on five possible cases. We note that the necessary condition to obtain correct decompressed output is that a correct decompressed value must be calculated (type-1) or an unpredictable data handling is called (type-2) during compression; and the same data should be reconstructed during decompression (type-3), which will be used later.
Case 1 - a computation error happens to line 2. In this case, we need to take into account two possible situations in terms of the deviation of the predicted value affected by the error.
Situation 1: the predicted value is changed by the error significantly such that the quantization bin calculated later on falls outside the maximum quantization range (i.e., bin bin_max does not hold). In this situation (zone A in Figure 1 (b)), the decompressed data will still respect the error bound for sure because of the type-2 behavior.
Situation 2: the impact of the SDC error on the predicted value is relatively small such that the quantization bin is within the maximum quantization range (i.e., bin bin_max still holds). This may cause a significant error to the decompressed data (zone B, C in Figure 1 (b)) because of violation of type-3 behavior. The reason is described as follows. On the one hand, the double-checking step (line 7) cannot detect such an error because it would decompress the data point based on the “wrong” predicted value such that the reconstructed value will still respect the error bound. On the other hand, it is unlikely that such an SDC error would happen again during the decompression, so that SZ would get a different predicted value for the current data point in the course of decompression and thus a wrong decompressed value on this data point (violation of type-3 behavior). What is even worse is that this decompressed value would also be used to predict other data points in the decompression, such that the errors would be propagated throughout the whole dataset.
Case 2 - A computation error happens to line 3 or 4. These two lines are naturally resilient due to the type-2 behavior. The unpredicatable data compression is always called (line 10 for zone A and line 8 for zone B), no matter how much the calculated quantization bin deviates (zone B or zone A),
Case 3 - A computation error happens to line 6. This may affect correctness of the decompression data, which will be analyzed based on two possible situations.
Situation 1: the decompressed data value is deviated significantly because of the SDC such that the following double-checking (i.e., line 7-8) suggests to use unpredictable compression here. So it is resilient because of type-2 behavior.
the decompressed data value is changed slightly such that it skips the double-checking step. In this situation, the skewed (wrong) decompressed data value would be used in the prediction of the succeeding data points, and this would lead to the inconsistent prediction results between the compression and decompression. Thus it is not resilient in this situation because of violation of type-3 behavior.
Case 4 - A computation error happens to line 7. Line 7 has very good resilience but not perfect. Obviously, if line 7 makes a false result to be true, it is resilient because of the unpredictable data solution (type-2 behavior). If line 7 makes a true result to be false, it is not resilient because of the impact of machine epsilon. However, in our fault tolerant design, we do not protect this part because the likelihood of this situation is extremely small. This situation happens only when the original real value is located right on the edge of a quantization bin. To be more specific, a test shows only 24 out of data points (NYX dataset, relative error bound 1E-3) will make line 7 true.
4.1.3. SDC resilience in lossless compression
We will show our solutions are able to detect SDCs that occur in lossless compression in Section 5.3.
All in all, in terms of the SZ lossy compression framework, the only concern regarding fault tolerance during the compression procedure is on the correctness of the predicted value (i.e., line 2 in Figure 1 (a)) and the correctness of data decompression during the compression (i.e., line 6). To address this issue, we developed an efficient selective instruction duplication method, to be described in Section 5 in detail.
4.2. SDC Resilience – Memory error
Now, we analyze the resilience against the memory errors occurring in different places, such as input data, regression coefficients and quantization bin index array, respectively.
4.2.1. SDC resilience against memory error in inputs
Since the input data (i.e., original data) occupies the significant portion of the memory footprint, we have to protect it against potential SDC errors. The input data is used in the following steps: 1. computing the regression coefficients; 2. sampling and estimating the compression error of both regression and Lorenzo predictor; 3. data prediction and calculation of the difference between predicted data and original data and handling unpredictable data. We find that: for the first two steps, similar to the analysis in Section4.1.1, the memory error in input data will only impact the compression ratio and keep the correctness of decompressed data. However, step 3 must use genuine uncorrected input data since that is where the compression happens. With a corrupted input in step 3, the decompressed data will be calculated based on that corrupted value which is obviously SDC prone.
We will leverage the above finding to reduce the overhead of checksum calculations since it discloses the fact that the corrupted values may not affect the correctness of decompressed data in the first 2 steps (i.e., error detection/correction for those parts are not necessary).
4.2.2. SDC resilience against the memory error in regression coefficients
The memory usage of regression coefficients are found to be very small compared to the overall memory usage such that this part does not need particular protection. Each data block will maintain at most 4 coefficients (for 3D dataset). Thus, the coefficients only take of the overall memory. For a 3D example, usually the block size is 8X8X8 which means the coefficients take only of overall memory.
4.2.3. SDC resilience against the memory error in quantization bin index array
In SZ, the quantization bin index array (to be called bin array for simplicity) is an array used to record how much the predicted value deviates from the original value for each data point. The element in the array is a positive integer if the data is predictable; otherwise, the element is 0, indicating that the data needs to be compressed/decompressed by unpredictable compression method. Obviously, if the bin array is corrupted by some memory error, the decompressed data will not be correct. So, the array is not resilient to memory error. Also, since the prediction is a critical stage that contributes the portion of the overall execution time, the likelihood of error happening during this stage is higher than other stages, thus we have to protect the bin array in this stage. Specifically, we carry out two different checksums on each block right after all the data inside the block are processed, such that we are able to detect and correct the possible corrupted data by double-checking the checksum values later on (e.g., during the Huffman encoding stage).
5. Error Tolerance Methodology
Our SDC resilient SZ design is done in three aspects. First, we eliminate the data dependency between adjacent blocks; second, we use selective instruction duplication to ensure correct computation; third, we use checksums to detect and correct corrupted values caused by memory errors.
5.1. Blockwise independent design
In the following, we discuss how to eliminate the dependency between blocks, such that any SDC error can be confined within a small block, improving the robustness significantly.
The key difference between the original SZ and our independent-block based compression is that we now treat each block of data as separately with each other. Specifically, we apply the prediction and quantization inside each block individually and make sure the compressed data of one block is totally independent with others’. This requires many changes to the original SZ development. For instance, we need to record the compressed size of each block after we finish the compression for that block. Both recording the bin array and Huffman encoding need to be done individually per block.
Another significant advantage in the independent-block based compression design is that one can perform random-access decompression efficiently by specifying a specific region in space. To this end, we implement random-access support in our implementation, such that the decompression speed can be improved significantly if the user just wants to decompress a small region in the whole dataset. The corresponding experimental results will be presented in Section 6. Moreover, such an independent-block based compression also makes the parallelism of SZ much easier to port on many-core architectures, such as GPU.
5.2. Fault tolerant compression
We present our SDC resilient compression method in Algorithm 1. We highlight the lines related to our fault tolerance design in blue font. Line 3 and 4 are calculating checksums for input data, in order to detect possible SDC errors striking the input data later on. As we discussed in Section 4.2.1, we do not need to detect memory error in the input data during computations for regression coefficients and compression error estimation. We only detect whether the input data encounters memory errors before the data prediction gets started (line 11). If a data corruption is detected (by ), it can be located and recovered by the pair of checksums (i.e., and ) applied on input data. Then, we protect the quantization bin array against memory errors (line 24 and 35). Line 29 and 40 are designed for detecting possible SDC errors occurring in the decompression stage, to be detailed later. For the computation errors, instruction duplication can be used. According to our analysis in Section 4.1, only data prediction (line 18) and calculating decompressed data (line 25) need to be protected by instruction duplication.
5.3. Fault tolerant decompression
The SDC resilient SZ decompression is presented in Algorithm 2. Line 1-9 refers to the regular block-wise data decompression of SZ. Our resilience design starts from line 10. We constructed the checksums for each block and compressed the checksum array (i.e., ) by lossless compression (Zstd) during the data compression. Accordingly, we need to decompress (line 10) before the error detection. Our idea is leveraging such checksums of decompressed data (i.e., ) constructed during the compression to detect possible errors that happen during the decompression. Specifically, after performing the data decompression for each block (line 1-9), our algorithm will calculate the corresponding checksums for each block of decompressed data and compare the checksums to  (line 12-13). If they are not consistent, some errors must happen during the decompression. So, the algorithm will decompress this block by random-access decompression (line 14), meaning the compressed bytes are reloaded. If the checksum is consistent, we know some memory or computation error is detected (line 17). If inconsistent the second time, we can conclude that the SDC error likely happens during the lossless compression, which will be reported to users (line 19).
5.4. Avoiding round off errors in checksums
Since the input data and the decompressed data are both floating point numbers, round off errors in the checksums may introduce inaccurate memory error corrections. To avoid the impact of round off error, we treat the floating point numbers as unsigned 32-bit integers and then calculate checksums based on these integers. We first describe how the checksum is performed on the 32-bit single-precision floating point data as an example and then discuss how to extend it to 64-bit double-precision floating point values.
Given a data block of 32-bit floating point values, for each element, we put all its 32 bits in a temporary variable and treat the bits in that variable as a 64-bit unsigned integer with the first 32 bits being flushed to 0. We then add that integer to the checksum which is also a 64-bit unsigned integer. Finally, we get the checksum represented by a 64-bit unsigned integer for this data block. Notice that the “checksum” here is not equal or approximate to the real sum of the data block because it is calculated based on integer interpretation of the bits instead of floating point. Thus, it is immune to NaN/Inf issues that happens only to floating point numbers. Using the 64-bit unsigned integer representation, we can have the checksum hold up to 32-bit unsigned integers without overflow because the maximum 64-bit unsigned integer divided by maximum 32-bit unsigned integer is equal to . That is fairly enough to totally avoid the overflow since each data block in SZ has only 1000 data points (such as 101010 block) in general. With all these techniques, we can provide bit-level error detection and correction. The main difference from Demmel’s work (reproduciblesum) is that we are actually doing integer-based summation instead of the sum based on floating point numbers.
To extend to 64-bit double precision numbers, we just need to treat each double value as two 32-bit unsigned integers. So it is reduced to the above case.
5.5. Impact to compression ratio without protecting regression and sampling
As mentioned previously, we do not protect the computation in regression and sampling in that the errors during this period would not affect the correctness of decompressed data and just have tiny impact to the compression ratios. In what follows, we derive theoretically the upper bound of the compression ratio decrease affected by the computation errors happening during the regression or sampling. We denote the compression ratio of SZ in error free run by ; the number of data blocks by . For simplicity, we assume that the compression ratio for each block is identical with each other. In the worst case, the error in regression or sampling will at most reduce the compression ratio to be 1, which means that it does not reduce the size of that block of data. Consequently, we can derive the maximum compression ratio decrease as CR_decrease = ()100%. Based on the above equation, the upper bound of compression ratio decrease depends on the error free compression ratio and the block size. For example, if the block size is set to 6X6X6 and the compression ratio is 10, and if the input data is around 864 MB, then there will be data blocks. The compression ratio decrease would be bounded within , which is negligible to the overall compression ratios.
6. Experimental Evaluation
6.1. Experimental Setup
In this subsection, we describe how we set the experiments in our evaluation, including applications, error injections, and experimental environment.
We evaluate our SDC resilient error-bounded SZ compressor on three real scientific datasets: NYX, Hurricane, and SCALE-LETKF (SL for short). We also evaluate our fault tolerant compressor using 20 Pluto images provided by Plantary Data System (PDS) (pds). Those images were taken by New Horizons space probe (new-horizons) in aerospace which is an error-prone environment because of potential impact of cosmic rays. The description to these datasets is presented in Table 1. For the Pluto image data, we perform the error-bounded lossy compression such that the visual quality can be maintained very well, as illustrated in Figure 2.
6.1.2. Error injections with two modes
Evaluation mode A - source-code level error injection.
Like most ABFT work (xinfft; jieyangipdps), we inject errors at the source code level and only inject errors to the main data structures. Specifically, as for the memory errors in input data and quantization bin array, we randomly choose an index from the array and randomly flip a bit of the selected data value during the compression. Thus, we simulate memory error randomness both in time and location. We inject them after the checksums are applied on input data (i.e.,  and ). To simulate the computation errors when calculating regression coefficients, sampling and estimating compression error of Lorenzo and regression, we randomly select a data point in a random block and then change its value by injecting a random bitflip error. We exclude the evaluation of computation errors in prediction as it is already protected by instruction redundancy.
Evaluation mode B - system level error injection.
Besides the evaluation mode A (memory errors happens only to the data we protected), we also follow a Checkpoint-based Fault Injection (CFI) (cfi) model to inject random error(s) to the whole memory consumed during the compression. We adopt a system-level checkpointing toolkit - Berkeley Lab Checkpoint/Restart (BLCR) (blcr), which can dump the whole memory of a running process to disk as a checkpoint and then restart its execution from that checkpoint. In our experiment, we select a random time stamp during the whole compression period. Then, we set a checkpoint by saving the whole memory at that time stamp using BLCR and kill the process. We then inject a random bitflip error in the checkpoint file and restart the process by the bit-flipped checkpoint. We inject 1, 2 or 3 errors and perform 500 runs per test for both fault tolerant SZ and original unprotected SZ.
6.1.3. Experimental Environment
We run experiments on a supercomputer (bebop). Inside each computing node are two Intel Xeon E5-2695 v4 processors totalling 36 cores. The POSIX I/O (posixio) with mode, file-per-process, is used for parallel data reading and writing. We implement our solution in SZ’s source code and call it ftrsz (or FT-SZ) in the following text. We alter the order of value additions in the duplicated computation of data prediction, which can effectively prevent the compiler from overlooking this operation, and the execution time overhead can thus be measured correctly.
6.2. Evaluation of Independent-block Compression
We first evaluate our designed independent-block based SZ compression (a.k.a., random-access based compression).
6.2.1. Exploration of The Best Block Size
It is important to determine an appropriate block size in our independent-block based compression framework. We determine the best block size by a comprehensive analysis in terms of rate-distortion with masses of experiments using different block sizes, as the optimal block size is hard to find for different datasets by theory.
We evaluate the compression results using the block size of 4x4x4 through 20x20x20. We exemplify the rate-distortion with cosmological NYX simulation data (velocity_x field) and climate hurricane simulation data (TCf48 field) with five different block sizes in Figure 3. As shown in the figure, small block sizes (such as 4x4x4 and 6x6x6) may lead to high PSNR in the cases with low bit-rates (such as 2); large block sizes (such as 8x8x8 12x12x12) would be clearly better than the small block sizes on high bit-rates. The reason is explained as follows. For the over-small block sizes such as 4x4x4, the overhead of storing the regression-coefficients appears relatively high compared to the overall compressed size. For the over-large block sizes such as 20x20x20, the linear-regression based predictor cannot get a good fitting for the data. Based on our experiments with multiple simulation data, we set the block size to 10x10x10 in our implementation because it has much better compression ratios (i.e., low bit-rate) in the hard-to-compress cases than other block sizes, while it exhibits comparative compression ratios with other block sizes in the cases with relatively low bit-rates.
6.2.2. Evaluating independent-block decompression
The biggest advantage of the independent-block based implementation is very fast decompression speed if the users just want to extract a small sub-block of data. Moreover, as we discussed in Section 5.3, this design can also help correct the errors very quickly upon a detection of problematic blocks by checksums. In Figure 4, we present the decompression times with different data sizes compared to the whole dataset. The x-axis indicates the ratio of the decompressed data size to the whole data size. In the figure, we observe that the decompression time decreases approximately linearly with decreasing data size in the decompression, which confirms the high efficiency of random-access decompression.
6.3. Error free experimental results
One key indicator is how much overhead (including compression ratio overhead and execution time overhead) would be introduced by the SDC detection in the compressor.
6.3.1. Compression ratio overhead
Since we store the checksum  during the compression in order to verify the correctness of the decompressed data, the compression ratio could be degraded more or less. Table 2 presents the compression ratios of the original SZ (denoted as sz) and the relative decreases of compression ratios under the independent-block based SZ (or random-based SZ, abbreviated as rsz) and fault-tolerant random-access SZ (denoted as ftrsz), respectively. It is observed that our proposed solution incurs only 010.7% degradation on compression ratio for NYX, Hurricane and Pluto data, and the degradation level decreases with decreasing error bounds. The SL dataset exhibits 9.424.9% compression ratio degradation, which mainly comes from the overhead introduced by the random-access design.
6.3.2. Execution time overhead
We evaluate the time overheads introduced by our fault tolerance codes added to SZ when there are no errors. We show the results in both compression and decompression in Figure 5. We can see from Figure 5 that in most cases, the rsz and ftrsz incur about 520% overheads in compression time and 230% overheads in decompression time. Such time overhead, actually, are negligible compared to the total I/O time on a PFS because of potential I/O bottleneck, which will be demonstrated in the end of this section.
6.4. Error injected experimental results
6.4.1. Resilience against memory errors in input and quantization bin array (evaluation mode A)
We first inject memory errors into the input array and bin array to verify that our proposed solution can still ensure the decompressed data within the user defined error bounds.
In this experiment, we observe that various fields exhibit similar results. As such, we present the results based on the field of dark matter density in NYX dataset as an example. For every error bound, we repeat running sz and ftrsz for 100 times, each with randomly injected memory errors in input and quantization bin array.
As shown in Table 3, our proposed fault tolerance solution can always yield correct decompressed results when the memory errors are injected in input data or quantization bin array. The 100% correctness of the decompressed data under ftrsz also means that our solution is immune to the round-off errors. In comparison, for the original SZ without our techniques, we can see that only 4860% runs can yield error bounded decompressed data when the input data experiences memory errors. As the memory error corrupts a value in the bin array, the situation gets worse because some of the memory errors may cause core-dump segmentation fault, which happens in the case that the corrupted values turn out to be a fresh value such that it is beyond the range of the constructed Huffman tree. As shown in the right side of Table 3, under the original SZ compression, only 3454% runs can complete without segmentation faults; and only 0-3% runs can complete with correct decompressed data.
As for the extra time overheads introduced by the detection/correction of errors in our fault tolerance method, we conduct error injected experiments for all three datasets. The extra overheads compared to ftrsz in an error-free case are all less than 1% for any error bound. This is because the case with injected errors only incurs one more block of checksum calculation, which is negligible to the overall execution time.
6.4.2. Resilience against memory errors happening anywhere (evaluation mode B)
Figure 6 presents the experimental results of our solution (ftrsz) against the original SZ in the evaluation mode B (i.e., by injecting the errors into the whole memory during the compression). It is observed that our solution can improve the percentage of successful non-crash runs by 10%20%, and improve the percentage of the runs with correct decompression results by 30%170%. Our solution can substantially reduce the crash runs because we protect the bin arrays, which may run into core-dump segmentation faults when being injected errors, as shown in Table 3. In addition, as shown in Figure 6 (b), when injecting one and two memory errors respectively, about 92% of running cases lead to correct decompressed data (with guaranteed error bound) under our solution, while the original SZ suffers very low percentage (71.2% and 47%, respectively). For our solution, the 8% failed cases with incorrect decompression data are likely due to the error injection before the checksum execution at the beginning period, which means the checksum is calculated based on corrupted input data. Thus, it will not be able to detect future memory errors.
6.4.3. Resilience against computation errors during compression
As discussed in Section 4.1.1, the computations of regression coefficients, sampling and estimating compression error are error resilient though computation errors will impact the compression ratio. Figure 7 shows our experimental results about the impact to compression ratios. Computation errors are randomly injected and each experiment is repeated 50 times. The compression ratio decrease is calculated by taking the lowest compression ratio among 50 trials. As can been seen, the compression ratio decrease is within 2% for up to 10 computation errors injected under the error bound of 1E-6 or 1E-3. The compression ratios in an error-free case are 4.8023 and 1.8112 for these two error bounds, respectively.
6.4.4. Resilience against errors injected during decompression
For each run of decompression, we injected one computation error to a random block and noted all the errors can be 100% detected by checksum and corrected by re-executing decompression for that block. Again, the extra overheads compared to fault tolerant random access SZ in error-free cases are all less than 1% for all datasets in all error bounds.
6.5. Parallel experimental results
We evaluate the I/O performance with breakdown of the execution times (compression/decompression time + data writing/reading time) by processing NYX dataset under the error bound of 1E-4 in parallel on the PFS of the cluster. The experiment follows a weak-scaling style: i.e., we run the tests with different execution scales (2562,048 cores), in which each rank kept the same data size (3GB) to process. Results are shown in Figure 8. As for the total data dumping time, it is observed that our error-resilient SZ incurs only 7.3% overhead at the scale of 2,048 cores. Our error-resilient SZ has only 6.2% overhead on the data dumping performance when using 2,048 cores to read and decompress data. The key reason for the very limited overall overhead is that the total I/O performance is dominated by compression ratio because of the I/O bottleneck of the PFS.
In this paper, we propose a novel SDC resilient strategy for the SZ lossy compressor. We develop an independent-block based compression model for SZ to improve the robustness. We analyze each subroutine of the SZ framework elaborately and then design a series of fault tolerance strategies for the fragile code segments. We perform the evaluation by processing three well-known scientific datasets on a cluster with up to 2048 cores. Our solution can control the time overhead to about 10%, with a degradation of compression ratio limited within about 5%. When injecting one and two SDC errors respectively during the compression, our solution can have about 92% running cases get correct decompressed data (with guaranteed error bound), which is significantly higher than that of the original SZ (71.2% & 47%, respectively).