1. Introduction
Largescale scientific simulations running with leadership supercomputers are essential in many science and engineering domains such as cosmology studies. Modern cosmological simulations are used by researchers and scientists to investigate new fundamental astrophysics ideas, develop and evaluate new cosmological probes, assist largescale cosmological surveys, and investigate systematic uncertainties (heitmann2019hacc; friesen2016situ). Historically such studies have required large simulations that are highly computation and storage intensive, which are run on leadership supercomputers. Today’s supercomputers have evolved to heterogeneity with acceleratorbased architectures, in particular GPUbased highperformance computing (HPC) systems, such as the Summit system (summit) at Oak Ridge National Laboratory. To adapt to this evolution, cosmological simulation codes such as Nyx (almgren2013nyx) (an adaptive mesh cosmological simulation code) have been designed to take advantage of GPUbased HPC systems and can be efficiently scaled to simulate trillions of particles on millions of cores (almgren2013nyx). These simulations often run on a static number of ranks, usually for the same number of compute partitions, and periodically huge amounts dump raw simulation data to the storage for future post hoc analysis.
With the increase in scale of such simulations, saving all the raw data generated to disk becomes impractical due to: 1) limited storage capacity, and 2) the I/O bandwidth required to save this data to disk can create bottlenecks in the simulation (wan2017comprehensive; wan2017analysis; cappello2019use) . For example, one Nyx simulation with a resolution of cells can generate up to 2.8 TB of data for a single snapshot; a total of 2.8 PB of disk storage is needed assuming running the simulation 5 times with 200 snapshots dumped per simulation. One way to avoid this issue is to limit the volume of data that needs to be written to disk. This can be done by decimation, e.g. storing one snapshot at every other timestep during the simulation. However, even with decimation, we can still be left with a massive number of timesteps to store and the amount of data to be stored for one timestep can still overwhelm the storage capacity and I/O bandwidth of a supercomputer.
A better way to address this issue is to use data compression. While lossless compression would have been ideal, it typically only achieves a compression ratio (son2014data) for scientific data. On the other hand, using the new generation of errorbounded lossy compression techniques, such as SZ (tao2017significantly; di2016fast; liangerror) and ZFP (zfp), we can achieve much higher compression ratios with minimal distortion of the data as demonstrated in many prior studies (di2016fast; tao2017significantly; zfp; liangerror; lu2018understanding; luo2019identifying; tao2018optimizing; cappello2019use; jin2020understanding; grosset2020foresight). However, previous approaches of utilizing lossy compression for scientific datasets have always applied the same compression configuration to the entire dataset (jin2020understanding; tao2017exploration). Yet, if we look at a visualization of baryon density in a Nyx simulation, shown in Figure 1, we can see that not all partitions (regions) have the same amount of information. Cosmologists are typically interested in the dense regions as these would contain halos (clusters of particles) where galaxies would be formed. So, this means that the sparse regions (top row of Figure 1) could be compressed more aggressively than the dense ones (bottom row of Figure 1) and this would not impact the analysis done by cosmologists.
Apart from finegrained adaptive compression, we must also be able to precisely control the compression error for domainspecific posthoc analysis. Research has shown that generalpurpose data distortion metrics, such as
(PSNR), normalized rootmeansquare error, mean relative error (MRE), and mean square error (MSE), on their own cannot satisfy the demand of quality for cosmological simulation posthoc analysis (jin2020understanding; grosset2020foresight). For example, PSNR does not tell us how the mass of a halo would be impacted after compression. One approach to finding the optimal compression configuration needed is to run a broadspectrum analysis (try many different compression configurations and analyze the result of each) as is done by Foresight (grosset2020foresight). However, such an approach is entirely empirical and requires a long processing time. A smarter way would be to be able to characterize each region based on a number of metrics (e.g. running an FFT analysis) which would then be used to decide which compression configuration parameters to be applied.In this paper, we show that adaptively compressing different regions of a simulation, based on the amount of information that they contain, allows us to maximize the compression ratio while not impacting the quality needed for posthoc analysis. In order to determine the compression parameters to use for each region, we develop: (1) a theoretical error estimation model for posthoc analysis, including both power spectrum and halo finder for cosmological simulation; (2) an estimation model of compression ratio for lossy compression; and (3) a regionwise optimization approach for errorbound combination based on the proposed models. To demonstrate the effectiveness of our approach, we compare the power spectrum and halo generated from adaptive compression to traditional static compression method and show that we get similar posthoc analysis while getting compression ratio improvement by up to 73%. To the best of our knowledge, this paper is the first work on systematically study the possibility and efficiency of dividing scientific simulation data in compute partitions (i.e., regions) and apply adaptive lossy compression configurations.
Regarding the overhead for in situ implementing our approach to cosmological simulation, it only requires collecting several parameters (such as mean value and number of cells weighted in a certain value range) from every region for identifying their feature density and compressibility. This introduces very little overhead of only 1% compared to compression itself since we efficiently reduce the information required for optimization. Moreover, since the data is already partitioned among MPI ranks for cosmological simulation, our approach can be perfectly integrated in situ to optimize compression configuration individually for each partition. Our work can also be adopted to other largescale scientific simulations that require compression but are facing the challenge to understand the impacts of lossy compression on their domainspecific metrics. The contributions of this work are summarized as follows:

We propose a novel adaptive approach to select feasible errorbound combinations for different partitions of cosmological simulation data, and show the possibility and efficiency of adaptively configuring lossy compression for each partition, instead of statically setting empirical error bound for whole data set in the beginning of the simulation.

We build theoretical models to efficiently estimate (1) the overall loss of cosmological postanalysis result caused by lossy compression and (2) the compression ratio, all based on the property of each partition by collecting several representative parameters.

We develop an efficient optimization guideline to determine the bestfit configuration of errorbound combinations that maximizes compression efficiency under userdefined acceptable posthoc analysis quality distortion.

Experiments demonstrate that our approach can minimize the overhead for feature extraction and errorbound optimization for each partition while providing in situ postanalysisaware lossy compression for cosmological simulation. Our approach improves the compression ratio by up to 73% with only about 1% performance overhead compared to the original compression.
The rest of this paper is organized as follows. In Section 2, we discuss the background and motivation of our research. In Section 3, we describe our proposed modeling for cosmological simulation data postanalysis error impact and modeling for compression ratio, as well as our optimization strategy for finegrained lossy compression. In Section 4, we present the evaluation results of our proposed approach to Nyx cosmological simulation data and compare it with previous approaches. In Section 5, we conclude our work and discuss our future work.
2. Background and Motivation
In this section, we present the background information about cosmology simulation and some widely used post hoc analysis methods, as well as advanced lossy compression for scientific data.
2.1. Cosmological Simulation and Analysis
Nyx is an adaptive mesh, hydrodynamics code designed to model astrophysical reacting flows on HPC systems (almgren2013nyx; nyx). This code models dark matter as discrete particles moving under the influence of gravity. The fluid in gasdynamics is modeled using a finitevolume methodology on an adaptive set of 3D Eulerian grids/mesh. The mesh structure is used to evolve both the fluid quantities and the particles via a particlemesh method. For parallelization, Nyx uses MPI for the longrange force calculation and architecturespecific programming language for the shortrange force algorithms, such as OpenMP and CUDA. Nyx data uses multiple 3D arrays to represent field information in grid structure. According to prior studies (nyx; habib2016hacc), it can run up to millions of cores in the leadership supercomputers in the United States, such as Summit (summit). In this paper, we use Nyx simulation data that contains 6 fields: Baryon Density, Dark matter density, Temperature, Velocity x, y, and z.
As mentioned earlier, traditional evaluation metrics (such as MSE, PSNR) cannot inform us on the data quality needed for post hoc analysis
(jin2020understanding). So, we compute cosmologyspecific evaluation metrics, such as Power Spectrum and halo characteristics, to determine the data quality needed for post hoc analysis.Power Spectrum
Matter distribution in the Universe has evolved to form astrophysical structures on different physical scales, from planets to larger structures, such as superclusters, and galaxy filaments. The twopoint correlation function
, which gives the excess probability of finding a galaxy at a certain distance
from another galaxy, statistically describes the amount of the Universe at each physical scale. The Fourier transform of
is called the matter power spectrum , where is the comoving wavenumber. Therefore, the matter power spectrum describes how much structure exists at the different physical scales. Observational data from ongoing sky surveys have measured the power spectrum of matter density fluctuations across several scales. These sky surveys, along with largescale simulations, are used to investigate problems such as determining cosmological parameters (eke2001power). In general, we compared the of decompressed data to the original and target for acceptable distortion ratio within for all .Dark Matter Halos
Dark matter halos play an important role in the formation and evolution of galaxies and consequently cosmological simulations. Halos are overdensities in the dark matter distribution and can be identified using different algorithms; in this instance, we use the FriendsofFriends algorithm (Davis1985). That is, we connect each particle to all “friends” within a distance, with a group of particles in one chain considered as one halo. Another concept of halo, such as Most Connected Particle, is defined as the particle within a halo with the most friends. Then, there is the Most Bound Particle, which is defined as the particle within a halo with the lowest potential. For Nyx simulation, which is an Eulerian simulation instead of Lagrangian simulation, the Halo Finding algorithm uses density data to identify halos (friesen2016situ). For decompressed data, some of the information can be distorted from the original. Information such as the density of one cell can affect the halo number detected, particularly for smaller halos. We use three matrices to reflex the Halo Finder quality of decompressed data: (1) the position of halos; (2) the halo number detected; and (3) the halo mass change of each halo. Furthermore, we preferred to preserve that information for middle and large halos over for small halos.
2.2. Lossy Compression for Scientific Data
Floatingpoint data compression has been studied for decades. There are two main categories: lossless compression and lossy compression. Lossless compressors such as FPZIP (lindstrom2006fast) and FPC (FPC) can only provide limited compression ratios (typically up to 2:1 for most scientific data) due to the significant randomness of the ending mantissa bits (son2014data).
Lossy compression, on the other hand, can compress data with little information loss in the reconstructed data. Compared to lossless compression, lossy compression can provide a much higher compression ratio while still maintaining useful information for scientific discoveries. Different lossy compressors can provide different compression modes, such as errorbounded mode and fixedrate mode. Errorbounded mode requires users to set an error bound, such as absolute error bound or pointwise relative error bound. The compressor ensures the differences between the original data and the reconstructed data do not exceed the userset error bound. Fixedrate mode means that users can set a target bitrate, and the compressor guarantees the actual bitrate of the compressed data to be lower than the userset value.
In recent years, a new generation of lossy compressors for scientific data have been proposed and developed, such as SZ (di2016fast; tao2017significantly; liangerror; tian2020cusz) and ZFP (zfp). SZ and ZFP were first developed for CPU architectures, and both started rolling out their GPUbased lossy compression recently. Both SZ and ZFP teams have released the CUDA implementation of their compression (tian2020cusz; cuZFP). Compared to lossy compression on CPUs, GPUbased lossy compression can provide much higher throughput for both compression and decompression (jin2020understanding). Unlike traditional lossy compressors such as JPEG (wallace1992jpeg) which are designed for images (in integers), SZ and ZFP are designed to compress floatingpoint data and can provide a strict errorcontrolling scheme based on user’s requirements. In this work, we chose to use SZ instead of ZFP because the GPU version of SZ—cuSZ (tian2020cusz)—provides a higher compression ratio than ZFP and offers the absolute errorbound mode that ZFP does not support (but necessary for our error control). Specifically, SZ is a predictionbased errorbounded lossy compressor for scientific data. SZ has three main steps: (1) predict each data point’s value by its neighboring data points in a multidimensional space with an adaptive predictor (using either a Lorenzo predictor (ibarria2003out)
(liangerror)); (2) perform an errorcontrolled linearscaling quantize the difference between the real value and predicted value based on the userset error bound, convert all floatingpoint values to an array of integer numbers; and (3) apply a customized Huffman coding and lossless compression to achieve a higher ratio.Today’s lossy compression techniques have been used in many HPC scientific applications for saving storage space and reducing the I/O cost of saving data (gok2018pastri; wu2019full; poppick2020statistical). In this paper we focus on utilizing SZ lossy compression for cosmological simulation with consideration of specified analysis error control. We will build a model for SZ lossy compression error and provide a theoretical support for error propagation in post hoc analysis. Note that our study can be also applied to other lossy compressors with modifications on compression error modeling (more will be detailed in Section 3).
3. Design Methodology
In this section, we introduce the concept of optimizing compression configuration for different partitions in a given dataset and the necessary theoretical analysis as well as models needed. We describe in detail our theoretical analysis based on our hypothesis and provide an experimental evaluation to support our models on post hoc analysis error impact and compression ratio. Lastly, we propose an in situ approach for cosmological simulation with an adaptive compression configuration for each data partition based on our analysis and model.
3.1. Adaptive Compression Configuration
Our main goal is to provide a higher compression ratio while maintaining the same post hoc analysis quality or provide a higher post hoc analysis quality while maintaining the same compression ratio. We do so by applying our optimized compression configuration individually to each partition of a dataset, compared to traditionally one configuration for the entire dataset.
To achieve this, we need to determine the relationship among compression configurations, post hoc analysis quality, and compression ratio. Thus, we introduce two types of model: (1) errorimpact modeling for cosmological post hoc analysis based on compression configurations of different partitions, and (2) compression ratio modeling based on compression configurations of different partitions. Based on these two models, we can then create an optimization strategy. As shown in Figure 2, we target to balance between post hoc analysis quality and compression ratio, when providing corresponding compression configuration for each partition. This allows us to apply different compression configurations to different partitions whereby we can improve the compression efficiency of the entire dataset by both exchanging feature preservation (e.g., preserve more feature for dense information areas) and balancing compressibility characteristic (e.g., significantly improve compression ratio by sacrificing little analysis quality for low compressibility areas). Note that we can also significantly reduce the complexity of finding the optimized solution compared to the errorandtrail baseline method when different partitions vary significantly in terms of either information density or compressibility.
As shown in Figure 1, different partitions in Nyx data have completely different feature density and compressibility. Moreover, we can observe that a given partition can vary dramatically through different snapshots. To clarify, in Figure 1, we set a relative low maxvalue threshold to illuminate areas with low values, partitions with lower features (e.g., early timestep of sample partition upper in Figure 1) can be almost blank in comparison to regular visualization. This means the previous compression solution of compress all dataset with the same compression configuration is far from optimized in terms of the balance between post hoc analysis quality and compression ratio. Note that Nyx data is naturally partitioned due to its simulation with multiple MPI ranks, which provides a suitable condition to applying an individual configuration to each data partition. Based on our proposed modeling and optimization, we adaptively adjust the compression configuration for each partition in every snapshot.
3.2. Modeling Error Impact on Cosmology post hoc Analysis
Power spectra and halo finder are two main post hoc analysis metrics for Nyx cosmology simulation. We model the error impact in terms of both metrics. Note that our theoretical analysis in this section can be adopted to other post hoc analysis with little effort. For example, our analysis for power spectra can be adopted to other FFTbased analysis.
As discussed in Section 2, SZ lossy compression is a predictionbased lossy compressor using quantization for strict error control.
When Lorenzo predictor is used, SZ provides predicted values with an originprediction error in units of userdefined error bound. Such quantization causes evenly distributed error in both ABS
mode (i.e., absolute error bound) and PW_REL
mode (power relative error bound). More discussion in Section 3.5.
Figure 4 shows the error distribution of a sample Nyx dataset on the temperature
field with the absolute error bound of 10.0. Note that for both CPUSZ (quantization is performed after Lorenzo prediction) and GPUSZ (quantization is performed before Lorenzo prediction), their error distributions are the same as of uniform distribution. We also note that for some extreme cases where a high error bound is used, SZ lossy compression will introduce errors that are slightly different from the uniform distribution. We will discuss how we revise the analysis accordingly in the following sections. We use the uniform distribution as well as our revised uniform distribution to model SZ’s error. Based on our evaluation, this approximation is sufficiently accurate for our optimization.
3.3. FFTbased Power Spectrum
As mentioned in Section 2, power spectrum analysis for Nyx cosmological simulation is based on 3D Fast Fourier Transform (FFT). Thus, we mainly build our errorimpact model for power spectrum based on FFT algorithm. We provide theoretical analysis on error propagation from compressor introduced error in dataset to FFT result in terms of error distribution. We also provide an experimental evaluation to support our model.
FFT algorithm, such as CooleyTukey FFT algorithm (cooley1965algorithm), utilize recursive addition and multiplication of discrete Fourier transform (DFT) matrix decomposed sparse matrix divisor to accelerate the process compared to DFT (heideman1985gauss). Moreover, FFT provides exactly the same result as DFT with lower time complexity ( compared to ). Since we introduce error to the data from lossy compression before post hoc analysis, we can use the DFT equation instead of the more complex recursive FFT workflow to model the error. We first start with 1D FFT error modeling, and DFT is defined by:
(1) 
where is discrete input data, is input frequency, and is the number of elements. And it can be further simplified as:
(2) 
As of the error distribution from SZ lossy compression, as mentioned in Section 3.1, is modeled by uniform distribution as follows:
(3) 
where is the userdefined error bound, is the error distribution,
is probability density function. We consider this error as injected error to dataset and thus we can have the error model of DFT result as:
(4) 
Note here is not a value but a distribution function. We can get the error distribution of DFT as:
(5) 
In our use case, is a relatively large number (no smaller than
in our experiments), we can use Central Limit Theorem and know the above distribution should form into normal distribution. Now we need to find corresponding
and to define our normal error distribution. For real axis of DFT result, we can simplify Equation 5 to:(6) 
Then, we can get the average individual variance
by:(7) 
Note we transformed the equation with the fact that is a large number. Since the distribution of is central symmetric, the individual expected value is zero. Also, for situations where introduced error by lossy compression is not evenly distributed, we can still provide corresponding accordingly. Based on Central Limit Theorem, we can get the variance and expected value of DFT error distribution are:
(8) 
where is the number of elements in given 1D data. Similarly, we can further expand our equation to 2D DFT results by central limit theorem, since each row in the new dimension would further perform another 1D DFT on values with error distribution shown in Equation 9. And so on, we can get 3D DFT error distribution from SZ lossy compression is:
(9) 
where is the data dimension. The same error distribution goes to with similar analysis. As of applying various error bound to different partitions, since the element number in each partition is also considered a large number (e.g., when cutting data into 512 partitions), we can further derive our equation to:
(10) 
where is the number of partitions. We observe that (1) the absolute error impact on FFT analysis highly relies on data size: cosmological simulations running with higher resolution are less errortolerant regarding FFTbased post hoc analysis; and (2) the FFT error distribution does not rely on high “feature” density areas, in other words, every value in original data has the same importance and will impact on all FFT results from error introduced by lossy compression.
As shown in Figure 4, we evaluate the accuracy of our model on the temperature field in Nyx dataset by compress the data with various compression perpartition error bound in and estimate the final impact on FFT result. The average error bound here is , and the reason for such large error in values is because we present the error and estimation before FFT normalization, which is easy to convert by simply multiply the normalization factor. Note here x axis is normalized with error bound. Our model provides a highly reliable estimation to error impact on FFT by given compression configurations for all partitions. We also evaluate the precision of our model along with different range of error bound by applying various error bounds to partitions and collect real variance and compared to our estimation, Shown in Figure 5. Note we no longer provide errorbounded estimation on FFT error impact, but errorbounded probability. For example, we provide a possibility of that one FFT value is within .
3.4. Halo Finder Analysis
Halo finder is another post hoc analysis for cosmological simulation such as Nyx to find halos and identify their locations and masses. Similar to our proposed model for power spectrum, we also propose a model for halo finder in order to estimate the error impact based on given compressor error in dataset, in terms of both error distribution of halo locations and halo masses.
Different from power spectrum analysis, halo finder only applied to density data field (more specifically, “Baryon Density”) instead of all 6 fields in Nyx dataset. The principle of halo finder algorithm is to find the areas with density higher than a given threshold and to build a tree structure across the candidates. Then, those groups that have the highest maximum larger than are identified as “halos”. Lastly, halo position (centroid of all grid points belong to this halo) and halo mass (cell weighted sum of all grid points belong to this halo) are recorded for each halo.
Figure 6 presents the halo candidates before and after lossy (de)compression in a given partition. We can observe that after lossy compression, cell candidacy changes slightly on edge areas, where some cells of existing halos are added or dropped. For demonstration, we applied a relatively high error bound of (normally we use error bound ), thus, more boundary cells are falsedetected, and the hollow background is illuminated with reconstructed inerrorbound values visualized in blue.
After compression, we observe almost no number of found halos changes, even with very high distortion from high error bound, as is shown in Figure 7. Note that halo is further categorized into many small halos and few large halos, only the former can be falsedetected under high error bound. In general, we intend to concentrate and preserve information for large halos based on cosmological analysis (friesen2016situ). As a matter of fact, we also observe almost no halo position change under even extreme compression during our experiment. This is reasonable since the only change we made to existing halos is minor edge distortion from introduced error by bringing edge cells up/down through the threshold. This is a very minor change in weight, compared to the central density of halos.
Compared to almost nonchanging position and count after lossy compression, the mass change of halos is more suitable for error control in Nyx simulation. As shown in Table 1, we analyze the mass change characteristic based on a large halo. The number of cells found for this halo increases along with the increment of the introduced error to the data as expected. However, if we target to mass difference per changed cell, we observe they fall into similar values. More specifically, it is around , which is the set threshold for halo finder. It is because those edge cells are easier to switch between in and outofhalos given lossy compression introduced error, and they cause the mass value change of the entire cell, which is significantly larger than value change merely by lossy compression. This means most post hoc analysis error (from lossy compression) can be attributed to faultdetection of edge cells for halo finder.
Error Bound  Cells  Mass  Mass Diff  Diff per cell 

original  6023  3.13E+6     
1E2  6023  3.13E+6  0   
1E1  6011  3.13E+6  9.8E+2  81.7 
1E+0  6038  3.13E+6  1.21E+3  80.7 
1E+1  6041  3.13E+6  1.66E+3  92.2 
To further provide an estimation of mass changes given perpartition error bound, we conduct fault cell detection estimation. To start with, we divide our problem into finding fault mass detection estimation of each partition and provide overall estimation with their sum:
(11) 
where is sum of individual halo absolute mass changes, is the threshold for halo finder, is the number of partitions, is the error in each partition. The reason is that for existing halos cell changes are addable even for halos across partitions, and for faultdetected halos, they tend to be small halos and only appear under rarely used very high error bound.
Different from FFT, which is based post hoc analysis present in the previous section, halo finder is highly related to density as a feature of each partition for post hoc analysis error estimation. In this case, we define the feature needs to preserve for halo finder in the given data been edge cell count around the threshold. By analyzing the value histogram of density data from Nyx simulation, we find that although histogram is not evenly distributed among the entire data range, the histogram of local values can be considered as evenly distributed. For example, if we consider error bound as and halo boundary threshold as , in which case only values within can affect number of cells detected for halos, and the data histogram within the small value range is considered as evenly distributed. Thus, we can conclude the possibility of fault detection of given cell as:
(12) 
Note here similar to what we discussed for FFTbased post hoc analysis, we can provide the corresponding based on error distribution from lossy compression other than uniform distribution. Then we can provide the number of fault detected cells in the given partition by:
(13) 
where is number of partitions for Equation 11, is boundary cells with a value range of in the given partition. Note here we discuss the number of cells estimated been fault detected due to error introduced by lossy compression. Fault detected cells can switch between both in and outofhalos and result in expected total number of cells being the same as the original, while error forming into normal distribution similar to what we discussed in Section 3.3. However, since we focus on cell changes of individual halos and most are small halos with little edge cells, the number of cell difference can be simplified to Equation 14. For large halos, depending on their size, the estimated error distribution of cell count can be given by central limit theorem forms into normal distribution:
(14) 
We evaluate our halo finder modeling with baryon density data from Nyx simulation. Figure 8 shows our estimated cell difference count based on Equation 11 compared to the result from applying multiple error bounds to different partitions. Our modeling provides high accuracy compared to an experimental result. Note here some of the larger differences between estimation and experiment can be due to loss/increase of small halos as well as cell difference count reduced for extremely large halos from Equation 14.
3.5. Modeling Compression Ratio
In this section, we build a model to estimate the overall compression ratio of the dataset based on given compression configurations of different partitions. As discussed in Section 3.2, error introduced by SZ lossy compression can be modeled by uniform distribution. However, in situations of much larger error bound, the error distribution of SZ lossy compression forms into a combination of uniform distribution and normal distribution. This is because under high error bound settings, Lorenzo predictor in SZ lossy compressor predicts more datapoint within the error bound against the original, even without quantization. As mentioned in the previous theoretical analysis for post hoc analysis, we can easily adopt our models based on revised of noneeven distributed error distribution. Because of this distortion of error distribution, it is extremely hard to provide a model for estimate compression ratio from errorbound combination purely with theoretical analysis. Thus, in this section, we mainly contribute to the compressionratio modeling based on empirical analysis and provide a universal equation for all fields of data from all snapshots.
Figure 9 shows the bitrate to errorbound curve by SZ lossy compression. Here bit rate represents how many bits are needed to represent a value on average, consider original data is singleprecision floatpoint numbers. Two areas can be distinguished for each curve of given partition: both form into power function but areas with bit rate higher than 2 features power values closer to zero (flatter). This is because once bit rate is lower than 2, high error bound causes Lorenzo predictor to predict values within error bound (i.e. no quantization) and improves the Huffman coding efficiency, thus encourages compressibility: curves converge faster when bit rate across 2 on log scale.
Based on our empirical studies and previous work (jin2020understanding), we consider the case where bit rate is always lower than 2, or compression ratio higher than for Nyx dataset (in fp32
). Also, consider our optimization strategy is only gently adjusting the error bound of different partitions, we assume our adjustment is also located within the same curve area. For partitions beyond this assumption, their impact is negligible due to model consistency on edge situation and their low percentage. Evaluation from Section 4 verified our assumption. Bitrate to errorbound equation can be given by:
(15) 
where is overall bit rate, is the number of partitions, is the estimated bit rate of each partition, and are parameters to define position and shape of this curve. Furthermore, we can determine perpartition the basic bitrate to errorbound power function method based on trial and error. However, we avoid this timeconsuming process by performing a twostep procedure: (1) we identify that different partitions across fields and snapshots share the same power parameter , thus we can select and keep using it; and (2) we use a representative parameter from a given data partition to determine of its bitrate to errorbound curve, or relative compressibility of the partition. We find the entropy of partition is one of the parameters that are highly related to , with higher entropy mapped to higher compressibility thus lower curve in Figure 9. However, for reducing the overhead to compute parameter, we instead choose mean value as our key parameter to determine the relative compressibility of a partition. Figure 10(a) shows that the estimated based on a logarithmic fitting and partitions’ mean values are highly precise compared to the real . Lastly, unlike transformbased compressors, SZ provides consistent bitrate to errorbound curves, as shown in Figure 10(b). Therefore, we have high confidence to use the estimated bit rate.
3.6. Proposed Optimization Strategy
Until now we have built the models for error impact of post hoc analyses as well as the model for compression ratio. Then, we optimize the compression configuration of different partitions to feature different error bound so that we can maximize the compression ratio while maintaining the estimated post hoc analysis quality or maximizing the post hoc analysis quality with the same estimated compression ratio.
First, we consider power spectrum post hoc analysis quality to compression ratio optimization. Since FFTbased post hoc analysis does not have a correlation with individual local partition feature, meaning every value shares the same importance. The optimization strategy relies on the difference in compressibility among data partitions. Based on Equation 10, we expect a similar error impact on FFTbased analysis under the condition when the average error bound of all partitions remains the same. In this case, to improve the overall compression ratio while guaranteeing the same FFTbased post hoc analysis quality, we utilize Equation 15 from the previous section by estimating parameters based on the mean value of a given partition.
Additionally, we also extract the overall mean value of the entire dataset by MPI_Allreduce
after each partition computes their own.
We then optimize the perpartition error bound such that their derivatives of bitrate to errorbound curve are the same, which are the minima for Equation 15, while keeping the average error bound within the threshold request from Equation 10. These minima can be given by:
(16) 
where is the optimized error bound of partition , is the average acceptable error bound based on Equation 10, is the parameter based on the average of the mean values from Equation 15, and are each partition’s parameters in Equation 15.
For baryon density, we can apply both power spectrum and halo finder post hoc analyses. In this case, instead of building an optimization strategy of high complexity with three models in mind: power spectrum, halo finder, and compression ratio, we only optimize based on two of the three. Those are powerspectrumtoratio and halofindertoratio, from which we choose one as acceptable condition for both. For example, once we determine the optimized compression configuration of all partitions for power spectrum, we then evaluate if the condition is also acceptable for halo finder. If acceptable, this combination of compression configurations is set; if unacceptable, we further apply the optimization of halo finder and ratio and then set the combination result as a boundary condition. The optimization between halo finder and compression ratio is similar to the above FFTbased post hoc analysis optimization, but it has to change the boundary condition with Equation 11. In this case, we need each partition’s mean value for bitrate curve estimation as well as its number of cells with a weighted value range of for halo finder estimation. However, both parameters and optimizations can be calculated with little effort compared to the computationally intensive post hoc analysis and cosmological simulation, hence introducing a very low overhead to the system with higher compression efficiency.
For both power spectrum and halo finder post hoc analysis optimizations, we also introduce thresholds for errorbound control. This is because there may be a few partitions that may not fit our models well and produce highly unreasonable errorbound options, which can be harmful to the efficiency of our strategy. Thus, we set the highest and lowest errorbound thresholds to and , respectively, where is the average error bound over all partitions.
4. Experimental Evaluation
We present the evaluation results of our framework for a finegrained adaptive lossy compression based on our ratequality modeling. We compare our results based on the de facto static configuration that is normally applied to datasets at the start of simulations in terms of both snapshots and multiple simulation scales. Finally, we evaluate and discuss the performance of our solution for improved compression quality.
4.1. Experimental Setup and Dataset
Dimension  Size  Field  Value Range 

6.6 GB 52 GB 352 GB  Baryon Density  
Dark Matter Density  
Temperature  
Velocity 
We conduct our evaluation with Foresight (foresightgit)
, an opensource toolkit used to evaluate, analyze, and visualize lossy compressors for extremescale cosmological simulations
(grosset2020foresight). We modified the toolkit so that we can gather necessary parameters for our framework to deploy adaptive lossy compression configuration to various data partitions in Nyx cosmological simulation. The details of the Nyx datasets used for evaluation are shown in Table 2. The Nyx dataset is provided by the Nyx development team at Lawrence Berkeley National Laboratory (k8gbvq7819). It is a singlelevel grid structure without adaptive mesh refinement (AMR) and does not include particle data. It contains six 3D arrays, include baryon density , dark matter density , temperature , and velocity in three directions . The dataset is in HDF5 file format (folk1999hdf5). Our experiment platforms include the Cori system (cori) at NERSC and the Frontera system (frontera) and its subsystem Longhorn system (longhorn) at TACC, of which each GPU node is equipped with 4 Nvidia Tesla V100 GPUs (nv100) per node.4.2. QualityRatio Evaluation
We first experimentally demonstrate the effectiveness of our proposed solution. Figure 11 shows the visualization of errorbound of all partitions by adaptive optimization to the tested dataset. data partitions have been assigned with various errorbound, instead of the traditional method that compresses the entire dataset with a single errorbound configuration. Moreover, the presented temperature data only serve for power spectrum as post hoc analysis, thus qualityratio of partitions are effectively traded based on their compressibility since no features are expected to be preserved for power spectrum based on our previous analysis in Section 3.
In Figure 12, we present the optimization efficiency of our optimization solution, where the bitquality ratio is the derivative of the bitquality curve. Traditionally, we compress every data partition with the same errorbound which results in a disorganized bitquality ratio. This indicates the potential compression efficiency improvement by exchanging compression ratio and post hoc analysis data quality between partitions for a higher overall compression ratio and better overall data quality. By doing so, we can significantly improve the bitquality ratio difference of every partition, at which point every partition shares a similar balance between compression ratio and data quality as optimized.
When considering power spectrum as post hoc analysis, which is the case for all data fields other than baryon density in our tested dataset, we generally require ratio of reconstructed data to original data to keep within a for , so that the simulation results can be compared and verified by our cosmological observation capability. Figure 13 demonstrates a power spectrum analysis with baryon density from a Nyx dataset. Note that one data point from data compressed by the traditional method has exceeded the acceptable error range. Our strategy, however, successfully bounds the analysis error within the acceptable error range without expensive trialanderror. Note that in our experiment we choose from Equation 10 mapped to an acceptable error range. Based on our model, this can provide a 95.4% of confidence for no escaping error. In practice, this decision can be changed by potentially lower overall compression ratio for higher post hoc analysis error control.
For the baryon density field, not only does it use power spectrum as post hoc analysis, halo finder is also used in this field as an important workflow. From our proposed model in Section 3, we extract the number of cells with a value range between in each partition as features, since their values need to be preserved for halo finder. Figure 14 shows the number of such effective cells varies greatly for given partition, consider xaxis is log scaled for information demonstration. A dispersed histogram means a high potential of preserving features for some partitions while sacrificing features from others to achieve high compression efficiency. Note that here we do not need to redetermine this number every time we select a new error bound for the partition, because in small ranges the value distribution can be considered as evenly distributed as discussed in Section 3. We only need to extract this number once with (a fairly high errorbound setting) and get ebcell function , where is the number of effective cells found with . We also note that such effective cells also have a correlation to mean value, usually get more from partitions with higher mean values. This illustrates a competition between selecting error bound based on compressibility or based on feature density. However, in our evaluation, when optimizing based on the qualityratio modeling for FFTbased analysis, halo finder result can also be satisfied so that the RMSE of alternated halo mass and original halo mass remains . Unless a higher accuracy is needed by halo finder, we use the optimization strategy discussed in Section 3, which can provide 29.8% higher accuracy in terms of halo mass error, compared to the traditional method.
Figure 15 shows the compression ratio improvement of our finegrained adaptive lossy compression solution over the traditional method on the tested dataset while satisfying the data quality requirement for post hoc analysis. Considering all 6 fields in the tested Nyx dataset, our solution provides an improvement of compression ratio by 56.0% on average. The main advantage of our proposed technique is a more precise errorbound control for every partitions. On the other hand, our solution also requires no trialanderror effort thanks to our specifically designed post hoc analysis error modeling, which can provide a nearoptimal solution with little optimization overhead imposed. The traditional method traverses all possible error bounds for a given dataset and can only provide a suboptimal solution to maintain the high post hoc analysis quality. For example, velocity data in Nyx simulation are highly random (tao2017exploration; jin2020understanding), thus simply deploying different error bounds to different partitions will not provide a very high compression efficiency. The improvement on velocity data shown in Figure 15 mainly because of the highly accurate errorbound estimation. In fact, in order to guarantee the unpredictable post hoc analysis error within acceptable for multiple snapshots, simulation users usually choose a relatively lower errorbound for lossy compressor based on empirical studies (jin2020understanding) compared to the optimized solution.
In Figure 16, we show the performance of our solution with various redshifts evolved to reduce over time (i.e., from 54 to 42 in our tested data). The static implementation of our solution (in yellow) is to optimize the error bounds for all partitions once at the early stage of simulation when the redshift is lower and keep using them for all following snapshots. We observe that the static implementation impacts the compression ratio to some extent, since the simulation data evolve to a different state and require an adjustment of errorbound combination for the highest compression efficiency.
Figure 17 illustrates the difference of errorbound configurations optimized by the earlystage dataset and by the redshift42 dataset. Most of the data partitions in the early stage with larger redshift are smooth and close to each other, resulting in similar optimized error bounds for all partitions. Compared to the traditional method, our solution can provide a consistent improvement on multiple snapshots. Note the improvement increases slightly as the redshift decreases. This is because cosmological simulation evolves to sparser formation, which increases the feature density or compressibility difference between partitions and thus can be beneficial from our proposed solution.
We also evaluate our solution with different partition sizes and show the result in Figure 18. We can observe that the overall compression ratio improvement increases as the partition size decreases, from 27.1% to 56.0% for partition dimension from 512 to 64, respectively. This is because larger partition size averages out the qualityratio difference between partitions, which leads to a lower gain of compression efficiency from separately configuring error bounds to partitions, but it relies on a more accurate estimation of optimized error bounds. We suggest lowering the size of partition for higher utilization of our solution. Lastly, we evaluate our solution on the Nyx dataset with different simulation scales, shown in Figure 19. It illustrates that our solution provides a consistent improvement over the traditional method on different simulation scales. More specifically, it provides average compression ratio improvement compared to the traditional method by 56.0% and 51.9% for simulation scale of 512 and 1024, respectively.
4.3. Performance Evaluation and Discussion
Traditional methods find desirable error bounds for a given dataset by traversing all possible candidates. With a smooth curve of ratio to errorbound for SZ lossy compression, a statically optimized error bound for the entire dataset can be always found by the trialanderror process if there is no time limit. However, this is challenging for scientific datasets where quicktocompute general distortion metrics cannot represent data quality for post hoc analysis purposes (jin2020understanding), and each trail requires compression, decompression, and post hoc analysis. Thus, this method is not scalable and cannot be applied to extremescale simulations. Not to mention if pursuing higher compression efficiency by varying configuration for individual partition, the traditional trialanderror method is simply infeasible considering the combination possibility of different partitions (e.g., ). In comparison, our solution only requires a very low overhead for the errorbound combination optimization with our theoretical models. As a result, our solution can be adopted to all snapshots regardless of the simulation scale. For FFTbased distortion metrics used for other scientific simulations, we have provided solutions in Section 3. For other types of post hoc analysis, we propose to study the information necessary and build a specific model based on it for feature preserving, similar to our modeling methodology for halo finder.
In terms of the in situ overhead of our proposed solution, it only requires the mean value of each partition and the overall mean value of the dataset as parameters for all data fields.
To compute the mean value of each partition, it only takes about of the compression time overhead using our tested CPUs, while it takes almost no overhead on the tested GPUs.
The overall mean value can be gathered by MPI_Allreduce
after each partition generates its own mean value, which is almost negligible compared to compression throughput (e.g., 31.6 GB/s on a V100 GPU with cuSZ (tian2020cusz)), not to mention both baryon density and dark matter density that require no MPI_Allreduce
operation due to their fixed overall mean value set by the simulation.
Moreover, we also need the number of effective cells as an extra parameter for baryon density. This process takes an extra time overhead of up to 5% of compression time in our experiment.
Overall, all the overheads introduced by our optimization are negligible comparing to the time of cosmological simulation or compression.
Therefore, our solution introduces very little overhead and provides a high compression efficiency improvement.
5. Conclusion and Future Work
In this paper, we propose to adaptively configure lossy compression in data partitions for cosmological simulation with newly designed ratequality models for posthoc analysis. We first propose an accurate model to estimate the postanalysis result considering lossy compression error, as well as a compressionratio model based on theoretical analysis and experimental evaluation. We then further develop an adaptive optimization solution based on our models to dynamically determine the optimal compression parameters for different partitions instead of static selection, maximizing compression efficiency while offering an error control to posthoc analysis. Our proposed finegrained approach further improves the compression ratio by up to 73% on the tested dataset compared to existing methods, making the compression ratio reach to , with a very low time overhead of only 1%. Our future work is to further apply our approach to other HPC applications and posthoc analysis metrics such as climate simulation with SSIM.
Acknowledgments
This work has been authored by employees of Triad National Security, LLC which operates Los Alamos National Laboratory under Contract No. 89233218CNA000001 with the U.S. Department of Energy/National Nuclear Security Administration. This research was supported by the Exasky Exascale Computing Project (17SC20SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. This research was supported by the U.S. National Science Foundation under Grants OAC2034169 and OAC2042084. We would like to thank NERSC for providing access to the Cori supercomputer and TACC for providing access to the Frontera supercomputer. We would like to thank the NYX team at Lawrence Berkeley National Laboratory for granting us access to cosmology datasets.
Comments
There are no comments yet.