1. Introduction
Deep neural networks (DNNs) have rapidly evolved to the stateoftheart technique in many science and technology areas. For instance, the CANDLE Project (Wozniak et al., 2018) launched by the U.S. Department of Energy Exascale Computing Project aims to exploit DNNs for cancer research on distributed learning environments in toptier supercomputers, such as Summit (Summit Supercomputer, 2019) at the Oak Ridge Leadership COmputing Facility, Theta (Theta Supercomputer, 2019) at the Argonne Leadership Computing Facility, and Cori (Cori Supercomputer, 2019) at the National Energy Research Scientific Computing Center. DNNs contain millions of parameters in an unparalleled representation, which is efficient for modeling complexity nonlinearities. Thus, using either deeper or larger DNNs can be an effective way to improve data analysis. As pointed by Wang et al. (Wang et al., 2018), the deep learning community has been acknowledging that increasing the scales of DNNs can improve the inference accuracy of image recognition tasks. A 9layer AlexNet (Krizhevsky et al., 2012), for example, proposed by Krizhevsky et al., won the 2012 ILSVRC (ImageNet LargeScale Visual Recognition Challenge) (Large Scale Visual Recognition Challenge, 2019) with a top5 accuracy of 83%. In 2014 ILSVRC, a 22layer GoogLeNet (Szegedy et al., 2015) proposed by Szegedy et al. further improved the record of top5 accuracy to 93.3%. He et al. proposed a 152layer ResNet (He et al., 2016), which refreshed the record to 96.43% in 2015 ILSVRC. This trend suggests that the networks will go larger in the future.
The everincreasing growth of networks is bringing more and more challenges to systems with limited resources. For instance, one practical challenge is to deliver multiple latest DNN models (i.e., an order of tens to hundreds for each model) from cloud to mobile devices through bandwidthlimited networks. Such an issue also arises in highperformance computing (HPC) systems because of their limited I/O bandwidths.^{1}^{1}1As indicated by many recent studies (Tao et al., 2017b; Liang et al., 2018; Tao et al., 2019), the I/O bottleneck has become one of the most serious issues for the overall performance of extremescale executions.
For example, the CANDLE workflow targets the problem of hyperparameter exploration of DNNs
(Wozniak et al., 2018). A series of tasks would be launched across multiple nodes to optimize different models. Upon completion of the hyperparameter optimization and training process of one DNN model, the model needs to be stored in a parallel file system. Considering multiple optimization tasks in parallel, each with multiple largescale DNN models, the performance of running the workflow for deep learning (such as CANDLE) would be hindered significantly by the inevitable I/O bottleneck on an exascale HPC system. Compressing neural networks provides an effective way to reduce the burden of these problems. Most approaches, however, have focused on simplification methods, such as network pruning (Han et al., 2015b) and quantization (Han et al., 2015a), which suffer from limited compression quality. A straightforward idea is to leverage existing lossy compression and encoding techniques (Reagen et al., 2017) to significantly improve the ratio for compressing DNNs. The existing compression strategies applied on DNNs, however, do not have errorbounded features, which may greatly distort the data, leading to expensive finetuning overhead (i.e., extra, costly retraining process).In this paper, we propose deepsz: a lossy compression framework for DNNs. deepsz is composed of four key steps: network pruning, error bound assessment, optimization of error bound configuration, and compressed model generation. Unlike traditional compression methods used on DNNs, we perform errorbounded lossy compression on the pruned weights, an approach that can significantly reduce the data size while restricting the loss of inference accuracy ^{2}^{2}2Loss of inference accuracy is defined as the difference between the real accuracy and the target accuracy.. Specifically, we adapt our developed SZ lossy compression framework (Di and Cappello, 2016; Tao et al., 2017b; Liang et al., 2018)
to fit the context of DNN compression. In this compression framework, each data point’s value would be predicted based on its neighboring data points by an adaptive, bestfit prediction method (either a Lorenzo predictor or linear regressionbased predictor
(Liang et al., 2018)). Then, each floatingpoint weight value would be converted to an integer number by a linearscaling quantization based on the difference between the real value and predicted value and a specific error bound. Huffman encoding or other lossless compression such as Zstd (Zstandard, 2018) and Blosc (Blosc compressor, 2018) would be applied to significantly reduce the data size thereafter. SZ can get a much higher compression ratio on the compression of nonzero weights than other stateoftheart compressors such as ZFP (Lindstrom, 2014)can, especially because of efficient linearscaling quantization, which contrasts with the simple vector quantization applied to the original weights in other related work
(Gong et al., 2014; Han et al., 2015a). Moreover, our SZ compressor can control errors in more sophisticated ways, such as relative error bound and peaksignaltonoise ratio (PSNR).Designing an efficient lossy compression framework for DNNs raises two important challenging issues to resolve. (1) How can we determine an appropriate error bound for each layer in the neural network? Specifically, we have to explore a feasible range of error bounds for each layer, under which the lossy compression should still get a high inference accuracy for users. (2) How can we maximize the overall compression ratio regarding different layers in the DNN under userspecified loss of inference accuracy? Considering the heterogeneous and diverse data features across multiple layers, we have to explore the bestfit error bounds on the compression of different layers. A straightforward idea is to traverse all the possible errorbound combinations on different layers, which would definitely lead to an extremely high timecomplexity. To address this issue, we develop a dynamic strategy to efficiently determine the bestfit error bound for each layer.
The contributions of this work are summarized as follows.

We propose a novel, inferenceaccuracyloss bounded framework, called deepsz, by using errorbounded lossy compression to compress DNNs. To the best of our knowledge, this is the first attempt to do so.

We propose an adaptive method to select the feasible range of error bounds for each layer. We also develop an effective model to estimate the overall inference accuracy degradation or loss based on the forwardpropagation results with individual layers reconstructed from the errorbounded lossy compressor.

We develop a dynamic algorithm to optimize the combined configuration regarding different layers’ error bounds, significantly reducing the overall size of the neural network. In addition, because of our careful design of the accurate error control, our solution also effectively eliminates the costly retraining overhead that was generally introduced by other DNN compression methods (Han et al., 2015a; Reagen et al., 2017).

We compare deepsz with two other stateoftheart works (i.e., Deep Compression and Weightless) based on four wellknown neural networks. Evaluation results demonstrate that the compression ratio of deepsz is 1.211.43 higher than that of the other two approaches. Experiments show that deepsz can obtain 1.84.0 encoding performance improvement on four Nvidia Tesla V100 GPUs and 4.56.2 decoding performance improvement on Intel Xeon Gold 6148 CPU over the secondbest approach.
The rest of the paper is organized as follows. In Section 2, we discuss the background and motivation of our research. In Section 3, we describe the design methodologies of the deepsz framework in detail. In Section 4, we provide a detailed analysis and comparison of deepsz and two other stateoftheart approaches. In Section 5, we present the evaluation results on four wellknown DNNs with multiple GPUs. In Section 6, we discuss related work. In Section 7, we summarize our conclusions and briefly present ideas for future work.
2. Background and Motivation
In this section, we present some background information about neural networks and lossy compression for floatingpoint data and discuss our research motivation.
2.1. Neural Networks
Neural networks have been widely studied and used in recent years and have produced dramatic improvements in many scientific and engineering aspects, including computer vision
(Rowley et al., 1998), speech recognition (Graves et al., 2013), and natural language processing
(Collobert and Weston, 2008). Each neural network is composed of multiprocessing layers. Among the various kinds of layers, convolutional layers and fully connected layers (denoted by fclayer) have contributed the most to the recent progress in the deep learning community, especially in visionrelated tasks such as image classification and object detection. One convolutional layer consists of a set of filters that slide in the input dataset and perform convolution with the signal in the sliding window. fclayer are connected by a dense weight matrix and forward the signals by a matrixmatrix multiplication. The filters in the convolutional layers and the weight matrices in the fclayer dominate the storage space of the neural networks, which will become larger as the networks become deeper or wider. In neural networks, the forward pass refers to the calculation process, which traverses from the first layer to the last layer. The backward pass refers to the process to update the weights by stochastic gradient descent, which traverses from the last layer backward to the first layer. During the training period, both forward and backward passes are performed, whereas only the forward pass is performed for testing. In the following discussion,
test refers to the forward pass process on the test dataset for generating the inference accuracy of the neural network.According to prior studies (Alwani et al., 2016), although convolutional layers occupy most of the computation time (95%) because of the expensive convolution operations, they take up little storage space (5%). On the other hand, fclayer require large storage space (95%) because of the large dense matrices, while consuming little computation time (5%). This phenomenon is also verified in our experiments. For demonstration purposes, we present the breakdown of storage and computational overhead for four wellknown networks. As shown in Table 1, the fclayer take the majority of the networks’ storage space (i.e., 89.4% 96.1%) in all three cases; however, they have much lower computational cost (i.e., about 1% 2% for VGG16 and AlexNet and 20% for LeNet5) than the convolutional layers do. Hence, we are motivated to leverage lossy compression techniques in order to trade computation time for storage space on the fclayer in resourceconstrained scenarios, our aim is to significantly reduce the storage requirements of neural networks while introducing little computation overhead.


LeNet5  AlexNet  VGG16  

conv layers  0  3  5  13  
fclayer  3  2  3  3  
ip1/fc6  
ip2/fc7  
ip3/fc8    
conv fwd time  0 ms  0.5 ms  116.5 ms  149.8 ms  
fc fwd time  0.30 ms  0.12 ms  2.5 ms  1.7 ms  
total size  1.1 MB  1.7 MB  243.9 MB  553.4 MB  
fclayer’ size (%)  100%  95.3%  96.1%  89.4% 
2.2. Lossy Compression for FloatingPoint Data
Floatingpoint data compression has been studied for decades. The data compressors can be split into two categories: lossless and lossy. Lossless compressors such as GZIP (Deutsch, 1996), FPC (Burtscher and Ratanaworabhan, 2007), FPZIP (Lindstrom and Isenburg, 2006), BlosC (Alted, 2017), and decimationbased lossless compressors (Ainsworth et al., 2017) cannot significantly reduce the floatingpoint data size because of the significant randomness of the ending mantissa bits. The compression ratios of lossless compression are generally limited to 2:1, according to recent studies (Son et al., 2014; Lindstrom, 2017).
Many lossy compressors supporting floatingpoint data were proposed originally for visualization. Hence, many lossy compressors employ the techniques directly inherited from lossy compression of images, such as variations of wavelet transforms, coefficient prioritization, and vector quantization (Goldschneider, 1997; Li et al., 2018). Lossy compressors for image processing are designed and optimized considering human perception, such as JPEG (Wallace, 1992) and JPEG2000 (Taubman and Marcellin, 2012). While such compressors may be adequate for scientific visualization, they do not provide pointwise error controls on demand. For example, most lossy compressors designed for visualization do not provide control of a global upper bound on the compression error (the maximum compression error, or norm of the compression error).
A new generation of lossy compression techniques for floatingpoint data has been developed recently. SZ, ZFP, and MGARD ^{3}^{3}3Some compressors such as ISABELA (Lakshminarasimhan et al., 2011) were designed with pointwise error control, but tests (Di and Cappello, 2016) have shown that the maximum error could be much larger than the userset error bound. are three typical errorbounded compressors. SZ (Di and Cappello, 2016; Tao et al., 2017b; Liang et al., 2018) predicts each data point’s value by its neighboring data points in a multidimensional space with an adaptive predictor (using either a Lorenzo predictor (Ibarria et al., 2003) or linear regression (Liang et al., 2018)). Next, it performs an errorcontrolled linearscaling quantization to convert all floatingpoint values to an array of integer numbers. And then it performs a customized Huffman coding and lossless compression to shrink the data size significantly. ZFP (Lindstrom, 2014) splits the whole dataset into many small blocks and compresses the data in each block separately by four steps: alignment of exponent, orthogonal transform, fixedpoint integer conversion, and bitplanebased embedded coding. MGARD uses multigrid methods to compress multidimensional floatingpoint data (Ainsworth et al., 2018). Many independent studies (Di and Cappello, 2016; Tao et al., 2017b; Liang et al., 2018; Lu et al., 2018; Tao et al., 2017a) have showed that SZ outperforms the other two compressors in terms of compression ratio, especially on 1D floatingpoint datasets; note that the datasets to compress in our case are 1D floatingpoint arrays after conversion.
Today’s lossy compression techniques have been used in HPC scientific applications for saving storage space and reducing the I/O cost of saving data. However, how to effectively and efficiently utilize errorbounded lossy compressors to significantly reduce the neural network size and encoding time, while still maintaining a high inference accuracy, remains an open question.
3. Design Methodologies
In this section, we describe in detail deepsz, our proposed lossy compression framework for neural networks.
3.1. Overview of DeepSZ Framework
The general workflow of the deepsz framework is presented in Figure 1. As illustrated in the figure, deepsz consists of four key steps: network pruning, error bound assessment, optimization of the error bound configuration, and generation of the compressed model. The first step is to adopt network pruning in order to reduce the network complexity and mitigate the overfitting problem caused by the large number of parameters in the network. The second step is to apply the errorbounded lossy compression to the pruned fclayer and assess the impacts of different error bounds on the inference accuracy for different fclayer. Based on the inference accuracy degradation, deepsz will identify the feasible range of error bounds for each fclayer and collect the results of inference accuracy degradation and compressed layer size based on these bounds. This step can effectively narrow the range of the bestfit error bounds for each layer. Note that we focus only on the fclayer in this work because it dominates most of the storage space, as discussed in Section 2. The third step is to determine the bestfit error bound for each fclayer based on the narrowed feasible range generated from the second step. deepsz will compress the network as much as possible while satisfying the userset inference accuracy requirement. The fourth step is to generate the compressed network based on the optimized error bounds and bestfit lossless compressor. In the remainder of this section, we discuss each step of deepsz in detail.
3.2. Network Pruning
An fclayer in DNNs can be represented by a floatingpoint matrix. Each nonzero element in the matrix represents the weight of one connection between previous layer and current layer. Previous studies on modern neural network models (LeCun et al., 1990; Hassibi and Stork, 1993) have shown that most of the weights in fclayer are redundant and can be pruned without any impact on the inference accuracy. Moreover, network pruning is an effective way to prevent the DNN model from overfitting.
We build our pruning method on top of prior stateoftheart techniques (Han et al., 2015a), which can prune DNNs without loss of inference accuracy. We first set up thresholds for each fclayer and prune their weights based on these thresholds: every weight below these thresholds will be removed. The thresholds are set based on the predefined pruning ratios suggested by previous studies (Han et al., 2015b). Then, we retrain the network with masks (i.e., zero weights are marked as unchanged) on fclayer such that the weights that have been pruned can be kept zero. This pruning method is called magnitude threshold plus retraining (Magnitude). Note that this process can start from welltrained networks; more details can be found in (Han et al., 2015b). We note that Reagen et al. (Reagen et al., 2017) presented another pruning method, dynamic network surgery (DNS), and evaluated its performance on several networks. The time overhead of DNS applied to large networks (such as VGG16) is very high, however, because DNS needs to iteratively prune the weights and retrain the network based on increasing thresholds. In contrast, Magnitude has relatively lower time overhead thus can be well applied to large neural networks. Therefore, we focus on the Magnitude method in this paper.
After the network pruning, the weight matrix becomes sparse, so it can be represented by a sparse matrix format, such as compressed sparse row (CSR) or compressed sparse column (CSC) format. Unlike the traditional format that uses three 1D arrays (e.g., arrays for nonzero values, the extents of rows, and column indices in CSR), we only use two 1D arrays to represent one fclayer after the pruning. One array is named dataArray ^{4}^{4}4We note that the nonzero floatingpoint weights will be condensed into a 1D array or linked list regardless of the sparse matrix representation format.; it is used to store the floatingpoint weights (32 bits per value); the other one is named indexArray; it is used to store the index differences between two consequent nonzero weights (8 bits per value). Similar to (Han et al., 2015a), if the index difference exceeds 256 (i.e.,
), we additionally save a zero padding to dataArray and
to indexArray. Here we use sparse matrix representation because the inference accuracy can be dropped sharply (i.e., to 20% on the tested networks) if the lossy compression is applied to the matrices of pruned weights (i.e., 2D arrays) based on our experiments. Note that the real compression ratio after the pruning step (i.e., original size divided by the CSR size) is always lower than the compression ratio that we set for the pruning (i.e., one divided by the pruning ratio), because after the pruning every nonzero weight will be represented by 40 bits (8 for index and 32 for data), which is slightly larger than the original 32 bits. Based on our evaluation results, the pruning step can typically reduce the size of fclayer by about 8 to 20 if the pruning ratio is set to be around 4 to 10.3.3. Error Bound Assessment
szfull usually has a much higher compression ratio on 1D datasets than other lossy compression methods do (Di and Cappello, 2016; Tao et al., 2017a). Our floatingpoint datasets are 1D dataArray, as described in Section 3.2. For demonstration purposes, we evaluated SZ and ZFP^{5}^{5}5Many recent studies, such as (Tao et al., 2019), have demonstrated that SZ and ZFP are two leading lossy compressors for floatingpoint data. on the 1D dataArray of each fclayer in AlexNet and VGG16. The compression ratios are presented in Figure 2. The figure shows that SZ consistently outperforms ZFP in terms of compression ratios on the tested fclayer with absolute error bounds of , , and . Although SZ has higher compression and decompression times than does ZFP (Tao et al., 2017b), they are still much lower than the time overheads of forward or backward pass in neural networks. Taking these facts into consideration, we propose to use SZ lossy compression in our deepsz framework.
Errorbounded lossy compression can provide high compression ratio but can also bring bounded errors to neural networks, leading to possible loss of inference accuracy. Thus, before adopting SZ to compress fclayer, we need to find the bestfit error bound for each layer. Our idea is to narrow the bestfit error bounds by identifying a feasible error bound range for each fclayer with a high compression ratio and also bounded loss of inference accuracy (detailed in this subsection) and then fine tune the bestfit error bound within the range (detailed in next subsection). To this end, we need to understand the impact of different error bounds of each fclayer on the overall inference accuracy. Specifically, we use SZ to compress each fclayer’s dataArray with different error bounds and use its decompressed array to reconstruct the fclayer while leaving the other fclayer uncompressed or unchanged. Based on the reconstructed network (only one fclayer is modified), we can perform the forward pass on the test data to generate the inference accuracy. Thus, we can get a series of inference accuracies based on different error bounds for each fclayer in the network. For example, Figure 3 presents the accuracies based on the absolute error bounds from to for the three fclayer in AlexNet. Note that in order to reduce the time overhead, we choose only a group of error bounds for compression, decompression, and checking inference accuracy rather than all the error bounds shown in Figure 3. In the following text we will describe how to determine the error bounds.
In our solution, we test the inference accuracy with only one compressed layer in every test, instead of using a bruteforce method to search all possible test cases (i.e., all combinations of error bounds across all layers), for the following two reasons. (1) We observe that the fclayer in neural networks usually have independent characteristics in the context of SZ lossy compression. That is, two reconstructed fclayer based on SZ affect the overall accuracy independently. Thus, the overall loss of inference accuracy can be composed of (and thus estimated by) the losses of inference accuracy introduced by individual layers. We will discuss more details in Section 3.4. (2) Checking the inference accuracy in a test with only one reconstructed layer using multiple error bounds has much less computational cost than does a bruteforce method involving every possible combination of error bounds across multiple layers. Specifically, our solution has a linear time complexity compared with the bruteforce method with an exponential time complexity. VGG16, for instance, has three fclayer. Assume we have 10 candidate error bounds for each layer. Then the bruteforce method needs to check 1,000 test cases, each involving one compression, one decompression, and one forwardpass test. By comparison, our solution has only 30 test cases to check, thus reducing the testing time to 3% compared with the bruteforce method.
We propose an algorithm to identify the feasible range of error bounds and collect the results about inference accuracy degradation and compressed layer size based on these bounds for each fclayer. We present the pseudocode in Algorithm 1. The inputs of the algorithm include the architecture of the neural network, the pruned weights of the network, and the userset loss of inference accuracy. Lines 12–21 show the main loop of this algorithm. Specifically, when the loss of inference accuracy exceeds a criterion of 0.1% (called distortion criterion) in terms of absolute percentage, we treat the reconstructed network to be distorted. We search the feasible range of error bounds by checking the accuracy based on multiple error bounds. The error bound to check starts from a certain value (i.e., the default value is ; the reason will be discussed in Section 5.1) and will be increased by an order of magnitude (10) each time. As the accuracy drops below the criterion (i.e., 0.1%) at a certain error bound , we set the starting point of the range to be . Note that the default value of can be further decreased, such as , based on different neural networks.
From Lines 1–10, we determine the ending point of the range based on the accuracy of reconstructed network. The ending point is the first error bound that the accuracy drops below the user’s expected accuracy . Once the feasible range is generated (as shown in Figure 3), we conduct the tests with the error bounds in the feasible range and collect the sizes of compressed layer and accuracy degradation results. Lines 6–8 describe how we choose the error bounds within the range. For example, if the range is , ], we test the error bounds of , , , , and .
3.4. Optimization of Error Bound Configuration
Depending on the number of input and output neurons, the sizes of different fclayer vary dramatically. For example, the largest fclayer of VGG16 is
fc6 (i.e., 25,0884,096), which is 25 larger than the smallest layer fc8 (i.e., 1,0004,096). Compressing larger fclayer with a higher error bound can lead to a higher compression ratio, which can benefit the overall ratio. However, the higher error bound also brings more errors to the network, which may degrade the overall accuracy in turn. Therefore, how to determine the optimal error bound for each fclayer is an important problem.From our experiments we discovered that the overall accuracy loss in the neural network exhibits a linearity in terms of the accuracy degraded in each fclayer in the context of SZ lossy compression when the targeted loss of inference accuracy is lower than 2% based on our tested neural networks. In other words, the overall accuracy loss is equal to the sum of each layer’s accuracy degradation , as shown in Equation (1), where is a given arbitrary error bound for each fclayer :
(1) 
We propose Algorithm 2 to determine the bestfit error bound for each layer. The inputs include the accuracy degradation and compressed size of each fclayer based on our tested error bounds, which are outputted by our previous error bound assessment (Section 3.3) and an expected loss of accuracy set by users. Algorithm 2 can minimize the total size of the compressed fclayer while ensuring the sum of the accuracy degradation of each layer to be within the expected accuracy loss. Specifically, the first part of Algorithm 2 (Lines 2–14) finds the minimum total size of compressed fclayer with different combinations of error bounds before a certain layer by using a variation of Knapsack algorithm. Then, the algorithm traces back to determine the error bound for each fclayer (Lines 15–19). More specifically, we save the minimal size of all fclayer (before layer) and the accuracy loss of to the variable . After we find the minimum compressed size of all the fclayer under the constraint of the overall accuracy loss (Lines 13–14), we trace back from the minimal to identify the error bound combination for each layer (Lines 15–19).
Besides this optimization of the compression ratio by an expected accuracy loss (i.e., expectedaccuracy mode), deepsz can optimize the overall accuracy with an expected compression ratio (i.e., expectedratio mode).^{6}^{6}6The two modes are similar to the fixedaccuracy and fixedrate modes in ZFP (Lindstrom, 2014). The algorithm of the fixedrate mode is similar to Algorithm 2 but just reverses the compressed size and accuracy degradation. Based on these two modes, we can fine tune the balance between accuracy loss and compression ratio for a neural network, which is much more flexible than other stateoftheart methods.
3.5. Generation of Compressed Model
The last step in our framework is to generate the compressed model by using SZ lossy compression on the dataArray with the error bounds (obtained in Step 3) and the bestfit lossless compression on the indexArray. The indexArray represents the locations of nonzero weights, which need to be compressed losslessly. deepsz provides three stateoftheart lossless compressors: Gzip (Deutsch, 1996), Zstandard (Zstandard, 2018), and Blosc (Zstandard, 2018). More lossless compressors can be integrated into the framework in the future. In our experiments, we identified that Zstandard always leads to the highest compression ratio compared with the other two compressors, as shown in Figure 4. After these four steps of deepsz, the compressed neural network model is generated. In this paper, we use encoding to refer this whole process of generating compressed DNNs and decoding to refer the process of reconstructing DNNs.
Once the network is needed for forward pass, it must be decoded. During the decoding, deepsz will decompress the dataArray using the SZ lossy compression and the indexArray using the bestfit lossless compression (e.g., Zstandard). Then, the sparse matrix can be reconstructed based on the decompressed dataArray and indexArray for each fclayer. Finally, the whole neural network can be decoded. Note that the computational cost of the decoding in deepsz is relatively low compared to that of the forward pass with a batch of images. We will analyze the performance overhead of our decoding in detail and compare it with other stateoftheart methods next section.
4. Detailed Analysis and Comparison
In this section, we analyze deepsz in detail and compare it with two other stateoftheart solutions: Weightless (Reagen et al., 2017) and Deep Compression (Han et al., 2015a). Our analyses focus on both performance and storage.
4.1. Performance Analysis of DeepSZ
For Algorithm 1 in deepsz, the computational cost is focused mostly on performing the tests with different error bounds to check the corresponding accuracies, while compression and decompression both cost negligible time overhead. Let us take AlexNet as an example. Compressing and decompressing one dataArray (about tens of megabytes per fclayer’s dataArray, as shown in Table 2(c)), and reconstructing the network based on the decompressed layer typically take no more than one second on an Nvidia Tesla V100 GPU,^{7}^{7}7Based on a recent study (Tao et al., 2017b), SZ’s compression and decompression rate are about 80 MB/sec and 150 MB/sec, respectively, with the error bound of on a 2.3 GHz Intel Core i7 processor.
whereas testing the reconstructed network with 50,000 images in the ImageNet dataset will take about 55 seconds (10 seconds for data transfer, 5 seconds for initialization, and 40 seconds for forward computations). In this case, deepsz needs to perform 12 tests on each fclayer,  bringing the total to 36 tests. In contrast, the 36 tests require performing forward passes of 1.8 million images considering 50,000 images in the ImageNet test data. Based on our experiments, the execution time of one epoch
^{8}^{8}8One epoch contains one forward pass and one backward pass of the 1.28 million images in the training dataset and takes about 15.7 minutes for AlexNet on a single Nvidia Tesla V100 GPU, based on our experiments. is about 42 times higher than that of one test with an Nvidia V100 GPU on AlexNet. Thus, the workload of 1.8 million images in the test is equivalent to training about epochs of data (i.e., one test is about epochs). Therefore, the time complexity of Algorithm 1 is , where is the number of tests per layer (e.g., 12 for AlexNet), is the number of fclayer (e.g., 3 for AlexNet), and is the time complexity for training one epoch of data. For AlexNet, for example, we can set to 12 and to 3; hence, the time complexity is about .For Algorithm 2 in deepsz, because of our optimization in Algorithm 1, the input dimension of the algorithm is very small (i.e., , for example 36 pairs of inference accuracy degradation and compressed size for 3fclayer AlexNet. Based on the time complexity of the Knapsack algorithm, the time complexity of Algorithm 2 is . We note that is far smaller than because is much larger than . Thus, the computational cost of Algorithm 2 would be relatively small compared with multiple tests of inference accuracy in Step 2. Overall, we conclude that the time complexity of deepsz’s encoding is , which is much less than that of traditional methods with retraining (typically to in the ImageNet dataset). It is worth noting that the scalability of test (i.e., embarrassing parallelism) is higher than that of training in parallel, thus, the time complexity of deepsz compared with training will be further reduced with increasing scale.
For deepsz’s decoding, the computational cost is also comparatively low because it performs an lossy decompression with SZ, an lossless decompression with the bestfit lossless compressor (e.g., Zstandard) and an sparsedense matrix conversion. Here we denote the number of pruned weights by and the number of original weights by . Overall, the time complexity of deepsz’s decoding is .
4.2. Comparison with Weightless
deepsz has four major advantages over Weightless. (1) Weightless has higher time overhead than does deepsz for encoding. After Weightless reconstructs the layer based on the Bloomier filter, the inference accuracy can drop dramatically. For example, the inference accuracy drops about 3% when compressing fc6 in VGG16 using Weightless. Thus, Weightless requires retraining the other layers to recover the overall inference accuracy, whereas deepsz does not require any retraining. (2) Weightless has higher time overhead than does deepsz on decoding. To decode one element, Weightless has to calculate four hash functions based on all the values (including zero values) in the pruned matrix and check the hash table to determine the value of this element, leading to much higher time overhead compared with deepsz. (3) Weightless can compress only one layer (usually the largest layer). By contrast, deepsz can compress all fclayer, leading to higher overall compression ratio. (4) deepsz provides two modes to users. Even for the fixedaccuracy mode, users can set an expected loss of inference accuracy in deepsz and get as high a compression ratio as possible, whereas Weightless is unable to provide such flexibility.
4.3. Comparison with Deep Compression
Similar to Weightless, Deep Compression also requires retraining the whole network to mitigate the inference accuracy loss caused by its quantization. Deep Compression adopts a simple quantization technique on the pruned weights. It quantizes all the nonzero weights to a group of floatingpoint values based on a codebook. The number of these values in the codebook is always , where refers to the number of bits used to represent one weight. Using 5 bits per weight, for example, can map every nonzero weights to a 32value codebook. Unlike Deep Compression applying a simple quantization to the weights, deepsz applies an errorbounded linearscaling quantization to the difference between the predicted weight and real weight based on a bestfit prediction method, leading to higher compression ratios and finegranularity error controls. Similar to Weightless, Deep Compression has lower flexibility than deepsz has in terms of the balance between the ratio and inference accuracy. Since the number of floatingpoint values the codebook can represent is always , the inference accuracy under Deep Compression may drop significantly (shown in Section 5.2) with increasing compression ratios (or lower bit rates), leading to unbounded inference accuracy.
5. Experimental Evaluation
In this section, we evaluate our proposed deepsz framework by comparing it with stateoftheart methods.
5.1. Experimental Setting
We conduct our evaluation on a single core of an MacBook Pro with Intel Core i78750H Processors (with 32 GB of memory) and parallel experiments using four Nvidia Tesla V100 GPUs (each with 16 GB of memory) on the node of the Pantarhei cluster at the University of Alabama. The four GPUs are connected via NVLink (Foley and Danskin, 2017)
. We implement deepsz based on the Caffe deep learning framework
(Jia et al., 2014) (v1.0) and SZ lossy compression library (v2.0) (Liang et al., 2018). We evaluate deepsz on four wellknown neural networks: LeNet300100, LeNet5, AlexNet, and VGG16. We train/test LeNet300100 and LeNet5 on the MNIST dataset and AlexNet and VGG16 on the ImageNet dataset, respectively. These neural networks and datasets are commonly used in evaluation studies (Han et al., 2015a; Reagen et al., 2017). We present the details of their architectures in Table 1. Note that the fclayer occupy most of the storage space (i.e., 89.4% 96.1%). We use the default solver (i.e., stochastic gradient descent (SGD)) in Caffe for all training. We set the expected loss of inference accuracy to 0.2% for two LeNets and 0.4% for AlexNet and VGG16, without loss of generality. We also set the expected loss of inference accuracy to zero and demonstrate the flexibility of deepsz.We note that in an fclayer of a neural network, weights are floatingpoint numbers between 1.0 and 1.0; more generally, for a trained network, such as AlexNet and VGG16, the value ranges of their weights are typically between 0.3 and +0.3. Thus, the absolute error bounds in the order of are relatively large compared with the weight values. Consequently, using the error bounds in the order of would significantly affect the overall inference accuracy (i.e., dropped to less than 20%), as illustrated in Figure 5. We also note that the absolute error bound of can maintain the inference accuracy without any loss for these networks. Thus, we set to be the default value for initial point of the error bound to be checked.
5.2. Evaluation Results




5.2.1. Compression Ratio
We first present the experimental results of deepsz in terms of compression ratio and compare with Deep Compression and Weightless.
LeNet300100 and LeNet5 on MNIST
First, we evaluate deepsz on LeNet300100 and LeNet5 with the MNIST dataset (LeCun et al., 2010). LeNet300100 contains only three fclayer (i.e., ip1, ip2, and ip3). LeNet5 contains three convolutional layers and two fclayer (i.e., ip1 and ip2). The fclayer dominate: 100% and 95.3% of the overall sizes of LeNet300100 and LeNet5, respectively, as shown in Table 1. deepsz first prunes the network with the pruning ratios suggested by (Han et al., 2015a) (as shown in Table 2(a) and 2(b)) and stores the pruned weights in the dataArray and indexArray. After the pruning step, LeNet300100 and LeNet5 can be reduced by 9.7 and 9.8, respectively. Note that the compression ratio is slightly different from the pruning ratio because every nonzero pruned weight requires 40 bits instead of 32 bits, as discussed in Section 3.2. Then, deepsz deploys the error bound assessment step to the pruned network and gets the feasible ranges of error bounds for fclayer. The feasible ranges are , , and for ip1, ip2, and ip3 of LeNet300100, respectively, as shown in Figure 4(a). The ranges are and for ip1 and ip2 of LeNet5, respectively, as shown in Figure 4(b). deepsz then optimizes the configuration of the error bounds based on Algorithm 2. The final error bounds of ip1, ip2, and ip3 of LeNet300100 are , , and , respectively. The final error bounds of ip1 and ip2 of LeNet5 are and , respectively. deepsz then adopts the bestfit lossless compressor—Zstandard—to compress the indexArray. As shown in Table 2(a) and 2(b), deepsz can compress fclayer of LeNet300100 by 55.8 and the fclayer of LeNet5 by 57.3 with no loss of inference accuracy. We note that each fclayer has a threshold of error bound, after which the inference accuracy begins to drop sharply. This phenomenon is also true for other networks, such as AlexNet and VGG16. It demonstrates that how to determine the proper error bounds for each fclayer is a critical problem, for which deepsz provides an efficient, finetuning solution (see Section 3.3 and Section 3.4).
AlexNet on ImageNet
We next evaluate deepsz on a much larger network, AlexNet, with the ImageNet dataset (Krizhevsky et al., 2012). AlexNet contains five convolutional layers and three fclayer (i.e., fc6, fc7, and fc8). The fclayer take up 96.1% of the overall storage space, as shown in Table 1. After the pruning, the network can be reduced to 10.1%, as shown in Table 2(c). After the second assessment and third optimization step, deepsz uses , , and as the error bound for fc6, fc7, and fc8, respectively, as shown in Figure 4(d). deepsz can compress AlexNet by 45.5 with only 0.13% loss of top1 accuracy, as shown in Table 3. Note that the top5 accuracy is not decreased but, rather, is increased by 0.18%. We can further set the expected inference accuracy loss to zero. deepsz then can compress AlexNet by 36.5 with no loss of inference accuracy (the error bound of for all fc6, fc7, and fc8).
VGG16 on ImageNet
We now apply deepsz on VGG16, which contains one large fclayer (i.e., fc6) and two relatively small fclayer (i.e., fc7 and fc8). The pruning ratios are set to relatively low values, leading to a much higher compression ratio after the pruning (i.e., 20.9), as shown in Table 2(d). deepsz then uses , , and as the error bound for fc6, fc7, and fc8, respectively. By leveraging deepsz, we can achieve a compression ratio of 115.6 with only 0.25% loss of inference accuracy on VGG16, as shown in Table 3. Similar to AlexNet, we can also set the expected inference accuracy loss to zero. deepsz then can compress VGG16 by 92.7 with no loss of inference accuracy (with the error bound of for fc6 and fc7 and for fc8).
Neural Network 






LeNet300100 original  98.35%    1056 KB  
LeNet300100 deepsz  98.31%    19.1 KB  55.8  
LeNet5 original  99.13%    1620 KB  
LeNet5 deepsz  99.16%    28.3 KB  57.3  
AlexNet original  57.41%  80.40%  234.5 MB  
AlexNet deepsz  57.28%  80.58%  5.15 MB  45.5  
VGG16 original  68.05%  88.34%  494.5 MB  
VGG16 deepsz  67.80%  88.20%  4.277 MB  115.6 
In summary, deepsz can compress the fclayer in the tested neural networks with compression ratios of 57 to 116 while maintaining a loss of inference accuracy less than 0.3% (within the userset expected loss of 0.4%), as shown in Table 3. Note that the top5 accuracy is usually not displayed on the LeNet5 because its top1 accuracy (i.e., > 99%) is relatively high. deepsz can improve the overall compression ratio by 21% to 43%, compared with the secondbest solution, as shown in Table 4. This table also illustrates that deepsz can deliver a high compression ratio for each fclayer. Even compared with the Weightless method, which can compress only one layer, deepsz can still achieve a comparable compression ratio. We note that compression ratio is not available for some layers in Weightless, because (1) Weightless (Reagen et al., 2017) does not provide their open source code and (2) the Weightless paper (Reagen et al., 2017) showed evaluation results only for the largest two layers in LeNet5 and VGG16, without any results for AlexNet. We also note that Deep Compression uses 5 bits per pruned weights, whereas deepsz can compress the networks to 2.0 3.3 bits per pruned weights. If we also set similar bit width for Deep Compression’s quantization (i.e., the number of bits based on deepsz compressed layers), the inference accuracy will drop sharply by 1.56% for AlexNet and 2.81% for VGG16, as shown in Table 5. Note that the inference accuracy degradation is not available for Weightless for LeNet5 and AlexNet, because Weightless does not provide these results in (Reagen et al., 2017) (the paper does show the inference accuracy degradation and encoding time overhead for VGG16).
Neural Network  Layer  Compression Ratio  
Deep  Weight  deepsz  Improve  
Compression  less  ment  
LeNet 300100  ip1  43.1  60.1  61.81  1.43 
ip2  32.9  64.3  37.97  1.15  
ip3  7.9    5.6  0.71  
overall  41.0  7.6  55.77  1.36  
LeNet5  ip1  40.8  74.2  58.5  1.43 
ip2  16.3    21.5  1.32  
overall  40.1  39.0  57.3  1.43  
AlexNet  fc6  41.8    54.4  1.30 
fc7  40.7    46.5  1.14  
fc8  17.1    17.5  1.02  
overall  37.7    45.5  1.21  
VGG16  fc6  119.0  157.0  152.1  1.28 
fc7  80.0  85.8  90.0  1.13  
fc8  19.1    19.8  1.04  
overall  95.8  5.9  115.6  1.21 
model 





LeNet300100  0.22%    0.12%  
LeNet5  0.30%    0.03%^{9}^{9}9The accuracy is slightly increased by 0.03% in this case.  
AlexNet  1.56%    0.13%  
VGG16  2.81%  3.0%  0.25% 
5.2.2. Performance Evaluation
As discussed in Section 4, deepsz is faster than the other methods theoretically in terms of both encoding and decoding. We now present the time overhead of deepsz on the four neural networks, as shown in Figure 6. The figure illustrates that deepsz has lower encoding and decoding time overheads than do Deep Compression and Weightless. We note that the time results of LeNet300100 are almost identical to those of LeNet5; hence, because of space limitations, we present the time overheads only for LeNet5.
We investigated the times of the last three steps (i.e., spent mainly in the time of compression, decompression, and tests) for deepsz’s encoding on GPUs. We do not include the pruning time because all three methods have the same pruning process and the time overheads are the same. Figure 5(a) shows the encoding time with the three solutions. We normalize the other two compression methods compared with deepsz in Figure 5(a), because compared with AlexNet and VGG16, LeNet5 features much smaller encoding time. Specifically, deepsz takes <1 min, 8 min, and 16 min on encoding LeNet5, AlexNet, and VGG16, respectively. Deep Compression takes 4 min, 14 min, and 38 min on encoding LeNet5, AlexNet, and VGG16, respectively. Due to lack of source code, we estimate the encoding time of Weightless based on the number of epochs (for retraining) shown in the paper and the time of one epoch based on our experimental platform. Weightless takes about 113 min on encoding VGG16; again, Weightless does not present the encoding time (i.e., the number of epochs) of LeNet5 or AlexNet in the paper. deepsz can improve the encoding performance by 1.8 to 4.0 compared with the secondbest solution. We note that for the Deep Compression and Weightless methods, it is difficult to determine the initial parameters of the solver in order to retrain the network. It could take much longer time than the optimal performance overhead if users are not familiar with the characteristics of the network.
We also investigated the times of lossless decompression, SZ lossy decompression, and sparse matrix reconstruction for deepsz’s decoding on CPU ^{10}^{10}10We follow the previous studies (He et al., 2016; Reagen et al., 2017) to evaluate the decoding/inference on CPU.. As we can see in Figure 5(a), deepsz outperforms the secondbest solution by 4.5 to 6.2 for decoding. Specifically, deepsz takes 2.7 ms, 296 ms, and 341 ms on decoding LeNet5, AlexNet, and VGG16, respectively; Deep Compression takes 13.9 ms, 1,832 ms, and 1,565 ms on decoding LeNet5, AlexNet, and VGG16, respectively; and Weightless takes 520 ms, 1,300 ms, and 22,800 ms on decoding LeNet5, AlexNet, and VGG16, respectively, as shown in the paper ^{11}^{11}11The paper evaluated its decoding time on an Intel Core i76700K Processor, which has similar processing power to our processor.. More specifically, for example, deepsz spends 26 ms in lossless decompression, 108 ms in SZ lossy decompression, and 162 ms in reconstructing the sparse matrix on AlexNet. As a comparison, the time for one forward pass with 50 images per batch takes 1,100 ms on AlexNet. This demonstrates that the time overhead of deepsz’s decoding is comparatively low compared with typical forward pass. Therefore, once the network is needed for inference, deepsz can quickly decompress the compressed data and reconstruct the network without much delay. Note that the decoding time of Weightless relies on the number of nonpruned weights, whereas the decoding times of deepsz and Deep Compression depend on the number of pruned weights. This difference can explain the following two observations. (1) deepsz and Deep Compression have similar decoding time on AlexNet and VGG16 because they have similar numbers of pruned weights (i.e., 6.5 million for AlexNet and 5.8 million for VGG16). (2) Weightless spends more time on VGG16 than AlexNet for decoding because the largest fclayer of VGG16 (i.e., fc6 of 25,0884,096) is much larger than that of AlexNet (i.e., fc6 of 9,2164,096).
6. Related Work
Most of neural networks have significant redundancy in their parameters according to a wellknown research study (Denil et al., 2013). Such redundant information may cause significant waste of computation, memory, and storage resources. In general, two types of methods have been proposed to resolve this issue: (1) modifying the structures of networks to reduce the complexity of parameters and (2) compressing a welltrained network by removing redundant information.
Modifying the structures of networks by adopting specialized structure or loss function can reduce the memory footprint while training the largerscale networks with the same resources. For example, Vanhoucke et al.
(Vanhoucke et al., 2011) exploited a fixedpoint representation of activations with 8bit integer rather than 32bit floating point. Denton et al. (Denton et al., 2014)proposed using lowrank tensor approximations to reduce the number of parameters by up to a factor of 13 for a single layer while keeping the inference accuracy loss of 1% compared with the original network. Arora et al.
(Arora et al., 2014) theoretically studied using randomlike sparse networks with +1/0/1 weights for interesting properties. Chen et al. (Chen et al., 2015) proposed a network architecture, named HashedNets, that uses a lowcost hash function to randomly group connection weights into hash buckets, such that all connections within the same hash bucket share a single parameter value.Compressing neural networks is an alternative strategy to reduce the model size. For example, Gong et al. (Gong et al., 2014) compressed fclayer by using vector quantization, which achieved a compression ratio of 24 with 1% inference accuracy loss. Recently, two stateoftheart works (Han et al., 2015a; Reagen et al., 2017) have been designed for compressing the network with high compression ratio and inference accuracy. Han et al. proposed a threestep approach, named Deep Compression, that contains pruning, quantization, and encoding. Deep Compression, however, may degrade the inference accuracy significantly in the course of each forwardpropagation because of its vector quantization design, such that the network has to to be retrained over and over again in order to reach the target inference accuracy, thus resulting in a high execution time overhead. Reagen et al. proposed a lossy compression method, named Weightless, by adopting a Bloomier filter to compress the weights lossily. For encoding, the Bloomier filter needs to construct a hash table, which is in time complexity; for decoding, in order to decompress one value, the Bloomier filter typically needs to calculate four hashing functions. The time complexity is for the best case but for the worst case. Here is the number of values for encoding/decoding. Therefore, the Weightless method suffers from a relatively high time overhead because of the expensive Bloomier filter. Moreover, it was applied to only one fclayer instead of the whole neural network. In this paper, we compare our proposed deepsz with both Deep Compression and Weightless approaches comprehensively.
Unlike the first type of method that requires modification of the network structure and full retraining, the second type of method is more general and efficient. Therefore, we focus on compressing welltrained neural networks without modifying the network structure for high reduction ratio and inference accuracy.
7. Conclusion and Future Work
In this paper, we propose a novel lossy compression framework, called deepsz, for effectively compressing sparse weights in deep neural networks. Unlike traditional methods, deepsz can avoid the costly retraining process after compression, leading to a significant performance improvement in encoding DNNs. We develop a series of approaches to efficiently determine the bestfit error bound for each layer in the network, maximizing the overall compression ratio with user acceptable loss of inference accuracy. Experimental results based on the tested neural networks show that deepsz can achieve compression ratios of up to 116 and can outperform the secondbest approach by up to 1.43. Our experiments with four Nvidia Tesla V100 GPUs demonstrate that deepsz can obtain 1.8 to 4.0 performance improvement in encoding compared with the previous stateoftheart. deepsz can improve the decoding performance by 4.5 to 6.2 compared with the secondbest solution. deepsz also can provide high flexibility to balance the compression ratio and inference accuracy.
We plan to first evaluate our proposed deepsz on more neural network architectures. We also will further improve the compression algorithm to achieve a higher reduction ratio. Moreover, we hope to use deepsz for improving GPU memory utilization.
Acknowledgments
This research was supported by the Exascale Computing Project (ECP), Project Number: 17SC20SC, a collaborative effort of two DOE organizations – the Office of Science and the National Nuclear Security Administration, responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, to support the nation’s exascale computing imperative. This material was based upon work supported by the U.S. Department of Energy, Office of Science, under contract DEAC0206CH11357. We gratefully acknowledge the support from Alabama Water Institute.
References
 (1)
 Ainsworth et al. (2017) Mark Ainsworth, Scott Klasky, and Ben Whitney. 2017. Compression using lossless decimation: analysis and application. SIAM Journal on Scientific Computing 39, 4 (2017), B732–B757.
 Ainsworth et al. (2018) Mark Ainsworth, Ozan Tugluk, Ben Whitney, and Scott Klasky. 2018. Multilevel techniques for compression and reduction of scientific data—the univariate case. Computing and Visualization in Science 19, 5–6 (2018), 65–76.
 Alted (2017) F Alted. 2017. Blosc, an extremely fast, multithreaded, metacompressor library.
 Alwani et al. (2016) Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fusedlayer CNN accelerators. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 22.

Arora
et al. (2014)
Sanjeev Arora, Aditya
Bhaskara, Rong Ge, and Tengyu Ma.
2014.
Provable bounds for learning some deep
representations. In
International Conference on Machine Learning
. 584–592.  Blosc compressor (2018) Blosc compressor. 2018. http://blosc.org/. Online.
 Burtscher and Ratanaworabhan (2007) Martin Burtscher and Paruj Ratanaworabhan. 2007. High throughput compression of doubleprecision floatingpoint data. In Data Compression Conference, 2007. DCC’07. IEEE, 293–302.
 Chen et al. (2015) Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. 2015. Compressing neural networks with the hashing trick. In International Conference on Machine Learning. 2285–2294.
 Collobert and Weston (2008) Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning. ACM, 160–167.
 Cori Supercomputer (2019) Cori Supercomputer. 2019. http://www.nersc.gov/users/computationalsystems/cori/. Online.
 Denil et al. (2013) Misha Denil, Babak Shakibi, Laurent Dinh, Nando De Freitas, et al. 2013. Predicting parameters in deep learning. In Advances in neural information processing systems. 2148–2156.
 Denton et al. (2014) Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. 2014. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems. 1269–1277.
 Deutsch (1996) Peter Deutsch. 1996. GZIP file format specification version 4.3. Technical Report.
 Di and Cappello (2016) Sheng Di and Franck Cappello. 2016. Fast errorbounded lossy HPC data compression with SZ. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 730–739.
 Foley and Danskin (2017) Denis Foley and John Danskin. 2017. Ultraperformance Pascal GPU and NVLink interconnect. IEEE Micro 37, 2 (2017), 7–17.
 Goldschneider (1997) Jill R. Goldschneider. 1997. Lossy Compression of Scientific Data via Wavelets and Vector Quantization.
 Gong et al. (2014) Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. 2014. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115 (2014).

Graves
et al. (2013)
Alex Graves, Abdelrahman
Mohamed, and Geoffrey Hinton.
2013.
Speech recognition with deep recurrent neural networks. In
2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6645–6649.  Han et al. (2015a) Song Han, Huizi Mao, and William J Dally. 2015a. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).
 Han et al. (2015b) Song Han, Jeff Pool, John Tran, and William Dally. 2015b. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems. 1135–1143.
 Hassibi and Stork (1993) Babak Hassibi and David G Stork. 1993. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in neural information processing systems. 164–171.

He
et al. (2016)
Kaiming He, Xiangyu
Zhang, Shaoqing Ren, and Jian Sun.
2016.
Deep residual learning for image recognition. In
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition
. 770–778.  Ibarria et al. (2003) Lawrence Ibarria, Peter Lindstrom, Jarek Rossignac, and Andrzej Szymczak. 2003. Outofcore compression and decompression of large ndimensional scalar fields. In Computer Graphics Forum, Vol. 22. Wiley Online Library, 343–348.
 Jia et al. (2014) Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross B Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe  Convolutional Architecture for Fast Feature Embedding. ACM Multimedia (2014).

Krizhevsky
et al. (2012)
Alex Krizhevsky, Ilya
Sutskever, and Geoffrey E Hinton.
2012.
Imagenet classification with deep convolutional neural networks. In
Advances in neural information processing systems. 1097–1105.  Lakshminarasimhan et al. (2011) Sriram Lakshminarasimhan, Neil Shah, Stephane Ethier, Scott Klasky, Rob Latham, Rob Ross, and Nagiza F Samatova. 2011. Compressing the incompressible with ISABELA: Insitu reduction of spatiotemporal data. In European Conference on Parallel Processing. Springer, 366–379.
 Large Scale Visual Recognition Challenge (2019) Large Scale Visual Recognition Challenge. 2019. http://www.imagenet.org/challenges/LSVRC/. Online.
 LeCun et al. (2010) Yann LeCun, Corinna Cortes, and CJ Burges. 2010. MNIST handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist 2 (2010).
 LeCun et al. (1990) Yann LeCun, John S Denker, and Sara A Solla. 1990. Optimal brain damage. In Advances in neural information processing systems. 598–605.
 Li et al. (2018) Shaomeng Li, Nicole Marsaglia, Christoph Garth, Jonathan Woodring, John Clyne, and Hank Childs. 2018. Data reduction techniques for simulation, visualization and data analysis. In Computer Graphics Forum, Vol. 37. Wiley Online Library, 422–447.
 Liang et al. (2018) Xin Liang, Sheng Di, Dingwen Tao, Sihuan Li, Shaomeng Li, Hanqi Guo, Zizhong Chen, and Franck Cappello. 2018. ErrorControlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets. (2018).
 Lindstrom (2014) Peter Lindstrom. 2014. Fixedrate compressed floatingpoint arrays. IEEE Transactions on Visualization and Computer Graphics 20, 12 (2014), 2674–2683.
 Lindstrom (2017) Peter Lindstrom. 2017. Error Distributions of Lossy FloatingPoint Compressors. Joint Statistical Meetings (2017), 2574–2589.
 Lindstrom and Isenburg (2006) Peter Lindstrom and Martin Isenburg. 2006. Fast and efficient compression of floatingpoint data. IEEE Transactions on Visualization and Computer Graphics 12, 5 (2006), 1245–1250.
 Lu et al. (2018) Tao Lu, Qing Liu, Xubin He, Huizhang Luo, Eric Suchyta, Jong Choi, Norbert Podhorszki, Scott Klasky, Mathew Wolf, Tong Liu, et al. 2018. Understanding and modeling lossy compression schemes on HPC scientific data. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 348–357.
 Reagen et al. (2017) Brandon Reagen, Udit Gupta, Robert Adolf, Michael M Mitzenmacher, Alexander M Rush, GuYeon Wei, and David Brooks. 2017. Weightless: Lossy Weight Encoding For Deep Neural Network Compression. arXiv preprint arXiv:1711.04686 (2017).

Rowley
et al. (1998)
Henry A Rowley, Shumeet
Baluja, and Takeo Kanade.
1998.
Neural networkbased face detection.
IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 1 (1998), 23–38.  Son et al. (2014) Seung Woo Son, Zhengzhang Chen, William Hendrix, Ankit Agrawal, Weikeng Liao, and Alok Choudhary. 2014. Data compression for the exascale computing erasurvey. Supercomputing Frontiers and Innovations 1, 2 (2014), 76–88.
 Summit Supercomputer (2019) Summit Supercomputer. 2019. https://www.olcf.ornl.gov/summit/. Online.
 Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1–9.
 Tao et al. (2017a) Dingwen Tao, Sheng Di, Zizhong Chen, and Franck Cappello. 2017a. Indepth exploration of singlesnapshot lossy compression techniques for Nbody simulations. In 2017 IEEE International Conference on Big Data. IEEE, 486–493.
 Tao et al. (2017b) Dingwen Tao, Sheng Di, Zizhong Chen, and Franck Cappello. 2017b. Significantly improving lossy compression for scientific data sets based on multidimensional prediction and errorcontrolled quantization. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 1129–1139.
 Tao et al. (2019) Dingwen Tao, Sheng Di, Xin Liang, Zizhong Chen, and Franck Cappello. 2019. Optimizing Lossy Compression RateDistortion from Automatic Online Selection between SZ and ZFP. IEEE Transactions on Parallel and Distributed Systems (2019).
 Taubman and Marcellin (2012) David Taubman and Michael Marcellin. 2012. JPEG2000 image compression fundamentals, standards and practice: image compression fundamentals, standards and practice. Vol. 642. Springer Science & Business Media.
 Theta Supercomputer (2019) Theta Supercomputer. 2019. https://www.alcf.anl.gov/theta. Online.
 Vanhoucke et al. (2011) Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. 2011. Improving the speed of neural networks on CPUs. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, Vol. 1. Citeseer, 4.
 Wallace (1992) Gregory K Wallace. 1992. The JPEG still picture compression standard. IEEE Transactions on Consumer Electronics 38, 1 (1992), xviii–xxxiv.
 Wang et al. (2018) Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: dynamic GPU memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 41–53.
 Wozniak et al. (2018) Justin M Wozniak, Rajeev Jain, Prasanna Balaprakash, Jonathan Ozik, Nicholson T Collier, John Bauer, Fangfang Xia, Thomas Brettin, Rick Stevens, Jamaludin MohdYusof, et al. 2018. CANDLE/Supervisor: a workflow framework for machine learning applied to cancer research. BMC bioinformatics 19, 18 (2018), 491.
 Zstandard (2018) Zstandard. 2018. http://facebook.github.io/zstd/. Online.