# Compression strategies and space-conscious representations for deep neural networks

Recent advances in deep learning have made available large, powerful convolutional neural networks (CNN) with state-of-the-art performance in several real-world applications. Unfortunately, these large-sized models have millions of parameters, thus they are not deployable on resource-limited platforms (e.g. where RAM is limited). Compression of CNNs thereby becomes a critical problem to achieve memory-efficient and possibly computationally faster model representations. In this paper, we investigate the impact of lossy compression of CNNs by weight pruning and quantization, and lossless weight matrix representations based on source coding. We tested several combinations of these techniques on four benchmark datasets for classification and regression problems, achieving compression rates up to 165 times, while preserving or improving the model performance.

## Authors

• 2 publications
• 1 publication
• 7 publications
• 3 publications
08/28/2021

### Compact representations of convolutional neural networks via weight pruning and quantization

The state-of-the-art performance for several real-world problems is curr...
01/18/2021

### Deep Compression of Neural Networks for Fault Detection on Tennessee Eastman Chemical Processes

Artificial neural network has achieved the state-of-art performance in f...
06/11/2021

### DECORE: Deep Compression with Reinforcement Learning

Deep learning has become an increasingly popular and powerful option for...
02/07/2018

### Universal Deep Neural Network Compression

Compression of deep neural networks (DNNs) for memory- and computation-e...
09/29/2015

### Compression of Deep Neural Networks on the Fly

Thanks to their state-of-the-art performance, deep neural networks are i...
07/09/2021

### Joint Matrix Decomposition for Deep Convolutional Neural Networks Compression

Deep convolutional neural networks (CNNs) with a large number of paramet...
11/01/2021

### iFlow: Numerically Invertible Flows for Efficient Lossless Compression via a Uniform Coder

It was estimated that the world produced 59 ZB (5.9 × 10^13 GB) of data ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Although the main structural results behind deep neural networks (NN) date back to more than forty years ago, this field has constantly evolved, giving rise to original models, but also to novel techniques for controlling generalization error. The availability of powerful computing facilities and of massive amount of data allows nowadays to train extremely efficient neural predictors, setting the state-of-the-art in various fields, such as image processing or financial forecasting. Convolutional neural networks (CNN) play a prominent role: indeed, several pre-trained models, such as AlexNet [krizhevsky2012imagenet] and VGG16 [Simonyan15]

, are made available as a starting point for transfer learning techniques

[transfer-learning]. These models, however, are characterized by a large memory footprint: for instance, VGG16 requires no less than 500 MB to be stored in main memory. As a consequence, querying such models becomes also demanding in terms of energy consumption. This clashes with the limitations of mobile phones, smartwatches, and in general with IoT-enabled devices. Leaving aside the vein of compact model production, aiming to directly induce succint CNNs [sandler2018mobilenetv2], in this paper we focus on network compression. Indeed, knowledge in a neural network is distributed over millions, or even billions, of connection weights. This knowledge can be extracted transforming a learnt network into a smaller one, yet with comparable (or even better) performance. Main approaches proposed in the literature can be cast into the following four categories:

• matrix decomposition, aiming at writing over-informative weight matrices as the product of more compact matrices;

• data quantization, focused on limiting the bitwidth of data encoding the mathematical objects behind a NN, such as weights, activations, errors, gradients, and so on;

• network sparsification, aimed at reducing the number of free parameters of a NN, notably the connection weights;

• knowledge distillation, consisting in subsuming a large network by a smaller one, trained with the target of mimicking the function learnt by .

The reader interested to a thorough review of these methods can refer for instance to [deng2020model, cheng2017survey]. Note that we don’t consider special techniques for convolutional layers, such as those tweaking the corresponding filters [zhai2016doubly]. This is due to the fact that, in most CNNs, the memory required by these layers is negligible w.r.t. that of fully-connected layers.

This work aims at investigating the joint effect of: (i) lossy compression for NN weights, and (ii) entropy coding, lossless representation techniques aware of the limited amount of memory (aka space-conscious techniques [Ferragina:2020pgm]). In particular, concerning matrix compression, we analyse the effectiveness of some existing pruning and quantization methods, and introduce a probabilistic quantization technique mutuated from federated learning. From the matrix storage standpoint, we propose a novel representation, named sparse Huffman Address Map compression (sHAM), combining entropy coding, address maps, and compressed sparse column (CSC) representations. sHAM is specifically designed to exploit the sparseness and the quantization of the original weight matrix. The proposed methods have been evaluated on two publicly available CNNs, and on four benchmarks for image classification and for the regression problem of drug-target affinity prediction, confirming that CNN compression can even improve the performance of uncompressed models, whereas the sHAM representation can achieve compression rates up to around times. The work is organized as follows: Sects. II and III describe the considered compression and representation techniques, while Sect. IV illustrates the above mentioned experimental comparison, depicted in terms of performance gain/degradation, achieved compression rate, and execution times. Some concluding remarks end the paper.

## Ii Compression techniques

This section describes three methodologies used in order to transform the matrix organizing the connection weights of one layer in a neural network into a new matrix which approximates , though having a structure exploitable by specific compression schemes (cfr. Sect. III) 111these techniques can be applied separately to the dense layers of any NN.. In the rest of the paper, and will denote generic entries of and , respectively. Boldface and italic boldface will be used for matrices (e.g.

) and vectors (

), respectively, while will be an abstract cardinality operator returning the length of a string or the number of elements of a vector. Finally, we define the sparsity coefficient of as the ratio between the number of its nonzero elements and .

### Ii-a Pruning

Neural networks have several analogies with the human central nervous system. In particular, storing knowledge in a distributed fashion implies robustness as a side effect: performance degrades gracefully when a damage occurs in the network components, i.e., when connections change their weight, or even get discarded. In turn, robustness can be exploited to compress a NN by removing connections which do not significantly affect the overall behaviour. This is referred to as pruning a learnt neural network. An originally oversized NN might even be outperformed by its pruned version.

Pruning is typically done by considering connections whose weights are small in absolute value. Indeed, the signal processed by an activation function is computed as a weighted sum of the inputs to the corresponding neuron, precisely relying on connection weights. Thus, nullifying all negligible (positive or negative) weights should not sensibly change the above signal, as well as the network output. This is why we performed pruning by fixing an empirical percentile

of the set of entries of , and subsequently defined the entries of setting if , otherwise. This procedure has a time complexity of (due to sorting). As pruning has the effect of modifying the structure of the neural network, a post-processing phase retrains the network on the same data, now only updating non-null weights in . The only parameter is the percentile level , which in turn is obviously related to the sparsity coefficient (see Sect. IV for a description of how , as well as and in next sections, have been selected).

### Ii-B Weight sharing

When the weights in assume a small number of distinct values, applying the flyweight pattern [gang-of-4] results in a technique called weight sharing (WS) [Han15]. Distinct values are stored in a table, whose indices are used as matrix entries. As (integer) indices require less bits than (float) weights, the latter matrix is significantly compact and largely compensates for the additional table222note that, in the original formulation, the representation of this matrix still scales with , while in Sect. IV we use a more efficient encoding.. This comes at the price of requiring two memory accesses in order to retrieve a weight.

Although isn’t expected to initially enjoy this property, robustness is helpful in this case, too. Close enough weights can be set to a common value without significantly affecting network performance, yet allowing to apply WS. For instance, [Han15] clusterizes all values, setting to the centroid of the corresponding . Assuming the -means algorithm is used [mcqueen], the time complexity is , where is the number of different weight values. A second retraining phase is advisable also here, though updating weights is trickier, because the latter shall take values in the centroid set . This is ensured using the cumulative gradient

 ∂L∂cl=∑i,j∂L∂wij1(Iij=l),

where , is the cluster index of , and is the indicator function. Applying cumulative gradient might end up in using less than distinct weights, if some representatives converge to a same value during retraining. Pruning and weight sharing can be applied in chain, with weight sharing only considering the non-null weights identified by pruning.

### Ii-C Probabilistic quantization

A recent trend on quantization relies on probabilistic projections of weight onto special binary [NIPS2015_5647] or ternary values [deng2018gxnor]. Here we present an alternative approach, named Probabilistic Quantization (PQ), mutuated and extended from the federated learning context [federated-learning], and never used for NN compression. PQ is based on the following probabilistic rationale. Let and denote the minimum and maximum weight in , respectively, and suppose that each learnt weight

is the specification of a random variable

distributed according to a fixed, yet unknown, distribution having as support. Let be the two-valued random variable defined by and . The observations of approximate a weight

through an extreme form of WS (cfr. previous section), using an approach different from k-means for finding representative weights. Now,

, and, in turn, regardless how is distributed. Thus, simulating for each entry we obtain an approximation of having the desirable unbiasedness property

that the two corresponding random matrices have the same expected value. This method has been heuristically extended by partitioning

in intervals. A generic is compressed precisely as in the two-values case, but now and denote now the extremes of the interval containing . We remark here that sub-intervals, and therefore representative weights, can be chosen in order to preserve the above mentioned unbiasedness property. Indeed, this happens when the intervals’ extremes are , for , where denotes the

-quantile of

333

this requires the weights to follow a common probability distribution; however, no additional hypotheses are needed.

. The time complexity of the overall operation is (due to quantile computation). Note that the same considerations pointed out for tWS at the end of previous section, namely retraining via cumulative gradient formula and combined use with pruning, also apply to PQ.

## Iii Compressed Matrix Representation

The matrix obtained using any of the techniques described in Sect. II has as many elements as the original matrix. However, it exhibits properties exploitable by a clever encoding, so that is stored using less than memory locations, as required by the classical row-order method. In this section, two existing compressed representations of are first described, then a novel method is proposed, overcoming their limitations and explicitly profiting from sparsity and presence of repeated values. Moreover, the method does not require assumptions on the matrix sparsity, on the distribution of nonzero elements, or on the presence of repeated values. To be ablle to efficiently compute the dot product , where , necessary for the forward computation in a NN, a dedicated procedure is also described.

### Iii-a Compressed sparse column

The compressed sparse column (CSC) format [CSC] is a common general storage format for sparse matrices. It is composed of three arrays:

• , containing the nonzero values, listed by columns;

• , containing the row indices of elements in ;

• , where the difference provides the number of nonzero elements in column ; thus, has dimension , where .

As an example, consider the matrix

 W=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝10400010000230050000000006⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠, (1)

whose corresponding CSC representation is , , and . Let be the number of nonzero elements in , and be the number of bits used to represent every element of the matrix (one memory word), so that we need bits to store , and to store its CSC representation. Note that we assumed bits are needed also for the components of , although they can be represented using only bits, which might be lower than . Thus the occupancy proportion is given by .

Denoting by the sparsity coefficient of , we have , thus implies . The matrix dot product will be computed through the typical dot product for CSC format, with computational complexity [CSC], that can be sped up through parallel computing. The main limitation of CSC is tye use of bits for any elements of the matrix, whereas variable length coding can provide more compact representations and higher bit-memory efficiency.

### Iii-B Huffman address map compression

The idea of using Huffman coding after network pruning and quantization is introduced in [Han15], although with little detail. Here, in addition to provide an exhaustive description, we also point out a main limitation of this approach: it does not directly profit from the matrix sparsity. Like CSC, this is a lossless compression technique, based on Huffman coding and the address map logic [Pooch73], which we named Huffman Address Map compression (HAM). In address maps, matrix elements are treated as a sequence of bits, concatenated by rows or by columns, where null entries correspond to the bit , and each nonzero element is substituted by a binary string encoding its address , and concatenated to the rest of the stream. For instance, the bit stream for the matrix defined in (1) is

 a(1)0a(2)000a(10)a(3)00a(4)00000000000a(5)0a(6) .

To be efficient, this storage needs a compact representation of addresses. The Huffman coding of nonzero values is a uniquely decodable and instantaneous code ensuring a near-optimal compression rate [Huffman52]. Indeed, given a source whose symbols have probabilities , the average number of bits per symbol is almost equal to the optimal value corresponding to the entropy of the source (when the symbols are independent and identically distributed). More precisely, , and corresponds to the minimal average number of bits per symbol, according to Shannon’s source coding theorem [Shannon48].

Once the Huffman code has been built, we replace each in the bit stream with the corresponding . In order to have a uniquely decodable string, zeros are also included in the Huffman code, thus having a total of codewords. The resulting bit stream is then split into memory words, , represented as an array of unsigned integers. If is not a multiple of

To estimate

, we can assume the worst case for the entropy value, that is when all symbols are distinct ( symbols appearing exactly once in the matrix): in this case , and the Huffman code has an average codeword length upper-bounded by , which implies at most bits are needed. To store , and its inverse used to decode, we need extra space: although there are methods storing a -symbols Huffman code using at most bits [Sultana12], to ensure optimal search time, a ‘classical’ B-tree representation is used for both and , being aware that space occupancy can be improved. Assuming each value is represented trough 1 word ( bits), each dictionary requires bits, to store each pair and , and bits to store a pointer in the B-tree structure—overestimated, since we have less pointers than keys in a B-tree. Overall, HAM requirements are upper-bounded by bits, which is more than bits required by an uncompressed matrix. For this reason, in the experiments only using pruning, the CSC representation is adopted. As opposite, the space occupancy of HAM decreases when only distinct weights are present in , like in the output of WS and PQ (see Sections II). Indeed, in the worst case (all symbols are equally probable, entropy ), the occupancy proportion is at most , where as expected, for small the first term is more relevant, while the second term grows faster with .

Dot product. The procedure Dot (Fig. 1) executes the dot product , when is represented through the HAM format. It processes one compressed word of at a time, obtains its binary representation (line ), which is scanned at lines - to detect code words. The procedure NCW gets the next code word from , starting at the current bitstring offset , possibly adding at the beginning of the bits remaining from previous word. If, starting from the current offset, no code word can be detected in (NCW returns ), it means that the next code word has been split on two adjacent memory words, accordingly is updated and the next word will be read in the next iteration at line . The procedure also takes into account for the padding. Then, the weight relative to the code word detected is computed (line ), and multiplied by the corresponding element of , to update the cumulative sum stored in variable , thus requiring to keep in memory only one weight at a time.

In summary, the iterations take time for line , for lines -, and for line , leading to an overall time complexity .

### Iii-C Sparse Huffman address map compression

One drawback of HAM representation is that it does not directly exploit the sparsity of the matrix, which only indirectly induces a reduction in the space occupancy, due to the more compact resulting Huffman code (symbol has high frequency). When the matrix is large and very sparse, even using only bit to represent the symbol , much memory would be required (e.g. GB for a matrix). To address this issue, the novel sparse Huffman Address Map compression (sHAM) is proposed, extending the HAM format as follows. The symbol is excluded from the bit stream and from the Huffman code, and a bitwise CSC representation of the matrix is adopted, producing the vectors (cfr. Sect. III-A), but storing using the format. Namely, the Huffman code for nonzero elements is built, and the corresponding bit stream is obtained by concatenating their Huffman coding. is stored in the array of memory words. Considering the worst case, in which all symbols are distinct, is composed of bits, in addition to the for the dictionaries, and to the bits required for and . The resulting occupancy ratio is .

On the other side, when only distinct values are present in , and again assuming the worst case for the Huffman coding, the occupancy ratio becomes , where first term on the right is scaled by w.r.t. , emphasizing the gain of increasing the sparsity of , whereas last term is constant w.r.t.  and . Thus, when is such that , it follows .

Dot product. Figure 2 describes the procedure executing the dot product when is represented through the sHAM format. It extracts in sequence the compressed words of , computes the corresponding binary representation (line ), and executes lines - to detect code words. NCW is the same procedure used for Dot, whereas the cycle at lines - possibly skips empty columns. The variable contains the position in of the current element . Line finds the weight relative to the code word detected, multiplies it by the corresponding element of , and updates the column cumulative sum stored in variable . The iterations require steps for line , for lines -, for cycle -, and for line . The overall time complexity is thereby .

The procedure Dot can be adapted to parallel computation by substituting with the vector containing the beginning of each column in the bitstream , and using the current position in the bitstream () to detect the end of columns. In this way, each column product can be run in parallel (e.g., through GPU); we considered this as a future development, since in the empirical evaluation the sequential version was very close to the full matrix parallel dot product testing time, on sufficiently sparse and quantized matrices. Analogously, also Dot can be parallelized.

## Iv Experiments and Results

To assess the quality of the proposed techniques, an empirical evaluation has been carried out on four datasets and two uncompressed models, as explained here below.

### Iv-a Data

• Classification. The MNIST database [MNIST] is a classical large database of handwritten digits, containing 60K+10K 28x28 grayscale images (train and test set, respectively) from 10 classes (digits 0-9). The CIFAR-10 dataset [Krizhevsky09learningmultiple] consists of 50K+10K 32x32 color images belonging to 10 different classes. Both datasets are balanced w.r.t. labels.

• Regression. We predicted the affinity between drug (ligand) and targets (proteins) [DeepDTA], using the DAVIS [Davis11] and KIBA [KIBA14] datasets. Proteins and ligands are both represented through strings, respectively using the amino acid sequence and the SMILES (Simplified Molecular Input Line Entry System) representation. DAVIS and KIBA contain, respectively, and proteins, and ligands, and total interactions.

### Iv-B Benchmark models

To have a fair comparison of the various compression techniques, we selected top-performing CNN models publicly available: (i) VGG19 [Simonyan15], made up by convolutional layers and a fully-connected block (two hidden layers of neurons each, and a softmax output layer)444Source code: https://github.com/BIGBALLON/cifar-10-cnn, trained on CIFAR-10 and MNIST datasets; and (ii) DeepDTA [DeepDTA], with distinct convolutional blocks for proteins and ligands (each composed of convolutional and a MaxPool layers), combined in a fully connected block consisting of 3 hidden layers of , , units, and a single-neuron output layer555Source code: https://github.com/hkmztrk/DeepDTA.

The original work using DeepDTA operated a

-fold cross validation (CV) to perform model selection, thus training on 4/5 of available data. We retained the best configuration for hyperparameters and trained the CNN on the entire training set, leaving unchanged the original settings.

### Iv-C Evaluation metrics

We considered the difference between performances of compressed and uncompressed models, the ratio of testing time of the uncompressed model w.r.t. the compressed one (named ), and the space occupancy ratio (cfr. Sect.  III-A). As in the original works, we computed performance using Accuracy and mean squared error (MSE) for classification and regression, respectively. Time and space performance account only for the actually compressed weights, that is those in fully-connected layers. Moreover, the implementation of dot product for uncompressed models exploits parallel computations implemented in Python, thus penalizing and , implemented sequentially (their parallelization is planned as an extension). As shown in the next section, even with this penalty, our method approaches the uncompressed time when the matrix is sufficiently sparse and quantized.

### Iv-D Compression techniques setup

We tested all combinations of Pruning (Pr), WS, PQ, Pr-WS, Pr-PQ, selecting hyper-parameters as follows.

• Pruning. We tested percentiles with ; values and (for which CSC does not achieve any compression) are included, as potentially useful in Pr-WS and Pr-PQ.

• WS. For VGG19, was tested in the first two hidden layers , as well as in the third one (which is smaller). DeepDTA is smaller than VGG19, so all combinations of have been tested in the hidden layers, and of for the output layer, due to its dimension.

• PQ. In order to have a fair comparison, took the same values as in the WS procedure;

• Pr-X. The combined application of pruning followed by the quantization , was tested in two variants: a) best in terms of is selected, and the parameters for are subsequently tuned a in previous points, and b) vice-versa.

Fine tuning of compressed weights. The same configuration of original training procedure has been kept for the retraining after compression. Data-based tuning was applied only to learning rate after retraining ( for pruning, and

for PQ, WS, and combined schemes), and the maximum number of epochs, set to

.

### Iv-E Software implementation

The source code retrieved for baseline NNs was implemented in Python, using the Tensorflow and Keras libraries. Compression techniques and retraining procedures have been implemented in Python as well, also exploting GPUs.

### Iv-F Results

As baseline comparison, Table I reports the testing results of the uncompressed models. The top performing results for each compression technique, along with the corresponding configuration, are shown in Table II. To also evaluate compression capability, Table III contains the least occupying configuration for each compression methods having performance greater or equal to the original model (when available). Weight quantization is more accurate than pruning for classification, with PQ and WS on having the top absolute performance on MNIST and CIFAR-10, respectively. Overall, all techniques outperform the baseline, while exhibiting remarkable compression rates. Similar trends raise for regression, where however pruning top-performs, and where PQ never improves the baseline MSE on KIBA dataset. Performance improvements are particularly remarkable on DAVIS data (till around of baseline). As expectable, the largest compression (while preserving accuracy) is achieved on the bigger net, VGG19, with a compression rate of more than times on CIFAR-10, with Pr-PQ and sHAM representation. Anyway, Pr-PQ method improves the baseline MSE of , while compressing around times, also for DeepDTA.

To better unveil the behavior of the proposed compression methods and storage formats, in Fig. 3 we summarize their testing performance, space occupancy and time ratio, for all tested hyper-parameter configurations. The sHAM storage format is used, except for techniques producing dense matrices, where HAM is more convenient. Reminding that WS and PQ combinations are reported in increasing order (before all combinations with in the first layer, denoted by label , then those with in the first layer, label , and so on), on CIFAR-10 and DAVIS most compression techniques outperform the baseline, and this is likely due to overfitting, since on training data they show similar results. Conversely, on MNIST and KIBA only some compression configurations improve the baseline results, which is however important, since at the same time the compressed model uses much less parameters, confirming results obtained in [Han15]. When using binary quantization () clearly we get lower but at the same time worse performance, whereas already with the baseline performance is improved on almost all datasets. sHAM occupancy, as expected, gets lower when increases (and consequently decreases), along with the time ratio, approaching in turn to (same testing time). The high time ratios, for some configurations, reflect the fact that the compress dot procedure is slower than the numpy.dot used by baseline and leveraging parallel computation. As mentioned in Sect. IV-C, the former is still sequential, and we plan to produce a parallel version (cfr. Sect. III-C).

Comprehensively, a compression technique better that the other ones is not emerging from these results. Nevertheless, a sound result is that methods providing the lowest occupancy, i.e., those combining weight pruning and quantization, still achieve high performance; secondly, it seems that the pruning technique is preferable for regression problems, whereas quantization performs better in the setting of classification. However, we believe further studies are necessary to assume this trend as consolidated. Finally, our proposed storage representations, HAM and sHAM, produce the expected behaviors, being suitable for both dense (HAM) and sparse (sHAM) compressed matrices, and remarkably improving the CSC format on sparse matrices.

## V Conclusions

This work investigated both classical CNN compression techniques (like weight pruning and quantization) and a novel probabilistic compression algorithm, combined with a new lossless entropy coding storage of the network. Our results confirmed that model simplification can improve the overall generalization abilities, e.g., due to limited overfitting, and showed that our compressed representation can reduce the space occupancy of the input network, when suitably preceded by pruning and quantization, more than times. As meaningful extension of this work, it would be worthy to operate the quantization so as to minimize the entropy of the quantized weights (known as entropy coded scalar/vector quantization), which in turn would lead to shorter entropy coding [Choi20]. In this study indeed we considered them separately, since the aim was to compare the effectiveness in terms of prediction accuracy of different compression techniques, to detect possible performance trends related to the type of problem. Moreover, other source coding methodologies (known as universal lossless source coding, e.g., the Lempel–Ziv source coding), less sensitive to source statistics, could be applied rather than Huffman coding, being more convenient in practice than the latter, since they do not require the knowledge of source statistics, and having smaller overhead, since the codebook (i.e., dictionary) is built from source symbols while encoding and decoding.