# Compact and Computationally Efficient Representation of Deep Neural Networks

Dot product operations between matrices are at the heart of almost any field in science and technology. In many cases, they are the component that requires the highest computational resources during execution. For instance, deep neural networks such as VGG-16 require up to 15 giga-operations in order to perform the dot products present in a single forward pass, which results in significant energy consumption and thus limits their use in resource-limited environments, e.g., on embedded devices or smartphones. One common approach to reduce the complexity of the inference is to prune and quantize the weight matrices of the neural network and to efficiently represent them using sparse matrix data structures. However, since there is no guarantee that the weight matrices exhibit significant sparsity after quantization, the sparse format may be suboptimal. In this paper we present new efficient data structures for representing matrices with low entropy statistics and show that these formats are especially suitable for representing neural networks. Alike sparse matrix data structures, these formats exploit the statistical properties of the data in order to reduce the size and execution complexity. Moreover, we show that the proposed data structures can not only be regarded as a generalization of sparse formats, but are also more energy and time efficient under practically relevant assumptions. Finally, we test the storage requirements and execution performance of the proposed formats on compressed neural networks and compare them to dense and sparse representations. We experimentally show that we are able to attain up to x15 compression ratios, x1.7 speed ups and x20 energy savings when we lossless convert state-of-the-art networks such as AlexNet, VGG-16, ResNet152 and DenseNet into the new data structures.

## Authors

• 10 publications
• 95 publications
• 76 publications
• ### Compressed Learning of Deep Neural Networks for OpenCL-Capable Embedded Systems

Deep neural networks (DNNs) have been quite successful in solving many c...
05/20/2019 ∙ by Sangkyun Lee, et al. ∙ 0

• ### DeepIoT: Compressing Deep Neural Network Structures for Sensing Systems with a Compressor-Critic Framework

Recent advances in deep learning motivate the use of deep neutral networ...
06/05/2017 ∙ by Shuochao Yao, et al. ∙ 0

• ### Hierarchical Matrix Operations on GPUs: Matrix-Vector Multiplication and Compression

Hierarchical matrices are space and time efficient representations of de...
02/05/2019 ∙ by Wajih Halim Boukaram, et al. ∙ 0

• ### Deep Compression for PyTorch Model Deployment on Microcontrollers

Neural network deployment on low-cost embedded systems, hence on microco...
03/29/2021 ∙ by Eren Dogan, et al. ∙ 0

• ### Compressed Data Structures for Binary Relations in Practice

Binary relations are commonly used in Computer Science for modeling data...
02/20/2020 ∙ by Carlos Quijada-Fuentes, et al. ∙ 0

• ### Throughput Optimizations for FPGA-based Deep Neural Network Inference

Deep neural networks are an extremely successful and widely used techniq...
09/28/2018 ∙ by Thorbjörn Posewsky, et al. ∙ 0

• ### A User-Friendly Hybrid Sparse Matrix Class in C++

When implementing functionality which requires sparse matrices, there ar...
05/09/2018 ∙ by Conrad Sanderson, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

The dot product operation between matrices constitutes one of the core operations in almost any field in science. Examples are the computation of approximate solutions of complex system behaviors in physics [1], iterative solvers in mathematics [2]

and features in computer vision applications

[3]. Also deep neural networks heavily rely on dot product operations in their inference [4, 5]; e.g., networks such as VGG-16 require up to 16 dot product operations, which results in 15 giga-operations for a single forward pass. Hence, lowering the algorithmic complexity of these operations and thus increasing their efficiency is of major interest for many modern applications. Since the complexity depends on the data structure used for representing the elements of the matrices, a great amount of research has focused on designing data structures and respective algorithms that can perform efficient dot product operations [6, 7, 8].

Of particular interest are the so called sparse matrices, a special type of matrices that have the property that many of their elements are zero valued. In principle, one can design efficient representations of sparse matrices by leveraging the prior assumption that most of their element values are zero and therefore, only store the non-zero entries of the matrix. Consequently, their storage requirements become of the order of the number of non-zero values. However, having an efficient representation with regard to storage requirement does not imply that the dot product algorithm associated to that data structure will also be efficient. Hence, a great part of the research was focused on the design of data structures that have as well low complex dot product algorithms [8, 9, 10, 11]. However, by assuming sparsity alone we are implicitly imposing a spike-and-slab prior111

That is, a delta function at 0 and a uniform distribution over the non-zero elements.

over the probability mass distribution of the elements of the matrix. If the actual distribution of the elements greatly differs from this assumption, then the data structures devised for sparse matrices become inefficient. Hence, sparsity can be a too constrained assumption for some applications of current interest, e.g., representation of quantized neural networks.

In this work, we alleviate the shortcomings of sparse representations by considering a more relaxed prior over the distribution of the matrix elements. More precisely, we assume that the empirical probability mass distribution of the matrix elements has a low entropy value as defined by Shannon [12]. Mathematically, sparsity can be considered a subclass of the general family of low entropic distributions. In fact, sparsity measures the min-entropy of the element distribution, which is related to Shannon’s entropy measure through Renyi’s generalized entropy definition [13]. With this goal in mind, we ask the question:

“Can we devise efficient data structures under the implicit assumption that the entropy of the distribution of the matrix elements is low?”

We want to stress once more that by efficiency we regard two related but distinct aspects

1. efficiency with regard to storage requirements

2. efficiency with regard to algorithmic complexity of the dot product associated to the representation

For the later, we focus on the number of elementary operations required in the algorithm, since they are related to the energy and time complexity of the algorithm. It is well known that the minimal bit-length of a data representation is bounded by the entropy of it’s distribution [12]. Hence, matrices with low entropic distributions automatically imply that we can design data structures that do not require high storage resources. In addition, as we will discuss in the next sections, low entropic distributions also attain gains in efficiency if these data structures implicitly encode the distributive law of multiplications. By doing so, a great part of the algorithmic complexity of the dot product is reduced to the order of the number of shared weights per row in a matrix. This number is related to the entropy, such that it is small as long as the entropy of the matrix is low. Therefore, these data structures not only attain higher compression gains, but also require less total number of operations when performing the dot product.

Our contributions can be summarized as follows:

• We propose new highly efficient data structures that exploit on the prior that the matrix has a low number of shared weights per row (i.e., low entropy).

• We provide a detailed analysis of the storage requirements and algorithmic complexity of performing the dot product associated to these data structures.

• We establish a relation between the known sparse and the proposed data structures. Namely, sparse matrices belong to the same family of low entropic distributions, however, they can be considered a more constrained subclass of them.

• We show through experiments that indeed, these data structures attain gains in efficiency on simulated as well as real-world data. In particular, we show that up to x42 compression ratios, x5 speed ups and x90 energy savings can be achieved when we benchmark the compressed weight matrices of state-of-the-art neural networks relative to the matrix-vector multiplication.

In the following Section II we introduce the problem of efficient representation of neural networks and briefly review related literature. In Section III the proposed data structures are given. We demonstrate through a simple example that these data structures are able to: 1) achieve higher compression ratios than their respective dense and sparse counterparts and 2) reduce the algorithmic complexity of performing the dot product. Section IV analyses the storage and energy complexity of these novel data structures. Experimental evaluation is performed in Section V using simulations as well as state-of-the-art neural networks such as AlexNet, VGG-16, ResNet152 and DenseNet. Section VI concludes the paper with a discussion.

## Ii Efficient Inference in Neural Networks

Deep neural networks [14, 15]

became the state-of-the-art in many fields of machine learning, such as in computer vision, speech recognition, natural language processing

[16, 17, 18, 19], and have been progressively also used in the sciences, e.g. physics [20], neuroscience [21], chemistry [22, 23]. In their most basic form, they constitute a chain of affine transformations concatenated with a non linear function which is applied element-wise to the output. Hence, the goal is to learn the values of those transformation or weight matrices (i.e., parameters) such that the neural network performs it’s task particularly well. The procedure of calculating the output prediction of the network for a particular input is called inference. The computational cost of performing inference is dominated by computing the affine transformations (thus, the dot products between matrices). Since today’s neural networks perform many dot product operations between large matrices, this greatly complicates their deployment onto resource constrained devices.

However, it has been extensively shown that most neural networks are overparameterized, meaning that there are many more parameters than actually needed for the tasks of interest [24, 25, 26, 27]. This implies that these networks are highly inefficient with regard to the resources they require when performing inference. This fact motivated an entire research field of model compression [28]. One of the suggested approaches is to: 1) compress the weight elements of the neural network without (considerably) affecting their prediction accuracy and 2) convert the resulting weights into a representation that achieves high compression ratios and is able to execute the dot product operation efficiently. Whilst there has been a plethora of work focusing on the first step [29, 30, 31, 27, 32, 33, 34, 35, 36, 26, 37, 38, 39], previous literature has not focused as much on the second part. As a consequence, most of the research has focused on developing techniques that either sparsify the networks weights [29, 30, 31, 27] or reduce the cardinality of the weight elements [32, 33, 34], since then sparse matrix representations or dense matrices with compressed numerical representations can be employed in order to efficiently perform inference.

However, this greatly reduces the possible efficiency gains that can be achieved. In fact, highest reported compression gains are attained with techniques that either implicitly [26, 38] or explicitly [35, 36, 37, 39] attempt to reduce the entropy of the weight matrices of the network. To recall, throughout this work we consider the entropy of the empirical probability mass distribution of the weight elements. That is, we first identify the set of unique elements that appear in the matrix, denoted as . Then, for each element in , we count it’s frequency of appearance and divide it by the total number of elements in the matrix, resulting in the probability mass value , where is the counting operator and the total number of elements in the matrix. Finally, we calculate Shannon’s entropy .

However, with no other means for representing the resulting compressed weight matrices, the achievable efficiency gains are bounded by the limitations of the sparse or dense representations.

For instance, figure 1 demonstrates the discrepancy between the sparsity assumption and the real distribution of weight elements. It plots the distribution of the weight elements of the last classification layer of VGG-16 [40] ( dimensional matrix), after having applied uniform quantization on the weight elements. We stress that the prediction accuracy and generalization of the network was not affected by this operation. On the one hand, as we can see, the distribution of the compressed layer does not satisfy the sparsity assumption, i.e., there is not one particular element (such as 0) that appears specially frequent in the matrix. The most frequent value is -0.008 and it’s frequency of appearance does not dominate over the others (about 4.2%). On the other hand, naively compressing the numerical values of the matrix elements down to a trivial 7-bit representation would also result in an inefficient representation. Since the activation values are still represented in single precision floating point values222In this case, compressing the activation values down to a 7-bit representation would have significantly harmed the prediction accuracy of the network., the respective dot product algorithm would require multiple, mostly expensive decoding operations in order to convert back each element of the weight matrix into it’s original 32-bit floating point value.

Hence, neither sparse matrix representations nor the (compressed) dense representations can efficiently exploit the statistical properties of the weight matrix.

In this work, we overcome these limitations and present new matrix representations that become more efficient as the entropy of the weight matrices is reduced. In particular, their complexity depend partially on the number of shared weights present in the matrix, which is reduced as the entropy of the matrix is reduced. Indeed, we notice that for the matrix in figure 1 most of the entries are dominated by only 15 distinct values, which is 1.5% of the number of columns of the matrix. In the next section we will describe with a simple example how these new representations leverage on this property in order to achieve both, high compression ratios and efficient dot products.

## Iii Data structures for matrices with low entropy statistics

In this section we introduce the proposed data structures and show that they implicitly encode the distributive law. Consider the following matrix

 M=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝030240023404440004004404403400040200000444034400044004040000⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠

Now assume that we want to: 1) store this matrix with the minimum amount of bits and 2) perform the dot product with a vector with the minimum complexity.

### Iii-a Minimum storage

We firstly comment on the storage requirement of dense and sparse formats and then introduce two new formats which more effectively store matrix .
Dense format: Arguably the simplest way to store the matrix is in it’s so called dense representation. That is, we store it’s elements in a long array (in addition to it’s dimensions and ).
Sparse format: However, notice that more than 50% of the entries are 0. Hence, we may be able to attain a more compressed representation of this matrix if we store it in one of the well known sparse data structure, for instance, in the Compressed Sparse Row (or CSR in short) format. This particular format stores the values of the matrix in the following way:

• Scans the non-zero elements in row-major order (that is, from left to right, up to down) and stores them in an array (which we denote as ).

• Simultaneously, it stores the respective column indices in another array (which we call ).

• Finally, it stores pointers that signal when a new row starts (we denote this array as ).

Hence, our matrix would take the form

 W: [3,2,4,2,3,4,4,4,4,4,4,4,4,4,3, 4,4,2,4,4,4,3,4,4,4,4,4,4] colI: [1,3,4,7,8,9,11,0,1,5,8,9,11,0, 2,3,7,9,3,4,5,7,8,9,1,2,5,7] rowPtr: [0,7,13,18,24,28]

If we assume the same bit-size per element for all arrays, then the CSR data structure does not attain higher compression gains in spite of not saving the zero valued elements (62 entries vs. 60 that are being required by the dense data structure).
We can improve this by exploiting the low-entropy property of matrix . In the following, we propose two new formats which realize this.
Compressed Entropy Row (CER) format: Firstly, notice that many elements in share the same value. In fact, only the four values appear in the entire matrix. Hence, it appears reasonable to assume that data structures that repeatedly store these values (such as the dense or CSR structures) induce high redundancies in their representation. Therefore, we propose a data structure where we only store those values once. Secondly, notice that different elements appear more frequent than others, and their relative order does not change throughout the rows of the matrix. Concretely, we have a set of unique elements which appear times respectively in the matrix, and we obtain the same relative order of highest to lowest frequent value throughout the rows of the matrix. Hence, we can design an efficient data structure which leverages on both properties in the following way:

1. Store unique elements present in the matrix in an array in frequency-major order (that is, from most to least frequent). We name this array .

2. Store respectively the column indices in row-major order, excluding the first element (thus excluding the most frequent element). We denote it as .

3. Store pointers that signal when the positions of the next new element in starts. We name it . If a particular pointer in is the same as the previous one, this means that the current element is not present in the matrix and we jump to the next element.

4. Store pointers that signal when a new row starts. We name it . Here, points to entries in .

Hence, this new data structure represents matrix as

 Ω: [0,4,3,2] colI: [4,9,11,1,8,3,7,0,1,5,8,9,11,0, 3,7,2,9,3,4,5,8,9,7,1,2,5,7] ΩPtr: [0,3,5,7,13,16,17,18,23,24,28] rowPtr: [0,3,4,7,9,10]

Notice that we can uniquely reconstruct from this data structure. We refer to this data structure as the Compressed Entropy Row (or CER in short) data structure. One can verify that indeed, the CER data structure only requires 49 entries (instead of 60 or 62) attaining as such a compressed representation of the matrix .

To summarize, the CER representation is able to attain higher compression gains because it leverages on the following two properties: 1) many matrix elements share the same value and 2) the empirical probability mass distribution of the shared weight elements does not change significantly across rows.
Compressed Shared Elements Row (CSER) format

: In some cases, it may well be that the probability distribution across rows in a matrix are not similar to each other. Hence, the second assumption in the CER data structure would not apply and we would only be left with the first one. That is, we only know that not many distinct elements appear per row in the matrix or, in other words, that many elements share the same value. The

compressed shared elements row (or CSER in short) data structure is a slight extension to the previous CER representation. Here, we add an element pointer array, which signals which element in the indices refer to. We called it . Thus, points to entries in , to entries in and to entries in . Hence, the above matrix would then be represented as follows

 Ω: [0,2,3,4] colI: [4,9,11,1,8,3,7,0,1,5,8,9,11,0, 3,7,2,9,3,4,5,8,9,7,1,2,5,7] ΩI: [3,2,1,3,3,2,1,3,2,3] ΩPtr: [0,3,5,7,13,16,17,18,23,24,28] rowPtr: [0,3,4,7,9,10]

Thus, for storing matrix we require 59 entries, which is still a gain but not a significant one. Notice, that now the ordering of the elements in is not important anymore, as long as the array is accordingly adjusted. Similarly, the ordering of at each row can also be arbitrary, as long as the and array are accordingly adjusted.

The relationship between CSER, CER and CSR data structures is described in Section IV.

### Iii-B Dot product complexity

We just saw that we can attain gains with regard to compression if we represent the matrix in the CER and CSER data structures. However, we can also devise corresponding dot product algorithms that are more efficient than their dense and sparse counterparts. As an example, consider only the scalar product between the second row of matrix with an arbitrary input vector . In principle, the difference in the algorithmic complexity arises because each data structure implicitly encodes a different expression of the scalar product, namely

 dense: 4a1+4a2+0a3+0a4+0a5+4a6 +0a7+0a8+4a9+4a10+0a11+4a12 CSR: 4a1+4a2+4a6+4a9+4a10+4a12 CER/CSER: 4(a1+a2+a6+a9+a10+a12)

For instance, the dot product algorithm associated to the dense format would calculate the above scalar product by

2. calculating .

This requires 24 load (12 for the matrix elements and 12 for the input vector elements), 12 multiply, 11 add and 1 write operations (for writing the result into memory). We purposely omitted the accumulate operation which stores the intermediate values of the multiply-sum operations, since their cost can effectively be associated to the sum operation. Moreover, we only considered read/write operations from and into memory. Hence, this makes 48 operations in total.

In contrast, the dot product algorithm associated with the CSR representation would only multiply-add the non-zero entries. It does so by performing the following steps

1. Load the subset of respective to row 2. Thus, .

2. Then, load the respective subset of non-zero elements and column indices. Thus, and .

3. Finally, load the subset of elements of respective to the loaded subset of column indices and subsequently multiply-add them to the loaded subset of . Thus, and calculate .

By executing this algorithm we would require 20 load operations (2 from the and 6 for the , the and the input vector respectively), 6 multiplications, 5 additions and 1 write. In total this dot product algorithm requires 32 operations.

However, we can still see that the above dot product algorithm is inefficient in this case since we constantly multiply by the same element 4. Instead, the dot product algorithm associated to, e.g., the CER data structure, would perform the following steps

1. Load the subset of respective to row 2. Thus, .

2. Load the corresponding subset in . Thus, .

3. For each pair of elements in , load the respective subset in and the element in . Thus, and .

4. For each loaded subset of , perform the sum of the elements of respective to the loaded . Thus, and do .

5. Subsequently, multiply the sum with the respective element in . Thus, compute .

A similar algorithm can be devised for the CSER data structure. One can find both pseudocodes in the appendix. The operations required by this algorithm are 17 load operations (2 from , 2 from , 1 from , 6 from and 6 from ), 1 multiplication, 5 additions and 1 write. In total these are 24 operations.
Hence, we have observed that for the matrix , the CER (and CSER) data structure does not only achieve higher compression rates, but it also attains gains in efficiency with respect to the dot product operation.
In the next section we give a detailed analysis about the storage requirements needed by the data structures and also the efficiency of the dot product algorithm associated to them. This will help us identify when one type of data structure will attain higher gains than the others.

## Iv An analysis of the storage and energy complexity of data structures

Without loss of generality, in the following we assume that we aim to encode a particular matrix , where it’s elements take values from a finite set of elements . Moreover, we assign to each element a probability mass value , where counts the number of times the element appears in the matrix . We denote the respective set of probability mass values . In addition, we assume that each element in appears at least once in the matrix (thus, for all ) and that is the most frequent value in the matrix. Finally, we order the elements in and in probability-major order, that is, .

### Iv-a Measuring the energy efficiency of the dot product

This work proposes representations that are efficient with regard to storage requirements as well as their dot product algorithmic complexity. For the latter, we focus on the energy requirements, since we consider it as the most relevant measures for neural network compression. However, exactly measuring the energy of an algorithm is unreliable since it depends on the software implementation and on the hardware the program is running on. Therefore, we will model the energy costs in a way that can easily be adapted across different software implementations as well as hardware architectures.

In the following we model a dot product algorithm by a computational graph, whose nodes can be labeled with one of four elementary operations, namely: 1) a mul or multiply operation which takes two numbers as input and outputs their multiplied value, 2) a sum or summation operation which takes two values as input and outputs their sum, 3) a read operation which reads a particular number from memory and 4) a write operation which writes a value into memory. Note, that we do not consider read/write operations from/into low level memory (like caches and registers) that store temporary runtime values, e.g., outputs from summation and/or multiplications, since their cost can be associated to those operations. Now, each of these nodes can be associated with an energy cost. Then, the total energy required for a particular dot product algorithm simply equals the total cost of the nodes in the graph.

However, the energy cost of each node depends on the hardware architecture and on the bit-size of the values involved in the operation. Hence, in order to make our model flexible with regard to different hardware architectures, we introduce four cost functions , which take as input a bit-size and output the energy cost of performing the operation associated to them333The sum and mul operations take two numbers as input and they may have different bit-sizes. Hence in this case, we take the maximum of those as a reference for the bit-sizes involved in the operation.; is associated to the sum operation, to the mul, to the read and to the write operation.

Figure 2 shows the computational graph of a simple dot product algorithm for two 2-dimensional input vectors. This algorithm requires 4 read operations, 2 mul, 1 sum and 1 write. Assuming that the bit-size of all numbers is , we can state that the energy cost of this dot product algorithm would be . Note that similar energy models have been previously proposed [41, 42]. In the experimental section we validate the model by comparing it to real energy results measured by previous authors.

Considering this energy model we can now provide a detailed analysis of complexity of the CER and CSER data structure. However, we start with a brief analysis of the storage and energy requirements of the dense and sparse data structure in order to facilitate the comparison between them.

### Iv-B Efficiency analysis of the dense and CSR formats

The dense data structure stores the matrix in an -long array (where ) using a constant bit-size for each element. Therefore, it’s effective per element storage requirement is

 Sdense =bΩ (1)

bits. The associated standard scalar product algorithm then has the following per element energy costs

 Edense =σ(bo)+μ(bo)+γ(ba)+γ(bΩ)+1nδ(bo) (2)

where denotes the bit-size of the elements of the input vector and the bit-size of the elements of the output vector. The cost (2) is derived from considering 1) loading the elements of the input vector [], 2) loading the elements of the matrix [)], 3) multiplying them [], 4) summing the multiplications [], and 5) writing the result []. We can see that both the storage and the dot product efficiency have a constant cost attached to them, despite the distribution of the elements of the matrix.

In contrast, the CSR data structure requires only

 SCSR =(1−p0)(bΩ+bI)+1nbI (3)

effective bits per element in order to represent the matrix, where denotes the bits-size of the column indices. This comes from the fact that we need in total bits for representing the non-zero elements of the matrix, bits for their respective column indices and bits for the row pointers. Moreover, it requires

 ECSR =(1−p0)(σ(bo)+μ(bo)+γ(ba)+γ(bΩ)+γ(bI)) +1nγ(bI)+1nδ(bo) (4)

units of energy per matrix element in order to perform the dot product. The expression (4) was derived from 1) loading the non-zero element values [)], their respective indices [] and the respective elements of the input vector [], 2) multiplying and summing those elements [] and then 3) writing the result into memory [].

Different to the dense format, the efficiency of the CSR data structure increases as , thus, as the number of zero elements increases. Moreover, if the matrix size is large enough, the storage requirement and the cost of performing a dot product becomes effectively 0 as .

For the ease of the analysis, we introduce the big notation for capturing terms that depend on the shape of the matrix. In addition, we denote the following set of operations

 ca =σ(ba)+γ(ba)+γ(bI) (5) cΩ =γ(bI)+γ(bΩ)+μ(bo)+σ(bo)−σ(ba) (6)

can be interpreted as the total effective cost of involving an element of the input vector in the dot product operation. Analogously can be interpreted with regard to the elements of the matrix. Hence, we can rewrite the above equations (2) and (4) as follows

 Edense =ca+cΩ−2γ(bI)+O(1/n) (7) ECSR =(1−p0)(ca+cΩ)+O(1/n) (8)

### Iv-C Efficiency analysis of the CER and CSER formats

Following a similar reasoning as above, we can state the following theorem

###### Theorem 1

Let be a matrix. Let further be the empirical probability mass distribution of the zero element, and let be the bit-size of the numerical representation of a column or row index in the matrix. Then, the CER representation of requires

 SCER =(1−p0)bI+¯k+~knbI+O(1/n)+O(1/N) (9)

effective bits per matrix element, where denotes the average number of shared elements that appear per row (excluding the most frequent value),

the average number of padded indices per row and

the total number of elements of the matrix. Moreover, the effective cost associated to the dot product with an input vector is

 ECER =(1−p0)ca+¯kncΩ+~knγ(bI)+O(1/n) (10)

per matrix element, where and are as in (5) and (6).

Analogously, we can state

###### Theorem 2

Let , , , , , be as in theorem 1. Then, the CSER representation of requires

 SCSER (11)

effective bits per matrix element, and the per element cost associated to the dot product with an input vector is

 ECSER =(1−p0)ca+¯kncΩ+¯knγ(bI)+O(1/n) (12)

The proofs of theorems 1 and 2 are in the appendix. These theorems state that the efficiency of the data structures depends on the (average number of distinct elements per row - sparsity) values of the empirical distribution of the elements of the matrix. That is, these data structures are increasingly efficient for distributions that have high and low

values. However, since the entropy measures the effective average number of distinct values that a random variable outputs

444From Shannon’s source coding theorem [12] we know that the entropy of a random variable gives the effective average number of bits that it outputs. Therefore, we may interpret as the effective average number of distinct elements that a particular random variable outputs., both values are intrinsically related to it. In fact, from Renyi’s generalized entropy definition [13] we know that . Moreover, the following properties are satisfied

• , as or , and

• , as or .

Consequently, we can state the following corollary

###### Corollary 2.1

For a fixed set size of unique element and constant index bit-size , the storage requirements as well as the cost of the dot product operation of the CER and CSER representations satisfy

 S,E ≤O(1−2−H)+O(K/n)+O(1/N) =O(1−2−H)+O(1/n)

where , , and are as in theorems 1 and 2, and denotes the entropy of the matrix element distribution.

Thus, the efficiency of the CER and CSER data structures increase as the column size increases, or as the entropy decreases. Interestingly, when both representations will converge to the same values, thus, will become equivalent. In addition, there will always exist a column size where both formats are more efficient than the original dense and sparse representations (see Fig. 5 where this trend is demonstrated experimentally).

### Iv-D Connection between CSR, CER and CSER

The CSR format is considered to be one of the most general sparse matrix representations, since it makes no further assumptions regarding the empirical distribution of the matrix elements. Consequently, it implicitly assumes a spike-and-slab555That is, a spike at zero with probability and a uniform distribution over the non-zero elements. distribution on them. However, spike-and-slab distributions are a particular class of low entropic (for sufficiently high sparsity levels ) distributions. In fact, spike-and-slab distributions have the highest entropy values compared to all other distributions that have same sparsity level. In contrast, as a consequence of corollary 2.1, the CER and CSER data structures relax this prior and can therefore efficiently represent the entire set of low entropic distributions. Hence, the CSR data structure can be interpreted as a more specialized version of the CER and CSER representations.

This may be more evident via the following example: consider the 1st row of the matrix example from section III

 (030240023404)

The CSER data structure would represent the above row in the following manner

 Ω: [0,4,3,2] colI: [4,9,11,1,8,3,7] ΩI: [1,2,3] ΩPtr: [0,3,5,7] rowPtr: [0,3]

In comparison, the CER representation assumes that the ordering of the elements in is similar for all rows and therefore, it directly omits this array and implicitly encodes this information in the array. Therefore, the CER representation can be interpreted as a more explicit/specialized version of the CSER. The representation would then be

 Ω: [0,4,3,2] colI: [4,9,11,1,8,3,7] ΩPtr: [0,3,5,7] rowPtr: [0,3]

Similarly, the CSR representation omits the array since it assumes a uniform distribution over the non-zero elements (thus, over the array), and in such case all the entries in would redundantly be equal to 1. Therefore, the respective representation would be

 Ω: [3,2,4,2,3,4,4] colI: [1,3,4,7,8,9,11] rowPtr: [0,7]

Consequently, the CER and CSER representations will have superior performance for all those distributions that are not similar to the spike-and-slab distributions. Figure 3 displays a sketch of the regions on the entropy-sparsity plane where we expect the different data structures to be more efficient. The sketch shows that the efficiency of sparse data structures is high on the subset of distributions that are close to the right border line of the -plane, thus, that are close to the family of spike-and-slab distribution. In contrast, dense representations are increasingly efficient for high entropic distributions, hence, in the upper-left region. The CER and CSER data structures would then cover the rest of them. Figure 4 confirms this trend experimentally.

## V Experiments

We applied the dense, CSR, CER and CSER representations on simulated matrices as well as on quantized neural network weight matrices, and benchmarked their efficiency with regard to the following four criteria:

1. Storage requirements: We calculated the storage requirements according to equations (1), (3), (9) and (11).

2. Number of operations: We implemented the dot product algorithms associated to the four above data structures (pseudocodes of the CER and CSER formats can be seen in the appendix) and counted the number of elementary operations they require to perform a matrix-vector multiplication.

3. Time complexity: We timed each respective elementary operation and calculated the total time from the sum of those values.

4. Energy complexity

: We estimated the respective energy cost by weighting each operation according to Table

I. The total energy results consequently from the sum of those values. As for the case of the IO operations (read/write operations), their energy cost depend on the size of the memory the values reside on. Therefore, we calculated the total size of the array where a particular number is entailed and chose the respective maximum energy value. For instance, if a particular column index is stored using a 16 bit representation and the total size of the column index array is 30KB, then the respective read/write energy cost would be 5.0 pJ.

In addition, we used single precision floating point representations for the matrix elements and unsigned integer representations for the index and pointer arrays. For the later, we compressed the index-element-values to their minimum required bit-sizes, where we restricted them to be either 8, 16 or 32 bits.

Notice that we do not consider the complexity of converting the dense representation into the different formats in our experiments. This is justified in the context of neural network compression since we can apply this step a priori to the inference procedure. That is, in most real world scenarios one firstly convert the weight matrices, possibly with help of a capable computer, and then deploys the converted neural network into a resource constrained device. We are mostly interested in the resource consumption that will take place on the device. Nevertheless, as an additional side note we would like to mention that the algorithmic complexity of conversion into the CSR, CER and CSER representations is of , that is, of the order of number of elements in the matrix.

### V-a Experiments on simulated matrices

As first experiments we aimed to confirm the theoretical trends described in Section IV.

#### V-A1 Efficiency on different regions of the entropy-sparsity plane

Firstly, we argued that each distribution has a particular entropy-sparsity value, and that the superiority of the different data structures is manifested in different regions on that plane. Concretely, we expected the dense representation to be increasingly more efficient in the upper-left corner, the CSR on the bottom-right (and along the right border) and the CER and CSER on the rest.

Figure 4 shows the result of performing one such experiment. In particular, we randomly selected a point-distribution on the -plane and sampled 10 different matrices from that distribution. Subsequently, we converted each matrix into the respective dense, CSR, CER and CSER representation, and benchmarked the performance with regard to the 4 different measures described above. We then averaged the results over these 10 different matrices. Finally, we compared the performances with each other and respectively color-coded the max result. That is, blue corresponds to points where the dense representation was the most efficient, green to the CSR and red to either the CER or CSER. As one can see, the result closely matches the expected behavior.

#### V-A2 Efficiency as a function of the column size

As second experiment, we study the asymptotic behavior of the data structures as we increase the column size of the matrices. From corollary 2.1 we expect that the CER and CSER data structures increase their efficiency as the number of columns in the matrix grows (thus, as ), until they converge to the same point, outperforming the dense and sparse data structures. Figure 5 confirms this trend experimentally with regard to all four benchmarks. Here we chose a particular point-distribution on the -plane and fixed the number of rows. Concretely, we chose , and (the later is the row dimension), and measured the average complexity of the data structures as we increased the number of columns .

As a side note, the sharp changes in the plots are due to the sharp discontinuities in the values of table I. For instance, the sharp drops in storage ratios come from the change of the index bit-sizes, e.g., from bits.

### V-B Compressed Neural Networks without Retraining

As second set of experiments, we tested the efficiency of the proposed data structures on compressed deep neural networks. In particular, we benchmarked their weight matrices relative to the matrix-vector operation, after them being compressed using two different types of quantization techniques: one where retraining of the network is required (section V-C) and one where it is not (section V-B). We treat them separately, since the statistics of the resulting compressed weight matrices are conditioned by the quantization applied on them.

We start by first analyzing the later case. This scenario is of particular interest since it applies to cases where one does not have access to the training data (e.g., federated learning scenario) or it is prohibited to retrain the model (e.g., limited access to computational resources). Moreover, common matrix representations, such as the dense or CSR, may fail to efficiently exploit the statistics present in these compressed weight matrices (see figure 1 and discussion in section II).

In our experiments we firstly quantized the elements of the weight matrices of the networks in a lossy manner, while ensuring that we negligible impact their prediction accuracy. Similarly to [35, 36], we applied an uniform quantizer over the range of weight values at each layer and subsequently rounded the values to their nearest quantization point. That is, for each weight matrix , we calculated the range of values (with being the lowest weight element value and analogously) and inserted equidistant points inside that range, whose values were stored in the array . Then, we quantized each weight element in to it’s closest neighbor relative to and measured the validation accuracy of the quantized network. In our experiments, we did not see any significant impact on the accuracy for all (table II

). We chose the uniform quantizer because of it’s simplicity and high performance relative to other, more sophisticated quantizers such as entropy-constrained k-mean algorithms

[35, 36]. Finally, we lossless converted the quantized weight matrices into the different data structures and tested their efficiency with regard to the four above mentioned benchmark criteria.

#### V-B1 Storage requirements

Table II shows the gains in storage requirements of different state-of-the-art neural networks. Gains can be attained when storing the networks in CER or CSER formats. In particular, we achieve more than x2.5 savings on the DenseNet architecture, whereas in contrast the CSR data structure attains negligible gains. This is mainly attributed to the fact, that the dense and sparse representations store very inefficiently the weight element values of these networks. This is also reflected in Fig. 6, where one can see that most of the storage requirements for the dense and CSR representations is spent in storing the elements of the weight matrices . In contrast, most of the storage cost for the CER and CSER data structures comes from storing the column indices , which is much lower than the actual weight values.

#### V-B2 Number of operations

Table III shows the savings attained with regard to number of elementary operations needed to perform a matrix-vector multiplication. As one can see, we can save up to 40% of the number of operations if we use the CER/CSER data structures on the DenseNet architecture. This is mainly due to the fact, that the dot product algorithm of the CER/CSER formats implicitly encode the distributive law of multiplications and consequently they require much less number of them. This is also reflected in Fig. 7, where one can see that the CER/CSER dot product algorithms are mainly performing input load (), column index load () and addition (add) operations. Here, others refers to any other operation involved in the dot product, such as multiplications, weight loading, writing, etc. In contrast, the dense and CSR dot product algorithms require an additional equal number of weight element load () and multiplication (mul) operations.

#### V-B3 Time cost

In addition, Table III also shows that we attain speedups when performing the dot product in the new representation. Interestingly, Fig. 8 shows that most of the time is being consumed on IO’s operations (that is, on load operations). Consequently, the CER and CSER data structures attain speedups since they do not have to load as many weight elements. In addition, 20% and 16% of the time is spent in performing multiplications respectively in the dense and sparse representation. In contrast, this time cost is negligible for the CER and CSER representations.

#### V-B4 Energy cost

Similarly, we see that most of the energy consumption is due to IOs operations (Fig. 9). Here the cost of loading an element may be up to 3 orders of magnitude higher than any other operations (see Table I) and therefore, we obtain up to x6 energy savings when using the CER/CSER representations (see Table III).

Finally, Table IV and Fig. 10 further justify the observed gains. Namely, Table IV shows that the effective number of shared elements per row of the network is small relative to the networks effective column dimension. To clarify, we calculated the effective number of shared elements by: 1) for all rows, calculate the number of shared weights, 2) aggregating the numbers and 3) dividing the result by the total number of rows that appear in the network. Similarly, the effective number of columns indicates the average number of columns in the network, and the effective sparsity level as well as effective entropy values indicate the over the total number of weights averaged result. Fig. 10 shows the distributions of the different layers of the networks on the entropy-sparsity plane where we see, that most of them lay in the regions where we expect the CER/CSER formats to be more efficient.

On a last side note we would like to comment on the alternative, compressed representations of the dense format. For instance, after quantization, we could trivially compress the weight element values down to a 7-bit representation, or apply more sophisticated entropy-coders [35, 36]. Although these representation of the dense format are able to attain relatively high compression ratios, they are inefficient with regard to the dot product algorithm, since additional decoding steps are required in order to convert back the weight values into their original floating point representations. Recall, that in this case the activation values would still be represented by single precision floating point values, and quantizing them down to 7 bits would significantly harm the prediction accuracy of the network. As an example, the matrix-vector product operation of the VGG-16 architecture slowed down by about 47% compared to the original dense representation, after we converted each weight element down into it’s 7-bit representation.

### V-C Compressed Neural Networks with Retraining

In this section we benchmark the CER/CSER matrix representation on networks whose weight matrices have been compressed using quantization techniques where retraining was required in the process. This case is also of particular interest since highest compression gains can only be achieved if one applies such quantizations techniques on to the network [26, 27, 37, 38, 39].

For instance, Deep Compression [26] is a technique for compressing neural networks that is able to attain high compression rates without incurring significant loss of accuracy. It is able to do so by applying a three staged pipeline: 1) prune unimportant connections by employing algorithm [31], 2) cluster the non-pruned weight values and refine the cluster centers to the loss surface and 3) employ an entropy coder for storing the final weights. Notice, that the first two stages aim to implicitly minimize the entropy of the weight matrices without incurring significant loss of accuracy, whereas the third stage lossless converts the weight matrices into low-bit representation. However, the proposed representation is based on the CSR format and, consequently, the complexity of the respective dot product algorithm remains on the same order. Concretely, the total number of operations that need to be performed is greater equal to the original CSR format. In fact, one requires specialized hardware in order to efficiently exploit this final neural network representation during inference [46]. Therefore, many authors benchmark the inference efficiency of highly compressed deep neural networks with regard to the standard CSR representation when tested on standard hardware such as CPU’s and/or GPU’s [26, 38, 41]. However, this comes at the cost of adding redundancies since then one does not exploit step 2 of the compression pipeline.

In contrast, the CER/CSER representation become increasingly efficient as the entropy of the network is reduced, even if the sparsity level is maintained (see figures 3 and 4). Hence, it is of high interest to benchmark their efficiency on highly compressed networks and compare them to their sparse (and dense) counterparts.

As first experimental setup we chose the by the authors trained and quantized AlexNet architecture [45], where they were able to reduce the overall entropy of the network down to 0.89 without incurring any loss of accuracy. Figure 11 shows the gains in efficiency when the network layers are converted into the different data structures. We see, that the proposed data structures are able to surpass the dense and sparse data structures for all four benchmark criteria. Therefore, CER/CSER data structures are much less redundant and efficient representations of highly compressed neural network models. Interestingly, the CER/CSER data structures attain up to x14 storage and x20 energy savings, which is considerably higher than the sparse counterpart. Nevertheless, we do not attain significant time gains. This is due to the fact that, in our implementations, the time cost of loading the input elements was significantly higher than any other component in the algorithm (see figure 14 in appendix). This also explains why the CSR format shows similar speedups than the CER and CSER. However, this effect can be mitigated if one applies further optimizations on the input vector, such as data reuse techniques and/or better storage management of it’s values during the dot product procedure. We would also consequently expect significant gains in time performance relative to the CSR format. We will consider it in future work.

Lastly, we trained and compressed additional architectures while following a similar compression pipeline as described in [26]. Concretely, we: 1) pretrained the architectures until we reached state-of-the-art accuracies, 2) sparsified the architectures using the technique proposed in [27], 3) applied a uniform quantizer to the non-zero values in order to reduce their effective bit-size, finally, 4) converted the weight matrices into the different representations and benchmarked their efficiency relative to their matrix-vector product operation. In step 2) we chose [27] since it is the current state-of-the-art sparsification technique. In our experiments we chose to benchmark the same architectures as reported in [27, 38]. That is, an adapted version of the VGG network for the CIFAR-10 image classification task and the fully connected and convolutional LeNet architectures for the MNIST classification task. The respective accuracies and compression gains can be seen in tables V and the gains relative to the dot product complexity in table VI. As we can see, we attain significantly higher gains in all four benchmarks when we convert their weight matrices into the CER/CSER representations. In particular, we are able to attain up to x42 compression gains, x5 speedups and x90 energy gains on the VGG model.

As a last side note we want to mention again that compressing further the CSR representation by, for instance, replacing the non-zero values by their respective quantization indices (as proposed by [26]), does not necessarily result in higher gains with regards to the dot product since it requires an additional decoding step per non-zero element in the process. For instance, we got only x2.89 speedups on our compressed CIFAR10-VGG model, which is less than the speedups attained by the original CSR format (x3.63). Moreover, the CER/CSER representations still attained higher gains in all other complexity measures. Concretely, we attained x33.62, x3.10 and x62.32 gains in storage, number of operations and energy respectively, which is still lower than the gains attained by the CER/CSER representations (tables V and VI).

## Vi Conclusion

We presented two new matrix representations, Compressed Entropy Row (CER) and Compressed Shared Elements Row (CSER), that are able to attain high compression ratios and energy savings if the distribution of the matrix elements has low entropy. We showed on an extensive set of experiments that the CER/CSER data structures are more compact and computationally efficient representations of compressed state-of-the-art neural networks than dense and sparse formats. In particular, we attained up to x42 compression ratios and x90 energy savings by representing the weight matrices of an highly compressed VGG model in their CER/CSER forms and benchmarked against the matrix-vector product operation.

By demonstrating the advantages of entropy-optimized data formats for representing neural networks, our work opens up new directions for future research, e.g., the exploration of entropy constrained regularization and quantization techniques for compressing deep neural networks. The combination of entropy constrained regularization and quantization and entropy-optimized data formats may push the limits of neural network compression even further and also be beneficial for applications such as federated or distributed learning [47, 48].

Future work will also study lossy compression schemes, specially in combination with their analysis with explanation methods [49, 50].

## Appendix A Details on neural network experiments

### A-a Matrix preprocessing and convolutional layers

Before benchmarking the quantized weight matrices we applied the following preprocessing steps:

#### A-A1 Matrix decomposition

After the quantization step it may well be that the 0 value is not included in the set of values and/or that it’s not the most frequent value in the matrix. Therefore, we applied the following simple preprocessing steps: assume a particular quantized matrix , where each element belong to a discrete set. Then, we decompose the matrix into the identity , where is the unit matrix whose elements are equal to 1 and is the element that appears most frequently in the matrix. Consequently, is a matrix with as it’s most frequent element. Moreover, when performing the dot product with an input vector , we only incur the additional cost of adding the constant value to all the elements of the output vector. The cost of this additional operation is effectively of the order of additions and 1 multiplication for the entire dot product operation, which is negligible as long as the number of rows is sufficiently large.

#### A-A2 Convolution layers

A convolution operation can essentially be performed by a matrix-matrix dot product operation. The weight tensor containing the filter values would be represented as a

-dimensional matrix, where is the number of filters of the layer, the number of channels, and the height/width of the filters. Hence, the convolution matrix would perform a dot product operation with an -dimensional matrix, that contains all the patches of the input image as column vectors.

Hence, in our experiments, we reshaped the weight tensors of the convolutional layers into their respective matrix forms and tested their storage requirements and dot product complexity by performing a simple matrix-vector dot product, but weighted the results by the respective number of patches that would have been used at each layer.

### A-B More results from experiments

Figures 12, 13 and 14 show our results for compressed ResNet152, VGG16 and AlexNet, respectively.

### A-C Dot product pseudocodes

Algorithms 2, 3 and 4 show the pseudocodes of the dot product algorithm of the CSR, CER and CSER data structures.

For the dense algorithm, we implemented the standard 3 loop nest algorithm 1.

We used the programming language Python in all our experiments.

## Appendix B Proof of theorems

#### B-1 Theorem 1

The CER data structure represents any matrix via 4 arrays, which respectively contain: , , , entries, where denotes the number of unique elements appearing in the matrix, the total number of elements, the total number of zero elements, the row dimension and finally, the number of shared elements that appeared at row (excluding the 0) and the number of redundant padding entries needed to communicate at row .
Hence, by multiplying each array with the respective element bit-size and dividing by the total number of elements we get

 KbΩN+(1−#(0)N)bI+1N(m∑r=0¯kr+~kr)bI+1nbI

where and are the bit-sizes of the matrix elements and the indices respectively. With and we get equation 9.

The cost of the respective dot product algorithm can be estimated by calculating the cost of each line of algorithm 4. To recall, we denoted with the cost of performing a summation operation, which involved bits. the cost of a multiplication. the cost of a read and of a write operation into memory. We further denoted with the cost of performing other types of operations. Moreover, assume an input vector (that is, ), since the result can be trivially extended to input matrices of arbitrary size. Thus, algorithm 3 requires: from line 2) - 7) we assume a cost of , 8) , 9) , 10) , 11) , 12) , 13) , 14) , 15) , 16) , 17) , 18) , 19) , 20) , 21) , 22) ; where , and are the bit-sizes of the matrix elements, the indices and output vector element respectively. Hence, adding up all above costs and replacing and as in equations (5) and (6), we can get the total cost of . It is fair to assume that the cost is negligible compared to the rest for highly optimized algorithms. Indeed, figures 8 and 7 and 9 show that cost of these operations contribute very little to the total cost of the algorithm. Hence, we can assume the ideal cost of the algorithm to be equal to the above expression with (which corresponds to equation (10)).

#### B-2 Theorem 2

Analogously, we can follow the same line of arguments. Namely, each array in the CSER data structure contains: , , , , entries. Consequently, by adding those terms, multiplying by their bit-size and dividing by the total number of elements we recover (11).

Each line of algorithm 4 induces a cost of: form line 2) - 8) we assume a cost of , 9) , 10) , 11) , 12) , 13) , 14) , 15) , 16) , 17) , 18) , 19) , 20) , 21) , 22) .
Again, adding up all terms and replacing with and then we get the total cost of