I Introduction
It has been well established that deep neural networks excel at solving many complex machine learning tasks [1, 2]. Their relatively recent success can be attributed to three phenomena: 1) access to large amounts of data, 2) researchers having designed novel optimization algorithms and model architectures that allow to train very deep neural networks, 3) the increasing availability of compute resources [1]. In particular, the latter two allowed machine learning practitioners to equip neural networks with an evergrowing number of layers and, consequently, to consistently attain stateoftheart results on a wide spectrum of complex machine learning tasks.
However, this has triggered an exponential growth in the number of parameters these models entail over the past years [3]. Trivially, this implies that the models are becoming more and more complex in terms of memory. This can become very problematic since it does not only imply higher memory requirements, but also slower runtimes and high energy consumption [4]. In fact, IO operations can be up to three orders of magnitude more expensive than arithmetic operations. Moreover, [3] show that the memoryenergy efficiency trends of most common hardware platforms are not able to keep up with the exponential growth of the neural networks’ sizes, thus expecting them to be more and more power hungry over time.
In addition, there has also been an increasing demand on deploying deep models to resource constrained devices such as mobile or wearable devices [5, 6, 7], as well as on training deep neural networks in a distributed setting such as in federated learning [8, 9, 10], since these approaches have direct advantages with regards to privacy, latency and efficiency issues. High memory complexity greatly complicates the applicability of neural networks for those use cases, in particular for the federated learning case since the parameters of the networks are transmitted through communication channels with limited capacity.
Model compression is one possible paradigm to solve this problem. Namely, by attempting to maximally compress the information contained in the network’s parameters we automatically leave only the bits that are necessary for solving the task. Thus, in principle, the memory complexity of deep models should only increase with the complexity of the learning task and not with its number of parameters^{1}^{1}1To be more precise, the memory complexity of the model should only increase sublinearly with its number of parameters [11].. In addition, model compression has direct practical advantages such as reduced communication and compute cost [12, 13, 14]. In fact, the Moving Picture Expert Group (MPEG) of the International Organization of Standards (ISO) has recently issued a call on neural network compression [15], which stresses the relevance of the problem and the broad interest by the industry to find practical solutions.
Ia Entropy coding in video compression
The topic of signal compression has been long studied and highly practical and efficient algorithms have been designed. Stateoftheart video compression schemes like H.265/HEVC [16] employ efficient entropy coding techniques that can also be used for compressing deep neural networks. Namely, the Contextbased adaptive binary arithmetic coding (CABAC) engine [17] provides a very flexible interface for entropy coding that can be adapted to a wide range of applications. It is optimized to allow high throughput and a high compression ratio at the same time. In particular, the transform coefficient coding part of H.265/HEVC contains many interesting aspects that might be suitable for compressing deep neural network. Hence, it appears only natural to try to adapt current stateoftheart compression techniques such as CABAC to deep neural networks and accordingly compress them.
IB Contributions
Our contributions can be summarized as follows:

We adapt CABAC for the task of neural network compression. To the best of our knowledge, we are the first in applying stateoftheart coding techniques from video compression to deep neural networks.

We quantize the parameters of the networks by minimizing a generalized form of a ratedistortion function which takes the impact of quantization on the accuracy of the network into account.

In our experiments we show that DeepCABAC is able to attain very high compression ratios and that it consistently attains a higher compression performance than previously proposed coders.
IC Outline
In section II we start by reviewing some basic concepts from information theory, in particular from source coding theory. We also highlight the main difference between the classical source coding and the model compression paradigms in subsection IID. Subsequently, we proceed by explaining DeepCABAC in section III. In section IV we provide a comprehensive review of the related work on neural network compression. Finally, we provide experimental results and a respective discussion in section V.
Ii Source coding
Source coding is a subfield of information theory that studies the properties of so called codes. These are mappings that assign a binary representation and a reconstruction value to a given input element. Figure 1 depicts their most common structure. They are comprised of two parts, an encoder and a decoder. The encoder is a mapping that assigns a binary string of finite length to an input element . In contrast, the decoder assigns a reconstruction value to the corresponding binary representation. We will also sometimes refer to as a quantization point. Furthermore, it is assumed that the output elements and of the code are elements of finite countable sets, and that there is a onetoone correspondence between them. Therefore, without loss of generality, we can decompose the encoder into a quantizer and a binarizer, where the former maps the input to an integer value , and the latter maps the integers to their corresponding binary representation . Analogously for the decoder. Naturally, it follows that the binarizer is always a bijective map, thus .
We also distinguish between two types of codes, the so called lossless codes and lossy codes. They respectively correspond to the cases where is either bijective or not, thus, the latter implies that information is lost in the coding process. Therefore, we stress that the map does not necessarily have to be the inverse of !
After establishing the basic definition of codes, we will now formalize the source coding problem. In simple terms, source coding studies the problem of
finding the code that maximally compresses a set of input samples, while maintaining the error between the input and reconstruction values under an error tolerance constraint.
Notice that the problem is probabilistic in its nature since it implicitly assumes that the decoder has no access to the element values being encoded. Moreover, the input values themselves may come from an unknown source distribution. Hence, we denote with the encoders probability model of , and with the decoders probability model of (or equivalently and ). It is important to stress that both models do not have to coincide, thus . Furthermore, we will assume that the encoder’s probability model follows the true underlying distribution of the input source, and therefore we will simply write .
Thus, the source coding problem can be formulated more precisely as follows: let be a given input set and let be the probability of an element being sampled. Then, find a code that
(1) 
where . is some distance measure and the length of the binary representation . We will sometimes refer to as the codelength of a sample, and with the distortion between and .
denotes expectations as taken by the probability distribution
. is the Lagrange multiplier that controls the tradeoff between the compression strength and the error incurred by it.Minimization objectives of the form (1) are called ratedistortion objectives in the source coding literature. However, solving the ratedistortion objective for a given input source is most often NPhard, since it involves finding optimal quantizers , binarizers and reconstruction values from the space of all possible maps. However, concrete solutions can be found for special cases, in particular in the lossless case. In the following we will review some of the fundamental theorems of source coding theory and introduce stateoftheart coding algorithms that produce binary representations with minimal redundancy.
Iia Lossless coding
Lossless coding implies that . Thus, in (1) and the ratedistortion objective simplifies into finding a binarizer that maximally compresses the input samples. Hence, throughout this section we will equate the general code with the binarizer and refer to it accordingly. Moreover, we will also assume that the decoder’s probability model equals the encoder’s, thus . In the next subsection we discuss the case when the latter property does not apply.
Information theory already makes concrete statements regarding the minimum information contained in a probability source. Namely, Shannon in its influential work [18] stated that the minimum information required to fully represent a sample that has probability is of bits. Consequently, the entropy states the minimum average number of bits required to represent any element . This implies that
(2) 
where is the average codelength that any code assigns to each element . Eq. (2) is also referred as the fundamental theorem of lossless coding.
Fortunately, from the source coding literature [19] we know of the existence of codes that are able to reach average codelength of up to only 1 bit of redundancy to the theoretical minimum. That is,
(3) 
Moreover, we even know how to build them.
Before we start discussing in more detail some of these codes we want to recall an important property of joint probability distributions. Namely, due to their sequential decomposition property, we can express the minimal information entailed in the output sample of a joint probability distribution sequentially as
That is, we can always interpret a given input vector as an
long random process and encode its outputs sequentially. As long as we know the respective conditional probability distributions, we can optimally encode the entire sequence. Respectively, we denote with
the scalar value of the th dimension of (or equivalently th output of the random process). Also, we denote with the set of possible scalar inputs, where .IiA1 (scalar) Huffman coding
One optimal code is the wellknown Huffman code [20]. It consists of building a binary tree such that each input sample is associated with one of the leaves of the tree. Thus, each can be associated with the sequence of binary decisions that traverse the tree from its root point. The main idea is then to build the tree in such a manner that shorter paths are associated to more probable samples . Huffman successfully proved that this code satisfies (3). We provide a pseudocode of the encoding and decoding process in the appendix (see algorithms 3, 1, and 2).
However, Huffman codes can be very inefficient in practice since the Huffmantree grows very quickly for large input dimensions . Therefore, most often scalar Huffman codes are used instead. Scalar Huffman codes do only consider 1dimensional inputs, and do accordingly encode each sample from the long random process. However, these codes are suboptimal in that they produce redundant binary representations and do therefore not satisfy (3). Concretely, they produce average codelengths of
where now is the probability of a scalar output and is the average codelength produced by the scalar Huffman code. Moreover, they are limited to stationary processes since they do not take conditional dependencies into account, which could further reduce the average codelength.
IiA2 Arithmetic coding
A concept that approaches the joint entropy of eq. (3) in a practical and efficient manner is arithmetic coding. It consists of expressing a particular sequence of samples of an long random process as a so called coding interval. An overview of the idea is given in the following.
Let be the coding interval before encoding symbol and let and . Encoding of a symbol corresponds to deriving a coding interval from the previous interval as follows. Subdivide into one subinterval for each element of so that the interval width is given as
for a given sequence of (already sampled) values , and arrange the subintervals so that they are nonoverlapping and adjacent. The subinterval associated with the sample to be encoded becomes the new coding interval . Encoding of symbols yields the coding interval and the sequence of symbols can be reconstructed (in the decoder) when an arbitrary value inside of this coding interval is known. Figure 2 exemplifies this procedure for a binary random process. Interestingly, the width of the coding interval equals the probability of sequence . As the minimum achievable code length for encoding of the symbols is known to be , the location of interval needs to be signaled to the decoder in a way so that the number of written bits gets as close to as possible. The basic encoding principle is as follows. Derive an integer so that
(4) 
holds. Subdivide the unit interval into (adjacent and nonoverlapping) subintervals of width . Equation (4) guarantees that one of the intervals is fully contained in the coding interval (regardless of the exact location of interval ) and if the decoder knows this interval, it can unambiguously identify . Consequently, the index identifying this interval is written to the bitstream using bits. Equation (4) can be rewritten as
(5) 
which shows that the ideal arithmetic coder only requires up to two bits more than the minimum possible code length for a sequence of length .
IiB Universal coding
In the previous subsection we learned that there exist codes that are able to produce binary representations of (almost) minimal redundancy (e.g. arithmetic codes). However, recall that the decoder has to know the joint probability distribution of the input source in order to build the most optimal binary representation. However, in most practical situations the decoder has no prior knowledge about it. Hence, in such cases, we have to rely on so called universal codes. They basically apply the following principle: 1) start with a general, dataindependent probability model , 2) update the model upon seeing incoming samples, 3) encode the input samples with regards to the updated probability model.
Thus, the theoretical minimum of universal codes is upper bounded by the decoder’s probability estimate. Concretely, let
be the decoder’s estimate of the input’s probability model, then the minimum average codelength that can be achieved iswith being the crossentropy and
the KullbackLeibler divergence. Hence, a lossless code can only create binary representations with minimal redundancies iff its decoder’s probability model is the same as the input sources. In other words, the better its estimate is, the better it can encode the input samples.
An example of a universal lossless code is the so called twopart Huffman code. Given a set of samples to be encoded, it firstly makes an estimate of their empirical probability mass distribution (EPMD) and, subsequently, it encodes the samples with regards to it. However, it has the natural caveat that the estimate needs to be encoded as well, which may add a significant number of bits in many practical situations. Moreover, as we already discussed in the previous subsection, Huffman codes also come with a series of undesired properties that make it very inefficient for cases where fast adaptability and coding efficiency is required [19].
In general, a universal lossless code should have the following desiderata:

Universality: The code should have a mechanism that allows it to adapt its probability model to a wide range of different types of input distributions, in a sampleefficient manner.

Minimal redundancy: The code should produce binary representations of minimal redundancy with regards to its probability estimate.

High efficiency: The code should have high coding efficiency, meaning, that encoding/decoding should have high throughput.
IiB1 Cabac
Contextbased Adaptive Binary Arithmetic Coding is a form of universal lossless coding that fulfils all of the above properties, in that it offers a high degree of adaptation, optimal codelengths, and a highly efficient implementation. It was originally designed for the video compression standard H.264/AVC [17], but it is also an integral part of its successor H.265/HEVC. It is well known to attain higher compression performance as well as higher throughput as compared to other entropy coding methods [21]. In short, it encodes each input sample by applying the following three stages:

Binarization: Firstly, it predefines a series of binary decisions (also called bins) under which each unique input sample element (or symbol) will be uniquely identified. In other words, it builds a predefined binary decision tree where each leaf identifies a unique input value.

Contextmodeling: Then, it assigns a binary probability model to each bin (also named context model) which is updated onthefly by the local statistics of the data. This enables CABAC to model a high degree of different source distributions.

Arithmetic coding: Finally, it employs an arithmetic coder in order to optimally and efficiently code each bin, based on the respective context model.
Notice that, in contrast to twopart Huffman codes, CABACs encoder does not need to encode its probability estimates, since the decoder is able to analogously update its context models upon sequentially decoding the input samples. Codes that have this property are called backwardadaptive codes. Moreover, its able to take local correlations into account, since the context models are updated by the local statistics of the incoming input data.
IiC Lossy coding
In contrast to lossless coding, information is lost in the lossy coding process. This implies that the quantizer is non invertible, and therefore . An example of a distortion measure may be the meansquared error , but we stress that other measures can be considered as well (which will become apparent in section III).
The infimum of the ratedistortion objective (1) is referred to as the ratedistortion function in the source coding literature [19], and it represents the fundamental bound on the performance of lossy source coding algorithms. However, as we have already discussed above, finding the most optimal code that follows the ratedistortion function is most often NPhard, and can be calculated only for very few types and/or special cases of input sources. Therefore, in practice, we relax the problem until we formalize an objective that we can solve in a feasible manner.
Firstly, we fix the binarization map by selecting a particular (universal) lossless code and condition the minimization of (1) on it. That is, now we only ask for the quantizer , along with its reconstruction values , that minimize the respective ratedistortion objective. Secondly, we always assume that we encode an long 1dimensional random process. Then, objective (1) simplifies to: given a lossless code , find that
(6) 
, where and .
For instance, if we choose such that it assigns a binary representation of fixedlength to all , then the minimizer of (6
) can be found by applying the KMeans algorithm.
The minimizers of (6) are called scalar quantizers, since they measure the distortion independently for each input sample. In contrast, vector quantizers are those that result from minimizing (6) when grouping a sequence of input samples together and measuring the distortion in the respective vector space. It is well known that the infimum of scalar codes are fundamentally more redundant than vector quantizers. Nevertheless, due to the associated complexity of vector quantizers, it is more common to apply scalar quantizers in practice. Moreover, the inherent redundancy of scalar quantizers is mostly negligible for most practical applications [19].
We also want to stress that although the distortion in (6) is measured independently for each sample, the binarization (and consequently the respective codelength) of each sample can still depend on the other samples by taking correlations into account.
IiC1 Scalar Lloyd algorithm
An example of an algorithm that finds a local optimum is the Lloyd algorithm. It approximates the average codelength of the quantized samples with the entropy of their empirical probability mass distribution (EPMD). Thus, it substitutes the codelength in (6) by and applies a greedy algorithm in order to find the most optimal quantizer and quantization points that minimize the respective objective. A pseudocode can be seen in the appendix (see algorithm 4).
IiC2 CABACbased RDquantization
If we are given a set of quantization points and select CABAC as our posterior universal lossless code, then we can trivially minimize (6) by sequentially quantizing the input samples. In the video coding standards, the set of quantization points are predefined by the particular choice of quantization strength [16]. However, in the context of neural network compression we do not know of a good relationship between the quantization strength and the set of quantization points. In the next section III we describe how we tackled this problem.
IiD Model compression vs. source coding
So far, we have reviewed some fundamental results of source coding theory. However, in this work, we are rather interested in the general topic of model compression. There is a fundamental difference between both paradigms. Namely, now we are more interested in the predictive performance of the resulting quantized model rather than the distance between the quantized and original parameters. Figure 4
highlights this distinction. We will now formalize the general model compression paradigm for the supervised learning setting. However, the problem can be analogously formulated for other learning tasks.
Firstly, we assume that we are given only one model sample with realvalued weight parameters (thus, here the input space is equivalent to the one discussed above). In addition, we assume a universal coding setting, where the decoder has no prior knowledge regarding the distribution of the parameter values. We argue that this simulates most realworld scenarios.
Let now be a set of data samples. Let further denote the approximate posterior of the data, parameterized by . For instance, may be a trained neural network model with parameters . Finally, let be a chosen and fix universal lossless code. Then, we aim to find a quantizer that minimizes
(7) 
with being outputs of the quantized model and .
The first term in (7) expresses the minimization of the usual learning task of interest, whereas the second term explicitly expresses the codelength of the model. This minimization objective is well motivated from the Minimum Description Principle (MDL) [11]. However, finding the minimum of (7) is also most often NPhard. This motivates further approximations where, as a result, one can directly apply techniques from the source coding literature in order to minimize the desired objective.
IiD1 Relaxation of the model compression problem into a source coding problem
We may further assume that the given unquantized model has been pretrained on the desired task and that it reaches satisfactory accuracies. Then, in such cases, it is reasonable to replace the first term in (7) by the KLDivergence between the unquantized model and the respective quantized model . That is, now we want to quantize our model such that its output distribution does not differ too much from its original version.
Furthermore, if we now assume that the output distributions do not differ too much from each other, then we can approximate the KLDivergence with the Fisher Information Matrix (FIM). Concretely,
(8) 
with and
Then, by substituting (8) in (7) we get the following minimization objective
(9) 
Objective (9) now follows the same paradigm as the usual source coding problem. However, with the peculiarity that now (approximately) measures the distortion of and in the space of output distributions instead the Euclidian space. The advantage of the ratedistortion objective (9) is that, after the FIM has been calculated, it can be solved by applying common techniques from the source coding literature, such as the scalar Lloyd algorithm.
However, minimizing (9) as well as estimating the FIM for deep neural networks usually requires considerable computational resources, and is most often infeasible for practical applications. Therefore, we further approximate it by only its diagonal elements (FIMdiagonals), which can be efficiently estimated (see appendix). As a result, (9) simplifies into
(10) 
, which can be feasibly solved.
In the next section we will give a thorough description of our proposed coder. Its design complies with all desired properties that a coder for neural network compression should have.
Iii DeepCABAC
In light of the discussion in the previous section, we can highlight a set of desiderata that a coder for neural network compression should have.

Minimal redundancy: Stateoftheart deep neural networks usually contain millions of parameters. Thus, any type of redundancy in the weight parameters may imply several additional MB being stored. Hence, the code should output a binary representation with minimal redundancy per weight element.

Universality: The code should be applicable to any type of incoming neural networks, without having to know their distribution a priori. Hence, the code should entail a mechanism that allows it to adapt to a rich number of possible parameter distributions.

High coding efficiency: The computational complexity of encoding/decoding should be minimal. In particular, the throughput of the decoder should be very high if performing inference on the compressed representation is desired.

Configurable error vs. compression strength: The coder should have a hyperparameter that controls the tradeoff between the compression strength and the incurred prediction error.

High data efficiency: Minimizing (7) implies access to data. Hence, it is desirable that the coder finds a (local) solution with the least amount of data samples possible.
Iiia DeepCABAC’s coding procedure
We propose a coding algorithm that satisfies all the above properties. We named it DeepCABAC, since it’s based on applying CABAC on the networks quantized weight parameters. Figure 5 shows the respective compression scheme. It performs the following steps:

it extracts the weight parameters of the neural networks layerbylayer in rowmajor order^{2}^{2}2Thus, it assumes a matrix form where the parameters are scanned from lefttoright, toptobottom..

Then, it selects a particular value which defines the set of quantization points.

Subsequently, it quantizes the weight values by minimizing a weighted ratedistortion function, which implicitly takes the impact of quantization on the accuracy of the network into account.

Then, it compresses them by applying our adapted version of CABAC.

Finally, it reconstructs the network and evaluates the prediction performance of the quantized model.

The process is repeated for a set of hyperparameters , until the desired accuracyvssize tradeoff is achieved.
This approach has several advantageous properties. Firstly, it applies CABAC to the quantized parameters and therefore we ensure that the code satisfies the desiderata 13. Secondly, by conducting the compression for a set of hyperparameters for the quantizer we can select the desired paretooptimal solutions of the accuracy vs. bitsize plane, thus satisfying property 4. Finally, since only one evaluation of the model is required in the process, a significantly lower amount of data samples is required for the compression process than usually employed for training.
In the following we will explain in more detail the different components of DeepCABAC.
IiiB Lossless coder of DeepCABAC
Consider the weight distribution of the last fullyconnected layer of the trained VGG16 model displayed in figure 6. As we can see, there is one peak near 0 and the distribution is asymmetric and monotonically decreasing on both sides. In our experience, all layers we have studied so far have weight distributions with similar properties. Hence, in order to accommodate to this type of distributions, we adopted the following binarization procedure.
Given a quantized weight tensor in its matrix form
^{3}^{3}3For fullyconnected layers this is trivial. For convolutional layers we converted them into their respective matrix form according to [22]., DeepCABAC scans the weight elements in rowmajor order and binarizes them as follows:
The first bit, sigFlag, determines if the weight element is a significant element or not. That is, it indicates if the weight value is 0 or not. This bit is then encoded using a binary arithmetic coder, according to its respective context model (colorcoded in grey). The context model is initially set to 0.5 (thus, 50% probability that a weight element is 0 or not), but will automatically be updated to the local statistics of the weight parameters as DeepCABAC encodes more elements.

Then, if the element is not 0, the sign bit or signFlag is analogously encoded, according to its respective context model.

Subsequently, a series of bits are analogously encoded, which determine if the element is greater than (hence AbsGr()Flags). The number becomes a hyperparameter for the encoder.

Finally, the remainder is encoded using an ExponentialGolomb^{4}^{4}4To recall, the ExponentialGolomb code encodes a positive integer by firstly encoding the exponent using an unary code and subsequently the remainder in fixedpoint representation. code[23], where each bit of the unary part are also encoded relative to their contextmodels. Only the fixedlength part of the code is not encoded using a contextmodel (colorcoded in blue).
For instance, assume that , then the integer would be represented as 111101, or the 7 as 10111010. Figure 7 depicts an example scheme of the binarization procedure.
The first three parts of the binarization scheme empower CABAC to adapt its probability estimates to any shape distribution around the value 0 and, therefore, to encode the most frequent values with minimal redundancy. For the remainder values, we opted for the ExponentialGolomb code since it automatically assigns smaller codelengths to smaller integer values. However, in order to further enhance its adaptability, we also encode its unary part with the help of context models. We left the fixedlength part of the Golomb code without context models, meaning that we approximate the distribution of those values by a uniform distribution (see figure
6). We argue that this is reasonable since usually the distribution of large numbers become more and more flat, and it comes with the direct benefit of increasing the efficiency of the coder.IiiC Lossy coder of DeepCABAC
After establishing CABAC as our choice of universal lossless code, now we aim to find the optimal quantizer that minimizes the objective stated in (7) (section IID). To recall, this involves the optimization of two components (see figure 4):

Assignment: finding the quantizer that assigns the optimal set of integers to each weight parameter

Quant. points: finding the optimal quantization points .
Since neural networks usually rely on scalable, gradientbased minimization techniques in order to optimize their loss function, finding the quantizers that solve (
7) becomes infeasible in most cases since is a nondifferentiable map. Therefore, we opted for a simpler approach.Firstly, we decouple the assignment map and the quantization points from each other and optimize them independently. The quantization points then become hyperparameters for the quantizer, and their values are selected such that they minimize the loss function directly. This separation between and was empirically motivated, since we discovered that the networks performance is significantly more sensitive to the choice of than to the assignment . We discuss this in more detail in the experimental section.
IiiC1 The quantization points
Since finding the correct map for a large number of points can be very complex, we constrain them to be equidistant to each other with a specific stepsize . That is, each point can be rewritten as to be with . This does not only considerably simplify the problem, but it does also encourage fixedpoint representations which can be exploited in order to perform inference with lower complexity [24, 25].
IiiC2 The assignment
Hence, the quantizer has two configurable hyperparameters , the former defining the set of quantization points and the latter the quantization strength. Once a particular tuple is given, the quantizer will then assign each weight parameter to its corresponding quantization point by minimizing the weighted ratedistortion function
(11) 
, where is the codelength of the quantization point at the weight as estimated by CABAC.
As previously mentioned, we perform a gridsearch algorithm over the hyperparameters and in order to find the quantizer configuration that achieves the desired accuracy vs. bitsize tradeoff. However, for that we need to define a predefined set of hyperparameter candidates to look for, in particular for the stepsizes . In this work we considered two approaches for finding the set of stepsizes, which we denote as DeepCABACversion1 (DCv1) and DeepCABACversion2 (DCv2).
IiiC3 DeepCABACversion1 (DCv1)
In DCv1 we firstly estimate the diagonals of the FIM by applying scalable Bayesian techniques. Concretely, we parametrize the network with a fullyfactorized gaussian posterior and minimize the variational objective proposed in [26]. As a result, we obtain a mean
and a standard deviation
for each parameter, where the former can be interpreted as its (new) value (thus ) and the latter as a measure of their “robustness” against perturbations. Therefore, we simply replaced in (11). This is also well motivated theoretically since [27]showed that the variance of the parameters approximate the diagonals of the FIM for a similar variational objective. We also provide a more thorough discussion and a precise connection between them in the appendix.
After the FIMdiagonals have been estimated, we define the set of considered stepsizes as follows:
(12) 
where is the smallest standard deviation and the parameter with highest magnitude value. is then the quantizers hyperparameter, which controls the “coarseness” of the quantization points. By selecting in such a manner we ensure that the quantization points lie within the range of the standard deviation of each weight parameter, in particular for values . Hence, we selected to be .
One advantage of this approach is that we can have one global hyperparameter for the entire network, but each layer will still attain a different value for its stepsize if we select one per layer. Thus, with this approach we can adapt the stepsize to the layer’s sensitivity to perturbations. Moreover, the quantization will also take the sensitivity of each single parameter into account.
IiiC4 DeepCABACversion2 (DCv2)
Estimating the diagonals of the FIM can still be computationally expensive since it requires the application of the backpropagation algorithm for several iterations in order to minimize the variational objective. Moreover, it only offers an approximation of the robustness of each parameter, and can therefore sometimes be misleading and limit the potential compression gains that can be achieved. Therefore, due to simplicity and complexity reasons, we also considered to directly try to find a good set of candidates
. We do so by applying a first round of the gridsearch algorithm while applying a nearestneighbor quantization scheme (that is, for ). This allows us to identify the range of stepsizes that do not considerably harm the networks accuracy when applying the simplest quantization procedure. Then, we quantize the parameters as in eq. (11), but without the diagonals of the FIM (thus, ).Under a limited computational budget, this approach has the advantage that we can directly search for a more optimal set of stepsizes since we spare the computational complexity of having to estimate the FIMdiagonals. However, since DCv2 considers only one global stepsize for the entire network, it cannot adapt to the different sensitivities of the layers.
Iv Related work
There has been a plethora of work focusing on the topic of neural network compression [12, 13]. In some way or another, all of them try to partially solve the general model compression objective (7) (section IID). In the following we will thoroughly examine the currently proposed approaches and discuss some of their advantages and disadvantages.
Iva Lossy neural network compression
Some of the insofar proposed approaches include:
IvA1 Trained scalar quantization
These are methods that aim to minimize the model compression objective (7) by applying training algorithms that learn the optimal quantization map and reconstruction values. Inter alia, this includes sparsification methods [28, 29, 30, 31, 26, 32, 33] which try to minimize the norm of the network’s parameters. Others attempted to find optimal binary or ternary weighted networks [34, 35, 36, 37], or a more general set of (locally) optimal quantizers [38, 39, 40, 41, 42].
Although these methods are able to attain very high compression ratios, they are also very computationally expensive in that several training iterations over a large training set have to be performed in order to find the optimal quantized network. In contrast, DeepCABAC does not require any retraining, nor access to the full training dataset in order to be applicable.
IvA2 Nontrained scalar quantization methods
Another line of work has focused on implicitly minimizing (7). They also rely on distance measures for quantizing the network’s parameters [43, 44, 45, 46]. In fact, these methods can be seen as special cases of (10), in that they either use different approximations of the FIMdiagonals or apply other minimization algorithms. To the best of our knowledge, mainly two quantizers are widely applied by the community, either the scalar uniform quantizers or the weighted scalar Lloyd algorithm. Usually, the former basically consist on uniformly spreading quantization points over the range of parameter values and then applying nearestneighbor quantization on to them [45, 38, 40]. The latter consists of applying the scalar Lloyd algorithm in order to find the most optimal quantizer that minimizes a weighted ratedistortion objective (10). In particular,[45] considers the diagonal elements of the empirical average of the Hessian of the loss function, which has a close connection to the FIMdiagonals (see appendix for a comprehensive discussion).
Applying quantization methods that do not rely on retraining are significantly less computationally expensive. But their compression gains are heavily limited by the networks unquantized parameter distribution, since they rely on a distance measure for quantization. Moreover, as already mentioned, most of these methods do only implicitly take the impact on to the accuracy of the network into account. In contrast, DeepCABAC does explicitly take the accuracy of the network into account since its hyperparameters are optimized with regards to it.
IvB Lossless neural network compression
In the field of lossless network compression, we are given an already quantized model and we want to apply a universal lossless code to its parameters in order to maximally compress it. Hence, this setting is entirely equivalent to the usual lossless source coding setting discussed in sections IIA and IIB, and therefore all of its theorems and results can be applied in a straightforward manner. Nevertheless, most of the previous work did not apply stateoftheart universal lossless compression algorithms to them. Instead, these are some of the most commonly used:
IvB1 Fixedlength numerical representations
These methods reduce the bitlength representation of the parameter values after quantization [34, 35, 36, 37, 47, 48, 49, 50, 51, 52]. They usually have the advantage of immediately reducing the complexity for performing inference, however, at the expense of having a highly redundant network representation.
IvB2 Scalar Huffman code
Others applied the scalar Huffman code on to quantized neural networks [45, 46]. However, as we have already discussed in section IIA, this code has several disadvantages compared to other stateoftheart lossless codes such as arithmetic codes. Probably the most prominent one is that this code is suboptimal in that it incurs up to 1 bit of redundancy per parameter being encoded. This can be quite significant for large networks with millions of parameters. For instance, VGG16 [53] contains 138 million parameters, meaning that the binary representation of any quantized version of it may have about 17MB of redundancy if we encode it using the scalar Huffman code.
IvB3 Compressed matrix representations
Most of the literature that sparsify the networks parameters aim to convert the resulting networks into a compressed sparse matrix representation, e.g., the Compressed Sparse Row (CSR) representation. These matrix data structures do not only offer compression gains, but also an efficient execution of the associated dot product algorithm [54]. Similarly, [14] proposed two novel matrix representation, the Compressed Entropy Row (CER) and Compressed Shared Elements Row (CSER) representations, that are provably more optimal than the CSR with regards to both, compression and execution efficiency when the networks parameters have low entropy statistics.
However, these matrix representations are also redundant in that they do not approach the reachable entropy limit (3) (section IIA). [38] attempted to extract some of the redundancies entailed in the CSR representations by applying a scalar Huffman code to its numerical arrays. However, this has again the same limitations that come by applying the scalar Huffman code.
IvC Compression pipelines/frameworks
Among all different proposed approaches for deep neural network compression there is one paradigm that stands out in that very high compression gain can be achieved with it [38, 26, 40, 55, 42]. Namely, it consists on applying four different compression stages:

Sparsification: Firstly, the networks are maximally sparsified by applying a trained sparsification technique.

Quantization: Then, the nonzero elements are quantized by applying one of the nontrained quantization techniques.

Finetuning: Subsequently, the quantization points are finetuned in order to recover the accuracy loss incurred by the quantization procedure.

Lossless compression: Finally, the quantized values are encoded using a lossless coding algorithm.
Hence, DeepCABAC is designed to enhance points 2 and 4. As we will see in the next section, DeepCABAC is able to considerably boost the attainable compression gains, surpassing previously proposed methods for steps 2 and 4.
V Experiments
In this section we benchmark DeepCABAC and compare it to other compression algorithms. We also design further experiments with the purpose to shed light on the effectiveness of its different components.
Va General compression benchmark
Models (Spars. [%])  Org. acc. (top1 [%])  Org. size  DCv1 [%]  DCv2 [%]  Lloyd [%]  Uniform [%]  



VGG16  69.94  553.43 MB  5.84 (69.44)  3.96 (69.54)  7.74 (69.50)  17.37 (69.90)  
ResNet50  74.98  102.23 MB  10.14 (74.40)  10.14 (74.51)  13.04 (74.74)  15.58 (74.64)  
MobileNet v1  70.69  17.02 MB  21.40 (70.21)  22.08 (70.21)  15.00^{5}^{5}5Although a better compression ratio was attained, we were not able to get an accuracy in the percentage point range of the original accuracy. Therefore, this result shall not be considered as the best result. (68.10)  24.23 (70.10)  
Small VGG16  91.54  60.01 MB  6.35 (91.11)  5.88 (91.13)  9.98 (91.59)  16.18 (91.53)  
LeNet5  99.46  1722 KB  3.77 (99.23)  2.52 (99.12)  3.96 (98.96)  20.60 (99.45)  
LeNet300 100  98.32  1066 KB  8.61 (98.04)  5.87 (98.00)  8.07 (97.92)  15.01 (98.30)  


VGG16 (9.85)  69.43  553.43 MB  1.58 (69.43)  1.67 (69.04)  1.72 (69.01)  2.77 (69.42)  
ResNet50 (25.40)  74.09  102.23 MB  5.45 (73.73)  5.14 (73.65)  5.61 (73.73)  6.68 (73.98)  
MobileNet v1 (50.73)  66.18  17.02 MB  13.29 (66.01)  12.89 (66.02)  11.16 (65.63)  14.78 (65.71)  
Small VGG16 (7.57)  91.35  60.01 MB  1.90 (91.03)  1.95 (91.06)  2.08 (91.10)  2.84 (91.20)  
LeNet5 (1.90)  99.22  1722 KB  0.88 (99.14)  0.87 (99.02)  1.09 (99.25)  3.01 (99.22)  
LeNet300 100 (9.05)  98.29  1066 KB  2.26 (98.00)  2.20 (98.00)  1.69 (97.76)  4.17 (98.36) 
Here we benchmark the maximum compression gains attained by applying DeepCABAC. In order to assess its universality, we applied it to a wide set of pretrained network architectures, trained on different data sets. Concretely, we used the VGG16, ResNet50 and MobileNetv1 architectures, trained on the ImageNet dataset, a smaller version of the VGG16 architecture trained on the CIFAR10 dataset^{6}^{6}6http://torch.ch/blog/2015/07/30/cifar.html, which we denote as SmallVGG16, and the LeNet300100 and LeNet5 trained on MNIST.
In addition, we also applied DeepCABAC to presparsified versions of these networks. For that, we employed the variational sparsification algorithm [26] to all networks, except for the VGG16 and ResNet50 due to the high computational complexity demanded by the method. The advantage of employing [26] is that we obtain the variances of each weight parameters as a byproduct of the methods output, thus being able to directly apply DCv1 after the sparsification process finished. In the cases of the VGG16 and ResNet50 networks, we firstly applied the iterative sparsification algorithm [30] on them and subsequently estimated their FIMdiagonals by minimizing the same variational objective proposed in [26] (see appendix for a more in comprehensive explanation).
We compare the two versions of DeepCABAC, DCv1 & DCv2, against two previously proposed quantization schemes. Namely, similarly to [45, 46, 38], we applied the nearestneighbor quantization scheme on to the networks. In addition, we also applied the weighted Lloyd algorithm as proposed by [43, 45, 46]. As possible lossless compression candidates, we considered the scalar Huffman code, the code proposed by [38] which we denote CSRHuffman, and the bzip2 [56] algorithm. See appendix for a more detailed explanation of the respective implementations.
Table I shows the results. As one can see, DeepCABAC is able to attain higher compression gains on most networks as compared to the previously proposed coders. It is able to compress the pretrained by x18.9 and the sparsified models by x50.6 on average. In contrast, the Lloyd algorithm compresses the models by x13.6 and x47.3 on average, whereas uniform quantization only achieves x5.7 and x25.0 compression gains.
VB Assignment vs. quantization points
stepsizes (top1 acc.)  DCv1  DCv2  Lloyd  Uniform 
NONSPARSE  
0.032 (90.35)  1.48  1.48  1.79  1.60 
0.016 (91.13)  2.21  2.20  2.29  2.40 
0.001 (91.55)  4.27  4.80  2.34  5.61 
SPARSE (7.57%)  
0.032 (90.22)  0.47  0.47  0.52  0.48 
0.016 (91.06)  0.59  0.58  0.62  0.60 
0.001 (91.17)  0.91  1.00  0.74  1.00 
To recall, lossy quantization involves two types of mappings, the quantization map where input values are assigned to integers, and the reconstruction map which assigns a quantization point to each integer. Hence, the following experiment aims to assess the effectiveness of these components individually.
For that, we selected a predefined set of stepsizes and subsequently quantized the parameters according to different quantization schemes. In this way we can attain insights into the compression gains attained only by the influence of the quantization map.
Table II shows the average bitsizes per parameter attained by applying different quantizers with the threegiven stepsizes to the SmallVGG16 model. In order to decouple the lossless part from the quantization, the bitsizes are calculated with regards to the entropy of the empirical probability mass distribution (EPMD) of the quantized models in the case of the Lloyd and uniform algorithms, since it marks the theoretical minimum for lossless codes that do not take correlations between the parameters into account. In contrast, since DeepCABAC’s quantizer is optimized explicitly under CABACs lossless coder, we calculated the average bitsize with regards to the total bitsize of the model as outputted by CABAC. We also want to stress that we chose the networks that resulted in having equal accuracies, thus, within the percentage point range from the accuracy attained after applying a uniform quantizer.
We attain many insights from table II. Firstly, notice how DeepCABAC’s performance is very sensitive to the particular choice of the stepsize. This is due to fact that, usually, the best compression performances are attained for small compression strengths at high accuracies. Thus, DeepCABAC’s quantization map behaves similarly to uniform quantization in those cases, and therefore it becomes sensitive to the particular choice of the stepsize. Nevertheless, notice how DeepCABAC still always attains better compression performance than uniform quantization.
Secondly, notice how for small stepsizes DCv1 outperforms DCv2 and thus, makes better ratedistortion decisions. We attribute this to the property that DCv1 takes the “robustness” to perturbations of each element during quantization into account. As we have already discussed in sections IID and III (and in more detail in the appendix), the latter can be interpreted as minimizing an approximation to the desired MDLloss function. However, since it is only an approximation, this only applies for small stepsizes and becomes more inaccurate for larger ones. Indeed, table II shows how DCv2 attains similar results as DCv1 as the stepsize is increased, implying that the particular expression of the RDfunction becomes more and more irrelevant.
These insights motivated the design of DCv2 in the first place, since it is able to explore a larger set of stepsizes for the best accuracy vs. bitsize tradeoffs. Indeed, as table I from the previous experiment shows, DCv2 attains similar or even higher compression gains than DCv1, in particular in the case of pretrained networks.
VC Lossless coding
Quantizers  Uniform  Lloyd  DCv2 
Lossless codes  
NONSPARSE  
scalarHuffman  5.18  3.19  2.33 
bzip2  5.22  3.22  2.42 
CABAC  4.77  2.74  2.07 
H  5.09  2.91  2.20 
SPARSE (7.57%)  
scalarHuffman  1.35  1.71  1.33 
CSRHuffman  0.91  0.67  0.65 
bzip2  0.73  0.72  0.71 
CABAC  0.63  0.63  0.61 
H  0.84  0.60  0.58 
In our last experiment we aimed to assess the efficiency of different universal lossless coders. For that, we quantized the SmallVGG network using three different quantizers, and subsequently compressed each of them using different universal lossless coders. More concretely, we quantized the model by applying DCv2, the weighted Lloyd algorithm and the nearestneighbor quantizer. We then applied the scalar Huffman code, the CSRHuffman code [38], the bzip2 algorithm, and CABAC. Moreover, we also calculated the entropy of the quantized networks, as measured with regards to their empirical probability mass distribution (EPMD).
The resulting bitsizes are in table III. As one can see, CABAC is able to attain higher compression gains across all quantized versions of the network. Moreover, in some cases it is able to attain lower codelengths than the entropy of the EPMD. These results are attributed to CABACs inherent capability to take correlations between the network’s parameters into account. This property highlights its superiority as compared to the previously proposed universal lossless coders, e.g., scalar Huffman and CSRHuffman, since their average codelengths are bounded by the entropy and therefore it would be impossible for them to attain lower codelengths than CABAC.
Vi Conclusion
In this work we proposed a novel compression algorithm for deep neural networks called DeepCABAC, that is based on applying a Contextbased Adaptive Binary Arithmetic Coder (CABAC) to the networks parameters, which is the stateoftheart universal lossless coder employed in the H.264/HEVC and H.265/HEVC video coding standards. DeepCABAC also incorporates a novel quantization scheme that explicitly minimizes the accuracy vs. bitsize tradeoff, without relying on expensive retraining or access to large amounts of data. Experiments showed that it can compress pretrained neural networks by x18.9 on average, and their sparsified versions by x50.6, consistently attaining higher compression performance than previously proposed coding techniques with similar characteristics. Moreover, DeepCABAC is able to capture correlations between the network’s parameters, as such being able to compress the networks parameters beyond the entropy limit of codes that only assume a stationary distribution.
As future work we will investigate the impact of compression on the neural network’s problemsolving strategies [57] and apply DeepCABAC in distributed training scenarios, where communication overhead of the networks update parameters is critical for the overall training efficiency.
References

[1]
Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,”
Nature, vol. 521, pp. 436–444, 2015.  [2] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, vol. 61, pp. 85–117, 2015.
 [3] X. Xiaowei, D. Yukun, H. S. Xiaobo, N. Michael, C. Jason, H. Yu, and S. Yiyu, “Scaling for edge inference of deep neural networks,” Nature Electronics, vol. 1, pp. 216–222, Apr 2018.
 [4] M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” in IEEE International SolidState Circuits Conference Digest of Technical Papers (ISSCC), Feb 2014, pp. 10–14.
 [5] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv:1704.04861, 2017.
 [6] K. Ota, M. S. Dao, V. Mezaris, and F. G. B. D. Natale, “Deep learning for mobile multimedia: A survey,” ACM Trans. Multimedia Comput. Commun. Appl., vol. 13, no. 3s, pp. 34:1–34:22, Jun. 2017.
 [7] C. Zhang, P. Patras, and H. Haddadi, “Deep learning in mobile and wireless networking: A survey,” IEEE Communications Surveys Tutorials, pp. 1–1, 2019.
 [8] H. B. McMahan, E. Moore, D. Ramage, and B. A. y Arcas, “Federated learning of deep networks using model averaging,” arXiv:1602.05629, 2016.
 [9] F. Sattler, S. Wiedemann, K.R. Müller, and W. Samek, “Sparse binary compression: Towards distributed deep learning with minimal communication,” in IEEE International Joint Conference on Neural Networks (IJCNN), 2019.
 [10] F. Sattler, S. Wiedemann, K.R. Müller, and W. Samek, “Robust and communicationefficient federated learning from noniid data,” arXiv:1903.02891, 2019.
 [11] P. Grünwald and J. Rissanen, The Minimum Description Length Principle, ser. Adaptive computation and machine learning. MIT Press, 2007.
 [12] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model compression and acceleration for deep neural networks,” arXiv:1710.09282, 2017.
 [13] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “Model compression and acceleration for deep neural networks: The principles, progress, and challenges,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 126–136, Jan 2018.
 [14] S. Wiedemann, K.R. Müller, and W. Samek, “Compact and computationally efficient representation of deep neural networks,” IEEE Transactions on Neural Networks and Learning Systems, 2019.
 [15] MPEG Requirements, “Updated call for proposals on neural network compression,” Moving Picture Experts Group (MPEG), Marrakech, MA, CfP, Jan. 2019.
 [16] V. Sze, M. Budagavi, and G. J. Sullivan, High Efficiency Video Coding: Algorithms and Architectures. Springer, 2014.
 [17] D. Marpe, H. Schwarz, and T. Wiegand, “Contextbased adaptive binary arithmetic coding in the h.264/avc video compression standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 620–636, July 2003.
 [18] C. E. Shannon, “A mathematical theory of communication,” SIGMOBILE Mobile Computing and Communications Review, vol. 5, no. 1, pp. 3–55, 2001.
 [19] T. Wiegand and H. Schwarz, “Source coding: Part 1 of fundamentals of source and video coding,” Found. Trends Signal Process., vol. 4, no. 1–2, pp. 1–222, 2011.
 [20] D. A. Huffman, “A method for the construction of minimumredundancy codes,” Proceedings of the IRE, vol. 40, no. 9, pp. 1098–1101, Sep. 1952.
 [21] D. Marpe and T. Wiegand, “A highly efficient multiplicationfree binary arithmetic coder and its application in video coding,” in Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429), vol. 2, Sep. 2003, pp. 263–266.
 [22] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, “cudnn: Efficient primitives for deep learning,” arXiv:1410.0759, 2014.
 [23] J. Teuhola, “A compression method for clustered bitvectors,” Information Processing Letters, vol. 7, no. 6, pp. 308–311, 1978.
 [24] “QNNPACK open source library for optimized mobile deep learning,” https://github.com/pytorch/QNNPACK, accessed: 28.02.2019.

[25]
“TensorFlow Lite,”
https://www.tensorflow.org/lite, accessed: 28.02.2019.  [26] D. Molchanov, A. Ashukha, and D. Vetrov, “Variational dropout sparsifies deep neural networks,” in International Conference on Machine Learning (ICML), 2017, pp. 2498–2507.
 [27] A. Achille, M. Rovere, and S. Soatto, “Critical learning periods in deep neural networks,” arXiv:1711.08856, 2017.
 [28] B. Hassibi, D. G. Stork, and G. J. Wolff, “Optimal brain surgeon and general network pruning,” in IEEE International Conference on Neural Networks, 1993, pp. 293–299.
 [29] Y. L. Cun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in Advances in Neural Information Processing Systems (NIPS), 1990, pp. 598–605.
 [30] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and connections for efficient neural networks,” in Advances in Neural Information Processing Systems (NIPS), 2015, pp. 1135–1143.
 [31] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient dnns,” arXiv:1608.04493, 2016.
 [32] C. Louizos, M. Welling, and D. P. Kingma, “Learning Sparse Neural Networks through Regularization,” arXiv:1712.01312, 2017.
 [33] E. Tartaglione, S. Lepsø y, A. Fiandrotti, and G. Francini, “Learning sparse neural networks via sensitivitydriven regularization,” in Advances in Neural Information Processing Systems (NIPS), 2018, pp. 3878–3888.
 [34] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnornet: Imagenet classification using binary convolutional neural networks,” arXiv:1603.05279, 2016.
 [35] M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural networks with weights and activations constrained to +1 or 1,” arXiv:1602.02830, 2016.
 [36] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” arXiv:1612.01064, 2016.
 [37] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” arXiv:1605.04711, 2016.
 [38] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” arXiv:1510.00149, 2015.
 [39] K. Ullrich, E. Meeds, and M. Welling, “Soft WeightSharing for Neural Network Compression,” arXiv:1702.04008, 2017.
 [40] C. Louizos, K. Ullrich, and M. Welling, “Bayesian Compression for Deep Learning,” in Advances in Neural Information Processing Systems (NIPS), 2017, pp. 3288–3298.
 [41] J. Achterhold, J. M. Köhler, A. Schmeink, and T. Genewein, “Variational network quantization,” in International Conference on Representation Learning (ICLR), 2018.
 [42] S. Wiedemann, A. Marbán, K.R. Müller, and W. Samek, “Entropyconstrained training of deep neural networks,” in IEEE International Joint Conference on Neural Networks (IJCNN), 2019.

[43]
E. Park, J. Ahn, and S. Yoo, “Weightedentropybased quantization for
deep neural networks,” in
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, July 2017, pp. 7197–7205.  [44] M. Tu, V. Berisha, Y. Cao, and J. S. Seo, “Reducing the model order of deep neural networks using information theory,” in 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2016, pp. 93–98.
 [45] Y. Choi, M. ElKhamy, and J. Lee, “Towards the limit of network quantization,” arXiv:1612.01543, 2016.
 [46] Y. Choi, M. ElKhamy, and J. Lee, “Universal deep neural network compression,” arXiv:1802.02271, 2018.
 [47] N. Mellempudi, A. Kundu, D. Das, D. Mudigere, and B. Kaul, “Mixed lowprecision deep learning inference using dynamic fixed point,” arXiv:1701.08978, 2017.
 [48] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” arXiv:1502.02551, 2015.

[49]
X. Chen, X. Hu, N. Xu, H. Zhou, and and, “Fxpnet: Training deep convolutional neural network in fixedpoint representation,” in
IEEE International Joint Conference on Neural Networks (IJCNN), 2017.  [50] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network quantization: Towards lossless cnns with lowprecision weights,” arXiv:1702.03044, 2017.
 [51] R. Banner, I. Hubara, E. Hoffer, and D. Soudry, “Scalable methods for 8bit training of neural networks,” in Advances in Neural Information Processing Systems (NIPS), 2018, pp. 5145–5153.
 [52] N. Wang, J. Choi, D. Brand, C. Chen, and K. Gopalakrishnan, “Training deep neural networks with 8bit floating point numbers,” arXiv:1812.08011, 2018.
 [53] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv:1409.1556, 2014.
 [54] I. S. Duff, “A survey of sparse matrix research,” Proceedings of the IEEE, vol. 65, no. 4, pp. 500–535, 1977.
 [55] M. Federici, K. Ullrich, and M. Welling, “Improved Bayesian Compression,” arXiv:1711.06494, 2017.
 [56] J. Seward, “bzip2,” http://www.bzip.org, 1998, accessed: 09.06.2019.
 [57] S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, and K.R. Müller, “Unmasking clever hans predictors and assessing what machines really learn,” Nature Communications, vol. 10, p. 1096, 2019.
 [58] D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout and the local reparameterization trick,” in Advances in Neural Information Processing Systems (NIPS), 2015, pp. 2575–2583.
 [59] J. Martens, “New perspectives on the natural gradient method,” arXiv:1412.1193, 2014.
Appendix A Experiment details
Aa Uniform quantization
Uniform quantization is essentially one step of the weighted Lloyd algorithm with no importance measure, , and no cluster center update. One major difference between uniform quantization and the weighted Lloyd algorithm is, that in the weighted Lloyd algorithm, the neural network is quantized as a whole, while in uniform quantization, the neural network is quantized layerwise (see algorithm 5).
For the experiment described in section VA, the number of clusters were determined by first starting out with 256 clusters for unsparsified networks and with 32 clusters for sparsified networks. Then the networks were quantized and evaluated. If the accuracy was not within a range of percentage points as compared to the original accuracy, then the number of clusters was doubled until the accuracy was within that range.
Model  Clusters 
LeNet300100  256 
LeNet300100 Sparse  32 
LeNet5  256 
LeNet5 Sparse  32 
SmallVGG16  256 
SmallVGG16 Sparse  128 
ResNet50  256 
ResNet50 Sparse  256 
VGG16  256 
VGG16 Sparse  256 
MobileNet v1  1,024 
MobileNet v1 Sparse  1,024 
The quantized models were then compressed using scalar Huffman coding and bzip2. Additionally, the sparse models were compressed using CSRHuffman coding. Since additional parameters such as biases were not quantized, their original size was added to the compressed size of all compression methods.
AB Lloyds hyperparameter selection
To determine the optimal number of clusters and optimal values for , in the beginning, the number of clusters were fixed to 256. The ranges for were determined iteratively under the assumption that the accuracy decreases roughly monotonic with an increasing . At first, 20 experiments with a value between 0.0 and 1.0 were started and evaluated to establish a rough range of in which the accuracies are within a range of percentage points as compared to its original accuracy. In one case (LeNet300100 Sparse) all experiments yielded accuracies within that range. In that case another 20 experiments were conducted with values between 1.0 and 2.0. In the case of MobileNet v1 and its sparse counterpart, even a value of 0.0 for produced accuracies below the percentage point threshold. In both cases the number of clusters was doubled, then 20 experiments with in the range of 0.0 to 1.0 were conducted. This process was repeated until the accuracies were within the threshold^{7}^{7}7Please note that, although the number of clusters was drastically increased for MobileNet v1, we were not able to achieve accuracies within the target threshold.. Then for all networks two adjacent values of were selected where the accuracy lies within the range and that produced the smallest entropies. Again, 20 experiments with values between these selected s were conducted. This process was repeated until there were no longer any gains in the entropies. Typically, only two rounds were enough to find no further improvement.
Model  Clusters  
LeNet300100  0.0144  256 
LeNet300100 Sparse  1.1053  256 
LeNet5  0.1222  256 
LeNet5 Sparse  0.4  256 
SmallVGG16  0.2105  256 
SmallVGG16 Sparse  0.2368  256 
ResNet50  0.05  256 
ResNet50 Sparse  0.0105  256 
VGG16  0.0063  256 
VGG16 Sparse  0.0474  256 
MobileNet v1  0.9474  10,240 
MobileNet v1 Sparse  0.9474  3,072 
AC DeepCABAC’s hyperparameter selection
In all experiments we set the AbsGr(n)Flag to 10.
AD DCv1
For the experiment in section VA, we searched through the following set of hyperparameters:
AE DCv2
For the experiment in section VA, we searched through the following set of hyperparameters:
Appendix B Approximations to the Fisher Information Matrix
[26] proposed a sparsification algorithm for neural networks, that is based on the minimization of a variational objective. Concretely, they assume the improper logscale uniform prior and assume a fully factorized gaussian posterior over the weight parameters , and minimize the corresponding variational upper bound
(13) 
with being the output samples of the neural network, the data samples, and the mean and standard deviations of all the networks parameters, and the Lagrangemultiplier. As the KLDivergence cannot be calculated analytically, they proposed to approximate it by
(14) 
with
being the sigmoid function,
the inverse of the signaltonoise ratio of the parameters, , and . Then, they minimize (13) by applying scalable sampling techniques proposed by [58].Ba Connection between pruning and quantization
As a result of minimizing (13) we get a mean and standard deviation for each parameter of the network. In our work, we interpreted the former as their (new) value and the latter as a measure of their “robustness” against perturbations. Indeed, the authors suggested to prune away (set to 0) parameters with a signaltonoiseratio under a given threshold. Concretely, they suggested the following pruning scheme
where represents now the mean value and thus .
We can see that the scalar ratedistortion objective (10)
is a generalization of the above sparsification scheme. Namely, if we assume that the set of quantization points entails the same elements as the input set (thus, identity map), and consider a decoder that assumes a spikeandslab distribution over the quantization points, then the above objective can be solved by applying the Lloyd algorithm. After convergence, it results in the following solution
with being the empirical probability distribution of the 0 value and the bitprecision for representing the nonzero values. Hence, if we now choose and the adequate , we get the suggested criteria as a special case solution. This insight motivated our choice of FIMdiagonals in our experimental section.
BB Connection between variances, Hessian, and FIMdiagonals
Firstly, as thoroughly discussed in [59] and mentioned in [27], it is important to recall that the FIM is a semipositive approximation of the Hessian of the loss function.
Hence, similar to [27], we can derive a more rigorous connection between the estimated variances from minimizing (13), the FIMdiagonals and the Hessian. Namely, assuming that the variational loss function can be approximated by its second order expansion around the weight configuration , we get the following expression
with being the loss value at and the trace. Hence, if we substitute this expression into (13) and take the derivative with respect to we attain
with being (approximately) a monotonically increasing function of the signaltonoise ratio of the parameter. Hence, there is a direct connection between the variances, signaltonoise ratio, and hessian of the loss function and, consequently, with the FIMdiagonals.
BC Hessianweighted vs. varianceweighted quantization
[45] suggested a Hessianweighted Lloyd algorithm for quantizing the neural networks parameters, where the diagonals of the empirical Hessian are taken as weights in the algorithm. As we have already discussed above, these coefficients are closely connected to the FIMdiagonals, and are thus also theoretically well motivated. However, we experienced the algorithm to be less stable in practice when we used the Hessiandiagonals instead of the variances. Figure 8 shows the rateaccuracy curves when we quantized the LeNet5 model with both alternatives. As we can see, the curves of the variances are more stable, even achieving better compression results than the Hessianweighted variant.
Comments
There are no comments yet.