1 Introduction
Artificial Neural Networks (ANNs) achieve stateoftheart performance in several tasks via complex architectures with millions of parameters. Deploying such architectures over resourceconstrained devices such as mobiles or autonomous vehicles entails tackling a number of practical issues. Such issues include tight bandwidth and storage caps for delivering and memorizing the trained networks and limited memory for its deployment.
Let us assume a neural network has to be deployed to a device such as a smartphone or an autonomous car over a wireless link. Downloading the network may exhaust the subscriber’s traffic plan, plus the downloaded network will take storage on the device that will be unavailable to other applications. In an autonomous driving context, safetycritical updates may be delayed due to the limited bandwidth available over the wireless channel Samarakoon et al. (2019). Such examples show the importance of efficiently compressing neural networks for transmission and storage purposes.
Multiple approaches have been proposed to compress neural networks.
A first approach consists is designing the network topology from the ground up to encompass fewer parameters Iandola et al. (2016); Sandler et al. (2018).
Needless to say, this approach requires designing novel topologies from scratch.
A second approach consists in pruning some parameters from the network, i.e. removing some connections between neurons Molchanov et al. (2017); Tartaglione et al. (2018); Louizos et al. (2018), yielding a sparse topology.
Pruning might reduce the memory footprint Courbariaux et al. (2015); Zhou et al. (2016); Mishra et al. (2018), however it does not necessarily minimize storage or bandwidth requirements.
A third approach consists in quantizing the network parameters Kim et al. (2016); Xu et al. (2018); Wiedemann et al. (2020), possibly followed by entropycoding the quantized parameters.
Similar approaches achieve promising results, however most quantization schemes just aim at learning a compressible representation of the parameters Wiedemann et al. (2020); Oktay et al. (2019); Zhou et al. (2016) rather than properly minimizing the compressed parameters entropy.
Indeed, the entropy of the quantized parameters is not differentiable and cannot be easily minimized in standard gradient descentbased frameworks.
In this work, we tackle the problem of compressing a neural network by minimizing the entropy of the compressed parameters at learning time. Enhancing model’s compressibility we are able to reduce required bandwidth for bit streaming as well as storage required. Deep models are redundant Cheng et al. (2015); Lee et al. (2021): hence, there is a overhead in the deep model’s representation, which can be hereby compressed.
This work introduces HEMP, a method that relies on highorder entropy minimization to allow for efficiently compression of the parameters of a neural network.
The proposed method is illustrated in Fig. 1. The main contribution of HEMP is the differentiable formulation of the quantized parameters’ entropy, which can be extended beyond firstorder with finite computational and memory complexity. Namely, HEMP relies on a twin parametrization of the neural network: continuous parameters and the corresponding quantized parameters, where the entropy of the latters is estimated from the entropy of the former. We design a regularization term around our entropy formulation that can be plugged into gradient descent frameworks to train a network to minimize the entropy of the quantized parameters No assumptions are made on the quantization scheme (including nonuniform quantization), nor entropy coding scheme, that are not part of the proposed method and towards which our method is totally agnostic.
Other techniques like norm or rankbased ones aim at removing parameters: this has a different effect on the distribution of the parameters. Indeed, while these other approaches maximize the frequency of the pruned parameters, which are still encoded with zeros, HEMP is more general as it is able to enhance the compressibility of any quantized representation for the values.
We experiment with different quantization and entropy coding scheme showing that training a network to minimize the 2nd order entropy of the quantized parameters is already sufficient to outperform stateoftheart competing schemes.
The rest of the paper is organized as follows. Sec. 2 reviews stateoftheart approaches in network compression, Sec. 3 introduces the proposed highorder entropy regularizer, and the overall training procedure is described in Sec. 4. Experimental results are discussed in Sec. 5 and finally, in Sec. 6 conclusions are drawn.
2 Related works
A lot of work has been done around neural network size reduction. In general, we can group them into three large categories, according to their primary goal.
Minimizing the architecture
Focusing from the architectural point of view, it is possible to design some memoryefficient deep networks, typically relying on strategies like channel shuffling, pointwise convolutional filters, weight sharing or a combination of them. Some examples of customized deep networks towards memory footprint reduction are SqueezeNet Iandola et al. (2016), ShuffleNet Zhang et al. (2018) and MobileNetv2 Sandler et al. (2018). Recently, a huge interest in automatically reducing the shape of the deep networks has gained interest, with works on neural network sparsification Molchanov et al. (2017); Tartaglione et al. (2018); Louizos et al. (2018); Ullrich et al. (2017) boosted by the recent lottery ticket hypothesis by Frankle and Carbin Frankle and Carbin (2018). These approaches address the problem of improving inference efficiency with limited memory footprint, but do not directly tackle the problem of reducing stored model size.
Minimizing the computation
Recently, this topic is collecting everincreasing interest. While deepening its roots in statistical physics, some works exploited lowprecision training in artificial neural networks Courbariaux et al. (2015); Rastegari et al. (2016); Baldassi et al. (2018). A large number of works attempts to also use lowprecision backpropagation signals and lowprecision activations, as it leads to lower power consumption at inference time Lin et al. (2017); Mishra et al. (2018); Zhou et al. (2016). These techniques, however, do not explicitly address the problem of minimizing the storage size of the entire model.
Minimizing the stored model’s memory
Here the main goal is not to modify the architecture of a deep model, but to merely compress it, to reduce its stored size: while the other two approaches focused on somehow changing the architecture of the deep model to simplify it and/or to reduce its memory footprint, here the objective is to compress a stored model with no architectural change. Towards this end, many approaches have been proposed: contextadaptive binary arithmetic coding Wiedemann et al. (2020), learning the quantized parameters using the local reparametrization trick Shayer et al. (2018), cluster similar parameters between different layers Xu et al. (2018), using matrix factorization followed by Tucker decomposition Kim et al. (2016), training adversarial neural networks towards compression Belagiannis et al. (2018) or employing a Huffman encoding scheme Han et al. (2016) are just some of them.
Recently, Oktay et al. proposed an entropy penalized reparametrizations to the parameters of a deep model, which leads to competitive compression values sacrificing a little bit the deep model’s performance Oktay et al. (2019). However, their approach has some training overhead, like the fact they require to train a decoder, they make their formulation differentiable through the use of straightthrough estimators (STE). The big advantage provided by their approach relies more in the reparametrization leading to the quantization strategy, but the compressibility of their quantized parameters is limited to arithmetic coding.
Deep learning based compression schemes proposing a direct high entropybased regularizer are difficult to design because of the nondifferentiability of the entropy and its computational heaviness. All the discussed methods do not explicitly minimize the final compressed file size but they are limited to rigid quantization and compression schemes Wiedemann et al. (2020) or they build dictionaries onpurpose Han et al. (2016), losing generality.
In the next section, we introduce our efficient and differentiable nth order entropy proxy, to be used in the HEMP framework: it can be freely associated to any quantization strategy and any entropic compression algorithm. Differently to the work by Wiedemann et al. Wiedemann et al. (2020), HEMP is not bound to a particular quantization scheme, and provides a direct, scalable and differentiable entropy estimator on the continuous parameters.
3 Entropybased regularization
In this section, we describe our entropybased framework for quantization. We introduce a regularization formulation that uses a differentiable entropy proxy, evaluated on the continuous parameters of the model, to indirectly reduce the compressed size of the quantized network. We will show that this term easily scalesup to any entropy orders, thus improving the compression efficiency of actual algorithms such as dictionarybased compression.
3.1 Preliminaries
Symbol  Meaning 

th (continuous) parameter in the th layer  
th (quantized) parameter in the th layer  
quantization index corresponding  
to the th parameter in the th layer  
quantization levels  
generic quantization index in range  
probability that the quantized representation  
of will have as quantization index  
th order entropy on the quantization indices  
differentiable proxy of proposed  
within HEMP 
Here, we introduce preliminaries and notations. Let a feedforward, multilayer artificial neural network be composed of layers. Let be the th parameter of the th layer. Let us assume all ANN parameters are quantized onto discrete levels, with:

quantization index for every parameter ;

reconstruction (or representation) levels ; as shown in the following, every layer of the ANN model gets its own optimized set of reconstruction levels.
From these, we get the quantized parameters according to
(1) 
Table 1 collects the most recurring symbols of this section. Please notice that multidimensional versions of the symbols are in bold.
Now, let us consider as the uples of the quantized parameters, where is the th uple of quantized parameters. In general, the th order entropy on the quantized model is
(2) 
where (L0norm) is the total number of parameters,
and, using the chain rule, we can express
as(3) 
In (3), the “probability” of the event is
(4) 
where is the indicator function. Minimizing (2) results in maximizing the final compression for the quantized model when using an entropic compression algorithm (Witten et al. (1987); Seroussi and Lempel (1993)). Unfortunately, the problem in minimizing (2) within a gradient descent based optimization framework lies in the nondifferentiability of (4). In the next section we introduce a differentiable proxy for (4) which directly optimizes the continuous parameters such that their quantization is highly compressible.
3.2 Differentiable nth order entropy regularization
In the previous section we have stated the impossibility of directly optimizing (2) using gradient descentbased techniques because of the nondifferentiability of (3). We are going to overcome this obstacle providing a formulation of (3) based on the distance between the continuous parameter and its quantized reconstruction . Here on, we will drop the subscript , but in general all the layers have different reconstruction levels.
Let us define first the distance between a parameter and the reconstruction level :
(5) 
From (5), we can estimate the probability of binning to using the softmax function:
(6) 
Such general formulation is computationally expensive, so we propose an efficient approximation thereof exploiting a “bin locality” principle, for which we say that a parameter can be binned to the two closest bins only. Under the assumption of quasistatic process, indeed, locally the probability of binning the continuous parameter in other bins than the two closest between two iteration steps can be neglected. We refer to these bins as and . In this case, we know . Here we can design a relative distance linearlyscaling probability:
(7) 
3.3 Study of the entropy regularization term
In this section we are detailing the derivations of the proposed entropy regularization. Obtaining an explicit formulation for the update terms allows to efficiently implement the update rule explicitly (when using gradientbased optimizers, without relying on the automatic differentiation packages) and to study both the stationary points of the regularization term and bounds of the gradient respectively.
3.3.1 Explicit derivation of the entropy regularization term’s gradient
Let us consider here the first order entropy proxy:
(10) 
with
(11) 
Let us differentiate (10) with respect to :
(12) 
According to (7), we can write
(13) 
and considering that
(14) 
where
(15) 
we have
(16) 
Using a similar approach, we can explicitly write the gradient for the nth order entropy term in (3.2):
(17) 
where indicates the set of whose binning probability for is nonzero.
Having (17) explicit enables efficient gradient computation: indeed, given the designer choice in (7), every uple of parameters has possible quantization indices uples
only, which is independent on the number of quantization levels . On the contrary, using (6) would result in possible quantization indices uples . Hence, out proposed approach allows us to save memory at computation time.
For sake of simplicity, the following analysis on stationary points and boundaries will be performed on the first order entropy, but similar conclusions can be equivalently drawn for any nth order.
3.3.2 Stationary points for H1
In this section we are looking for stationary points (or in other words, when gradient vanishes). From (16) we observe that
(18) 
assuming finite positive numbers. We can make explicit in the condition (18):
(19) 
where
According to (7), we can rewrite (19) as
(20) 
because by definition. As we expect, if and are evenly populated, and the stationary point of is
which results in
(21) 
exactly between the centre of the two bins. From the entropy point of view, this is essentially what we expect, since we have two equipopulated bins; however, this is not what we like to have when we quantize a deep network, considering that it leads to an high quantization error. For this reason, favoring solutions in which is a good strategy and this is also the reason we included a reconstruction error in the overall regularization function.
3.3.3 Bound for H1’s derivative
In this section we are looking for an upper bound of and we study the cases in which such quantity explodes, in order to assess conditions to avoid gradient explosion. We can set the bound for the gradient magnitude as:
(22) 
Considering that and that (so, both are finite, realvalued quantities), we are interested to guarantee
(23) 
where is a positive realvalue finite number. Given , let us study the cases in which such quantity explodes.

Case . In this case we have and . According to (16) ; so at least one parameter lies in the considered interval and the condition is impossible by construction.

Case . In this case we have and . Similarly to the previous case, ; so at least one parameter lies in the considered interval and the condition is impossible by construction.

Case . By construction, so this case is impossible.
In the next section we will describe the overall HEMP framework.
4 Training scheme
The overall training scheme is summarized in Fig. 5 and includes a quantizer and an entropy encoder. The quantizer generates the discretevalued representation of the network parameters at training time. The encoder produces the final compressed file embedding the deep model once the training is over. Our scheme does not make any assumption about the quantization or entropy coding scheme, contrarily to other strategies tailored for, e.g., specific quantization schemes (Han et al. (2016); Wiedemann et al. (2020)
). Therefore, in the following we will assume a very general, nonuniform Lloydmax quantizer, while we will not make any assumption about the entropy encoder for the moment as it is external to the training process. Our learning problem can be formulated as follow: given a dataset and a network architecture, we want to compress the network parameters
, while preserving the network performance as measured by some loss function
. Towards this end, we introduce the following regularization function:(24) 
where and are two positive hyperparameters and
(25) 
is a reconstruction error estimator. Minimizing makes and, for instance, loss evaluation on the continuous parameters network approaches the loss estimated on the quantized network. Overall, we minimize the objective function:
(26) 
Minimizing requires finding the right balance between and : towards this end, we propose to dynamically reweight according insensitivity of each parameter (Tartaglione et al. (2018)). The key idea here is to reweight the regularization gradient at every parameter update depending on the sensitivity of the loss with respect to every parameter. We say that the larger the magnitude of the gradient of the loss with respect to , the smaller the perturbation from the minimization induced by we desire. Hence, in the update of the parameter , we reweight the gradient of by the insensitivity:
(27) 
The HEMP framework allows to solve the learning problem using standard optimization strategies, where the gradient of (26) is descended.
5 Experiments
In this section we evaluate the effectiveness of HEMP. Towards this end, we propose experiments on several widely used datasets with different architectures.
Datasets and architectures
We experiment with LeNet5 on MNIST, ResNet32 and MobileNetv2 on CIFAR10, ResNet18 and ResNet50 on ImageNet. We always train from scratch except for ImageNet experiments where we rely on pretrained models.
^{1}^{1}1https://pytorch.org/docs/stable/torchvision/models.htmlSetup
We experiment on a Nvidia RTX 2080 Ti GPU. Our algorithm is implemented using PyTorch 1.5.
^{2}^{2}2The source code will be made available on GitHub upon acceptance of the work. For all our simulations we use SGD optimization with momentum , and . Learning rate and batchsize depend on the dataset and the architecture: for all the datasets except for ImageNet the learning rate used is and batchsize , for ResNet18 trained on ImageNet the learning rate is with batchsize while for ResNet50 learning rate is with batchsize 32. The file containing the quantized parameters is entropycoded using LZMA Pavlov (2007), a popular dictionarybased compression algorithm wellsuited to exploit highorder entropy.Metrics
The goal of the present work is to compress a neural network without jeopardizing its accuracy, so we rely on two distinct, largely used, performance metrics:

the compressed model size as the size of the file containing the entropyencoded network,

the classification accuracy of the compressed network (indicated as Top1 in the following).
5.1 Preliminary experiments
As preliminary experiments, we evaluate if the regularizer function (3.2) is a good estimator of (2). Towards this end, we train the LeNet5 architecture on MNIST minimizing while logging the entropy on the quantized parameters .
Fig. (a)a shows the normalized and its approximation : three findings are noteworthy.
First, accurately estimates , i.e. minimizing yields to minimizing . Under the assumption the quantized parameters are entropycoded, minimizing shall minimize the size of a file where the encoded parameters are stored.
Second, when , the training converges to a higher entropy, while minimizing higher entropy orders enables access to lower entropy embeddings. Higher entropy reflects on the final size of the model: while for we could get a final network size of 61kB, for the final size drops to approximately 27.5kB, having a top1 accuracy of 99.27%. This better performance can be explained by the fact that higher order entropy can catch repeated sequences of parameters’ binnings which can lead to a significant compression boost.
Third, the higher , the fewer the epochs required to converge to low entropy values. However, in terms of actual training time, the available GPU memory limits the parallelism degree for computing the derivative term in (17). In the following, we will stick to as it enables both reasonably low entropy embeddings and training times.
As a further verification, we have run the same experiments on the ResNet32 architecture trained on the CIFAR10 dataset: also here, we minimize while logging the entropy on the quantized parameters at different values of . Fig. (b)b shows the normalized . Similarly to what observed in the main paper, secondorder entropy minimization results to be a good tradeoff between complexity and final performance, considering that the reached entropic rate of is comparable to . Please notice also that the entropy estimated on the quantized model, and reported Fig. (b)b, is proportional to the final file sizes.
As a further analysis of HEMP’s effect of the parameter, in Fig. 7 we show the distribution of the optimized parameters on the second convolutional layer in LeNet5 trained on MNIST (the other layers follow a similar distribution). In this case we optimize the model having 3 quantized values. As we observe the continuous values are distributed tightly around their quantized representations : as ; in the end the accuracy of the quantized representation of the model approaches the accuracy of the continuous model. Additionally, as observed in Fig. (a)a, also the entropy of the quantized model is minimized, achieving both a quantized trained model with high accuracy and high compressibility of its representation.
5.2 Comparison with the stateoftheart
Model  Method  Top1 [%]  Size 

LeNet5  Baseline  99.30  1.7MB 
LOBSTER Tartaglione et al. (2020)  99.10  19kB  
Han et al. Han et al. (2016)  99.26  44kB  
Wiedemann et al. Wiedemann et al. (2020)  99.12  43.4kB  
HEMP  99.27  27.5kB  
Wiedemann et al.(+pruning) Wiedemann et al. (2020)  99.02  11.9kB  
HEMP+LOBSTER Tartaglione et al. (2020)  99.05  2.00kB 
Model  Method  Top1 [%]  Size 
ResNet32  Baseline  93.10  1.9MB 
LOBSTER Tartaglione et al. (2020)  92.97  439.4kB  
HEMP  91.57  168.3kB  
HEMP+ LOBSTER Tartaglione et al. (2020)  92.55  86.2kB  
MobileNetv2  Baseline  93.67  9.4MB 
HEMP  92.80  872kB 
Model  Method  Top1 [%]  Size 

ResNet18  Baseline  69.76  46.8MB 
LOBSTER Tartaglione et al. (2020)  70.12  17.2MB  
Lin et al. Lin et al. (2017)  68.30  5.6MB  
Shayer et al. Shayer et al. (2018)  63.50  2.9MB  
HEMP  68.80  3.6MB  
HEMP+LOBSTER Tartaglione et al. (2020)  69.70  2.5MB  
ResNet50  Baseline  76.13  102.5MB 
Wang et al. Wang et al. (2019)  70.63  6.3MB  
Han et al.Han et al. (2016)  68.95  6.3MB  
Wiedemann et al. Wiedemann et al. (2020)  74.51  10.4MB  
Tung et al. Tung and Mori (2020)  73.7  6.7MB  
HEMP (high acc.)  74.52  9.1MB  
HEMP  71.33  5.5MB  
MobileNetv2  Baseline  72.1  13.5MB 
Tu et al.Tu et al. (2020)  7.25  10.1MB  
He et al.He et al. (2019)  9.8  4.95MB  
Tung et al.Tung and Mori (2020)  70.3  2.2MB  
HEMP  71.3  1.7MB  
Custom, latency 6.11ms,  APQ Wang et al. (2020)  72.8  20.8MB 
energy 9.14mJ  APQ Wang et al. (2020) + HEMP  72.5  3.04MB 
Hemp
We now compare our method with stateoftheart methods for network compression. Our main goal is to minimize the size of the final compressed file while keeping the top1 performance as close as possible to the baseline network’s one. Therefore, our approach can be compared only with works that report the real final file size. To the best of our knowledge, only the methods reported in Tables 2, 3 and 4 can be included in this compression benchmark. Indeed, most of the pruningbased methods (Molchanov et al. (2017); Tartaglione et al. (2018)) typically report pruningrates only, which can not be directly mapped to file size: encoding sparse structures requires additional memory to store the coordinates for the unpruned parameters. We implemented one stateoftheart pruning method (LOBSTER Tartaglione et al. (2020)) to report pruning baseline storage memory achieved. Concerning quantization methods, existing approaches either focus on quantization to boost inference computation minimization (Courbariaux et al. (2015); Rastegari et al. (2016); Lin et al. (2017); Mishra et al. (2018); Zhou et al. (2016)) or do not report the final file size (Ullrich et al. (2017); Kim et al. (2016); Belagiannis et al. (2018); Xu et al. (2018)).
We also tried to directly compress the baseline file and we did not observe any compression gain. Therefore, to make reading easier, we did not report these numbers.
As a first experiment, we train LeNet5 on MNIST (Table 2): despite the simplicity of the task, the reference LeNet5 is notoriously overparametrized for the learning task. Indeed, as expected, most of the stateoftheart techniques are able to compress the model to approximately 40kB. In such a context, HEMP performs best, lowering the size of the compressed model to 27.5kB.
Then, we experiment with ResNet32 and MobileNetv2 on CIFAR10 (as reported in Table 3), achieving also in this case significant compression: ResNet32 size drops from 1.9MB to 168kB and MobileNetv2 from 9.4MB to 822kB. Note that, other literature methods do not report experiments on CIFAR10 on the proposed architectures. Nevertheless, HEMP approximately reduces the network size by a factor 11 for both architectures.
We also compress pretrained ResNet18, ResNet50 and MobileNetv2 trained on ImageNet (Table 4). Also in this case, HEMP reaches competitive final file size, being able to compress ResNet18 from 46.8MB to 3.6MB with minimal performance loss and ResNet50 from 102.5MB to 5.5MB. For the ResNet50 experiment, we also report partial result for high accuracy band, indicated as “high acc”, to compare to Wiedemann et al. Wiedemann et al. (2020): for the same accuracy, HEMP proves to drive the model to a higher compression. In the case of ResNet18, Shayer et al. (2018) achieves a 0.5MB smaller compressed model, which is however set off by a 4.3% worse Top1 error. Also in the case of very efficient architectures like Mobilenetv2, HEMP is able to reduce significantly the storage memory occupation, moving from 13.5MB to 1.7MB only. Furthermore, the error drop is in this case very limited (0.8%) when compared to other techniques, like Tung et al. which, in the case of less optimized architectures like ResNet50, do not have a large drop. While concurrent techniques rely on typical pruning+quantization strategies, aiming at indirectly eliminating the redundancy in the models, HEMP is directly optimizing over the existing redundancy.
Finally, we also tried to make HEMP cope with a different quantize and prune scheme. In particular, APQ Wang et al. (2020) proves to be a perfect framework for our purpose, since it is a strategy performing both network architecture search, pruning and quantization. We have used HEMP in the most challenging scenario proposed by Wang et al., with the lowest latency constraint (6.11ms) and the lowest energy consumption (9.14mJ) at inference time. In this case, we have finetuned the APQ’s provided model for 5 epochs. Even in this case, HEMP is able to reduce the model’s size, from the 20.8MB of the model to 3.04MB only, proving onthefield its deployability as a companion besides other quantization/pruning scheme, and nonexploiting any prior on the network’s architecture.
Overall, these experiments show that HEMP strikes a competitive tradeoff between compression ratio and performance on several architectures and datasets.
Hemp+lobster
It has been observed that combining pruning and compression techniques enhances reduces the model final file size with little performance loss Wiedemann et al. (2020). In our context, this translates into including two constraints to the learning:

force the quantizer to have, for some , the representation (or in simpler words, a quantization level corresponding to “0”);

include a pruning mechanism (permanent parameter set to “0”).
Both of the constraints work independently from HEMP: indeed, HEMP is not a quantization technique, but it is thought to side any other learning strategy whose aim is to quantize the model’s parameters (in such context, pruning “quantizes to zero” as many parameters as possible). Hence, we tried to side HEMP to LOBSTER Tartaglione et al. (2020), which is a stateoftheart differentiable pruning strategy (hence, compatible within HEMP’s framework).
The results are as well reported in Tables 2, 3 and 4: it is evident that, including a prior on the optimal distribution of the parameters (removing all the unnecessary ones for the learning problem) helps HEMP to compress more. We have tested the setup HEMP + LOBSTER on one architecture per dataset: LeNet5 (MNIST), ResNet32 (CIFAR10) and ResNet18 (ImageNet). While LOBSTER alone is able to achieve highly compressed models for toy datasets (like MNIST), it can not achieve high compression alone on more complex datasets. Still, siding a technique like LOBSTER to HEMP, boosts the compression of 10x for MNIST and ImageNet dataset and 4x for CIFAR10.
HEMP minimizes the th entropy order (in these experiments, )  or in other words, maximizes the occurrence of certain sequences of quantization indices. The mapping of these quantization indices to quantization levels has to be determined outside HEMP: when we run experiments with “HEMP” alone, the loss minimization (in our case, the crossentropy) automatically determines these levels  with the generalpurpose Lloydmax quantizer. However, pruning strategies include a prior on one of the quantization levels (the one corresponding to “0”), and this helps towards having a higher entropy minimization.
5.3 Ablation study
Here, we evaluate the impact of the reconstruction error term (25) and the overall insensitivity reweighting (27) for the regularization function. Towards this end, we perform an ablation study on the ResNet32 architecture trained on CIFAR10.
Reconstruction error regularization
Fig. (a)a (left), shows the ResNet32 loss for the continuous and the quantized models ( and when the reconstruction error is included or excluded () from the regularization function (24). We observe that both continuous models (solid lines) obtain similar performance on the test set. However, the quantized models (dashed lines) perform very differently. When the reconstruction error is not included in the training procedure (red lines), the quantized model reach a plateau with a high loss value showing that the network performs poorly on the test set. Conversely, when the reconstruction error is included (blue lines), the quantized model reaches a final loss closer to the continuous models. Indeed, regularizing also on (25) makes , hence . This experiment verifies the contribution of the error reconstruction regularization term towards the good performance of the quantized model.
Insensitivitybased reweighting
Fig. (b)b (right) shows the performance of the ResNet32 model including or excluding the insensitivity reweighting for the regularization function (27). Here, we report the test set losses obtained by the continuous models (continuous lines) and the value for the overall function (dashed lines). We observe a very unstable test loss without insensitivity rescaling for (magenta line). Hence, minimization with an overall rescaling for is also shown (in cyan): in such case, the test loss on the continuous model remains low, but is extremely slowly minimized. Using the insensitivity reweighting (in blue) proves to be a good tradeoff between keeping the test set loss low and both minimizing . This behavior is what we expected: the insensitivity reweighting, acting parameterwise (ie. there is a different value per each parameter), dynamically tunes the reweighting of the overall regularization function , allowing faster minimization with minimal or no performance loss. This is why we could use the same and values for all the simulations, despite optimizing different architectures on different datasets. Such robustness of the hyperparameters over different dataset is a major practical strength of our approach.
6 Conclusion
We presented HEMP, an entropy codingbased framework for compressing neural networks parameters. Our formulation efficiently estimates entropy beyond the first order and can be employed as regularizer to minimize the quantized parameters’ entropy in gradient based learning, directly on the continuous parameters. The experiments show that HEMP is not only an accurate proxy towards minimizing the entropy of the quantized parameters, but are also pivotal to model the quantized parameters statistics and improve the efficiency of entropy coding schemes. We also sided HEMP to LOBSTER, a stateoftheart pruning strategy which introduces a prior on the weight’s distribution which gives a further boost to the final model’s compression.Future works include the integration of a quantization technique designed specifically for deep models to HEMP.
References
 Role of synaptic stochasticity in training lowprecision neural networks. Physical review letters 120 (26), pp. 268103. Cited by: §2.

Adversarial network compression.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pp. 0–0. Cited by: §2, §5.2.  An exploration of parameter redundancy in deep networks with circulant projections. In Proceedings of the IEEE international conference on computer vision, pp. 2857–2865. Cited by: §1.
 Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131. Cited by: §1, §2, §5.2.
 The lottery ticket hypothesis: finding sparse, trainable neural networks. International Conference on Learning Representation (ICLR). Cited by: §2.
 Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representation (ICLR). Cited by: §2, §4, Table 2, Table 4.
 Realtime vehicle detection from shortrange aerial image with compressed mobilenet. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8339–8345. Cited by: Table 4.
 SqueezeNet: alexnetlevel accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: §1, §2.

Compression of deep convolutional neural networks for fast and low power mobile applications
. International Conference on Learning Representation (ICLR). Cited by: §1, §2, §5.2.  On the redundancy in the rank of neural network parameters and its controllability. Applied Sciences 11 (2), pp. 725. Cited by: §1.
 Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems, pp. 345–353. Cited by: §2, §5.2, Table 4.
 Learning sparse neural networks through regularization. International Conference on Learning Representation (ICLR). Cited by: §1, §2.
 WRPN: wide reducedprecision networks. International Conference on Learning Representation (ICLR). Cited by: §1, §2, §5.2.

Variational dropout sparsifies deep neural networks.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pp. 2498–2507. Cited by: §1, §2, §5.2.  Scalable model compression by entropy penalized reparameterization. arXiv preprint arXiv:1906.06624. Cited by: §1, §2.
 Lzma sdk (software development kit). Cited by: §5.
 Xnornet: imagenet classification using binary convolutional neural networks. In European conference on computer vision, pp. 525–542. Cited by: §2, §5.2.
 Distributed federated learning for ultrareliable lowlatency vehicular communications. IEEE Transactions on Communications. Cited by: §1.

Mobilenetv2: inverted residuals and linear bottlenecks.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 4510–4520. Cited by: §1, §2.  Lempelziv compression scheme with enhanced adapation. Google Patents. Note: US Patent 5,243,341 Cited by: §3.1.
 Learning discrete weights using the local reparameterization trick. International Conference on Learning Representation (ICLR). Cited by: §2, §5.2, Table 4.
 LOssbased sensitivity regularization: towards deep sparse neural networks. arXiv preprint arXiv:2011.09905. Cited by: §5.2, §5.2, Table 2, Table 3, Table 4.
 Learning sparse neural networks via sensitivitydriven regularization. In Advances in neural information processing systems, pp. 3878–3888. Cited by: §1, §2, §4, §5.2.
 Pruning depthwise separable convolutions for mobilenet compression. In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: Table 4.
 Deep neural network compression by inparallel pruningquantization. IEEE transactions on pattern analysis and machine intelligence 42 (3), pp. 568–579. Cited by: Table 4.
 Soft weightsharing for neural network compression. International Conference on Learning Representation (ICLR). Cited by: §2, §5.2.
 Haq: hardwareaware automated quantization with mixed precision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8612–8620. Cited by: Table 4.
 APQ: joint search for nerwork architecture, pruning and quantization policy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §5.2, Table 4.
 DeepCABAC: a universal compression algorithm for deep neural networks. IEEE Journal of Selected Topics in Signal Processing. Cited by: §1, §2, §4, §5.2, §5.2, Table 2, Table 4.
 Arithmetic coding for data compression. Communications of the ACM 30 (6), pp. 520–540. Cited by: §3.1.

Deep neural network compression with single and multiple level quantization.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §1, §2, §5.2.  Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6848–6856. Cited by: §2.
 Dorefanet: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §1, §2, §5.2.
Comments
There are no comments yet.