HEMP: High-order Entropy Minimization for neural network comPression

07/12/2021 ∙ by Enzo Tartaglione, et al. ∙ Università di Torino 0

We formulate the entropy of a quantized artificial neural network as a differentiable function that can be plugged as a regularization term into the cost function minimized by gradient descent. Our formulation scales efficiently beyond the first order and is agnostic of the quantization scheme. The network can then be trained to minimize the entropy of the quantized parameters, so that they can be optimally compressed via entropy coding. We experiment with our entropy formulation at quantizing and compressing well-known network architectures over multiple datasets. Our approach compares favorably over similar methods, enjoying the benefits of higher order entropy estimate, showing flexibility towards non-uniform quantization (we use Lloyd-max quantization), scalability towards any entropy order to be minimized and efficiency in terms of compression. We show that HEMP is able to work in synergy with other approaches aiming at pruning or quantizing the model itself, delivering significant benefits in terms of storage size compressibility without harming the model's performance.



There are no comments yet.


page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Artificial Neural Networks (ANNs) achieve state-of-the-art performance in several tasks via complex architectures with millions of parameters. Deploying such architectures over resource-constrained devices such as mobiles or autonomous vehicles entails tackling a number of practical issues. Such issues include tight bandwidth and storage caps for delivering and memorizing the trained networks and limited memory for its deployment.

Figure 1:

Proposed approach for neural network compression. At training time, we employ two parametrizations of the same neural network: continuous parameters are used for loss minimization, while quantized parameters are used for high-order entropy estimation. A regularization term enforces consistency between the neuron activations of the networks and a low entropy of the quantized network. The final model is obtained using any entropy-based encoder.

Let us assume a neural network has to be deployed to a device such as a smartphone or an autonomous car over a wireless link. Downloading the network may exhaust the subscriber’s traffic plan, plus the downloaded network will take storage on the device that will be unavailable to other applications. In an autonomous driving context, safety-critical updates may be delayed due to the limited bandwidth available over the wireless channel Samarakoon et al. (2019). Such examples show the importance of efficiently compressing neural networks for transmission and storage purposes.

Multiple approaches have been proposed to compress neural networks. A first approach consists is designing the network topology from the ground up to encompass fewer parameters Iandola et al. (2016); Sandler et al. (2018). Needless to say, this approach requires designing novel topologies from scratch. A second approach consists in pruning some parameters from the network, i.e. removing some connections between neurons Molchanov et al. (2017); Tartaglione et al. (2018); Louizos et al. (2018), yielding a sparse topology. Pruning might reduce the memory footprint Courbariaux et al. (2015); Zhou et al. (2016); Mishra et al. (2018), however it does not necessarily minimize storage or bandwidth requirements. A third approach consists in quantizing the network parameters Kim et al. (2016); Xu et al. (2018); Wiedemann et al. (2020), possibly followed by entropy-coding the quantized parameters. Similar approaches achieve promising results, however most quantization schemes just aim at learning a compressible representation of the parameters Wiedemann et al. (2020); Oktay et al. (2019); Zhou et al. (2016) rather than properly minimizing the compressed parameters entropy. Indeed, the entropy of the quantized parameters is not differentiable and cannot be easily minimized in standard gradient descent-based frameworks. In this work, we tackle the problem of compressing a neural network by minimizing the entropy of the compressed parameters at learning time. Enhancing model’s compressibility we are able to reduce required bandwidth for bit streaming as well as storage required. Deep models are redundant Cheng et al. (2015); Lee et al. (2021): hence, there is a overhead in the deep model’s representation, which can be hereby compressed.
This work introduces HEMP, a method that relies on high-order entropy minimization to allow for efficiently compression of the parameters of a neural network. The proposed method is illustrated in Fig. 1. The main contribution of HEMP is the differentiable formulation of the quantized parameters’ entropy, which can be extended beyond first-order with finite computational and memory complexity. Namely, HEMP relies on a twin parametrization of the neural network: continuous parameters and the corresponding quantized parameters, where the entropy of the latters is estimated from the entropy of the former. We design a regularization term around our entropy formulation that can be plugged into gradient descent frameworks to train a network to minimize the entropy of the quantized parameters No assumptions are made on the quantization scheme (including non-uniform quantization), nor entropy coding scheme, that are not part of the proposed method and towards which our method is totally agnostic.
Other techniques like norm or rank-based ones aim at removing parameters: this has a different effect on the distribution of the parameters. Indeed, while these other approaches maximize the frequency of the pruned parameters, which are still encoded with zeros, HEMP is more general as it is able to enhance the compressibility of any quantized representation for the values.
We experiment with different quantization and entropy coding scheme showing that training a network to minimize the 2-nd order entropy of the quantized parameters is already sufficient to outperform state-of-the-art competing schemes.

The rest of the paper is organized as follows. Sec. 2 reviews state-of-the-art approaches in network compression, Sec. 3 introduces the proposed high-order entropy regularizer, and the overall training procedure is described in Sec. 4. Experimental results are discussed in Sec. 5 and finally, in Sec. 6 conclusions are drawn.

2 Related works

A lot of work has been done around neural network size reduction. In general, we can group them into three large categories, according to their primary goal.

Minimizing the architecture

Focusing from the architectural point of view, it is possible to design some memory-efficient deep networks, typically relying on strategies like channel shuffling, point-wise convolutional filters, weight sharing or a combination of them. Some examples of customized deep networks towards memory footprint reduction are SqueezeNet Iandola et al. (2016), ShuffleNet Zhang et al. (2018) and MobileNet-v2 Sandler et al. (2018). Recently, a huge interest in automatically reducing the shape of the deep networks has gained interest, with works on neural network sparsification Molchanov et al. (2017); Tartaglione et al. (2018); Louizos et al. (2018); Ullrich et al. (2017) boosted by the recent lottery ticket hypothesis by Frankle and Carbin Frankle and Carbin (2018). These approaches address the problem of improving inference efficiency with limited memory footprint, but do not directly tackle the problem of reducing stored model size.

Minimizing the computation

Recently, this topic is collecting ever-increasing interest. While deepening its roots in statistical physics, some works exploited low-precision training in artificial neural networks Courbariaux et al. (2015); Rastegari et al. (2016); Baldassi et al. (2018). A large number of works attempts to also use low-precision back-propagation signals and low-precision activations, as it leads to lower power consumption at inference time Lin et al. (2017); Mishra et al. (2018); Zhou et al. (2016). These techniques, however, do not explicitly address the problem of minimizing the storage size of the entire model.

Minimizing the stored model’s memory

Here the main goal is not to modify the architecture of a deep model, but to merely compress it, to reduce its stored size: while the other two approaches focused on somehow changing the architecture of the deep model to simplify it and/or to reduce its memory footprint, here the objective is to compress a stored model with no architectural change. Towards this end, many approaches have been proposed: context-adaptive binary arithmetic coding Wiedemann et al. (2020), learning the quantized parameters using the local reparametrization trick Shayer et al. (2018), cluster similar parameters between different layers Xu et al. (2018), using matrix factorization followed by Tucker decomposition Kim et al. (2016), training adversarial neural networks towards compression Belagiannis et al. (2018) or employing a Huffman encoding scheme Han et al. (2016) are just some of them. Recently, Oktay et al. proposed an entropy penalized reparametrizations to the parameters of a deep model, which leads to competitive compression values sacrificing a little bit the deep model’s performance Oktay et al. (2019). However, their approach has some training overhead, like the fact they require to train a decoder, they make their formulation differentiable through the use of straight-through estimators (STE). The big advantage provided by their approach relies more in the re-parametrization leading to the quantization strategy, but the compressibility of their quantized parameters is limited to arithmetic coding.
Deep learning based compression schemes proposing a direct high entropy-based regularizer are difficult to design because of the non-differentiability of the entropy and its computational heaviness. All the discussed methods do not explicitly minimize the final compressed file size but they are limited to rigid quantization and compression schemes Wiedemann et al. (2020) or they build dictionaries on-purpose Han et al. (2016), losing generality. In the next section, we introduce our efficient and differentiable n-th order entropy proxy, to be used in the HEMP framework: it can be freely associated to any quantization strategy and any entropic compression algorithm. Differently to the work by Wiedemann et al. Wiedemann et al. (2020), HEMP is not bound to a particular quantization scheme, and provides a direct, scalable and differentiable entropy estimator on the continuous parameters.

3 Entropy-based regularization

In this section, we describe our entropy-based framework for quantization. We introduce a regularization formulation that uses a differentiable entropy proxy, evaluated on the continuous parameters of the model, to indirectly reduce the compressed size of the quantized network. We will show that this term easily scales-up to any entropy orders, thus improving the compression efficiency of actual algorithms such as dictionary-based compression.

3.1 Preliminaries

Symbol Meaning
-th (continuous) parameter in the -th layer
-th (quantized) parameter in the -th layer
quantization index corresponding
to the -th parameter in the -th layer
quantization levels
generic quantization index in range
probability that the quantized representation
of will have as quantization index
-th order entropy on the quantization indices
differentiable proxy of proposed
within HEMP
Table 1: Overview on the notation used in this work.

Here, we introduce preliminaries and notations. Let a feed-forward, multi-layer artificial neural network be composed of layers. Let be the -th parameter of the -th layer. Let us assume all ANN parameters are quantized onto discrete levels, with:

  • quantization index for every parameter ;

  • reconstruction (or representation) levels ; as shown in the following, every layer of the ANN model gets its own optimized set of reconstruction levels.

From these, we get the quantized parameters according to


Table 1 collects the most recurring symbols of this section. Please notice that multi-dimensional versions of the symbols are in bold.
Now, let us consider as the -uples of the quantized parameters, where is the -th -uple of quantized parameters. In general, the -th order entropy on the quantized model is


where (L0-norm) is the total number of parameters,

and, using the chain rule, we can express



In (3), the “probability” of the event is


where is the indicator function. Minimizing (2) results in maximizing the final compression for the quantized model when using an entropic compression algorithm (Witten et al. (1987); Seroussi and Lempel (1993)). Unfortunately, the problem in minimizing (2) within a gradient descent based optimization framework lies in the non-differentiability of (4). In the next section we introduce a differentiable proxy for (4) which directly optimizes the continuous parameters such that their quantization is highly compressible.

3.2 Differentiable n-th order entropy regularization

In the previous section we have stated the impossibility of directly optimizing (2) using gradient descent-based techniques because of the non-differentiability of (3). We are going to overcome this obstacle providing a formulation of (3) based on the distance between the continuous parameter and its quantized reconstruction . Here on, we will drop the subscript , but in general all the layers have different reconstruction levels.
Let us define first the distance between a parameter and the reconstruction level :


From (5), we can estimate the probability of binning to using the softmax function:


Such general formulation is computationally expensive, so we propose an efficient approximation thereof exploiting a “bin locality” principle, for which we say that a parameter can be binned to the two closest bins only. Under the assumption of quasi-static process, indeed, locally the probability of binning the continuous parameter in other bins than the two closest between two iteration steps can be neglected. We refer to these bins as and . In this case, we know . Here we can design a relative distance linearly-scaling probability:


where . Figure 2 displays the behavior of (7).

Figure 2: Visual representation of (7).

Hence, the binning probability in (7) scales as the relative distance from the center of the bin. If we combine (7) with (4) and, finally, with (2), we obtain




3.3 Study of the entropy regularization term

In this section we are detailing the derivations of the proposed entropy regularization. Obtaining an explicit formulation for the update terms allows to efficiently implement the update rule explicitly (when using gradient-based optimizers, without relying on the automatic differentiation packages) and to study both the stationary points of the regularization term and bounds of the gradient respectively.

3.3.1 Explicit derivation of the entropy regularization term’s gradient

Let us consider here the first order entropy proxy:




Let us differentiate (10) with respect to :


According to (7), we can write


and considering that




we have


Using a similar approach, we can explicitly write the gradient for the n-th order entropy term in (3.2):


where indicates the set of whose binning probability for is non-zero.
Having (17) explicit enables efficient gradient computation: indeed, given the designer choice in (7), every -uple of parameters has possible quantization indices -uples only, which is independent on the number of quantization levels . On the contrary, using (6) would result in possible quantization indices -uples . Hence, out proposed approach allows us to save memory at computation time.
For sake of simplicity, the following analysis on stationary points and boundaries will be performed on the first order entropy, but similar conclusions can be equivalently drawn for any n-th order.

3.3.2 Stationary points for H1

Figure 3: Gradient vanishing condition for both (in cyan) and (in violet).

In this section we are looking for stationary points (or in other words, when gradient vanishes). From (16) we observe that


assuming finite positive numbers. We can make explicit in the condition (18):



According to (7), we can rewrite (19) as


because by definition. As we expect, if and are evenly populated, and the stationary point of is

which results in


exactly between the centre of the two bins. From the entropy point of view, this is essentially what we expect, since we have two equi-populated bins; however, this is not what we like to have when we quantize a deep network, considering that it leads to an high quantization error. For this reason, favoring solutions in which is a good strategy and this is also the reason we included a reconstruction error in the overall regularization function.

3.3.3 Bound for H1’s derivative

Figure 4: Plot of the absolute upper bound (22) for as a function of and . In black we represent regions out of the considered domain.

In this section we are looking for an upper bound of and we study the cases in which such quantity explodes, in order to assess conditions to avoid gradient explosion. We can set the bound for the gradient magnitude as:


Considering that and that (so, both are finite, real-valued quantities), we are interested to guarantee


where is a positive real-value finite number. Given , let us study the cases in which such quantity explodes.

  • Case . In this case we have and . According to (16) ; so at least one parameter lies in the considered interval and the condition is impossible by construction.

  • Case . In this case we have and . Similarly to the previous case, ; so at least one parameter lies in the considered interval and the condition is impossible by construction.

  • Case . By construction, so this case is impossible.

In the next section we will describe the overall HEMP framework.

4 Training scheme

Figure 5: Schematic representation of HEMP.

The overall training scheme is summarized in Fig. 5 and includes a quantizer and an entropy encoder. The quantizer generates the discrete-valued representation of the network parameters at training time. The encoder produces the final compressed file embedding the deep model once the training is over. Our scheme does not make any assumption about the quantization or entropy coding scheme, contrarily to other strategies tailored for, e.g., specific quantization schemes (Han et al. (2016); Wiedemann et al. (2020)

). Therefore, in the following we will assume a very general, non-uniform Lloyd-max quantizer, while we will not make any assumption about the entropy encoder for the moment as it is external to the training process. Our learning problem can be formulated as follow: given a dataset and a network architecture, we want to compress the network parameters

, while preserving the network performance as measured by some loss function

. Towards this end, we introduce the following regularization function:


where and are two positive hyper-parameters and


is a reconstruction error estimator. Minimizing makes and, for instance, loss evaluation on the continuous parameters network approaches the loss estimated on the quantized network. Overall, we minimize the objective function:


Minimizing requires finding the right balance between and : towards this end, we propose to dynamically re-weight according insensitivity of each parameter (Tartaglione et al. (2018)). The key idea here is to re-weight the regularization gradient at every parameter update depending on the sensitivity of the loss with respect to every parameter. We say that the larger the magnitude of the gradient of the loss with respect to , the smaller the perturbation from the minimization induced by we desire. Hence, in the update of the parameter , we re-weight the gradient of by the insensitivity:


The HEMP framework allows to solve the learning problem using standard optimization strategies, where the gradient of (26) is descended.

5 Experiments

In this section we evaluate the effectiveness of HEMP. Towards this end, we propose experiments on several widely used datasets with different architectures.

Datasets and architectures

We experiment with LeNet-5 on MNIST, ResNet-32 and MobileNet-v2 on CIFAR-10, ResNet-18 and ResNet-50 on ImageNet. We always train from scratch except for ImageNet experiments where we rely on pre-trained models.



We experiment on a Nvidia RTX 2080 Ti GPU. Our algorithm is implemented using PyTorch 1.5.

222The source code will be made available on GitHub upon acceptance of the work. For all our simulations we use SGD optimization with momentum , and . Learning rate and batch-size depend on the dataset and the architecture: for all the datasets except for ImageNet the learning rate used is and batch-size , for ResNet-18 trained on ImageNet the learning rate is with batch-size while for ResNet-50 learning rate is with batch-size 32. The file containing the quantized parameters is entropy-coded using LZMA Pavlov (2007), a popular dictionary-based compression algorithm well-suited to exploit high-order entropy.


The goal of the present work is to compress a neural network without jeopardizing its accuracy, so we rely on two distinct, largely used, performance metrics:

  • the compressed model size as the size of the file containing the entropy-encoded network,

  • the classification accuracy of the compressed network (indicated as Top-1 in the following).

5.1 Preliminary experiments

As preliminary experiments, we evaluate if the regularizer function (3.2) is a good estimator of (2). Towards this end, we train the LeNet-5 architecture on MNIST minimizing while logging the entropy on the quantized parameters .

(a) LeNet-5 trained on MNIST
(b) ResNet-32 trained on CIFAR-10
Figure 6: Different entropy order minimization for LeNet-5 trained on MNIST (a) and on ResNet-32 trained on CIFAR-10 (b): in red first order is minimized, in blue the second one and in black the fourth (a) / third (b). Continuous lines represent the differentiable quantity introduced in (3.2) while dashed lines are the actual entropies directly computed on the quantized architecture (2).

Fig. (a)a shows the normalized and its approximation : three findings are noteworthy.
First, accurately estimates , i.e. minimizing yields to minimizing . Under the assumption the quantized parameters are entropy-coded, minimizing shall minimize the size of a file where the encoded parameters are stored.
Second, when , the training converges to a higher entropy, while minimizing higher entropy orders enables access to lower entropy embeddings. Higher entropy reflects on the final size of the model: while for we could get a final network size of 61kB, for the final size drops to approximately 27.5kB, having a top-1 accuracy of 99.27%. This better performance can be explained by the fact that higher order entropy can catch repeated sequences of parameters’ binnings which can lead to a significant compression boost.
Third, the higher , the fewer the epochs required to converge to low entropy values. However, in terms of actual training time, the available GPU memory limits the parallelism degree for computing the derivative term in (17). In the following, we will stick to as it enables both reasonably low entropy embeddings and training times.
As a further verification, we have run the same experiments on the ResNet-32 architecture trained on the CIFAR-10 dataset: also here, we minimize while logging the entropy on the quantized parameters at different values of . Fig. (b)b shows the normalized . Similarly to what observed in the main paper, second-order entropy minimization results to be a good trade-off between complexity and final performance, considering that the reached entropic rate of is comparable to . Please notice also that the entropy estimated on the quantized model, and reported Fig. (b)b, is proportional to the final file sizes.

Figure 7: Typical distribution of values during HEMP optimization for LeNet-5 trained on MNIST (second convolutional layer), with .

As a further analysis of HEMP’s effect of the parameter, in Fig. 7 we show the distribution of the optimized parameters on the second convolutional layer in LeNet-5 trained on MNIST (the other layers follow a similar distribution). In this case we optimize the model having 3 quantized values. As we observe the continuous values are distributed tightly around their quantized representations : as ; in the end the accuracy of the quantized representation of the model approaches the accuracy of the continuous model. Additionally, as observed in Fig. (a)a, also the entropy of the quantized model is minimized, achieving both a quantized trained model with high accuracy and high compressibility of its representation.

5.2 Comparison with the state-of-the-art

Model Method Top-1 [%] Size
LeNet-5 Baseline 99.30 1.7MB
LOBSTER Tartaglione et al. (2020) 99.10 19kB
Han et al. Han et al. (2016) 99.26 44kB
Wiedemann et al. Wiedemann et al. (2020) 99.12 43.4kB
HEMP 99.27 27.5kB
Wiedemann et al.(+pruning) Wiedemann et al. (2020) 99.02 11.9kB
HEMP+LOBSTER  Tartaglione et al. (2020) 99.05 2.00kB
Table 2: Results on the MNIST dataset using LeNet-5 architecture.
Model Method Top-1 [%] Size
ResNet-32 Baseline 93.10 1.9MB
LOBSTER Tartaglione et al. (2020) 92.97 439.4kB
HEMP 91.57 168.3kB
HEMP+ LOBSTER  Tartaglione et al. (2020) 92.55 86.2kB
MobileNet-v2 Baseline 93.67 9.4MB
HEMP 92.80 872kB
Table 3: Results on CIFAR-10 using different architectures.
Model Method Top-1 [%] Size
ResNet-18 Baseline 69.76 46.8MB
LOBSTER Tartaglione et al. (2020) 70.12 17.2MB
Lin et al. Lin et al. (2017) 68.30 5.6MB
Shayer et al. Shayer et al. (2018) 63.50 2.9MB
HEMP 68.80 3.6MB
HEMP+LOBSTER Tartaglione et al. (2020) 69.70 2.5MB
ResNet-50 Baseline 76.13 102.5MB
Wang et al. Wang et al. (2019) 70.63 6.3MB
Han et al.Han et al. (2016) 68.95 6.3MB
Wiedemann et al. Wiedemann et al. (2020) 74.51 10.4MB
Tung et al. Tung and Mori (2020) 73.7 6.7MB
HEMP (high acc.) 74.52 9.1MB
HEMP 71.33 5.5MB
MobileNet-v2 Baseline 72.1 13.5MB
Tu et al.Tu et al. (2020) 7.25 10.1MB
He et al.He et al. (2019) 9.8 4.95MB
Tung et al.Tung and Mori (2020) 70.3 2.2MB
HEMP 71.3 1.7MB
Custom, latency 6.11ms, APQ Wang et al. (2020) 72.8 20.8MB
energy 9.14mJ APQ Wang et al. (2020) + HEMP 72.5 3.04MB
Table 4: Results on ImageNet using different architectures.

We now compare our method with state-of-the-art methods for network compression. Our main goal is to minimize the size of the final compressed file while keeping the top-1 performance as close as possible to the baseline network’s one. Therefore, our approach can be compared only with works that report the real final file size. To the best of our knowledge, only the methods reported in Tables 23 and 4 can be included in this compression benchmark. Indeed, most of the pruning-based methods (Molchanov et al. (2017); Tartaglione et al. (2018)) typically report pruning-rates only, which can not be directly mapped to file size: encoding sparse structures requires additional memory to store the coordinates for the un-pruned parameters. We implemented one state-of-the-art pruning method (LOBSTER Tartaglione et al. (2020)) to report pruning baseline storage memory achieved. Concerning quantization methods, existing approaches either focus on quantization to boost inference computation minimization (Courbariaux et al. (2015); Rastegari et al. (2016); Lin et al. (2017); Mishra et al. (2018); Zhou et al. (2016)) or do not report the final file size (Ullrich et al. (2017); Kim et al. (2016); Belagiannis et al. (2018); Xu et al. (2018)). We also tried to directly compress the baseline file and we did not observe any compression gain. Therefore, to make reading easier, we did not report these numbers.
As a first experiment, we train LeNet-5 on MNIST (Table 2): despite the simplicity of the task, the reference LeNet-5 is notoriously over-parametrized for the learning task. Indeed, as expected, most of the state-of-the-art techniques are able to compress the model to approximately 40kB. In such a context, HEMP performs best, lowering the size of the compressed model to 27.5kB.
Then, we experiment with ResNet-32 and MobileNet-v2 on CIFAR-10 (as reported in Table 3), achieving also in this case significant compression: ResNet-32 size drops from 1.9MB to 168kB and MobileNet-v2 from 9.4MB to 822kB. Note that, other literature methods do not report experiments on CIFAR-10 on the proposed architectures. Nevertheless, HEMP approximately reduces the network size by a factor 11 for both architectures.
We also compress pretrained ResNet-18, ResNet-50 and MobileNet-v2 trained on ImageNet (Table 4). Also in this case, HEMP reaches competitive final file size, being able to compress ResNet-18 from 46.8MB to 3.6MB with minimal performance loss and ResNet-50 from 102.5MB to 5.5MB. For the ResNet-50 experiment, we also report partial result for high accuracy band, indicated as “high acc”, to compare to Wiedemann et al. Wiedemann et al. (2020): for the same accuracy, HEMP proves to drive the model to a higher compression. In the case of ResNet-18, Shayer et al. (2018) achieves a 0.5MB smaller compressed model, which is however set off by a 4.3% worse Top-1 error. Also in the case of very efficient architectures like Mobilenet-v2, HEMP is able to reduce significantly the storage memory occupation, moving from 13.5MB to 1.7MB only. Furthermore, the error drop is in this case very limited (0.8%) when compared to other techniques, like Tung et al. which, in the case of less optimized architectures like ResNet-50, do not have a large drop. While concurrent techniques rely on typical pruning+quantization strategies, aiming at indirectly eliminating the redundancy in the models, HEMP is directly optimizing over the existing redundancy.
Finally, we also tried to make HEMP cope with a different quantize and prune scheme. In particular, APQ Wang et al. (2020) proves to be a perfect framework for our purpose, since it is a strategy performing both network architecture search, pruning and quantization. We have used HEMP in the most challenging scenario proposed by Wang et al., with the lowest latency constraint (6.11ms) and the lowest energy consumption (9.14mJ) at inference time. In this case, we have fine-tuned the APQ’s provided model for 5 epochs. Even in this case, HEMP is able to reduce the model’s size, from the 20.8MB of the model to 3.04MB only, proving on-the-field its deployability as a companion besides other quantization/pruning scheme, and non-exploiting any prior on the network’s architecture.
Overall, these experiments show that HEMP strikes a competitive trade-off between compression ratio and performance on several architectures and datasets.


It has been observed that combining pruning and compression techniques enhances reduces the model final file size with little performance loss Wiedemann et al. (2020). In our context, this translates into including two constraints to the learning:

  • force the quantizer to have, for some , the representation (or in simpler words, a quantization level corresponding to “0”);

  • include a pruning mechanism (permanent parameter set to “0”).

Both of the constraints work independently from HEMP: indeed, HEMP is not a quantization technique, but it is thought to side any other learning strategy whose aim is to quantize the model’s parameters (in such context, pruning “quantizes to zero” as many parameters as possible). Hence, we tried to side HEMP to LOBSTER Tartaglione et al. (2020), which is a state-of-the-art differentiable pruning strategy (hence, compatible within HEMP’s framework).
The results are as well reported in Tables 23 and 4: it is evident that, including a prior on the optimal distribution of the parameters (removing all the un-necessary ones for the learning problem) helps HEMP to compress more. We have tested the setup HEMP + LOBSTER on one architecture per dataset: LeNet-5 (MNIST), ResNet32 (CIFAR-10) and ResNet-18 (ImageNet). While LOBSTER alone is able to achieve highly compressed models for toy datasets (like MNIST), it can not achieve high compression alone on more complex datasets. Still, siding a technique like LOBSTER to HEMP, boosts the compression of 10x for MNIST and ImageNet dataset and 4x for CIFAR-10.
HEMP minimizes the -th entropy order (in these experiments, ) - or in other words, maximizes the occurrence of certain sequences of quantization indices. The mapping of these quantization indices to quantization levels has to be determined outside HEMP: when we run experiments with “HEMP” alone, the loss minimization (in our case, the cross-entropy) automatically determines these levels - with the general-purpose Lloyd-max quantizer. However, pruning strategies include a prior on one of the quantization levels (the one corresponding to “0”), and this helps towards having a higher entropy minimization.

5.3 Ablation study

(a) Reconstruction error regularization
epochsminimized, test lossminimized, valueminimized, test lossminimized, valueminimized, test lossminimized, value
(b) Insensitivity re-weighting
Figure 8: Test set losses for different trainings on ResNet-32 trained on CIFAR-10 (a), and effect of the insensitivity as re-weighting factor for (b). Please notice that the blue line in both (a) and (b) refers to the same simulation, which refers to the standard HEMP training.

Here, we evaluate the impact of the reconstruction error term (25) and the overall insensitivity re-weighting (27) for the regularization function. Towards this end, we perform an ablation study on the ResNet-32 architecture trained on CIFAR-10.

Reconstruction error regularization

Fig. (a)a (left), shows the ResNet-32 loss for the continuous and the quantized models ( and when the reconstruction error is included or excluded () from the regularization function (24). We observe that both continuous models (solid lines) obtain similar performance on the test set. However, the quantized models (dashed lines) perform very differently. When the reconstruction error is not included in the training procedure (red lines), the quantized model reach a plateau with a high loss value showing that the network performs poorly on the test set. Conversely, when the reconstruction error is included (blue lines), the quantized model reaches a final loss closer to the continuous models. Indeed, regularizing also on (25) makes , hence . This experiment verifies the contribution of the error reconstruction regularization term towards the good performance of the quantized model.

Insensitivity-based re-weighting

Fig. (b)b (right) shows the performance of the ResNet-32 model including or excluding the insensitivity re-weighting for the regularization function (27). Here, we report the test set losses obtained by the continuous models (continuous lines) and the value for the overall function (dashed lines). We observe a very unstable test loss without insensitivity re-scaling for (magenta line). Hence, minimization with an overall re-scaling for is also shown (in cyan): in such case, the test loss on the continuous model remains low, but is extremely slowly minimized. Using the insensitivity re-weighting (in blue) proves to be a good trade-off between keeping the test set loss low and both minimizing . This behavior is what we expected: the insensitivity re-weighting, acting parameter-wise (ie. there is a different value per each parameter), dynamically tunes the re-weighting of the overall regularization function , allowing faster minimization with minimal or no performance loss. This is why we could use the same and values for all the simulations, despite optimizing different architectures on different datasets. Such robustness of the hyper-parameters over different dataset is a major practical strength of our approach.

6 Conclusion

We presented HEMP, an entropy coding-based framework for compressing neural networks parameters. Our formulation efficiently estimates entropy beyond the first order and can be employed as regularizer to minimize the quantized parameters’ entropy in gradient based learning, directly on the continuous parameters. The experiments show that HEMP is not only an accurate proxy towards minimizing the entropy of the quantized parameters, but are also pivotal to model the quantized parameters statistics and improve the efficiency of entropy coding schemes. We also sided HEMP to LOBSTER, a state-of-the-art pruning strategy which introduces a prior on the weight’s distribution which gives a further boost to the final model’s compression.Future works include the integration of a quantization technique designed specifically for deep models to HEMP.


  • C. Baldassi, F. Gerace, H. J. Kappen, C. Lucibello, L. Saglietti, E. Tartaglione, and R. Zecchina (2018) Role of synaptic stochasticity in training low-precision neural networks. Physical review letters 120 (26), pp. 268103. Cited by: §2.
  • V. Belagiannis, A. Farshad, and F. Galasso (2018) Adversarial network compression. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    pp. 0–0. Cited by: §2, §5.2.
  • Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. Choudhary, and S. Chang (2015) An exploration of parameter redundancy in deep networks with circulant projections. In Proceedings of the IEEE international conference on computer vision, pp. 2857–2865. Cited by: §1.
  • M. Courbariaux, Y. Bengio, and J. David (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131. Cited by: §1, §2, §5.2.
  • J. Frankle and M. Carbin (2018) The lottery ticket hypothesis: finding sparse, trainable neural networks. International Conference on Learning Representation (ICLR). Cited by: §2.
  • S. Han, H. Mao, and W. J. Dally (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representation (ICLR). Cited by: §2, §4, Table 2, Table 4.
  • Y. He, Z. Pan, L. Li, Y. Shan, D. Cao, and L. Chen (2019) Real-time vehicle detection from short-range aerial image with compressed mobilenet. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8339–8345. Cited by: Table 4.
  • F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: §1, §2.
  • Y. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin (2016)

    Compression of deep convolutional neural networks for fast and low power mobile applications

    International Conference on Learning Representation (ICLR). Cited by: §1, §2, §5.2.
  • C. Lee, Y. Kim, H. Ji, Y. Lee, Y. Hur, and H. Lim (2021) On the redundancy in the rank of neural network parameters and its controllability. Applied Sciences 11 (2), pp. 725. Cited by: §1.
  • X. Lin, C. Zhao, and W. Pan (2017) Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems, pp. 345–353. Cited by: §2, §5.2, Table 4.
  • C. Louizos, M. Welling, and D. P. Kingma (2018) Learning sparse neural networks through regularization. International Conference on Learning Representation (ICLR). Cited by: §1, §2.
  • A. Mishra, E. Nurvitadhi, J. J. Cook, and D. Marr (2018) WRPN: wide reduced-precision networks. International Conference on Learning Representation (ICLR). Cited by: §1, §2, §5.2.
  • D. Molchanov, A. Ashukha, and D. Vetrov (2017) Variational dropout sparsifies deep neural networks. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 2498–2507. Cited by: §1, §2, §5.2.
  • D. Oktay, J. Ballé, S. Singh, and A. Shrivastava (2019) Scalable model compression by entropy penalized reparameterization. arXiv preprint arXiv:1906.06624. Cited by: §1, §2.
  • I. Pavlov (2007) Lzma sdk (software development kit). Cited by: §5.
  • M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In European conference on computer vision, pp. 525–542. Cited by: §2, §5.2.
  • S. Samarakoon, M. Bennis, W. Saad, and M. Debbah (2019) Distributed federated learning for ultra-reliable low-latency vehicular communications. IEEE Transactions on Communications. Cited by: §1.
  • M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 4510–4520. Cited by: §1, §2.
  • G. Seroussi and A. Lempel (1993) Lempel-ziv compression scheme with enhanced adapation. Google Patents. Note: US Patent 5,243,341 Cited by: §3.1.
  • O. Shayer, D. Levi, and E. Fetaya (2018) Learning discrete weights using the local reparameterization trick. International Conference on Learning Representation (ICLR). Cited by: §2, §5.2, Table 4.
  • E. Tartaglione, A. Bragagnolo, A. Fiandrotti, and M. Grangetto (2020) LOss-based sensitivity regularization: towards deep sparse neural networks. arXiv preprint arXiv:2011.09905. Cited by: §5.2, §5.2, Table 2, Table 3, Table 4.
  • E. Tartaglione, S. Lepsøy, A. Fiandrotti, and G. Francini (2018) Learning sparse neural networks via sensitivity-driven regularization. In Advances in neural information processing systems, pp. 3878–3888. Cited by: §1, §2, §4, §5.2.
  • C. Tu, J. Lee, Y. Chan, and C. Chen (2020) Pruning depthwise separable convolutions for mobilenet compression. In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: Table 4.
  • F. Tung and G. Mori (2020) Deep neural network compression by in-parallel pruning-quantization. IEEE transactions on pattern analysis and machine intelligence 42 (3), pp. 568–579. Cited by: Table 4.
  • K. Ullrich, E. Meeds, and M. Welling (2017) Soft weight-sharing for neural network compression. International Conference on Learning Representation (ICLR). Cited by: §2, §5.2.
  • K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han (2019) Haq: hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8612–8620. Cited by: Table 4.
  • T. Wang, K. Wang, H. Cai, J. Lin, Z. Liu, and S. Han (2020) APQ: joint search for nerwork architecture, pruning and quantization policy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §5.2, Table 4.
  • S. Wiedemann, H. Kirchhoffer, S. Matlage, P. Haase, A. Marban, T. Marinc, D. Neumann, T. Nguyen, H. Schwarz, T. Wiegand, D. Marpe, and W. Samek (2020) DeepCABAC: a universal compression algorithm for deep neural networks. IEEE Journal of Selected Topics in Signal Processing. Cited by: §1, §2, §4, §5.2, §5.2, Table 2, Table 4.
  • I. H. Witten, R. M. Neal, and J. G. Cleary (1987) Arithmetic coding for data compression. Communications of the ACM 30 (6), pp. 520–540. Cited by: §3.1.
  • Y. Xu, Y. Wang, A. Zhou, W. Lin, and H. Xiong (2018) Deep neural network compression with single and multiple level quantization. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §1, §2, §5.2.
  • X. Zhang, X. Zhou, M. Lin, and J. Sun (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6848–6856. Cited by: §2.
  • S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou (2016) Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §1, §2, §5.2.