1 Introduction
Artificial neural networks (ANNs) are nowadays one of the most studied algorithm used to solve a huge variety of tasks. Their success comes from their ability to learn from examples, not requiring any specific expertise and using very general learning strategies. The use of GPUs (and, recently, TPUs) for training ANNs gave a decisive kick to their largescale deploy.
However, many deep models share a common obstacle: the large number of parameters, which allows their successful training [1, 4], determines in turn a large number of operations at inference time, preventing efficient deployment to mobile and cheap embedded devices.
In order to address this problem, a number of approaches have been proposed, like defining new, more efficient models [10]. Recently, a race to shrink the size of these ANN models has begun: the socalled pruning strategies are indeed able to remove (or prune) nonrelevant parameters from pretrained models, reducing the size of the ANN model, yet keeping a high generalization capability. On this topic, a very large amount of strategies have been proposed [6, 14, 22, 20] from which we can identify two main classes:

oneshot strategies: which are able to prune parameters using very fast, greedy approaches;

gradual strategies: much slower than oneshot approaches, potentially they can achieve higher compression rates (or in other words, they promise to prune more parameters at the cost of higher computational complexity).
In such a rush, however, an effort into a deeper understanding on potential properties of such sparse architectures has been mostly set aside: is there a specific reason for which we are able to prune many parameters with minimal or no generalization loss? Are oneshot strategies enough to match gradual pruning approaches? Is there any hidden property behind these sparse architectures?
In this work, we first compare oneshot pruning strategies to their gradual counterparts, investigating the eventual benefits of having a much more computationallyintensive sparsifying strategy. Then, we shine a light on some local properties of minima achieved using the two different pruning strategies and finally, we propose PSPentropy
, a measure on the state of ReLUactivated neurons, to be used as an analysis tool to get a better understanding for the obtained sparse ANN models.
The rest of this paper is organized as follows. Sec. 2 reviews the importance of network pruning and the most relevant literature. Next, in Sec. 3 we discuss the relevant literature around local properties of minima for ANN models. Then, in Sec. 4 we propose PSPentropy, a metric to measure how much a neuron specializes in identifying features belonging to a subset of classes learned at training time. Sec. 5 provides our findings on the properties for sparse architectures and finally, in Sec. 6, we draw the conclusions and identify further directions for future research.
2 State of the art pruning techniques
In the literature it is possible to find a large number of pruning approaches, some oldfashioned [12] and others more recent [9, 13, 17]. Among the latter, many subcategories can be identified. Ullrich et al. introduce what they call soft weight sharing, through which is possible to introduce redundancy in the network and reduce the amount of stored parameters [23]. Other approaches are based on parameters regularization and pruning: for example, Louizos et al. use an proxy regularization; Tartaglione et al., instead, define the importance of a parameter via a sensitivity measure used as regularization [20]. Other approaches are dropoutbased, like sparse variational dropout, proposed by Molchanov et al., leveraging on a bayesian interpretation of Gaussian dropout and promoting sparsity in the ANN model [17].
Overall, most of the proposed pruning techniques can be divided in two macro classes. The first is defined by approaches based on gradual pruning [15, 22, 25]
, in which the network is, at the same time, trained and pruned following some heuristic approach, spanning a large number of pruning iterations. One among these, showing the best performances, is LOBSTER, where parameters are gradually pruned according to their local contribution to the loss
[22]. The second class, instead, includes all the techniques based on oneshot pruning [6, 9, 16]: here the pruning procedure consists of three stages:
a large, overparametrized network is normally trained to completion;

the network is then pruned using some kind of heuristic (e.g. magnitude thresholding) to satisfaction (the percentage of remaining parameters is typically an hyperparameter);

the pruned model is further finetuned to recover the accuracy lost due to the pruning stage.
A recent work by Frankle and Carbin [6] termed the lottery ticket hypothesis, which is having a large impact on the research community.
They claim that from an ANN, early in the training, it is possible to extract a sparse subnetwork on a oneshot fashion: such sparse network, when trained, can match the accuracy of the original model. In a followup, Renda et al. propose a retraining approach that replaces the finetuning step with weight rewinding: after pruning, the remaining parameters are reset to their initial values and the pruned network is trained again. They also argue that using the initial weights values is fundamental to achieve competitive performance, which is degraded when starting from a random initialization [18].
On the other hand, Liu et al. show that, even when retraining a pruned subnetwork using a new random initialization, they are able to reach an accuracy level comparable to its dense counterpart; challenging one of the conjectures proposed alongside the lottery ticket hypothesis [14].
In our work we try to shed some light on this discussion, comparing stateoftheart oneshot pruning to gradual pruning.
3 Local properties of minima
In the previous section we have explored some of the most relevant pruning strategies. All of them rely on stateoftheart optimization strategies: applying very simple optimizing heuristics to minimize the loss function, like for example SGD
[2, 26], it is nowadays possible to succeed in training ANNs on huge datasets. Theoretically speaking, this is the “miracle” of deep learning, as the dimensionality of the problem is huge (indeed, these problems are typically overparametrized, and the dimensionality can be efficiently reduced [20]). Furthermore, minimizing nonconvex objective functions is typically supposed to make the trained architecture stuck into local minima. However, the empirical evidence shows that something else is happening under the hood: understanding it is in general critical.Goodfellow et al. pioneered the problem of understanding why deep learning works. In particular, they observed there is essentially no loss barrier between a generic random initialization for the ANN model and the final configuration [8]. This phenomena has also been observed on larger architectures by Draxler et al. [5]. These works lay as basis for the “lottery ticket hypothesis” papers. However, a secondary yet relevant observation in [8] stated that there is a loss barrier between different ANN configurations showing similar generalization capabilities. Later, it was shown that typically a low loss path between wellgeneralizing solutions to the same learning problem can be found [19]. From this brief discussion it is evident that a general approach on how to better characterize such minima has yet to be found.
Keskar et al. showed why we should prefer small batch methods to large batch ones: they correlate the stochasticity introduced by smallbatch methods to the sharpness of the reached minimum [11]. In general, they observe that the larger the batch, the sharper the reached minimum. Even more interestingly, they observe that the sharper the minimum, the worse the generalization of the ANN model. In general, there are many works supporting the hypothesis that flat minima generalize well, and this has been also the strength for a significant part of the current research [3, 11]. However, in general this does not necessarily mean that no sharp minimum generalizes well, as we will see in Sec. 5.2.
4 Towards a deeper understanding: an entropybased approach
In this section we propose PSPentropy, a metric to evaluate the dependence of the output for a given neuron in the ANN model to the target classification task. The proposed measure will allow us to better understand the effect of pruning.
4.1 Postsynaptic potential
Let us define the output of the given th neuron at the th layer as
(1) 
where is the input of such neuron, are the parameters associated to it, is some affine function and
is the activation function, we can define its
postsynaptic potential (PSP) [21] as(2) 
Typically, deep models are ReLUactivated: here on, let us consider the activation function for all the neurons in hidden layers as . Under such assumption it is straightforward to identify two distinct regions for the neuron activation:

: the output of the neuron will be exactly zero ;

: there is a linear dependence of the output to .
Hence, let us define
(3) 
Intuitively, we understand that if two neurons belonging to the same layer, for the same input, share the same , then they are linearlymappable to one equivalent neuron:

, : one of them can be simply removed;

, : they are equivalent to a linear combination of them.
In this work we are not interested in using this approach towards structured pruning: there are many works in the literature which tackle this issue using efficient proxies. In the next section we are going to formulate a metric to evaluate the degree of disorder in the post synaptic potentials. The aim of such measure will be to have an analytical tool to give us a broader understanding on the behavior of the neurons in sparse architectures.
4.2 PSPentropy for ReLUactivated neurons
In the previous section we have recalled the concept of postsynaptic potential. Some interesting concepts have been also introduced for ReLUactivated networks: we can use its value to approach the problem of binning the state of a neuron, according to . Hence, we can construct a binary random process that we can rank according to its entropy. To this end, let us assume we set as input of our ANN model two different patterns, and , belonging to the same class (for those inputs, we aim at having the same target at the output of the ANN model). Let us consider the PSP (where is an hidden layer):

if = we can say there is low PSP entropy;

if we can say there is high PSP entropy.
We can model an entropy measure for PSP:
(4) 
where
is the probability
when presented an input belonging to the th class. Since we typically aim at solving a multiclass problem, we can model an overall entropy for the neuron as(5) 
It is very important to separate the contributions of the entropy according to the th target class since we expect the neurons to catch relevant features being highlycorrelated to the target classes. Eq. (5) provides us very important information towards this end: the lower its value the more it specializes for some specific classes.
The formulation in (5) is very general and it can be easily extended to higherorder entropy, i.e. entropy of sets of neurons whose state correlates for the same classes. Now we are ready to use this metrics to shed further light to the findings in Sec. 5.
5 Experiments
For our test, we have decided to compare the stateoftheart oneshot pruning proposed by Frankle and Carbin [6] to one of the topperforming gradual pruning strategies, LOBSTER [22]. Towards this end, we first obtain a sparse network model using LOBSTER;
the nonpruned parameters
are then reinitialized to their original values, according to the lottery ticket hypothesis [6]. Our purpose here is to determine whether the lottery ticket hypothesis applies also to the sparse models obtained using highperforming gradual pruning strategies.
As a second experiment, we want to test the effects of different, random initialization while keeping the achieved sparse architecture. According to Liu et al., this should lead to similar results to those obtained with the original initialization [14]. Towards this end, we tried different new starting configurations. As a last experiment, we want to assess how important is the structure originating from the pruning algorithm in reaching competitive performances after reinitialization: for this purpose, we randomly define a new pruned architecture with the same number of pruned parameters as those found via LOBSTER. Also in this case, different structures have been tested.
We decided to experiment with different architectures and datasets commonly employed in the relevant literature: LeNet300 and LeNet5caffe trained on MNIST, LeNet5caffe trained on FashionMNIST
[24] and ResNet32 trained on CIFAR10.^{1}^{1}1https://github.com/akamaster/pytorch_resnet_cifar10 For all our trainings we used the SGD optimization method with standard hyperparameters and data augmentation, as defined in the papers of the different compared techniques [6, 14, 22].5.1 Oneshot vs gradual pruning
In Fig. 1 we show, for different percentages of pruned parameters, a comparison between the test accuracy of models pruned using the LOBSTER technique and the models retrained following the approaches we previously defined.
We can clearly identify a low compression rate regime in which the reinitialized model is able to recover the original accuracy, validating the lottery ticket hypothesis. On the other hand, when the compression rate rises (for example when we remove more than 95% of the LeNet300 model’s parameters, as observed in Fig. 0(a)), the retraining approach strives in achieving low classification errors.
As one might expect, other combinations of dataset and models might react differently. For example, LeNet300 is no longer able to reproduce the original performance when composed of less then 5% of the original parameters. On the other hand, LeNet5, when applied on MNIST, is able to achieve an accuracy of around 99.20% even when 98% of its parameters are pruned away (Fig. 0(b)). This does not happen when applied on a more complex dataset like FashionMNIST, where removing 80% of the parameters already leads to performance degradation (Fig. 0(c)). Such a gap becomes extremely evident when we reinitialize an even more complex architecture like ResNet32 trained on CIFAR10 (Fig. 0(d)).
From the reported results, we observe that the original initialization is not always important: the error gap between a randomly initialized model and a model using the original weights’ values is minor, with the latter being slightly better. Furthermore, they both fail in recovering the performance for high compression rates.
5.2 Sharp minima can also generalize well
Results of LeNet5 trained on MNIST with the highest compression (99.57%): (a) plots loss in the training set and (b) plots the top5 largest hessian eigenvalues. G is the solution found with gradual learning while 1S is the best oneshot solution (Frankle and Carbin).
In order to study the sharpness of local minima, let us focus, for example, on the results obtained on LeNet5 trained on MNIST. We choose to focus our attention on this particular ANN model since, according to the stateoftheart and coherently to our findings, we observe the lowest performance gap between gradual and oneshot pruning (as depicted in Fig. 0(b)); hence, it is a more challenging scenario to observe qualitative differences between the two approaches. However, we remark that all the observations for such a case apply also to the other architectures/datasets explored in Sec. 5.1.
In order to obtain the maps in Fig. 2, we follow the approach proposed by [8] and we plot the loss for the ANN configurations between two reference ones: in our, case, we compare a solution found with gradual pruning (G) and oneshot (1S). Then, we take a random orthogonal direction to generate a 2D map. Fig. 1(a) shows the loss on the training set between iterative and oneshot pruning for the highest compression rate (99.57% of pruned parameters as shown in Fig. 0(b)). According to our previous findings, we see that iterative pruning lies in a lower loss region. Here, we show also the plot of the top5 Hessian eigenvalues (all positive), in Fig. 1(b), using the efficient approach proposed in [7]. Very interestingly, we observe that the solution proposed by iterative pruning lies in a narrower minimum than the one found using the oneshot strategy, despite generalizing slightly better. With this, we do not claim that narrower minima generalize well: gradual pruning strategies enable access to a subset of wellgeneralizing narrow minima, showing that not all the narrow minima generalize worse than the wide ones. This finding raises warnings against second order optimization, which might favor the research of flatter, wider minima, ignoring wellgeneralizing narrow minima. These nontrivial solutions are naturally found using gradual pruning which cannot be found using oneshot approaches, which on the contrary focus their effort on larger minima. In general, the sharpness of these minima explains why, for high compression rates, retraining strategies fail in recovering the performance, considering that it is in general harder to access this class of minima.
5.3 Study on the post synaptic potential
In Sec. 5.2 we have observed that, as a result, iterative strategies focus on wellgeneralizing sharp minima. Is there something else yet to say about those?
Let us inspect the average magnitude values of the PSPs for the different found solutions: towards this end, we could plot the average of their L2 norm values (). As a first finding, graduallypruned architectures naturally have lower PSP L2norm values, as we observe in Fig. 3. None of the used pruning strategies explicitly minimize the term in : they naturally drive the learning towards such regions. However, the solution showing better generalization capabilities shows lower values. Of course, there are regions with even lower values; however, according to Fig. 1(a), they should be excluded since they correspond to highloss values (not all the low regions are lowloss).
If we look at the PSPentropy formulated in (5), we observe something interesting: gradual and oneshot pruning show comparable firstorder entropies, as shown in Fig. 3(a).^{2}^{2}2the source code for PSPentropy is available at https://github.com/EIDOSlab/PSPentropy.git It is interesting to see that there are also lower entropy regions which however correspond to higher loss values, according to Fig. 1(a). When we move to higherorder entropies, something even more interesting happens: gradual pruning shows higher entropy than oneshot, as depicted in Fig. 3(b) (displaying the second order entropy). In such a case, having a lower entropy means having more groups of neurons specializing to specific patterns which correlate to the target class; on the contrary, having higher entropy yet showing better generalization performance results in having more general features, more agnostic towards a specific class, which still allow a correct classification performed by the output layer. This counterintuitive finding has potentiallyhuge applications in transfer learning and domain adaptation, where it is critical to extract more general features, not very specific to the originallytrained problem.
6 Conclusion
In this work we have compared oneshot and gradual pruning on different stateoftheart architectures and datasets. In particular, we have focused our attention in understanding potential differences and limits of both approaches towards achieving sparsity in ANN models.
We have observed that oneshot strategies are very efficient to achieve moderate sparsity at a lower computational cost. However, there is a limit to the maximum achievable sparsity, which can be overcome using gradual pruning. The highlysparse architecture, interestingly, focus on a subset of sharp minima which are able to generalize well, which pose some questions to the potential suboptimality of secondorder optimization in such scenarios. This explains why we observe that oneshot strategies fail in recovering the performance for high compression rates. More importantly, we have observed, contrarily to what it could be expected, that highlysparse graduallypruned architectures are able to extract general features nonstrictly correlated to the trained classes, making them unexpectedly, potentially, a good match for transferlearning scenarios.
Future works include a quantitative study on transferlearning for sparse architectures and PSPentropy maximizationbased learning.
References
 [1] (2014) Do deep nets really need to be deep?. In Advances in neural information processing systems, pp. 2654–2662. Cited by: §1.

[2]
(2010)
Largescale machine learning with stochastic gradient descent
. In Proceedings of COMPSTAT’2010, pp. 177–186. Cited by: §3.  [3] (2016) Entropysgd: biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838. Cited by: §3.
 [4] (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pp. 1269–1277. Cited by: §1.
 [5] (2018) Essentially no barriers in neural network energy landscape. arXiv preprint arXiv:1803.00885. Cited by: §3.
 [6] (2019) The lottery ticket hypothesis: finding sparse, trainable neural networks. Note: cited By 8 External Links: Link Cited by: §1, §2, §5.
 [7] (201810) Pytorchhessianeigentings: efficient pytorch hessian eigendecomposition. External Links: Link Cited by: §5.2.
 [8] (2014) Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544. Cited by: §3, §5.2.
 [9] (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §2.

[10]
(2017)
Mobilenets: efficient convolutional neural networks for mobile vision applications
. arXiv preprint arXiv:1704.04861. Cited by: §1.  [11] (2016) On largebatch training for deep learning: generalization gap and sharp minima. arXiv preprint arXiv:1609.04836. Cited by: §3.
 [12] (1990) Optimal brain damage. In Advances in neural information processing systems, pp. 598–605. Cited by: §2.
 [13] (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §2.
 [14] (2018) Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270. Cited by: §1, §2, §5.
 [15] (2017) Learning sparse neural networks through regularization. arXiv preprint arXiv:1712.01312. Cited by: §2.

[16]
(2017)
Thinet: a filter level pruning method for deep neural network compression.
In
Proceedings of the IEEE international conference on computer vision
, pp. 5058–5066. Cited by: §2.  [17] (2017) Variational dropout sparsifies deep neural networks. In 34th International Conference on Machine Learning, ICML 2017, Vol. 5, pp. 3854–3863. Note: cited By 29 External Links: Link Cited by: §2.
 [18] (2020) Comparing rewinding and finetuning in neural network pruning. arXiv preprint arXiv:2003.02389. Cited by: §2.
 [19] (2019) Take a ramble into solution spaces for classification problems in neural networks. In International Conference on Image Analysis and Processing, pp. 345–355. Cited by: §3.
 [20] (2018) Learning sparse neural networks via sensitivitydriven regularization. In Advances in Neural Information Processing Systems, pp. 3878–3888. Cited by: §1, §2, §3.
 [21] (2019) Postsynaptic potential regularization has potential. In International Conference on Artificial Neural Networks, pp. 187–200. Cited by: §4.1.
 [22] (2020) LOssbased sensitivity regularization:towards deep sparse neural networks. Note: https://iris.unito.it/retrieve/handle/2318/1737767/608158/ICML20.pdf Cited by: §1, §2, §5.
 [23] (2019) Soft weightsharing for neural network compression. In 5th International Conference on Learning Representations, ICLR 2017  Conference Track Proceedings, Note: cited By 2 External Links: Link Cited by: §2.
 [24] (2017) Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. CoRR abs/1708.07747. External Links: Link, 1708.07747 Cited by: §5.
 [25] (2017) To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878. Cited by: §2.
 [26] (2010) Parallelized stochastic gradient descent. In Advances in neural information processing systems, pp. 2595–2603. Cited by: §3.