1 Introduction and Related Work
Neural network pruning techniques demonstrate that the function learned by neural networks can often be represented with significantly less parameters (even by less than 10%), without compromising performance, by selecting a small subnetwork of the initial model.
The Lottery Ticket Hypothesis, introduced in Frankle & Carbin (2019) postulates that these successful subnetworks are determined not only by their connection structure, but their initial weights as well. Once the network is initialised, some subnetworks receive a winning ticket, i.e., they can be trained in isolation and achieve similar performance as if the whole network as trained.
Winning tickets have been successfully identified in a variety of learning scenarios, e.g. Frankle & Carbin (2019), Frankle et al. (2019), Mehta (2019). Morcos et al. (2019) even showed that to some extent winning tickets can be transferred across different tasks and optimizers.
Frankle & Carbin (2019), perform pruning based on the magnitude of the weights of a trained network, using one or more iterations of training, pruning and reversing to initial weights. We refer to this algorithm as LTH. Several alternative pruning criteria based on the magnitude of trained and initial weights are explored in Zhou et al. (2019).
Lee et al. (2019) introduced two fundamental differences to network pruning. First, pruning is performed on an initialised, but untrained network. Second, the pruning criterion incorporates the
gradient of the loss function with respect to the weights
. We call this algorithm as SNIP.Our paper argues that the more complex pruning criterion of SNIP can be successfully combined with the more complex trainingbased setup of LTH to yield a pruning algorithm that is superior to both. We argue that the key benefit of the SNIP criterion is that it is data dependent.
2 Pruning Variants
The difference between LTH and SNIP can be factored into two components, which can be recombined into four different pruning algorithms.
The first factor is when pruning happens:

Trainingbased pruning makes decisions based on trained weights. We use the iterative version of LTH: in each iteration, we 1) fully train the network, 2) delete of the weights then 3) revert the remaining weights to their initial value.

Initialisationbased pruning makes decisions based on the untrained weights.
The second factor is the pruning criterion:

Magnitude pruning orders weights according to their absolute value and deletes the bottom .

Gradientsensitive pruning makes use of some training data . For each weight , we compute the average absolute gradient of the loss with respect to the weight: . Next, we order weights based on and delete the bottom .
The possible combinations of these two factors:

Trainw: Trainingbased, magnitude pruning: LTH.

Trainwg: Trainingbased, gradientsensitive pruning: a novel approach that we argue works best.

Initw: Initialisationbased, magnitude pruning: novel.

Initwg: Initialisationbased, gradientsensitive pruning: SNIP.
3 Analysis
The ordering criterion of weights for magnitude pruning is , while for gradientsenstive pruning this is multiplied with the gradient of the loss: . We aim to better understand the role of the gradient component.
Gradientsensitive pruning is more complex: weights can be deleted either because they are small or because their contribution to the final loss is small. As we shall see in Figure 3 and Figure 4, there are indeed more small weights that survive the pruning under the gradientsensitive criterion. Conversely, more large weights are deleted because their gradients are small.
The extreme case is when a weight has zero corresponding gradient, i.e., the weight simply does not contribute to the loss. The weights of a ReLU neuron that never activates are examples of this. Gradientsensitive pruning very reasonably deletes such weights, regardless of their magnitude.
The crucial property of gradientsensitive pruning is that it takes the dataset into consideration. The magnitude of a weight might be a good data independent heuristic for assessing the usefulness of a connection, while the gradient provides a data dependent alternative heuristic. A good combination of the two likely yields an ordering criterion that surpasses both. The product
is one such combination.^{1}^{1}1One could envision other, more refined combinations, which is left for future work. We shall see in Figure 1 that gradientsensitive pruning indeed yields higher test accuracy.4 Experiments
We run experiments in the openLTH framework (Frankle, 2020), using the CIFAR10 (Krizhevsky et al., ) dataset and the standard VGG11, VGG16 (Simonyan & Zisserman, 2015) and Resnet20 (He et al., 2016)
architectures. We use the default hyperparameters of the openLTH framework: SGD optimizer with learning rate 0.1, momentum 0.1 and weight decay 0.0001. We train the VGG networks for 60 epochs, the Resnet20 for 80 epochs, and the learning rate drops to 0.01 at the 40
^{th} and 60^{th} epochs, respectively.For trainingbased pruning, we perform 7 iterations, each removing of the weights. For gradientsensitive pruning, we compute the average gradients using the whole training set.^{2}^{2}2Note that this is different from Lee et al. (2019) which computes gradients using a single batch of data.
Our charts mark the mean of 5 runs, and the transparent regions mark the standard deviation.
Figure 1 and Figure 2 show the differences in accuracy for the four strategies introduced in Section 2 for VGG11 and VGG16. Trainingbased pruning methods clearly outperform initialisationbased methods by a large margin, and gradientsensitive methods outperform magnitude methods by a smaller margin. At lower pruning levels (), the difference of the two pruning criteria is larger.
Gradientsensitive pruning results in higher test accuracy. To get a deeper insight into this result, we look at layerwise differences in the trainingbased setting. Figure 5 shows that gradientsensitive pruning eliminates slightly more weights from the first layers and significantly more from the last layers, keeping more parameters in the middle layers. At extreme sparsity, we can see from Figure 6 that gradientsensitive pruning keeps very few weights from the last layers: above
pruning, less than 1000 weights remain in the last convolutional layer, and the classifying dense layer undergoes even more significant pruning.
Diving deeper, we focus on the last convolutional layer in the trainingbased setups and examine histograms of weights, gradients and weights times gradients for magnitude pruning (Figure 3) and gradientsensitive pruning (Figure 4).
The left columns show the distribution of weights. Notice that pruning leaves a “hole” in the histogram of subsequent iterations around zero. This means that after we remove small weights, reverting and retraining barely yields new small weights. As a result, subsequent iterations prune increasingly larger weights.
The hole is less accentuated for gradientsensitive pruning, which is expected, since some of the small weights survive pruning due to having larger gradients. On the other hand, the holes appear for gradientsensitive pruning in the product of the weights and the gradients (right side of Figure 4), since this is the pruning criterion. Note, however, that these holes are less rigid than their magnitude based counterparts and disappear in fewer iterations. This is because while the trained weights do not seem to be greatly affected by pruning, their gradiens adapt more easily.
Finally, the middle columns of Figure 3 and 4, reveal the distribution of the average absolute values of the gradients. At the end of the first training, we see a large number of weights with zero gradients. Gradientsensitive pruning removes all of them and yields rather healthy gradient histograms in later iterations. Magnitudebased pruning, on the other hand, keeps these weights, even though they are completely useless. This shows that the magnitude of a weight is not a good proxy for the magnitude of the gradient.
ResNet20
shows the training accuracies on ResNet20. Interestingly, the network is much less resistant to pruning, which we conjecture is due to its higher parameter efficiency. Furthermore, there is no significant difference between magnitude and gradientsensitive pruning. The residual connections allow for healthier gradient flow, which we can see from
Figure 8 and 9. There are much fewer parameters with extremely small gradients. Deeper understanding of the pruning behaviour of residual networks is left for future work.5 Conclusion and Future Work
Our work compares four network pruning algorithms that are the combinations of a simple (initialisationbased) and a complex (trainingbased) pruning setup on one hand and a simple (magnitudebased) and a complex (gradientsensitive) pruning criterion on the other. We show that trainingbased approaches surpass initializationbased approaches and that gradientsensitive approaches surpass magnitudebased approaches.
A key takeaway message from our work is the benefit of a data dependent pruning criterion as manifested in the gradientsensitive pruning scenarios.
As future work, we note that the gradientsensitive formula used in Lee et al. (2019) is one, but not necessarily the best solution. For example, we could introduce an extra exponent parameter and order by
. Another promising direction is to find a good interpolation between initializationbased and trainingbased pruning. A lot of evidence suggests that training allows for better pruning, however, it may not be necessary to fully train the network to make good pruning decisions. This promises to match the performance of trainingbased variants, without the large computational overhead. Futhermore, we are interested to see if there are systematic differences between the subnetworks selected by different pruning strategies.
Figure 6 already reveals differences in the layerwise number of remaining weights, and we are intrigued to find more complex patterns.6 Acknowledgments
This work has been supported by the European Union, cofinanced by the European Social Fund (EFOP3.6.3VEKOP16201700002), as well as by the Hungarian National Excellence Grant 20181.2.1NKP00008.
References
 Frankle (2020) Frankle, J. OpenLTH: A Framework for Lottery Tickets and Beyond, 2020. URL https://github.com/facebookresearch/open_lth.
 Frankle & Carbin (2019) Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJlb3RcF7.
 Frankle et al. (2019) Frankle, J., Dziugaite, G. K., Roy, D. M., and Carbin, M. The lottery ticket hypothesis at scale. CoRR, abs/1903.01611, 2019. URL http://arxiv.org/abs/1903.01611.

He et al. (2016)
He, K., Zhang, X., Ren, S., and Sun, J.
Deep residual learning for image recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016
, pp. 770–778. IEEE Computer Society, 2016. doi: 10.1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR.2016.90.  (5) Krizhevsky, A., Nair, V., and Hinton, G. Cifar10 (canadian institute for advanced research). URL http://www.cs.toronto.edu/~kriz/cifar.html.
 Lee et al. (2019) Lee, N., Ajanthan, T., and Torr, P. SNIP: Singleshot Network Pruning based on Connection Sensitivity. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=B1VZqjAcYX.
 Mehta (2019) Mehta, R. Sparse transfer learning via winning lottery tickets. ArXiv, abs/1905.07785, 2019.
 Morcos et al. (2019) Morcos, A., Yu, H., Paganini, M., and Tian, Y. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. In Wallach, H., Larochelle, H., Beygelzimer, A., d'AlchéBuc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 4932–4942. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/8739onetickettowinthemallgeneralizinglotteryticketinitializationsacrossdatasetsandoptimizers.pdf.
 Simonyan & Zisserman (2015) Simonyan, K. and Zisserman, A. Very deep convolutional networks for largescale image recognition. In Bengio, Y. and LeCun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1409.1556.
 Zhou et al. (2019) Zhou, H., Lan, J., Liu, R., and Yosinski, J. Deconstructing lottery tickets: Zeros, signs, and the supermask. In Wallach, H., Larochelle, H., Beygelzimer, A., d'AlchéBuc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 3597–3607. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/8618deconstructinglotteryticketszerossignsandthesupermask.pdf.
Comments
There are no comments yet.