1 Introduction and Related Work
Neural network pruning techniques demonstrate that the function learned by neural networks can often be represented with significantly less parameters (even by less than 10%), without compromising performance, by selecting a small subnetwork of the initial model.
The Lottery Ticket Hypothesis, introduced in Frankle & Carbin (2019) postulates that these successful subnetworks are determined not only by their connection structure, but their initial weights as well. Once the network is initialised, some subnetworks receive a winning ticket, i.e., they can be trained in isolation and achieve similar performance as if the whole network as trained.
Winning tickets have been successfully identified in a variety of learning scenarios, e.g. Frankle & Carbin (2019), Frankle et al. (2019), Mehta (2019). Morcos et al. (2019) even showed that to some extent winning tickets can be transferred across different tasks and optimizers.
Frankle & Carbin (2019), perform pruning based on the magnitude of the weights of a trained network, using one or more iterations of training, pruning and reversing to initial weights. We refer to this algorithm as LTH. Several alternative pruning criteria based on the magnitude of trained and initial weights are explored in Zhou et al. (2019).
Lee et al. (2019) introduced two fundamental differences to network
pruning. First, pruning is performed on an initialised, but untrained
network. Second, the pruning criterion incorporates the gradient of the
loss function with respect to the weights
gradient of the loss function with respect to the weights. We call this algorithm as SNIP.
Our paper argues that the more complex pruning criterion of SNIP can be successfully combined with the more complex training-based setup of LTH to yield a pruning algorithm that is superior to both. We argue that the key benefit of the SNIP criterion is that it is data dependent.
2 Pruning Variants
The difference between LTH and SNIP can be factored into two components, which can be recombined into four different pruning algorithms.
The first factor is when pruning happens:
Training-based pruning makes decisions based on trained weights. We use the iterative version of LTH: in each iteration, we 1) fully train the network, 2) delete of the weights then 3) revert the remaining weights to their initial value.
Initialisation-based pruning makes decisions based on the untrained weights.
The second factor is the pruning criterion:
Magnitude pruning orders weights according to their absolute value and deletes the bottom .
Gradient-sensitive pruning makes use of some training data . For each weight , we compute the average absolute gradient of the loss with respect to the weight: . Next, we order weights based on and delete the bottom .
The possible combinations of these two factors:
Train-w: Training-based, magnitude pruning: LTH.
Train-wg: Training-based, gradient-sensitive pruning: a novel approach that we argue works best.
Init-w: Initialisation-based, magnitude pruning: novel.
Init-wg: Initialisation-based, gradient-sensitive pruning: SNIP.
The ordering criterion of weights for magnitude pruning is , while for gradient-senstive pruning this is multiplied with the gradient of the loss: . We aim to better understand the role of the gradient component.
Gradient-sensitive pruning is more complex: weights can be deleted either because they are small or because their contribution to the final loss is small. As we shall see in Figure 3 and Figure 4, there are indeed more small weights that survive the pruning under the gradient-sensitive criterion. Conversely, more large weights are deleted because their gradients are small.
The extreme case is when a weight has zero corresponding gradient, i.e., the weight simply does not contribute to the loss. The weights of a ReLU neuron that never activates are examples of this. Gradient-sensitive pruning very reasonably deletes such weights, regardless of their magnitude.
The crucial property of gradient-sensitive pruning is that it takes the dataset into consideration. The magnitude of a weight might be a good data independent heuristic for assessing the usefulness of a connection, while the gradient provides a data dependent alternative heuristic. A good combination of the two likely yields an ordering criterion that surpasses both. The productis one such combination.111One could envision other, more refined combinations, which is left for future work. We shall see in Figure 1 that gradient-sensitive pruning indeed yields higher test accuracy.
We run experiments in the openLTH framework (Frankle, 2020), using the CIFAR-10 (Krizhevsky et al., ) dataset and the standard VGG-11, VGG-16 (Simonyan & Zisserman, 2015) and Resnet20 (He et al., 2016)
architectures. We use the default hyperparameters of the openLTH framework: SGD optimizer with learning rate 0.1, momentum 0.1 and weight decay 0.0001. We train the VGG networks for 60 epochs, the Resnet20 for 80 epochs, and the learning rate drops to 0.01 at the 40th and 60th epochs, respectively.
For training-based pruning, we perform 7 iterations, each removing of the weights. For gradient-sensitive pruning, we compute the average gradients using the whole training set.222Note that this is different from Lee et al. (2019) which computes gradients using a single batch of data.
Our charts mark the mean of 5 runs, and the transparent regions mark the standard deviation.
Figure 1 and Figure 2 show the differences in accuracy for the four strategies introduced in Section 2 for VGG-11 and VGG-16. Training-based pruning methods clearly outperform initialisation-based methods by a large margin, and gradient-sensitive methods outperform magnitude methods by a smaller margin. At lower pruning levels (), the difference of the two pruning criteria is larger.
Gradient-sensitive pruning results in higher test accuracy. To get a deeper insight into this result, we look at layerwise differences in the training-based setting. Figure 5 shows that gradient-sensitive pruning eliminates slightly more weights from the first layers and significantly more from the last layers, keeping more parameters in the middle layers. At extreme sparsity, we can see from Figure 6 that gradient-sensitive pruning keeps very few weights from the last layers: above
pruning, less than 1000 weights remain in the last convolutional layer, and the classifying dense layer undergoes even more significant pruning.
Diving deeper, we focus on the last convolutional layer in the training-based setups and examine histograms of weights, gradients and weights times gradients for magnitude pruning (Figure 3) and gradient-sensitive pruning (Figure 4).
The left columns show the distribution of weights. Notice that pruning leaves a “hole” in the histogram of subsequent iterations around zero. This means that after we remove small weights, reverting and retraining barely yields new small weights. As a result, subsequent iterations prune increasingly larger weights.
The hole is less accentuated for gradient-sensitive pruning, which is expected, since some of the small weights survive pruning due to having larger gradients. On the other hand, the holes appear for gradient-sensitive pruning in the product of the weights and the gradients (right side of Figure 4), since this is the pruning criterion. Note, however, that these holes are less rigid than their magnitude based counterparts and disappear in fewer iterations. This is because while the trained weights do not seem to be greatly affected by pruning, their gradiens adapt more easily.
Finally, the middle columns of Figure 3 and 4, reveal the distribution of the average absolute values of the gradients. At the end of the first training, we see a large number of weights with zero gradients. Gradient-sensitive pruning removes all of them and yields rather healthy gradient histograms in later iterations. Magnitude-based pruning, on the other hand, keeps these weights, even though they are completely useless. This shows that the magnitude of a weight is not a good proxy for the magnitude of the gradient.
shows the training accuracies on ResNet20. Interestingly, the network is much less resistant to pruning, which we conjecture is due to its higher parameter efficiency. Furthermore, there is no significant difference between magnitude and gradient-sensitive pruning. The residual connections allow for healthier gradient flow, which we can see fromFigure 8 and 9. There are much fewer parameters with extremely small gradients. Deeper understanding of the pruning behaviour of residual networks is left for future work.
5 Conclusion and Future Work
Our work compares four network pruning algorithms that are the combinations of a simple (initialisation-based) and a complex (training-based) pruning setup on one hand and a simple (magnitude-based) and a complex (gradient-sensitive) pruning criterion on the other. We show that training-based approaches surpass initialization-based approaches and that gradient-sensitive approaches surpass magnitude-based approaches.
A key takeaway message from our work is the benefit of a data dependent pruning criterion as manifested in the gradient-sensitive pruning scenarios.
As future work, we note that the gradient-sensitive formula used in Lee et al. (2019) is one, but not necessarily the best solution. For example, we could introduce an extra exponent parameter and order by
. Another promising direction is to find a good interpolation between initialization-based and training-based pruning. A lot of evidence suggests that training allows for better pruning, however, it may not be necessary to fully train the network to make good pruning decisions. This promises to match the performance of training-based variants, without the large computational overhead. Futhermore, we are interested to see if there are systematic differences between the subnetworks selected by different pruning strategies.Figure 6 already reveals differences in the layerwise number of remaining weights, and we are intrigued to find more complex patterns.
This work has been supported by the European Union, co-financed by the European Social Fund (EFOP-3.6.3-VEKOP-16-2017-00002), as well as by the Hungarian National Excellence Grant 2018-1.2.1-NKP-00008.
- Frankle (2020) Frankle, J. OpenLTH: A Framework for Lottery Tickets and Beyond, 2020. URL https://github.com/facebookresearch/open_lth.
- Frankle & Carbin (2019) Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7.
- Frankle et al. (2019) Frankle, J., Dziugaite, G. K., Roy, D. M., and Carbin, M. The lottery ticket hypothesis at scale. CoRR, abs/1903.01611, 2019. URL http://arxiv.org/abs/1903.01611.
- He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In doi: 10.1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR.2016.90.
- (5) Krizhevsky, A., Nair, V., and Hinton, G. Cifar-10 (canadian institute for advanced research). URL http://www.cs.toronto.edu/~kriz/cifar.html.
- Lee et al. (2019) Lee, N., Ajanthan, T., and Torr, P. SNIP: Single-shot Network Pruning based on Connection Sensitivity. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=B1VZqjAcYX.
- Mehta (2019) Mehta, R. Sparse transfer learning via winning lottery tickets. ArXiv, abs/1905.07785, 2019.
- Morcos et al. (2019) Morcos, A., Yu, H., Paganini, M., and Tian, Y. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 4932–4942. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/8739-one-ticket-to-win-them-all-generalizing-lottery-ticket-initializations-across-datasets-and-optimizers.pdf.
- Simonyan & Zisserman (2015) Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Bengio, Y. and LeCun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1409.1556.
- Zhou et al. (2019) Zhou, H., Lan, J., Liu, R., and Yosinski, J. Deconstructing lottery tickets: Zeros, signs, and the supermask. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 3597–3607. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/8618-deconstructing-lottery-tickets-zeros-signs-and-the-supermask.pdf.