Data-dependent Pruning to find the Winning Lottery Ticket

06/25/2020 ∙ by Dániel Lévai, et al. ∙ 0

The Lottery Ticket Hypothesis postulates that a freshly initialized neural network contains a small subnetwork that can be trained in isolation to achieve similar performance as the full network. Our paper examines several alternatives to search for such subnetworks. We conclude that incorporating a data dependent component into the pruning criterion in the form of the gradient of the training loss – as done in the SNIP method – consistently improves the performance of existing pruning algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Related Work

Neural network pruning techniques demonstrate that the function learned by neural networks can often be represented with significantly less parameters (even by less than 10%), without compromising performance, by selecting a small subnetwork of the initial model.

The Lottery Ticket Hypothesis, introduced in Frankle & Carbin (2019) postulates that these successful subnetworks are determined not only by their connection structure, but their initial weights as well. Once the network is initialised, some subnetworks receive a winning ticket, i.e., they can be trained in isolation and achieve similar performance as if the whole network as trained.

Winning tickets have been successfully identified in a variety of learning scenarios, e.g. Frankle & Carbin (2019), Frankle et al. (2019), Mehta (2019). Morcos et al. (2019) even showed that to some extent winning tickets can be transferred across different tasks and optimizers.

Frankle & Carbin (2019), perform pruning based on the magnitude of the weights of a trained network, using one or more iterations of training, pruning and reversing to initial weights. We refer to this algorithm as LTH. Several alternative pruning criteria based on the magnitude of trained and initial weights are explored in Zhou et al. (2019).

Lee et al. (2019) introduced two fundamental differences to network pruning. First, pruning is performed on an initialised, but untrained network. Second, the pruning criterion incorporates the

gradient of the loss function with respect to the weights

. We call this algorithm as SNIP.

Our paper argues that the more complex pruning criterion of SNIP can be successfully combined with the more complex training-based setup of LTH to yield a pruning algorithm that is superior to both. We argue that the key benefit of the SNIP criterion is that it is data dependent.

2 Pruning Variants

The difference between LTH and SNIP can be factored into two components, which can be recombined into four different pruning algorithms.

The first factor is when pruning happens:

  • Training-based pruning makes decisions based on trained weights. We use the iterative version of LTH: in each iteration, we 1) fully train the network, 2) delete of the weights then 3) revert the remaining weights to their initial value.

  • Initialisation-based pruning makes decisions based on the untrained weights.

The second factor is the pruning criterion:

  • Magnitude pruning orders weights according to their absolute value and deletes the bottom .

  • Gradient-sensitive pruning makes use of some training data . For each weight , we compute the average absolute gradient of the loss with respect to the weight: . Next, we order weights based on and delete the bottom .

The possible combinations of these two factors:

  • Train-w: Training-based, magnitude pruning: LTH.

  • Train-wg: Training-based, gradient-sensitive pruning: a novel approach that we argue works best.

  • Init-w: Initialisation-based, magnitude pruning: novel.

  • Init-wg: Initialisation-based, gradient-sensitive pruning: SNIP.

3 Analysis

The ordering criterion of weights for magnitude pruning is , while for gradient-senstive pruning this is multiplied with the gradient of the loss: . We aim to better understand the role of the gradient component.

Gradient-sensitive pruning is more complex: weights can be deleted either because they are small or because their contribution to the final loss is small. As we shall see in Figure 3 and Figure 4, there are indeed more small weights that survive the pruning under the gradient-sensitive criterion. Conversely, more large weights are deleted because their gradients are small.

The extreme case is when a weight has zero corresponding gradient, i.e., the weight simply does not contribute to the loss. The weights of a ReLU neuron that never activates are examples of this. Gradient-sensitive pruning very reasonably deletes such weights, regardless of their magnitude.

The crucial property of gradient-sensitive pruning is that it takes the dataset into consideration. The magnitude of a weight might be a good data independent heuristic for assessing the usefulness of a connection, while the gradient provides a data dependent alternative heuristic. A good combination of the two likely yields an ordering criterion that surpasses both. The product

is one such combination.111One could envision other, more refined combinations, which is left for future work. We shall see in Figure 1 that gradient-sensitive pruning indeed yields higher test accuracy.

4 Experiments

We run experiments in the openLTH framework (Frankle, 2020), using the CIFAR-10 (Krizhevsky et al., ) dataset and the standard VGG-11, VGG-16 (Simonyan & Zisserman, 2015) and Resnet20 (He et al., 2016)

architectures. We use the default hyperparameters of the openLTH framework: SGD optimizer with learning rate 0.1, momentum 0.1 and weight decay 0.0001. We train the VGG networks for 60 epochs, the Resnet20 for 80 epochs, and the learning rate drops to 0.01 at the 40

th and 60th epochs, respectively.

For training-based pruning, we perform 7 iterations, each removing of the weights. For gradient-sensitive pruning, we compute the average gradients using the whole training set.222Note that this is different from Lee et al. (2019) which computes gradients using a single batch of data.

Our charts mark the mean of 5 runs, and the transparent regions mark the standard deviation.

Figure 1 and Figure 2 show the differences in accuracy for the four strategies introduced in Section 2 for VGG-11 and VGG-16. Training-based pruning methods clearly outperform initialisation-based methods by a large margin, and gradient-sensitive methods outperform magnitude methods by a smaller margin. At lower pruning levels (), the difference of the two pruning criteria is larger.

Figure 1: Accuracy of different pruning strategies on VGG-11. Note the logarithmic scale on the x axis.
Figure 2: Accuracy of different pruning strategies on VGG-16. Note the logarithmic scale on the x axis.
Figure 3: Weights, gradients and weights gradients in the final conv layer, before pruning happens. Magnitude pruning.
Figure 4: Weights, gradients and weights gradients in the final conv layer before pruning happens. Gradient-sensitive pruning.

Gradient-sensitive pruning results in higher test accuracy. To get a deeper insight into this result, we look at layerwise differences in the training-based setting. Figure 5 shows that gradient-sensitive pruning eliminates slightly more weights from the first layers and significantly more from the last layers, keeping more parameters in the middle layers. At extreme sparsity, we can see from Figure 6 that gradient-sensitive pruning keeps very few weights from the last layers: above

pruning, less than 1000 weights remain in the last convolutional layer, and the classifying dense layer undergoes even more significant pruning.

Figure 5: Number of remaining weights under gradient-sensitive pruning divided by the number of remaining weights under magnitude pruning for each layer. Training-based setup, VGG-11.
Figure 6: Number of remaining weights by layer for the two training based pruning strategies on VGG-11.

Diving deeper, we focus on the last convolutional layer in the training-based setups and examine histograms of weights, gradients and weights times gradients for magnitude pruning (Figure 3) and gradient-sensitive pruning (Figure 4).

The left columns show the distribution of weights. Notice that pruning leaves a “hole” in the histogram of subsequent iterations around zero. This means that after we remove small weights, reverting and retraining barely yields new small weights. As a result, subsequent iterations prune increasingly larger weights.

The hole is less accentuated for gradient-sensitive pruning, which is expected, since some of the small weights survive pruning due to having larger gradients. On the other hand, the holes appear for gradient-sensitive pruning in the product of the weights and the gradients (right side of Figure 4), since this is the pruning criterion. Note, however, that these holes are less rigid than their magnitude based counterparts and disappear in fewer iterations. This is because while the trained weights do not seem to be greatly affected by pruning, their gradiens adapt more easily.

Finally, the middle columns of Figure 3 and 4, reveal the distribution of the average absolute values of the gradients. At the end of the first training, we see a large number of weights with zero gradients. Gradient-sensitive pruning removes all of them and yields rather healthy gradient histograms in later iterations. Magnitude-based pruning, on the other hand, keeps these weights, even though they are completely useless. This shows that the magnitude of a weight is not a good proxy for the magnitude of the gradient.

ResNet20

Figure 7

shows the training accuracies on ResNet20. Interestingly, the network is much less resistant to pruning, which we conjecture is due to its higher parameter efficiency. Furthermore, there is no significant difference between magnitude and gradient-sensitive pruning. The residual connections allow for healthier gradient flow, which we can see from

Figure 8 and 9. There are much fewer parameters with extremely small gradients. Deeper understanding of the pruning behaviour of residual networks is left for future work.

Figure 7: Accuracy of different pruning strategies on ResNet20. Note the logarithmic scale on the x axis.
Figure 8: Histogram of weights, gradients and weights gradients in the final network layer of ResNet20, before pruning happens. Magnitude pruning.
Figure 9: Histogram of weights, gradients and weights gradients in the final network layer or ResNet20, before pruning happens. Gradient-sensitive pruning.

5 Conclusion and Future Work

Our work compares four network pruning algorithms that are the combinations of a simple (initialisation-based) and a complex (training-based) pruning setup on one hand and a simple (magnitude-based) and a complex (gradient-sensitive) pruning criterion on the other. We show that training-based approaches surpass initialization-based approaches and that gradient-sensitive approaches surpass magnitude-based approaches.

A key takeaway message from our work is the benefit of a data dependent pruning criterion as manifested in the gradient-sensitive pruning scenarios.

As future work, we note that the gradient-sensitive formula used in Lee et al. (2019) is one, but not necessarily the best solution. For example, we could introduce an extra exponent parameter and order by

. Another promising direction is to find a good interpolation between initialization-based and training-based pruning. A lot of evidence suggests that training allows for better pruning, however, it may not be necessary to fully train the network to make good pruning decisions. This promises to match the performance of training-based variants, without the large computational overhead. Futhermore, we are interested to see if there are systematic differences between the subnetworks selected by different pruning strategies.

Figure 6 already reveals differences in the layerwise number of remaining weights, and we are intrigued to find more complex patterns.

6 Acknowledgments

This work has been supported by the European Union, co-financed by the European Social Fund (EFOP-3.6.3-VEKOP-16-2017-00002), as well as by the Hungarian National Excellence Grant 2018-1.2.1-NKP-00008.

References