Recovering from Random Pruning: On the Plasticity of Deep Convolutional Neural Networks

Recently there has been a lot of work on pruning filters from deep convolutional neural networks (CNNs) with the intention of reducing computations. The key idea is to rank the filters based on a certain criterion (say, l_1-norm, average percentage of zeros, etc) and retain only the top ranked filters. Once the low scoring filters are pruned away the remainder of the network is fine tuned and is shown to give performance comparable to the original unpruned network. In this work, we report experiments which suggest that the comparable performance of the pruned network is not due to the specific criterion chosen but due to the inherent plasticity of deep neural networks which allows them to recover from the loss of pruned filters once the rest of the filters are fine-tuned. Specifically, we show counter-intuitive results wherein by randomly pruning 25-50% filters from deep CNNs we are able to obtain the same performance as obtained by using state of the art pruning methods. We empirically validate our claims by doing an exhaustive evaluation with VGG-16 and ResNet-50. Further, we also evaluate a real world scenario where a CNN trained on all 1000 ImageNet classes needs to be tested on only a small set of classes at test time (say, only animals). We create a new benchmark dataset from ImageNet to evaluate such class specific pruning and show that even here a random pruning strategy gives close to state of the art performance. Lastly, unlike existing approaches which mainly focus on the task of image classification, in this work we also report results on object detection. We show that using a simple random pruning strategy we can achieve significant speed up in object detection (74% improvement in fps) while retaining the same accuracy as that of the original Faster RCNN model.


page 1

page 2

page 3

page 4


Studying the Plasticity in Deep Convolutional Neural Networks using Random Pruning

Recently there has been a lot of work on pruning filters from deep convo...

OrthoReg: Robust Network Pruning Using Orthonormality Regularization

Network pruning in Convolutional Neural Networks (CNNs) has been extensi...

Meta Filter Pruning to Accelerate Deep Convolutional Neural Networks

Existing methods usually utilize pre-defined criterions, such as p-norm,...

Centripetal SGD for Pruning Very Deep Convolutional Networks with Complicated Structure

The redundancy is widely recognized in Convolutional Neural Networks (CN...

LEAN: graph-based pruning for convolutional neural networks by extracting longest chains

Convolutional neural networks (CNNs) have proven to be highly successful...

L2PF – Learning to Prune Faster

Various applications in the field of autonomous driving are based on con...

Finding Original Image Of A Sub Image Using CNNs

Convolututional Neural Networks have achieved state of the art in image ...

1 Introduction

Over the past few years, deep convolutional neural networks (CNNs) have been very successful in a wide range of computer vision tasks such as image classification

[37, 4, 24] , object detection [10, 9, 33, 32, 27] and image segmentation [3, 28]. In general, with each passing year, these networks are becoming deeper and deeper with a corresponding increase in the performance [13, 18, 36]. However, this increase in performance is accompanied by an increase in the number of parameters and computations. This makes it difficult to port these models on embedded and mobile devices where storage, computation and power are limited. In such cases, it is crucial to have small, computationally efficient models which can achieve performance at par or close to large networks. This practical requirement has led to an increasing interest in model compression where the aim is to either (i) design efficient small networks [20, 16] or (ii) efficiently prune weights from existing deep networks [12, 39, 11] or (iii) efficiently prune filters from deep convolutional networks [26, 23, 30, 17] or (iv) replace expensive floating point weights by binary or quantized weights [5, 31, 11, 41] or (v) guide the training of a smaller network using a larger (teacher) network [2, 15].

In this work, we focus on pruning filters from deep convolutional neural networks. The filters in the convolution layers typically account for fewer parameters than the fully connected layers (the ratio is 10:90 for VGG-16 [26]), but they account for most of the floating point operations done by the model (99% for VGG-16 [26]). Hence reducing the number of filters effectively reduces the computation (and thus power) requirements of the model. All existing works on filter pruning [26, 23, 30, 17] follow a very similar recipe. The filters are first ranked based on a specific criterion such as, -norm [26] or percentage of zeros in the filter [17]. The scoring criterion essentially determines the importance of the filter for the end task, typically image classification [24]. Only the top-m ranked filters are retained and the resulting pruned network is then fine tuned. It is observed that when pruning up to 50% of the filters using different proposed criteria, the pruned network almost recovers the original performance after fine-tuning. The claim is that this recovery is due to soundness of the criterion chosen for pruning. However, in this work we argue that this recovery is not due the specific pruning criterion but due to the inherent plasticity of deep CNNs. Specifically, we show that even if we prune filters randomly we can match the performance of state-of-the-art pruning methods.

To effectively prove our point, it is crucial that we look at factors/measures other than the final performance of the pruned model. To do so we draw an analogy with the human brain and observe that the process of pruning filters from a deep CNN is akin to causing damage to certain portions of the brain. It is known that the human brain has a high plasticity and over time can recover from such damages with appropriate treatment [19]. In our case, the process of fine-tuning would be akin to such post-damage (post-pruning) treatment. If the injury damages only redundant or unimportant portions of the brain then the recovery should be complete quickly and with minimal treatment. Similarly, we could argue that if the pruning criteria is indeed good and prunes away only unimportant filters then (i) the performance of the model should not drop much (ii) the model should be able to regain its full performance after fine-tuning (iii) this recovery should be fast (i.e., with fewer iterations of fine tuning) and (iv) the quantum of data used for fine-tuning should be less. None of the existing works on filter pruning do a thorough comparison w.r.t. these factors. We not only consider these factors but also present counter-intuitive results which show that a random pruning criteria is comparable to state of the art pruning methods on all these factors. Note that we are not claiming that we can always recover the full performance of the unpruned network. For example, it should be obvious that in the degenerate case if 90% of the filters are pruned then it would be almost impossible to recover. The claim being made is that, at different pruning levels (25%, 50%, 75%) a random pruning strategy is not much worse than of state of the art pruning strategies.

To further prove our point, we wanted to check if such recovery from pruning is task agnostic. In other words, in addition to showing that a network trained for image classification (task1) can be pruned efficiently, we also show that same can be done with a network trained for object detection (task2). Here again, we show that a random pruning strategy works at par with state of the art pruning methods. Stretching this idea further and continuing the above analogy, we note that once the brain recovers from such damages, it is desirable that in addition to recovering its performance on the tasks that it was good at before the injury, it should also be able to do well on newer tasks. In our case, the corresponding situation would be to take a network pruned and fine-tuned for image classification (old task) and plug it into a model for object detection (new task). Specifically, we show that when we plug a randomly pruned and fine tuned VGG-16 network into a Faster RCNN model we can get the same performance on object detection as obtained by plugging (i) the original unpruned network or (ii) a network pruned using a state of the art pruning method. This once again hints at the inherent plasticity of deep CNNs which allows them to recover (up to a certain level) irrespective of the pruning strategy.

Finally, we consider the case of class specific pruning which has not been studied in the literature. We note that in many real world scenarios, it is possible that while we have trained an image classification network on a large dataset containing many classes, at test time we may be interested in only a few classes. A case in point, is the task of object detection using the Pascal VOC dataset [8]. RCNN and its variants [10, 9, 33] use as a sub-component an image classification model trained on all the 1000 ImageNet classes. We hypothesize that this is an overkill and instead create a class specific benchmark dataset from ImageNet which contains only those 52 classes which correspond to the 20 classes in Pascal VOC. Ideally, one would expect that a network trained, pruned and fine-tuned only for these 52 classes when plugged into faster RCNN should do better than a network trained, pruned and fine-tuned on a random set of 52 classes (which are very different from the classes in Pascal VOC). However, we observe that irrespective of which of these networks is plugged into Faster RCNN the final performance after fine-tuning is the same, once again showing the ability to recover from unfavorable situations.

To the best of our knowledge, this is a first of its kind work on pruning filters which:

  1. Proposes that while assessing the performance of a pruning method, we should consider factors such as amount of damage (drop in performance before fine-tuning), amount of recovery (performance after fine-tuning), speed of recovery and quantum of data required for recovery.

  2. Performs extensive evaluation using two image classification networks (VGG-16 and ResNet) and shows that a random pruning strategy gives comparable performance to that of state of the art pruning strategies w.r.t. all the above factors.

  3. Shows that such behavior is task agnostic and a random pruning strategy works well even for the task of object detection. Specifically, we show that by randomly pruning filters from an object detection model we can get a 74 improvement in fps while maintaining almost the same accuracy (1% drop) as the original unpruned network

  4. Shows that pruned networks can adapt with ease to newer tasks

  5. Proposes a new benchmark for evaluating class specific pruning

2 Related Work

In this section, we review existing work on making deep convolutional neural networks efficient w.r.t. their memory and computation requirements while not compromising much on the accuracy. These approaches can be broadly classified into the following categories (i) pruning unimportant weights (ii) low rank factorization (iii) knowledge distillation (iv) designing compact networks from scratch or (v) using binary or quantized weights and (vi) pruning unimportant filters. Below, we first quickly review the related work for the first five categories listed above and then discuss approaches on pruning filters which is the main focus of our work.

Optimal brain damage [25] and optimal brain surgery [7]

are two examples of approaches which prune the unimportant weights in the network. A weight is considered unimportant if the output is not very sensitive to this weight. They show that pruning such weights leads to minimal drop in the overall performance of the network. However, these methods are computationally expensive as they require the computation of the Hessian (second order derivative). Another approach is to use low rank factorization of the weight tensor/matrices to reduce the computations

[21, 38, 22, 6, 40]. For example, instead of directly multiplying a high dimensional weight tensor with the input tensor , we could first compute a low rank approximation of where the dimensions of , and are much smaller than the dimensions of . This essentially boils down to decomposing the larger matrix multiplication operation into smaller operations. Also, the low rank approximation ensures that only the important information in the weight matrix is retained. Alternately, researchers have also explored designing compact networks from scratch which have fewer number of layers and/or parameters and/or computations [20]. There are also some approaches which quantize [11]

or binarize

[31, 5] the weights of a network to reduce both memory footprint and computation time. Another line of work focuses on transferring the knowledge from bigger trained network (or ensemble of networks) to smaller (thin) network [2, 15].

The main focus of our work is on pruning filters from deep CNNs with the intention of reducing computations. As mentioned earlier, while the convolution filters do not account for a large number of parameters, they account for almost all the computations that happen in the network. Here, the idea is to rank the filters using a scoring function and then retain only the top scoring functions. For example, in [26], the authors have used the -norm of the filters to rank their importance. The argument is that filters having a lower l1-norm will produce smaller activation values which will contribute minimally to the output of that layer. Alternately, in [29] authors have proposed entropy as a measure to calculate the importance of a filter. If a filter as high entropy than the filter is more informative and hence more important. On the other hand, [17] calculate the average percentage of zeros in the corresponding activation maps of filters and hypothesize that filters having more average percentage of zeros in their activation are less important. In [30] authors have used Taylor series expansion that approximates the change in cost function caused by pruning filters. Unlike [25], this method uses information from first derivative only. Another work on pruning filters [23] proposes that instead of pruning filters based on current layer’s statistics they should be pruned based on next layer’s statistics. Essentially the idea of [23] is to look at the activation map of layer and prune out the channel which will give you the minimum change in output on its removal and its corresponding filter in layer . In [14] authors proposed a similar idea to [23]

but instead of removing the filters one by one they have proposed to use LASSO regression. Lastly, in

[1] authors has used particle filtering to prune out the filters.

3 Methodology

In this section, we first formally define the problem of pruning filter and give a generic algorithm for pruning filters using any appropriate scoring function. We then discuss existing scoring functions along with some new variants that we propose.

3.1 Problem Statement

Suppose there are convolutional layers in a CNN and suppose the layer contains filters. We use to denote the -th filter in the -th layer. Each such filter is a three dimensional tensor, where is the number of input channels for layer and are the width and height of the -th filter in the -th layer. Our goal is to rank all the filters in layer , and then retain the top- filters where

is a hyperparameter which indicates the desired pruning. For example, based on available computation resources, if we want to reduce the number of computations in this layer by half then we can set

. Let the original output of layer be denoted by where are the width and height and is the number of channels which is the same as the number of filters. After pruning and retaining only top- filters the size of the output will be reduced to . Thus, pruning filters not only reduces the number of computations in this layer but also reduces the size of the input to the next layer (which is the same as the output of this layer). The same process of pruning can then be repeated across all layers of the CNN. The main task here is to find the right scoring function for ranking the filters.

3.2 A Generic Algorithm for Pruning

Algorithm 1

summarizes the generic recipe used by different approaches for pruning filters. As shown in the algo, pruning typically starts from the outermost layer. Once the low scoring filters from this layer are pruned, the network is then fine-tuned and the same process is then repeated for the layers before it. Once all the layers are pruned and fine-tuned, the entire network is then tuned for a few epochs.

1:   number of layers in the network
2:   (filters in layer )
3:  for each layer  do
4:     for each filter  do
6:     end for
9:     Finetune for epochs
10:  end for
11:  Finetune the final pruned for epochs
Algorithm 1 Prune(CNN)

Existing methods for pruning filters differ in the that they use for ranking the filters. We alternately refer to this scoring function as pruning criteria as discussed in the next subsection.

3.3 Pruning Criteria

We now describe various pruning criteria which are used by existing approaches and also introduce some new variants of existing pruning criteria. These criteria are essentially used as in Algorithm 1.

  1. Mean Activation [30]

    : Most deep CNNs for image classification use ReLU as the activation function which results in very sparse activations (as all negative outputs are set to 0). We could compute the mean activation of the feature map corresponding to a filter across all images in the training data. If this mean activation is very low (because most of the activations are 0) then this feature map and hence the corresponding filter is not going to contribute much to the discriminatory power of the network (since the filter rarely fires for any input). Hence,

    [30] uses the mean activation as a scoring function for ranking filters.

  2. -Norm [26] : The authors of [26] suggest that the -norm (F) of a filter can also be used as an indicator of the importance of the filter. The argument is that if the -norm of a filter is small then on average the weights in the filter will be small and hence produce very small activations. These small activations will not influence the output of the network and hence the corresponding filters can be pruned away. One important benefit of this method is that apart from computing the -norm, it does not need any extra computation during pruning and fine-tuning.

  3. Entropy [29] : If the feature map corresponding to a filter produces the same output for every input (image) then this feature map and hence the corresponding filters may not be very important (because it does not play any discriminatory role). In other words, we are interested in feature maps (and hence filters) which are more informative or have a high entropy. If we divide the possible range of the average output of a feature map into bins then we could compute the entropy of the -th feature map (or filter) [29] as :


    is the probability that the output of the

    -th feature map lies in the -th bin. This probability can be computed as the fraction of input images for which the average output of the feature map lies in this bin.

  4. Average Percentage of Zeros (APoZ) [17]

    : As mentioned earlier, when ReLU is used as the activation function, the output activations are very sparse. If most of the neurons in a feature map are zero then this feature map is not likely to contribute much to the output of the network. The Average Percentage of Zeros in the output of each filter can thus be used to compute the importance of the filter (the lesser the better).

  5. Sensitivity : We could compute the gradient of a filter w.r.t. the loss function (i.e., cross entropy). If a filter has a high influence on the loss function then the value of this gradient would be high. The

    -norm of this gradient averaged over all images can thus be used to compute the importance of a filter.

  6. Scaled Entropy : We propose a new variant of the entropy based criteria. We observe that a filter may have a high entropy but if all its activations are very low (belonging to lower bins) then this filter is not likely to contribute much to the output. We thus propose to use a combination of entropy and mean activation by scaling the entropy by the mean activation of the filter. This scaled-entropy of th filter can be computed as:

    where is the average activation of the -th filter over all input images.

  7. Class Specific Importance : In this work, we are also interested in a more practical scenario, where a network trained for detecting all the 1000 classes from ImageNet is required to detect only () of these classes at test time (say, only animals). Intuitively, we should then devise a scoring function which retains only those filters which are important for these classes. To do so we once aging compute the gradient of the loss function w.r.t. the filter. However, now instead of averaging the -norm of this gradient over all images in the training data, we compute the average over only those images in the training data which correspond to the classes of interest. This class-specific average is then used to rank the filters.

  8. Random Pruning : One of the main contributions of this work is to show that even if we randomly prune the filters from a CNN, its performance after fine-tuning is not much worse than any of the above approaches.

4 Experiments: Image Classification

In this section, we focus on the task of image classification using the ImageNet [34] dataset. The dataset is split into three sets : training (1.3M images), validation (50K images), and testing (100K images with held-out class labels). We experiment with two popular networks, viz., VGG-16 and ResNet-50. We first train these networks using the full ImageNet training data and then prune them using Algorithm 1. We compare the performance of different scoring functions as listed in the section 3.3.

4.1 Comparison of different pruning methods on VGG-16

VGG-16 [35] has 13 convolutional (CONV) and two fully connected (FC) layers. The number of filters in each CONV layer in the the standard VGG-16 network [35] is {64, 64, 128, 128, 256, 256, 256, 512, 512, 512, 512, 512, 512}. We first train this network as it is (i.e., with the standard number of filters in each layer) using the ImageNet training data. When evaluated on the standard ImageNet test set, this trained model gives us a top-1 accuracy of 69 which is comparable to the accuracy reported elsewhere in the literature. We now prune this network, one layer at a time starting from the last convolution layer. We prune away % of filters from each layer where we chose the value of to be {25, 50, 75}. We use one of the scoring functions described in Section 3.3 to select the top % filters. We drop the remaining (100 - m)% filters from this layer and then fine-tune the pruned network for 1 epoch. We then repeat the same process for the lower layers and use the same value of across all layers. Once the network is pruned till layer 1, we then fine tune the entire pruned network for 12 epochs using 1/10-th of the training data picked randomly. The only reason for not using the entire training data is that it is quite computationally expensive. We did not see any improvement in the performance on the validation set by fine-tuning beyond 12 epochs. We then evaluate this pruned and fine-tuned network on the test set. Below, we discuss the performance of the final pruned and fine-tuned network obtained using different pruning strategies.

Performance of pruned network after fine-tuning: In Table 1, we report the performance of the final pruned network after fine tuning. We observe that random pruning works better than most of the other pruning methods described earlier. -norm is the only scoring function which does better than random and that too by a small margin. In fact, if we fine-tune the final trained network using the entire training data then we observe that there is hardly any difference between random and -norm (see Table 2). This provides empirical evidence for our claim that the amount of recovery (i.e., final performance after fine-tuning) is not due to the soundness of the pruning criteria. Even with random pruning, the performance of the pruned network is comparable. Of course, as the percentage of pruning increases ( as m increases) it becomes harder for the pruned network to recover the full performance of the original network (but the point is that it is equally hard irrespective of the pruning method used). Thus, w.r.t. the amount of recovery after damage (pruning), a random pruning strategy is as good as any other pruning strategy. We further drive this point in Figure 0(a) where we show that after pruning and fine tuning for every layer, the amount of recovery after fine tuning is comparable across different pruning strategies.

Heuristic 25 % 50% 75%
Random 0.650 0.569 0.415
Mean Activation 0.652 0.570 0.409
Entropy 0.641 0.549 0.405
Scaled Entropy 0.637 0.550 0.401
-norm 0.667 0.593 0.436
APoZ 0.647 0.564 0.422
Sensitivity 0.636 0.543 0.379
Table 1: Comparison of different filter pruning strategies on VGG-16.

As a side note we would like to mention that we do not include the performance of ThiNets [23] in Table 1. This is because it uses a slightly different methodology. In particular there are two major differences. First, in ThiNets pruning is done only till layer 10 and not upto layer 11 as is the case for all numbers reported in Table 1

. Secondly, in ThiNets, if a CONV layer appears before a max-pooling layer then it is fine-tuned for an extra epoch to compensate more for the downsampling in the max pooling layer. For a fair comparison, we followed this exact same strategy as ThiNet but using a random pruning criteria. In this setup, a randomly pruned network was able to achieve 68% top-1 accuracy after 50% pruning which is comparable to the performance of the corresponding ThiNet (69%).

Heuristic 50%
Random 0.6701
Mean Activation 0.6662
Entropy 0.6635
Scaled Entropy 0.6625
-norm 0.6759
APoZ 0.6706
Sensitivity 0.6659
Table 2: Performance after fine-tuning with full data

Amount of initial damage caused by different pruning strategies: One might argue that while random pruning strategy is equivalent to other pruning strategies w.r.t. final performance after fine tuning, it is possible that the amount of initial damage caused by a careful pruning strategy maybe less than than caused by random pruning. This could be important in cases where enough time or resources are not available for fine-tuning after pruning. To evaluate this, we compute the accuracy of the network just after pruning (and before fine-tuning) at each layer. Figure 0(b) compares this performance for different pruning strategies. Here again we observe that the damage caused by a random pruning strategy is not worse than other pruning strategies. The only exception is when we prune the first 4 layers in which case the damage caused by -norm based pruning is less than random pruning. We hypothesize that this is because the first 4 layers have very few filters and hence one needs to be careful while pruning for filters from these layers. In fact, in hindsight we would recommend not to prune any filters from these 4 layers because the computation savings are less as compared to drop in accuracy.

(a) Performance after fine-tuning at each layer in VGG-16
(b) Performance drop after pruning but before fine-tuning
(c) Finetuning the final pruned network with 1/ data
(d) Finetuning the final pruned network with full ImageNet data
Figure 1: Pruning and Fine-tuning VGG-16

Speed of recovery and quantum of data for fine-tuning: Another important criteria is the speed of recovery, i.e., the number of iterations for which the network needs to be fine-tuned after pruning. It is conceivable that a carefully pruned network may be able to recover and reach its best performance faster than a randomly pruned network. However, as shown in Figure 0(c) that almost all the pruning strategies (including random) reach their peak after 2 epochs when fine-tuned with one-tenth of the data. Even, if we increase the quantum of data, this behavior does not change as shown in Figure 0(d) (for -norm based pruning and random pruning). Of course, as we increase the quantum of data the amount of recovery increases, i.e., the peak performance of the pruned network increases. However, the important point is that a random strategy is no worse than a careful pruning strategy w.r.t. speed of recovery and quantum of data required.

4.2 Pruning ResNet-50 using -Norm and Random

While the above set of experiments focused on VGG-16, we now turn our attention to ResNet-50 [13] which gives state of the art results on ImageNet. We took a trained ResNet-50 model which gave 74.5 top-1 accuracy on the ImageNet test set which is again comparable to the accuracy reported elsewhere in the literature. ResNet contains 16 residual blocks wherein each block contains 3 layers with a skip connection from the first layer to the third layer. The standard practice is to either prune the first layer of each block or the first two layers of each block. In the first case, out of the total 48 convolution layers (16 * 3) we will end up pruning 16 and in the second case we will end up pruning 32. As before, for each pruned layer we vary the percentage of pruning from 25%, 50% to 75%. Here, we only compare the performance of -Norm with random pruning as these were the top performing strategies on VGG-16. This was just to save time and resources as given the deep structure of ResNet it would have been very expensive to run all pruning strategies. Once again from Table 3, we observe that random pruning performs at par (in fact, slightly better) when compared to -Norm based pruning. Note that, in this case the pruned models were trained with only one-tenth of the data. The performance of both the methods are likely to improve further if we were to fine-tune the pruned network on the entire training data.

Heuristics #Layers Pruned 25 % 50% 75%
Random 16 0.722 0.683 0.617
-norm 16 0.714 0.677 0.610
Random 32 0.696 0.637 0.518
-norm 32 0.691 0.633 0.514
Table 3: Comparison of different filter pruning strategies on ResNet (Top-1 accuracy of unpruned network is 0.745)

5 Experiments: Class specific pruning

Existing work on pruning filters (or model compression, in general) focuses on the scenario where we have a network trained for detecting all the 1000 classes in ImageNet and at test time it is again evaluated using data belonging to all of these 1000 classes. However, in many real world scenarios, at test time we may be interested in fewer classes. A case in point, is the Pascal VOC dataset which contains only 20 classes. Intuitively, if we are interested in only fewer classes at test time then we should be able to prune the network to cater to only these classes. Alternately, we could train the original network itself using data corresponding to these classes only. To enable these experiments, we first create a new benchmark from ImageNet which contains only those 52 classes which correspond to the 20 classes in Pascal VOC. Note that the mapping of 52-20 happens because ImageNet has more fine-grained classes. For example, there is only one class for ‘dog’ in Pascal VOC but ImageNet contains many sub-classes of ‘dog’ (different breeds of dogs). We manually went over all the classes in ImageNet and picked out the classes which correspond to the 20 classes in Pascal VOC. In some cases, we ignored ImageNet classes which were too fine-grained and only considered those classes which were immediate hyponyms of a class in Pascal VOC. We then extracted the train, test and valid images for these classes from the original ImageNet dataset. We refer to this subset of ImageNet as ImageNet-52P (where P stands for Pascal VOC). We refer to the original ImageNet dataset as ImageNet-1000. Note that the train, test and validation splits of ImageNet-52P are subsets of the corresponding splits of ImageNet-1000. In particular , the training split of ImageNet-1000 does not overlap with the test or validation splits of ImageNet-52P.

We first compare the performance in the following two setups: (i) model trained on ImageNet-1000 and evaluated on the test split of ImageNet-52P and (ii) model trained on ImageNet-52P and evaluated on the test split of ImageNet-52P. We observe that while in the first setup we get a top-1 accuracy of 74%, in the second setup we get an accuracy of 87%. This suggests that model trained on ImageNet-1000 is clearly overloaded with extra information about the remaining 948 classes and hence performs poorly on the 52 classes of interest. We should thus be able to prune the network effectively to cater to only the 52 classes of interest. Note that in practice it is desirable to have just one network trained on ImageNet-1000 and then prune it for different subsets of classes that we are interested in instead of training a separate network from scratch for each of these subsets. We again compare different pruning strategies as listed earlier except that now when fine-tuning (after each layer and at the end of all layers) we only use ImageNet-52P. In other words, we fine-tune using only data corresponding to the 52 classes. Once again, we observe that there is not much difference between random pruning and other pruning strategies. Also with 25% pruning, we are able to almost match the performance of a network trained only on these 52 classes (i.e., 87%) .

Heuristics 25 % 50% 75%
Random 0.859 0.820 0.692
Mean Activation 0.866 0.816 0.698
Entropy 0.860 0.802 0.684
Scaled Entropy 0.863 0.813 0.691
-norm 0.867 0.823 0.729
APoZ 0.858 0.811 0.700
Important Classes 0.857 0.795 0.655
Sensitivity 0.849 0.793 0.634
Table 4: Comparison of different filter pruning strategies when fine-tuned and evaluated with ImageNet-52P.

6 Experiments: Faster Object Detection

The above experiments have shown that with reasonable levels of pruning (25-50%) and enough fine-tuning (using entire data) the pruned network is able to recover and almost match the performance of the unpruned network on the original task (image classification) even with a random pruning strategy. However, it is possible that if such a pruned network is used for a new task, say object detection, then a randomly pruned network may not give the same performance as a carefully pruned network. To check this, we performs experiments using the Faster-RCNN model for object detection. Note that the Faster-RCNN model uses a VGG-16 model as a base component and then adds other components which are specific to object detection. We experiment with the PASCAL-VOC 2007 dataset [8] which consists of 9,963 images, containing 24,640 annotated objects. We first plug-in a standard trained VGG-16 network into Faster-RCNN and then train Faster-RCNN for 70K iterations (as is the standard practice). This model gives a mean Average Precision (mAP) value of . The idea is to now plug-in a pruned VGG-16 model into faster RCNN instead of the original unpruned model and check the performance. Table 5 again shows that the specific choice of pruning strategy does not have much impact on the final performance on object detection. Of course, as earlier, as the level of pruning increases the performance drops (but the drop is consistent across all pruning strategies). We now report some more interesting experiments on pruning Faster RCNN.

Directly pruning Faster RCNN: Instead of plugging in a pruned VGG-16 model into Faster-RCNN, we could alternately take a trained Faster-RCNN model and then prune it directly. Here again, we use a simple random pruning strategy and observe that the performance of the pruned model comes very close to that of the unpruned model. In particular, with 50% pruning we are able to achieve a mAP of 0.648 with a speedup in terms of frames per second.

Plugging in a VGG-16 model trained using ImageNet-52P: Since we are only interested in the 52 classes corresponding to Pascal-VOC, we wanted to check what happens if we plug-in a VGG-16 model trained, pruned and fine-tuned only on ImageNet-52P. As shown in Table 7 we do not get much benefit of plugging in this specialized model into Faster-RCNN. In fact, in a separate experiment we observed that even if we train a VGG-16 model on a completely random set of 52 classes (different from the 52 classes corresponding to Pascal VOC) and then plug in this model into Faster RCNN, even then the final performance of the Faster RCNN model remains the same. This is indeed surprising and further demonstrates the ability of these networks to recover from unfavorable situations.

Heuristics 25 % 50% 75%
Random 0.647 0.600 0.505
Mean Activation 0.647 0.601 0.489
Entropy 0.635 0.584 0.501
Scaled Entropy 0.640 0.593 0.507
-norm 0.628 0.608 0.520
APoZ 0.646 0.598 0.514
Sensitivity 0.636 0.592 0.485
Table 5: Object detection results obtained by plugging-in different pruned VGG-16 models into Faster-RCNN.
Faster-RCNN Baseline 25 % 50% 75%
mAP 0.66 0.655 0.648 0.530
fps 7.5 10 13 16
Table 6: Object detection results when directly pruning (random) a fully trained Faster-RCNN model.
Heuristics 25 % 50% 75%
Random 0.647 0.580 0.469
Mean Activation 0.644 0.583 0.454
Entropy 0.642 0.578 0.47
Scaled Entropy 0.645 0.580 0.443
-norm 0.648 0.601 0.487
APoZ 0.641 0.585 0.466
Important Classes 0.631 0.568 0.432
Sensitivity 0.637 0.576 0.4345
Table 7: Object detection results obtained by plugging-in different pruned VGG-16 models fine-tuned with ImageNet-52P as opposed to ImageNet-1000.

7 Conclusion and Future Work

We evaluated the performance of various pruning strategies based on the (i) drop in performance after pruning (ii) amount of recovery after pruning (iii) speed of recovery and (iv) amount of data required. We do extensive evaluations with two networks (VGG-16 and ResNet50) and present counter-intuitive results which show that w.r.t. all these factors a random pruning strategy performs at par with principled pruning strategies. We also show that even when such a randomly pruned network is used for a completely new task it performs well. Finally, we present results for pruning Faster RCNN and show that even a random pruning strategy can give a 74% speed-up w.r.t frames per second while giving only a 1% drop in the performance.