1 Introduction
As deep neural networks are widely deployed in mobile devices, there has been an increasing demand for reducing model size and runtime latency. Network pruning [7, 12, 19] techniques are proposed to achieve model compression and inference acceleration by removing redundant structures and parameters. In addition to the early nonstructured pruning methods [16, 7], the structured pruning method represented by channel pruning [17, 23, 12, 19] has been widely adopted in recent years because of its easy deployment on generalpurpose GPUs. The traditional network pruning methods adopt a threestage pipeline, namely pretraining, pruning, and finetuning [20], as shown in Figure 1(a). The pretraining and pruning steps can also be performed alternately with multiple cycles [10]. However, recent study [20] has shown that the pruned model can be trained from scratch to achieve a comparable prediction performance without the need to finetune the inherited weights from the full model (as shown in Figure 1(b)). This observation implies that the pruned architecture is more important for the pruned model performance. Specifically, in the channel pruning methods, more attention should be paid to searching the channel number configurations of each layer.
Although it has been confirmed that the weights of the pruned model do not need to be finetuned from the pretrained weights, the structure of the pruned model still needs to be learned and extracted from a welltrained model according to different criteria. This step usually involves cumbersome and timeconsuming weights optimization process. Then we naturally ask a question: Is it necessary for learning the pruned model structure from pretrained weights?
In this paper, we explored this question through extensive experiments and found that the answer is quite surprising: an effective pruned structure does not have to be learned from pretrained weights. We empirically show that the pruned structures discovered from pretrained weights tend to be homogeneous, which limits the possibility of searching for better structure. In fact, more diverse and effective pruned structures can be discovered by directly pruning from randomly initialized weights, including potential models with better performance.
Based on the above observations, we propose a novel network pruning pipeline that a pruned network structure can be directly learned from the randomly initialized weights (as shown in Figure 1(c)). Specifically, we utilize a similar technique in Network Slimming [19] to learn the channel importance by associating scalar gate values with each layer. The channel importance is optimized to improve the model performance under the sparsity regularization. What is different from previous works is that we do not update the random weights during this process. After finishing the learning of channel importance, we utilize a simple binary search strategy to determine the channel number configurations of the pruned model given resource constraints (e.g., FLOPS). Since we do not need to update the model weights during optimization, we can discover the pruned structure at an extremely fast speed. Extensive experiments on CIFAR10 [15] and ImageNet [24] show that our method yields at least and searching speedup while achieving comparable or even better model accuracy than traditional pruning methods using complicated strategies. Our method can free researchers from the timeconsuming training process and provide competitive pruning results in future work.
2 Related Work
Network pruning techniques aim to achieve the inference acceleration of deep neural networks by removing the redundant parameters and structures in the model. Early works [16, 7, 8] proposed to remove individual weight values, resulting in nonstructured sparsity in the network. The runtime acceleration cannot be easily achieved on a generalpurpose GPU, otherwise with a custom inference engine [6]. Recent works focus more on the development of structured model pruning [17, 12, 19], especially pruning weight channels. norm based criterion [17] prunes model according to the norm of weight channels. Channel Pruning [12] learns to obtain sparse weights by minimizing local layer output reconstruction error. Network Slimming [19] uses LASSO regularization to learn the importance of all channels and prunes the model based on a global threshold. Automatic Model Compression (AMC) [11]
explores the pruning strategy by automatically learning the compression ratio of each layer through reinforcement learning (RL). Pruned models often require further finetuning to achieve higher prediction performance. However, recent works
[20, 4] have challenged this paradigm and show that the compressed model can be trained from scratch to achieve comparable performance without relying on the finetuning process.Recently, neural architecture search (NAS) provides another perspective on the discovery of the compressed model structure. Recent works [18, 3] follow the topdown pruning process by trimming out a small network from a supernet. The oneshot architecture search methods [2, 1] further develop this idea and conduct architecture search only once after learning the importance of internal cell connections. However, these methods require a large amount of training time to search for an efficient structure.
3 Rethinking Pruning with PreTraining
Network pruning aims to reduce the redundant parameters or structures in an overparameterized model to obtain an efficient pruned network. Representative network pruning methods [19, 5] utilize channel importance to evaluate whether a specific weight channel should be reserved. Specifically, given a pretrained model, a set of channel gates are associated with each layer to learn the channel importance. The channel importance values are optimized with norm based sparsity regularization. Then with the learned channel importance values, a global threshold is set to determine which channels are preserved given a predefined resource constraint. The final pruned model weights can either be finetuned from the original full model weights or retrained from scratch. The overall pipeline is depicted in Figure 1(a) and (b).
In what follows, we show that in the common pipeline of network pruning, the role of pretraining is quite different from what we used to think. Based on this observation, we present a new pipeline which allows pruning networks from scratch, i.e., randomly initialized weights, in the next section.
3.1 Effects of PreTraining on Pruning
The traditional pruning pipeline seems to default to a network that must be fully trained before it can be used for pruning. Here we will empirically explore the effect of the pretrained weights on the final pruned structure. Specifically, we save the checkpoints after different training epochs when we train the baseline network. Then we utilize the weights of different checkpoints as the network initialization weights, and learn the channel importance of each layer by adopting the pipeline described above. We want to explore whether the pretrained weights at different training stages have a crucial impact on the final pruned structure learning.
3.2 Pruned Structure Similarity
First, we compare the structure similarity between different pruned models. For each pruned model, we calculate the pruning ratio of each layer, i.e.
, the number of remaining channels divided by the number of original channels. The vector formed by concatenating the pruning ratios of all layers is then considered to be the feature representation of the pruned structure. Then we calculate the correlation coefficient between each of the two pruned model features as the similarity of their structures. In order to ensure the validity, we randomly selected five sets of random seeds for experiments on CIFAR10 dataset with VGG16
[26] network. We include more visualization results of ResNet20 and ResNet56 [9] in the supplementary material.Figure 2 shows the correlation coefficient matrices for all pruned models. From this figure, we can observe three phenomena. First, the pruned structures learned from random weights are not similar to all the network structures obtained from pretrained weights (see topleft figures in Figure 2(a)(b)). Second, the pruned model structures learned directly from random weights are more diverse with various correlation coefficients. Also, after only ten epochs of weights update in the pretraining stage, the resulting pruned network structures become almost homogeneous. (see Figure 2(c)). Third, the pruned structures based on the checkpoints from near epochs are more similar with high correlation coefficients in the same experiment run (see right figures in Figure 2(a)(b)).
The structure similarity results indicate that the potential pruned structure space is progressively reduced during the weights update in the pretraining phase, which may limit the potential performance accordingly. On the other hand, the randomly initialized weights allow the pruning algorithm to explore more diverse pruned structures.
3.3 Performance of Pruned Structures
We further train each pruned structure from scratch to compare the final accuracy. Table 1 summarizes the prediction accuracy of all pruned structures on the CIFAR10 test set. It can be observed that the pruned models obtained from the random weights can always achieve comparable performance with the pruned structures based on the pretrained weights. Also, in some cases (such as ResNet20), the pruned structures directly learned from random weights achieves even higher prediction accuracy. These results demonstrate that not only the pruned structures learned directly from random weights are more diverse, but also that these structures are valid and can be trained to reach competitive performance.
Model  Rand  Pretraining Epochs  

10  20  30  40  80  120  160  
VGG16  93.68  93.60  93.83  93.71  93.69  93.64  93.69  93.58 
RN20  90.57  90.48  90.50  90.49  90.33  90.42  90.34  90.23 
RN56  92.95  92.96  92.90  92.98  93.04  93.03  92.99  93.05 
The pruned model accuracy results also demonstrate that the pruned structures based on pretrained weights have little advantages in the final prediction performance. Considering that the pretraining phase often requires a cumbersome and timeconsuming computation process, we think that network pruning can directly start from randomly initialized weights.
4 Our Solution: Pruning from Scratch
Based on the above analysis, we propose a new pipeline named pruning from scratch. Different from existing ones, it enables researchers to obtain pruned structure directly from randomly initialized weights.
Specifically, we denote a deep neural network as , where is an input sample, is all trainable parameters, and is the model structure. In general, includes operator types, data flow topology, and layer hyperparameters as modeled in NAS research. In the network pruning, we mainly focus on the microlevel layer settings, especially the channel number of each layer in the channel pruning strategies.
To efficiently learn the channel importance for each layer, a set of scalar gate values are associated with the th layer along the channel dimension. The gate values are multiplied onto the layer’s output to perform channelwise modulation. Therefore, a nearzero gate value will suppress the corresponding channel output, resulting in a pruning effect. We denote the scalar gate values across all the layers as . The optimization objective for is
(1)  
(2) 
where is the corresponding label,
is crossentropy loss function,
is a balance factor. Here, the difference from previous works is twofold. First, we do not update the weights during channel importance learning; Second, we use randomly initialized weights without relying on pretraining.Following the same approach in Network Slimming, we adopt subgradient descent to optimize for the nonsmooth regularization term. However, the naive norm will encourage the gates to be zeroes unconstrainedly, which does not lead to a good pruned structure. Different from the original formulation in Network Slimming, we use the elementwise mean of all the gates to approximate the overall sparsity ratio, and use the square norm to push the sparsity to a predefined ratio [22]. Therefore, given a target sparsity ratio , the regularization term is
(3) 
where is the channel number of the th layer. Empirically, we find this improvement can obtain more reasonable pruned structure. During the optimization, there can be multiple possible gates for pruning. We select the final gates whose sparsity is below the target ratio while achieving the maximum validation accuracy.
After obtaining a set of optimized gate values , we set a threshold to decide which channels are pruned. In the original Network Slimming method, the global pruning threshold is determined according to a predefined reduction ratio of the target structure’s parameter size. However, a more practical approach is to find the pruned structure based on the FLOPS constraints of the target structure. A global threshold can be determined by binary search until the pruned structure satisfies the constraints.
Algorithm 1 summarizes the searching strategy. Notice that a model architecture generator is required to generate a model structure given a set of channel number configurations. Here we only decide the channel number of each convolutional layer and do not change the original layer connection topology.
4.1 Implementations
4.1.1 Channel Expansion
The new pruning pipeline allows us to explore a larger model search space with no cost. We can change the full model size and then obtain the target pruned structure by slimming network. The easiest way to change model capacity is to use uniform channel expansion, which uniformly enlarges or shrinks the channel numbers of all layers with a common width multiplier. As for the networks with skip connection such as ResNet [9]
, the number of final output channels of each block and the number of channels at the block input are simultaneously expanded by the same multiplier to ensure that the tensor dimensions are the same.
4.1.2 Budget Training
A significant finding in [20] is that a pruned network can achieve similar performance to a full model as long as it is adequately trained for a sufficient period. Therefore, the authors in [20] proposed “ScratchB” training scheme, which trains the pruned model for the same amount of computation budget with the full model. For example, if the pruned model saves FLOPS, we double the number of basic training epochs, which amounts to a similar computation budget. Empirically, this training scheme is crucial for improving the pruned model performance.
4.1.3 Channel Gates Location
Following the same practice in Network Slimming [19], we associate the channel gates at the end of BatchNorm layer [14] after each convolutional layer, since we can use the affine transformation parameters in BatchNorm to scale the channel output. For the residual block, we only associate gates in the middle layers of each block. For the depthwise convolution block in MobileNetV1 [13], we associate gates at the end of the second BatchNorm layer. For the inverted residual block in MobileNetV2 [25], we associate gates at the end of the first BatchNorm layer.
5 Experiments
5.1 Settings
We conduct all the experiments on CIFAR10 and ImageNet datasets. For each dataset, we allocate a separate validation set for evaluation while learning the channel gates. Specifically, we randomly select 5,000 images from the original CIFAR10 training set for validation. For ImageNet, we randomly select 50,000 images (50 images for each category) from the original training set for validation. We adopt conventional training and testing data augmentation pipelines [9].
When learning channel importance for the models on CIFAR10 dataset, we use Adam optimizer with an initial learning rate of 0.01 with a batchsize of 128. The balance factor and total epoch is 10. All the models are expanded by , and the predefined sparsity ratio equals the percentage of the pruned model’s FLOPS to the full model. After searching for the pruned network architecture, we train the pruned model from scratch following the same parameter settings and training schedule in [10].
When learning channel importance for the models on ImageNet dataset, we use Adam optimizer with an initial learning rate of 0.01 and a batchsize of 100. The balance factor and total epoch is 1. During training, we evaluate the model performance on the validation set multiple times. After finishing the architecture search, we train the pruned model from scratch using SGD optimizer. For MobileNets, we use cosine learning rate scheduler [21] with an initial learning rate of 0.05, momentum of 0.9, weightdecay of . The model is trained for 300 epochs with a batch size of 256. For ResNet50 models, we follow the same hyperparameter settings in [9]. To further improve the performance, we add label smoothing [27] regularization in the total loss.
Method  Ratio  Baseline (%)  Pruned (%)  Acc (%)  

ResNet20 
SFP  40%  92.20  90.830.31  1.37 
Rethink  40%  92.41  91.070.23  1.34  
Ours  40%  91.75  91.140.32  0.61  
uniform  50%  90.50  89.70  0.80  
AMC  50%  90.50  90.20  0.30  
Ours  50%  91.75  90.550.14  1.20  
ResNet56 
uniform  50%  92.80  89.80  3.00 
ThiNet  50%  93.80  92.98  0.82  
CP  50%  93.80  92.80  1.00  
DCP  50%  93.80  93.49  0.31  
AMC  50%  92.80  91.90  0.90  
SFP  50%  93.59  93.350.31  0.24  
Rethink  50%  93.80  93.070.25  0.73  
Ours  50%  93.23  93.050.19  0.18  
ResNet110 
L1norm  40%  93.53  93.30  0.23 
SFP  40%  93.68  93.860.30  +0.18  
Rethink  40%  93.77  93.920.13  +0.15  
Ours  40%  93.49  93.690.28  +0.20  
VGG16 
L1norm  34%  93.25  93.40  +0.15 
NS  51%  93.99  93.80  0.19  
ThiNet  50%  93.99  93.85  0.14  
CP  50%  93.99  93.67  0.32  
DCP  50%  93.99  94.16  +0.17  
Ours  50%  93.44  93.630.06  +0.19  
VGG19 
NS  52%  93.53  93.600.16  +0.07 
Rethink  52%  93.53  93.810.14  +0.28  
Ours  52%  93.40  93.710.08  +0.31 
Model  Params  Latency  FLOPS  Top1 Acc (%)  

MobileNetV1 
Uniform 0.5  1.3M  20ms  150M  63.3 
Uniform 0.75  3.5M  23ms  325M  68.4  
Baseline 1.0  4.2M  30ms  569M  70.9  
NetAdapt  –  –  285M  70.1  
AMC  2.4M  25ms  294M  70.5  
Ours 0.5  1.0M  20ms  150M  65.5  
Ours 0.75  1.9M  21ms  286M  70.7  
Ours 1.0  4.0M  23ms  567M  71.6  
MobileNetV2 
Uniform 0.75  2.6M  39ms  209M  69.8 
Baseline 1.0  3.5M  42ms  300M  71.8  
Uniform 1.3  5.3M  43ms  509M  74.4  
AMC  2.3M  41ms  211M  70.8  
Ours 0.75  2.6M  37ms  210M  70.9  
Ours 1.0  3.5M  41ms  300M  72.1  
Ours 1.3  4.5M  42ms  511M  74.1  
ResNet50 
Uniform 0.5  6.8M  50ms  1.1G  72.1 
Uniform 0.75  14.7M  61ms  2.3G  74.9  
Uniform 0.85  18.9M  62ms  3.0G  75.9  
Baseline 1.0  25.5M  76ms  4.1G  76.1  
ThiNet30  –  –  1.2G  72.1  
ThiNet50  –  –  2.1G  74.7  
ThiNet70  –  –  2.9G  75.8  
SFP  –  –  2.9G  75.1  
CP  –  –  2.0G  73.3  
Ours 0.5  4.6M  44ms  1.0G  72.8  
Ours 0.75  9.2M  52ms  2.0G  75.6  
Ours 0.85  17.9M  60ms  3.0G  76.7  
Ours 1.0  21.5M  67ms  4.1G  77.2 
5.2 CIFAR10 Results
We run each experiment five times and report the “mean std.” We compare our method with other pruning methods, including naive uniform channel number shrinkage (uniform), ThiNet [23], Channel Pruning (CP) [12], L1norm pruning [17], Network Slimming (NS) [19], Discriminationaware Channel Pruning (DCP) [29], Soft Filter Pruning (SFP) [10], rethinking the value of network pruning (Rethink) [20], and Automatic Model Compression (AMC) [11]. We compare the performance drop of each method under the same FLOPS reduction ratio. A smaller accuracy drop indicates a better pruning method.
Table 2 summarizes the results. Our method achieves less performance drop across different model architectures compared to the stateoftheart methods. For large models like ResNet110 and VGGNets, our pruned model achieves even better performance than the baseline model. Notably, our method consistently outperforms Rethink method, which also utilizes the same budget training scheme. This validates that our method discovers a more efficient and powerful pruned model architecture.
5.3 ImageNet Results
In this section, we test our method on ImageNet dataset. We mainly prune three types of models: MobileNetV1 [13], MobileNetV2 [25], and ResNet50 [9]. We compare our method with uniform channel expansion, ThiNet, SFP, CP, AMC, and NetAdapt [28]. We report the top1 accuracy of each method under the same FLOPS constraint.
Table 3 summarizes the results. When compressing the models, our method outperforms both uniform expansion models and other complicated pruning strategies across all three architectures. Since our method allows the base channel expansion, we can realize the neural architecture search by pruning the model from an enlarged supernet. Our method achieves comparable or even better performance than the original full model design. We also measure the model CPU latency under batch size 1 on a server with two 2.40GHz Intel(R) Xeon(R) CPU E52680 v4. Results show that our model achieves similar or even faster model inference speed than other pruned models. These results validate that it is both effective and scalable to prune model from a randomly initialized network directly.
5.4 Comparison with Lottery Ticket Hypothesis
Model  PR  Random (Ours)  Lottery (Frank’19) 

ResNet20  40%  91.140.32  90.940.26 
ResNet20  50%  90.440.14  90.340.36 
ResNet56  50%  93.050.19  92.850.14 
ResNet110  40%  93.690.28  93.550.37 
VGG16  50%  93.630.06  92.950.22 
VGG19  52%  93.710.08  93.510.21 
Figure 3 displays the channel numbers of the pruned models on CIFAR10 and ImageNet datasets. For each network architecture, we learn the channel importance and prune 50% FLOPS compared to the full model under five different random seeds. Though there are some apparent differences in the channel numbers of the intermediate layers, the resulting pruned model performance remains similar. This demonstrates that our method is robust and stable under different initialization methods.
According to the Lottery Ticket Hypothesis (LTH) [4], a pruned model can only be trained to a competitive performance level if it is reinitialized to the original full model initialization weights (“winning tickets”). In our pipeline, we do not require that the pruned model has to be reinitialized to its original states for retraining the weights. Therefore, we conduct comparison experiments to testify whether LTH applies in our scenario. Table 4 summarizes the results. We traine all the models for five runs on CIFAR10 dataset. From the results, we conclude that our method achieves higher accuracy of the pruned models in all the cases. For Lottery Ticket Hypothesis, we do not observe the necessity of its usage. Similar phenomena are also observed in [20]. There are several potential explanations. First, our method focuses on structured pruning, while LTH draws conclusions on the unstructured pruning, which can be highly sparse and irregular, and a specific initialization is necessary for successful training. Second, as pointed by [20], LTH uses Adam optimizer with small learning rate, which is different from the conventional SGD optimization scheme. Different optimization settings can substantially influence the pruned model training. In conclusion, our method is valid under the mild pruning ratio in the structured pruning situation.
5.5 Computational Costs for Pruning
Since our pruning pipeline does not require updating weights during structure learning, we can significantly reduce the pruned model search cost. We compare our approach to traditional Network Slimming and RLbased AMC pruning strategies. We measure all model search time on a single NVIDIA GeForce GTX TITAN Xp GPU.
When pruning ResNet56 on the CIFAR10 dataset, NS and AMC take 2.3 hours and 1.0 hours, respectively, and our pipeline only takes 0.12 hours. When pruning ResNet50 on ImageNet dataset, NS takes approximately 310 hours to complete the entire pruning process. For AMC, although the pruning phase takes about 3.1 hours, a pretrained full model is required, which is equivalent to about 300 hours of pretraining. Our pipeline takes only 2.8 hours to obtain the pruned structure from a randomly initialized network. These results illustrate the superior pruning speed of our method.
5.6 Visualizing Pruned Structures
We also compare the pruned structures with those identified by AMC [11], which utilizes a more complicated RLbased strategy to determine layerwise pruning ratios. Figure 4 summarizes the difference. On MobileNetV1, our method intentionally reduces more channels between the eighth and eleventh layers, and increases channels in the early stage and the final two layers. The similar trend persists in the last ten layers of MobileNetV2. This demonstrates that our method can discover more diverse and efficient structures.
6 Ablation Study
In the following sections, we explore the performance of our method under different channel expansion rate, pruning ratio and sparsity level.
6.1 Channel Expansion Rate
We have proposed to use a width multiplier to enlarge the channels of each layer as channel expansion uniformly in the previous section. We further investigate the effect of different expansion rate to the final pruned model accuracy. Figure 5 displays the results. All the pruned models are required to reduce 50% FLOPS compared to the full models. From the figure, we find that a general trend of the influence is that when the expansion rate is too large, the pruned model performance will deteriorate. We also surprisingly notice that using the channel shrinkage (0.75 expansion) can even achieve higher pruned model performance in some situations. This is because the preset reduced model capacity can limit the search space, which makes the pruning algorithm easier to find efficient structures.
6.2 Pruning Ratio
In this section, we explore the performance of the pruned model under different pruning ratio. Figure 6 displays the results. For each pruned model, the channel importance is learned by setting predefined sparsity ratio as . Also, all the models are trained under the same hyperparameter settings with budget training scheme. From the figure, we conclude that our method is robust under different pruning ratio. Even under the extreme situation where a large portion of FLOPS is reduced, our method still achieves comparable prediction performance.
6.3 Sparsity Ratio
In this section, we explore the effects of different sparsity ratio on the performance of the pruned model. The predefined sparsity ratio is utilized to restrict the overall sparsity of channel importance value. Figure 7 summarizes the results. All the models are required to reduce 50% FLOPS of the original full models. From the figure, we observe that the final pruned model accuracy is not very sensitive to the sparsity ratio, though a small sparsity level may have the negative impact on the performance. This demonstrates that our method is stable for a range of sparsity ratio and does not require hyperparameter tuning.
7 Discussion and Conclusions
In this paper, we demonstrate that the novel pipeline of pruning from scratch is efficient and effective through extensive experiments on various models and datasets. In addition to high accuracy, pruning from scratch has the following benefits: 1) we can eliminate the cumbersome pretraining process and search the pruned structure directly on the randomly initialized weights in an extremely fast speed; 2) the pruned network structure can no longer be limited by the original network size, but can explore a larger structure space, which helps to search for better pruned model structure.
Another important observation is that pretrained weights reduce the search space for the pruned structure. Meanwhile, we also observe that even after a short period of pretraining weights, the possible pruned structures have become stable and limited. This perhaps implies that the learning of structure may converge faster than weights. Although our pruning pipeline fixes the random initialization weights, it needs to learn the channel importance. This is equivalent to treating each weight channel as a single variable and optimizing the weighting coefficients. The pruned structure learning may become easier with reduced degree of variables.
References

[1]
G. Bender, P.J. Kindermans, B. Zoph, V. Vasudevan, and Q. Le.
Understanding and simplifying oneshot architecture search.
In
Proceedings of the International Conference on Machine Learning (ICML)
, pages 549–558, 2018.  [2] A. Brock, T. Lim, J. M. Ritchie, and N. Weston. Smash: oneshot model architecture search through hypernetworks. In International Conference on Learning Representations (ICLR), 2018.
 [3] H. Cai, L. Zhu, and S. Han. Proxylessnas: Direct neural architecture search on target task and hardware. In International Conference on Learning Representations (ICLR), 2019.
 [4] J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (ICLR), 2019.

[5]
A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T.J. Yang, and E. Choi.
Morphnet: Fast & simple resourceconstrained structure learning of
deep networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 1586–1595, 2018.  [6] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. Eie: efficient inference engine on compressed deep neural network. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), pages 243–254, 2016.
 [7] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR), 2016.
 [8] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems (NeurIPS), pages 1135–1143, 2015.
 [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.

[10]
Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang.
Soft filter pruning for accelerating deep convolutional neural networks.
InProceedings of the International Joint Conference on Artificial Intelligence (IJCAI)
, pages 2234–2240, 2018.  [11] Y. He, J. Lin, Z. Liu, H. Wang, L.J. Li, and S. Han. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pages 784–800, 2018.
 [12] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1389–1397, 2017.
 [13] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning (ICML), pages 448–456, 2015.
 [15] A. Krizhevsky et al. Learning multiple layers of features from tiny images. Technical report, 2009.
 [16] Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain damage. In Advances in neural information processing systems (NeurIPS), pages 598–605, 1990.
 [17] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. In International Conference on Learning Representations (ICLR), 2016.
 [18] H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. In International Conference on Learning Representations (ICLR), 2018.
 [19] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2736–2744, 2017.
 [20] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell. Rethinking the value of network pruning. In International Conference on Learning Representations (ICLR), 2019.

[21]
I. Loshchilov and F. Hutter.
Sgdr: Stochastic gradient descent with warm restarts.
In International Conference on Learning Representations (ICLR), 2016.  [22] J.H. Luo and J. Wu. Autopruner: An endtoend trainable filter pruning method for efficient deep model inference. arXiv preprint arXiv:1805.08941, 2018.
 [23] J.H. Luo, J. Wu, and W. Lin. Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 5058–5066, 2017.
 [24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
 [25] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4510–4520, 2018.
 [26] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [27] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 2016.
 [28] T.J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, and H. Adam. Netadapt: Platformaware neural network adaptation for mobile applications. In Proceedings of the European Conference on Computer Vision (ECCV), pages 285–300, 2018.
 [29] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, and J. Zhu. Discriminationaware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pages 875–886, 2018.
Appendix A Effects of Pretraining on Pruning
In the main text, we explore the effects of pretrained weights on pruned structures by visualizing the structure similarity matrices. Here we present more similar results of ResNet20 and ResNet56 models on CIFAR10 datasets.
Figure 8 and 9 show the results. All the pruned models are required to reduce 50% FLOPS of their original models on CIFAR10 dataset. In each figure, (a) we display the correlation coefficient matrix of the pruned models directly learned from randomly initialized weights (“random”) and other pruned models based on different checkpoints during pretraining (“Epochs”) (topleft). We display the correlation coefficient matrix of pruned structures from pretrained weights in a finer scale (right). We show the channel numbers of each layer of different pruned structures (bottomleft). Red line denotes structure from random weights; (b) similar results from the experiment with a different random seed; (c) we display correlation coefficient matrices of all the pruned structure from five different random seeds. We mark the names of initialized weights used to get pruned structure below.
For ResNet20 and ResNet50, we observe the same phenomena with those in VGG16. First, that the pruned structures learned from random weights are not similar to all the network structures obtained from pretrained weights. Second, the pruned model structures learned directly from random weights are more diverse with various correlation coefficients. Third, the pruned structure based on the checkpoints from near epochs are more similar with high correlation coefficients in the same experiment run.
The only difference between ResNet models with VGG16 is that the the similarities of the pruned structure based on the pretrained weights of different random seeds are not as high as those of VGG16. This is mainly due to the fact that we only prune the layers on the residual branch in ResNet. In the case that the channel numbers of backbone layers are fixed, the number of channels of these pruned layers can have greater freedom of choice, so that they didn’t converge to the same structure. However, the similarity between pruned structures based on pretrained weights is still higher than that obtained from random weights. These results further validate our analysis in the main text.
Comments
There are no comments yet.