Network pruning is an important research field aiming at reducing computational costs of neural networks. Conventional approaches follow a fixed paradigm which first trains a large and redundant network, and then determines which units (e.g., channels) are less important and thus can be removed. In this work, we find that pre-training an over-parameterized model is not necessary for obtaining the target pruned structure. In fact, a fully-trained over-parameterized model will reduce the search space for the pruned structure. We empirically show that more diverse pruned structures can be directly pruned from randomly initialized weights, including potential models with better performance. Therefore, we propose a novel network pruning pipeline which allows pruning from scratch. In the experiments for compressing classification models on CIFAR10 and ImageNet datasets, our approach not only greatly reduces the pre-training burden of traditional pruning methods, but also achieves similar or even higher accuracy under the same computation budgets. Our results facilitate the community to rethink the effectiveness of existing techniques used for network pruning.READ FULL TEXT VIEW PDF
Network pruning is widely used for reducing the heavy computational cost...
Pre-training of models in pruning algorithms plays an important role in
Model pruning is a popular mechanism to make a network more efficient fo...
We propose an efficient once-for-all budgeted pruning framework (OFARPru...
In recent years, many neural network models have been proposed for patte...
Neural network pruning reduces the computational cost of an
Pre-training has improved model accuracy for both classification and
As deep neural networks are widely deployed in mobile devices, there has been an increasing demand for reducing model size and run-time latency. Network pruning [7, 12, 19] techniques are proposed to achieve model compression and inference acceleration by removing redundant structures and parameters. In addition to the early non-structured pruning methods [16, 7], the structured pruning method represented by channel pruning [17, 23, 12, 19] has been widely adopted in recent years because of its easy deployment on general-purpose GPUs. The traditional network pruning methods adopt a three-stage pipeline, namely pre-training, pruning, and fine-tuning , as shown in Figure 1(a). The pre-training and pruning steps can also be performed alternately with multiple cycles . However, recent study  has shown that the pruned model can be trained from scratch to achieve a comparable prediction performance without the need to fine-tune the inherited weights from the full model (as shown in Figure 1(b)). This observation implies that the pruned architecture is more important for the pruned model performance. Specifically, in the channel pruning methods, more attention should be paid to searching the channel number configurations of each layer.
Although it has been confirmed that the weights of the pruned model do not need to be fine-tuned from the pre-trained weights, the structure of the pruned model still needs to be learned and extracted from a well-trained model according to different criteria. This step usually involves cumbersome and time-consuming weights optimization process. Then we naturally ask a question: Is it necessary for learning the pruned model structure from pre-trained weights?
In this paper, we explored this question through extensive experiments and found that the answer is quite surprising: an effective pruned structure does not have to be learned from pre-trained weights. We empirically show that the pruned structures discovered from pre-trained weights tend to be homogeneous, which limits the possibility of searching for better structure. In fact, more diverse and effective pruned structures can be discovered by directly pruning from randomly initialized weights, including potential models with better performance.
Based on the above observations, we propose a novel network pruning pipeline that a pruned network structure can be directly learned from the randomly initialized weights (as shown in Figure 1(c)). Specifically, we utilize a similar technique in Network Slimming  to learn the channel importance by associating scalar gate values with each layer. The channel importance is optimized to improve the model performance under the sparsity regularization. What is different from previous works is that we do not update the random weights during this process. After finishing the learning of channel importance, we utilize a simple binary search strategy to determine the channel number configurations of the pruned model given resource constraints (e.g., FLOPS). Since we do not need to update the model weights during optimization, we can discover the pruned structure at an extremely fast speed. Extensive experiments on CIFAR10  and ImageNet  show that our method yields at least and searching speedup while achieving comparable or even better model accuracy than traditional pruning methods using complicated strategies. Our method can free researchers from the time-consuming training process and provide competitive pruning results in future work.
Network pruning techniques aim to achieve the inference acceleration of deep neural networks by removing the redundant parameters and structures in the model. Early works [16, 7, 8] proposed to remove individual weight values, resulting in non-structured sparsity in the network. The runtime acceleration cannot be easily achieved on a general-purpose GPU, otherwise with a custom inference engine . Recent works focus more on the development of structured model pruning [17, 12, 19], especially pruning weight channels. -norm based criterion  prunes model according to the -norm of weight channels. Channel Pruning  learns to obtain sparse weights by minimizing local layer output reconstruction error. Network Slimming  uses LASSO regularization to learn the importance of all channels and prunes the model based on a global threshold. Automatic Model Compression (AMC) 
explores the pruning strategy by automatically learning the compression ratio of each layer through reinforcement learning (RL). Pruned models often require further fine-tuning to achieve higher prediction performance. However, recent works[20, 4] have challenged this paradigm and show that the compressed model can be trained from scratch to achieve comparable performance without relying on the fine-tuning process.
Recently, neural architecture search (NAS) provides another perspective on the discovery of the compressed model structure. Recent works [18, 3] follow the top-down pruning process by trimming out a small network from a supernet. The one-shot architecture search methods [2, 1] further develop this idea and conduct architecture search only once after learning the importance of internal cell connections. However, these methods require a large amount of training time to search for an efficient structure.
Network pruning aims to reduce the redundant parameters or structures in an over-parameterized model to obtain an efficient pruned network. Representative network pruning methods [19, 5] utilize channel importance to evaluate whether a specific weight channel should be reserved. Specifically, given a pre-trained model, a set of channel gates are associated with each layer to learn the channel importance. The channel importance values are optimized with -norm based sparsity regularization. Then with the learned channel importance values, a global threshold is set to determine which channels are preserved given a predefined resource constraint. The final pruned model weights can either be fine-tuned from the original full model weights or re-trained from scratch. The overall pipeline is depicted in Figure 1(a) and (b).
In what follows, we show that in the common pipeline of network pruning, the role of pre-training is quite different from what we used to think. Based on this observation, we present a new pipeline which allows pruning networks from scratch, i.e., randomly initialized weights, in the next section.
The traditional pruning pipeline seems to default to a network that must be fully trained before it can be used for pruning. Here we will empirically explore the effect of the pre-trained weights on the final pruned structure. Specifically, we save the checkpoints after different training epochs when we train the baseline network. Then we utilize the weights of different checkpoints as the network initialization weights, and learn the channel importance of each layer by adopting the pipeline described above. We want to explore whether the pre-trained weights at different training stages have a crucial impact on the final pruned structure learning.
First, we compare the structure similarity between different pruned models. For each pruned model, we calculate the pruning ratio of each layer, i.e.
, the number of remaining channels divided by the number of original channels. The vector formed by concatenating the pruning ratios of all layers is then considered to be the feature representation of the pruned structure. Then we calculate the correlation coefficient between each of the two pruned model features as the similarity of their structures. In order to ensure the validity, we randomly selected five sets of random seeds for experiments on CIFAR10 dataset with VGG16 network. We include more visualization results of ResNet20 and ResNet56  in the supplementary material.
Figure 2 shows the correlation coefficient matrices for all pruned models. From this figure, we can observe three phenomena. First, the pruned structures learned from random weights are not similar to all the network structures obtained from pre-trained weights (see top-left figures in Figure 2(a)(b)). Second, the pruned model structures learned directly from random weights are more diverse with various correlation coefficients. Also, after only ten epochs of weights update in the pre-training stage, the resulting pruned network structures become almost homogeneous. (see Figure 2(c)). Third, the pruned structures based on the checkpoints from near epochs are more similar with high correlation coefficients in the same experiment run (see right figures in Figure 2(a)(b)).
The structure similarity results indicate that the potential pruned structure space is progressively reduced during the weights update in the pre-training phase, which may limit the potential performance accordingly. On the other hand, the randomly initialized weights allow the pruning algorithm to explore more diverse pruned structures.
We further train each pruned structure from scratch to compare the final accuracy. Table 1 summarizes the prediction accuracy of all pruned structures on the CIFAR10 test set. It can be observed that the pruned models obtained from the random weights can always achieve comparable performance with the pruned structures based on the pre-trained weights. Also, in some cases (such as ResNet20), the pruned structures directly learned from random weights achieves even higher prediction accuracy. These results demonstrate that not only the pruned structures learned directly from random weights are more diverse, but also that these structures are valid and can be trained to reach competitive performance.
The pruned model accuracy results also demonstrate that the pruned structures based on pre-trained weights have little advantages in the final prediction performance. Considering that the pre-training phase often requires a cumbersome and time-consuming computation process, we think that network pruning can directly start from randomly initialized weights.
Based on the above analysis, we propose a new pipeline named pruning from scratch. Different from existing ones, it enables researchers to obtain pruned structure directly from randomly initialized weights.
Specifically, we denote a deep neural network as , where is an input sample, is all trainable parameters, and is the model structure. In general, includes operator types, data flow topology, and layer hyper-parameters as modeled in NAS research. In the network pruning, we mainly focus on the micro-level layer settings, especially the channel number of each layer in the channel pruning strategies.
To efficiently learn the channel importance for each layer, a set of scalar gate values are associated with the -th layer along the channel dimension. The gate values are multiplied onto the layer’s output to perform channel-wise modulation. Therefore, a near-zero gate value will suppress the corresponding channel output, resulting in a pruning effect. We denote the scalar gate values across all the layers as . The optimization objective for is
where is the corresponding label,
is cross-entropy loss function,is a balance factor. Here, the difference from previous works is two-fold. First, we do not update the weights during channel importance learning; Second, we use randomly initialized weights without relying on pre-training.
Following the same approach in Network Slimming, we adopt sub-gradient descent to optimize for the non-smooth regularization term. However, the naive -norm will encourage the gates to be zeroes unconstrainedly, which does not lead to a good pruned structure. Different from the original formulation in Network Slimming, we use the element-wise mean of all the gates to approximate the overall sparsity ratio, and use the square norm to push the sparsity to a predefined ratio . Therefore, given a target sparsity ratio , the regularization term is
where is the channel number of the -th layer. Empirically, we find this improvement can obtain more reasonable pruned structure. During the optimization, there can be multiple possible gates for pruning. We select the final gates whose sparsity is below the target ratio while achieving the maximum validation accuracy.
After obtaining a set of optimized gate values , we set a threshold to decide which channels are pruned. In the original Network Slimming method, the global pruning threshold is determined according to a predefined reduction ratio of the target structure’s parameter size. However, a more practical approach is to find the pruned structure based on the FLOPS constraints of the target structure. A global threshold can be determined by binary search until the pruned structure satisfies the constraints.
Algorithm 1 summarizes the searching strategy. Notice that a model architecture generator is required to generate a model structure given a set of channel number configurations. Here we only decide the channel number of each convolutional layer and do not change the original layer connection topology.
The new pruning pipeline allows us to explore a larger model search space with no cost. We can change the full model size and then obtain the target pruned structure by slimming network. The easiest way to change model capacity is to use uniform channel expansion, which uniformly enlarges or shrinks the channel numbers of all layers with a common width multiplier. As for the networks with skip connection such as ResNet 
, the number of final output channels of each block and the number of channels at the block input are simultaneously expanded by the same multiplier to ensure that the tensor dimensions are the same.
A significant finding in  is that a pruned network can achieve similar performance to a full model as long as it is adequately trained for a sufficient period. Therefore, the authors in  proposed “Scratch-B” training scheme, which trains the pruned model for the same amount of computation budget with the full model. For example, if the pruned model saves FLOPS, we double the number of basic training epochs, which amounts to a similar computation budget. Empirically, this training scheme is crucial for improving the pruned model performance.
Following the same practice in Network Slimming , we associate the channel gates at the end of BatchNorm layer  after each convolutional layer, since we can use the affine transformation parameters in BatchNorm to scale the channel output. For the residual block, we only associate gates in the middle layers of each block. For the depth-wise convolution block in MobileNetV1 , we associate gates at the end of the second BatchNorm layer. For the inverted residual block in MobileNetV2 , we associate gates at the end of the first BatchNorm layer.
We conduct all the experiments on CIFAR10 and ImageNet datasets. For each dataset, we allocate a separate validation set for evaluation while learning the channel gates. Specifically, we randomly select 5,000 images from the original CIFAR10 training set for validation. For ImageNet, we randomly select 50,000 images (50 images for each category) from the original training set for validation. We adopt conventional training and testing data augmentation pipelines .
When learning channel importance for the models on CIFAR10 dataset, we use Adam optimizer with an initial learning rate of 0.01 with a batch-size of 128. The balance factor and total epoch is 10. All the models are expanded by , and the predefined sparsity ratio equals the percentage of the pruned model’s FLOPS to the full model. After searching for the pruned network architecture, we train the pruned model from scratch following the same parameter settings and training schedule in .
When learning channel importance for the models on ImageNet dataset, we use Adam optimizer with an initial learning rate of 0.01 and a batch-size of 100. The balance factor and total epoch is 1. During training, we evaluate the model performance on the validation set multiple times. After finishing the architecture search, we train the pruned model from scratch using SGD optimizer. For MobileNets, we use cosine learning rate scheduler  with an initial learning rate of 0.05, momentum of 0.9, weight-decay of . The model is trained for 300 epochs with a batch size of 256. For ResNet50 models, we follow the same hyper-parameter settings in . To further improve the performance, we add label smoothing  regularization in the total loss.
|Method||Ratio||Baseline (%)||Pruned (%)||Acc (%)|
|Model||Params||Latency||FLOPS||Top-1 Acc (%)|
We run each experiment five times and report the “mean std.” We compare our method with other pruning methods, including naive uniform channel number shrinkage (uniform), ThiNet , Channel Pruning (CP) , L1-norm pruning , Network Slimming (NS) , Discrimination-aware Channel Pruning (DCP) , Soft Filter Pruning (SFP) , rethinking the value of network pruning (Rethink) , and Automatic Model Compression (AMC) . We compare the performance drop of each method under the same FLOPS reduction ratio. A smaller accuracy drop indicates a better pruning method.
Table 2 summarizes the results. Our method achieves less performance drop across different model architectures compared to the state-of-the-art methods. For large models like ResNet110 and VGGNets, our pruned model achieves even better performance than the baseline model. Notably, our method consistently outperforms Rethink method, which also utilizes the same budget training scheme. This validates that our method discovers a more efficient and powerful pruned model architecture.
In this section, we test our method on ImageNet dataset. We mainly prune three types of models: MobileNet-V1 , MobileNet-V2 , and ResNet50 . We compare our method with uniform channel expansion, ThiNet, SFP, CP, AMC, and NetAdapt . We report the top-1 accuracy of each method under the same FLOPS constraint.
Table 3 summarizes the results. When compressing the models, our method outperforms both uniform expansion models and other complicated pruning strategies across all three architectures. Since our method allows the base channel expansion, we can realize the neural architecture search by pruning the model from an enlarged supernet. Our method achieves comparable or even better performance than the original full model design. We also measure the model CPU latency under batch size 1 on a server with two 2.40GHz Intel(R) Xeon(R) CPU E5-2680 v4. Results show that our model achieves similar or even faster model inference speed than other pruned models. These results validate that it is both effective and scalable to prune model from a randomly initialized network directly.
|Model||PR||Random (Ours)||Lottery (Frank’19)|
Figure 3 displays the channel numbers of the pruned models on CIFAR10 and ImageNet datasets. For each network architecture, we learn the channel importance and prune 50% FLOPS compared to the full model under five different random seeds. Though there are some apparent differences in the channel numbers of the intermediate layers, the resulting pruned model performance remains similar. This demonstrates that our method is robust and stable under different initialization methods.
According to the Lottery Ticket Hypothesis (LTH) , a pruned model can only be trained to a competitive performance level if it is re-initialized to the original full model initialization weights (“winning tickets”). In our pipeline, we do not require that the pruned model has to be re-initialized to its original states for re-training the weights. Therefore, we conduct comparison experiments to testify whether LTH applies in our scenario. Table 4 summarizes the results. We traine all the models for five runs on CIFAR10 dataset. From the results, we conclude that our method achieves higher accuracy of the pruned models in all the cases. For Lottery Ticket Hypothesis, we do not observe the necessity of its usage. Similar phenomena are also observed in . There are several potential explanations. First, our method focuses on structured pruning, while LTH draws conclusions on the unstructured pruning, which can be highly sparse and irregular, and a specific initialization is necessary for successful training. Second, as pointed by , LTH uses Adam optimizer with small learning rate, which is different from the conventional SGD optimization scheme. Different optimization settings can substantially influence the pruned model training. In conclusion, our method is valid under the mild pruning ratio in the structured pruning situation.
Since our pruning pipeline does not require updating weights during structure learning, we can significantly reduce the pruned model search cost. We compare our approach to traditional Network Slimming and RL-based AMC pruning strategies. We measure all model search time on a single NVIDIA GeForce GTX TITAN Xp GPU.
When pruning ResNet56 on the CIFAR10 dataset, NS and AMC take 2.3 hours and 1.0 hours, respectively, and our pipeline only takes 0.12 hours. When pruning ResNet50 on ImageNet dataset, NS takes approximately 310 hours to complete the entire pruning process. For AMC, although the pruning phase takes about 3.1 hours, a pre-trained full model is required, which is equivalent to about 300 hours of pre-training. Our pipeline takes only 2.8 hours to obtain the pruned structure from a randomly initialized network. These results illustrate the superior pruning speed of our method.
We also compare the pruned structures with those identified by AMC , which utilizes a more complicated RL-based strategy to determine layer-wise pruning ratios. Figure 4 summarizes the difference. On MobileNet-V1, our method intentionally reduces more channels between the eighth and eleventh layers, and increases channels in the early stage and the final two layers. The similar trend persists in the last ten layers of MobileNet-V2. This demonstrates that our method can discover more diverse and efficient structures.
In the following sections, we explore the performance of our method under different channel expansion rate, pruning ratio and sparsity level.
We have proposed to use a width multiplier to enlarge the channels of each layer as channel expansion uniformly in the previous section. We further investigate the effect of different expansion rate to the final pruned model accuracy. Figure 5 displays the results. All the pruned models are required to reduce 50% FLOPS compared to the full models. From the figure, we find that a general trend of the influence is that when the expansion rate is too large, the pruned model performance will deteriorate. We also surprisingly notice that using the channel shrinkage (0.75 expansion) can even achieve higher pruned model performance in some situations. This is because the preset reduced model capacity can limit the search space, which makes the pruning algorithm easier to find efficient structures.
In this section, we explore the performance of the pruned model under different pruning ratio. Figure 6 displays the results. For each pruned model, the channel importance is learned by setting predefined sparsity ratio as . Also, all the models are trained under the same hyper-parameter settings with budget training scheme. From the figure, we conclude that our method is robust under different pruning ratio. Even under the extreme situation where a large portion of FLOPS is reduced, our method still achieves comparable prediction performance.
In this section, we explore the effects of different sparsity ratio on the performance of the pruned model. The predefined sparsity ratio is utilized to restrict the overall sparsity of channel importance value. Figure 7 summarizes the results. All the models are required to reduce 50% FLOPS of the original full models. From the figure, we observe that the final pruned model accuracy is not very sensitive to the sparsity ratio, though a small sparsity level may have the negative impact on the performance. This demonstrates that our method is stable for a range of sparsity ratio and does not require hyper-parameter tuning.
In this paper, we demonstrate that the novel pipeline of pruning from scratch is efficient and effective through extensive experiments on various models and datasets. In addition to high accuracy, pruning from scratch has the following benefits: 1) we can eliminate the cumbersome pre-training process and search the pruned structure directly on the randomly initialized weights in an extremely fast speed; 2) the pruned network structure can no longer be limited by the original network size, but can explore a larger structure space, which helps to search for better pruned model structure.
Another important observation is that pre-trained weights reduce the search space for the pruned structure. Meanwhile, we also observe that even after a short period of pre-training weights, the possible pruned structures have become stable and limited. This perhaps implies that the learning of structure may converge faster than weights. Although our pruning pipeline fixes the random initialization weights, it needs to learn the channel importance. This is equivalent to treating each weight channel as a single variable and optimizing the weighting coefficients. The pruned structure learning may become easier with reduced degree of variables.
Proceedings of the International Conference on Machine Learning (ICML), pages 549–558, 2018.
Soft filter pruning for accelerating deep convolutional neural networks.In
Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 2234–2240, 2018.
Sgdr: Stochastic gradient descent with warm restarts.In International Conference on Learning Representations (ICLR), 2016.
In the main text, we explore the effects of pre-trained weights on pruned structures by visualizing the structure similarity matrices. Here we present more similar results of ResNet20 and ResNet56 models on CIFAR10 datasets.
Figure 8 and 9 show the results. All the pruned models are required to reduce 50% FLOPS of their original models on CIFAR10 dataset. In each figure, (a) we display the correlation coefficient matrix of the pruned models directly learned from randomly initialized weights (“random”) and other pruned models based on different checkpoints during pre-training (“Epochs”) (top-left). We display the correlation coefficient matrix of pruned structures from pre-trained weights in a finer scale (right). We show the channel numbers of each layer of different pruned structures (bottom-left). Red line denotes structure from random weights; (b) similar results from the experiment with a different random seed; (c) we display correlation coefficient matrices of all the pruned structure from five different random seeds. We mark the names of initialized weights used to get pruned structure below.
For ResNet20 and ResNet50, we observe the same phenomena with those in VGG16. First, that the pruned structures learned from random weights are not similar to all the network structures obtained from pre-trained weights. Second, the pruned model structures learned directly from random weights are more diverse with various correlation coefficients. Third, the pruned structure based on the checkpoints from near epochs are more similar with high correlation coefficients in the same experiment run.
The only difference between ResNet models with VGG16 is that the the similarities of the pruned structure based on the pre-trained weights of different random seeds are not as high as those of VGG16. This is mainly due to the fact that we only prune the layers on the residual branch in ResNet. In the case that the channel numbers of backbone layers are fixed, the number of channels of these pruned layers can have greater freedom of choice, so that they didn’t converge to the same structure. However, the similarity between pruned structures based on pre-trained weights is still higher than that obtained from random weights. These results further validate our analysis in the main text.