Convolutional neural networks (CNNs) have achieved remarkable performance on a wide range of vision and learning tasks [31, 11, 15, 39, 54, 60, 4, 41, 13, 56, 63]. Despite the impressive performance, CNNs are notably over-parameterized and thus lead to high computational overhead and memory footprint in inference. Therefore, network compression techniques are developed to assist the deployment of CNNs in real-world applications.
Filter pruning is an efficient way to reduce the computational cost of CNNs with negligible performance degradation. As shown in Fig. 1, a typical pipeline of filter pruning  works as follows: 1) train an over-parameterized model with the sparsity-inducing regularization; 2) estimate the importance of each filter and prune the unimportant filters; 3) finetune the compressed model to recover the accuracy. Among these, identifying unimportant filters is the key to efficient filter pruning. Prior work [16, 33, 37, 57] prunes filters according to the magnitude of the corresponding model parameters. For example, Li et al.  prune convolutional filters of smaller norms as they are considered to have less impact on the functionality of the network. Network Slimming  then proposes to prune channels (i.e.,
filters) based on the corresponding scaling factors. To be specific, the scaling factors of the batch normalization (BN) layer serve as an indicator of the channel importance, on which an regularization is imposed to promote sparsity. As a result, Liu et al.  derive an automatically searched network architecture of the compressed model.
However, existing methods select unimportant filters based only on the parameter magnitude of a single layer [16, 33, 37, 24, 23, 61], while neglecting the dependency between consecutive layers. For example, a specific channel with a small BN scaling factor may be followed by a convolution with a large weight magnitude at that channel, making the channel still important to the output. Besides, in the “smaller BN factor, less importance” strategy, BN factors from different layers are gathered together to rank and determine the filters to be pruned. We argue and empirically verify that this strategy is sub-optimal and may lead to unstable network architectures as it neglects the intrinsic statistical variation among the BN factors of different layers. Empirically, we observe that the pruned architectures of Network Slimming  are sometimes unbanlanced and lead to severely degraded performance, especially when the pruning ratio is relatively high.
In this paper, we propose a dependency-aware filter pruning strategy, which takes the relationship between adjacent layers into consideration. Hence, we measure the filter importance in a more principled manner. Along this line, we introduce a novel criteria to determine the filters to be pruned by the local importance of the consecutive two layers. That is, if one layer is sparse, then more filters will be pruned and vice versa, regardless of the statistics of other layers. Finally, we propose an automatic-regularization-control mechanism in which the coefficient of the sparsity-inducing regularization is dynamically adjusted to meet the desired sparsity. Our contributions are summarized below:
We propose a principled criteria of measuring the filter importance by taking the dependency between adjacent layers into consideration.
Given the dependency-aware filter importance, we prune filters based on the local statistics of each layer, instead of ranking the filter importance across the entire network.
We propose to dynamically control the coefficient of the sparsity-inducing regularization to achieve the desired model sparsity.
Comprehensive experimental results demonstrate that the improved filter pruning strategy performs favorably against the existing strong baseline  on the CIFAR, SVHN, and ImageNet datasets. We also validate our design choices with several ablation studies and verify that the proposed algorithm reaches more stable and well-performing architectures.
Ii Related work
Ii-a Network pruning
Network pruning is a prevalent technique to reduce redundancy in deep neural networks by removing unimportant neurons. Specifically,weight pruning approaches [3, 12, 17, 18, 19, 32, 52] remove network parameters without structural constraints, thus leading to unstructured architectures that are not well supported by the BLAS libraries. On the other hand, filter pruning methods [33, 37, 25, 26, 43, 62] remove the entire filters (i.e., channels) from each layer, thus resulting in compact networks that can be conveniently incorporated into modern BLAS libraries. According to how to identify the unimportant filters, existing filter pruning methods can be further divided into two categories: data-dependent filter pruning and data-independent filter pruning.
Data-dependent filter pruning utilizes the training data to determine the filters to be pruned. Polyak et al.  remove filters that produce activations of smaller norms. He et al.  perform a channel selection by minimizing the reconstruction error. Zheng et al.  and Anwar et al.  both evaluate the filter importance via the loss of the validation accuracy without each filter. Molchanov et al.  approximate the exact contribution of each filter with the Taylor expansion. A recent work  proposes a layer-wise recursive Bayesian pruning method with a dropout-based metric of redundancy.
Data-independent filter pruning identifies less important filters based merely on the model itself (i.e., model structure and model parameters). Li et al.  discard filters according to the norm of the corresponding parameters as filters with smaller weights are considered to contribute less to the output. Network Slimming  imposes a sparsity-inducing regularization on the scaling factors of the BN layer and then prunes filters with smaller scaling factors. Zhou 
using the evolutionary algorithm to search redundant filters during training. Heet al.  propose to dynamically prune filters during training. In another work He et al.  propose to prune filters that are close to the geometric median. They argue that filters near the geometric median are more likely to be represented by others , thus leading to redundancy.
Our method belongs to the data-independent filter pruning, which is generally more efficient as involving the training data brings extra computation. For example, Zheng et al.  and Anwar et al.  measure the importance of each filter by removing the filter and re-evaluating the compressed model on the validation set. This procedure is extremely time-consuming. Essentially, we take the dependency between the consecutive layers into consideration, while previous data-independent methods [33, 37, 25] merely focus on the parameters (either the convolutional weights [33, 25] or the BN scaling factors 
) of a single layer. Besides, we propose a novel mechanism to dynamically control the coefficient of the sparsity-inducing regularization, instead of pre-defining it based on human heuristics. Incorporating these components, our principled approches and better estimate the filter importance (Sec. V-A) and achieve more banlanced pruned architectures (Sec. V-D).
Ii-B Neural Architecture Search
While most state-of-the-art CNNs [21, 27, 50] manually designed by human experts, there is also a line of research that explores automatic network architecture learning [2, 9, 36, 46, 55, 58, 68], called neural architecture search (NAS). Specifically, automatically tuning channel width is also studied in NAS. For example, ChamNet  builds an accuracy predictor on the Gaussian Process with the Bayesian optimization to predict the network accuracy with various channel widths in each layer. FBNet  adopts a gradient-based method to optimize the CNN architecture and search for the optimal channel width. The proposed pruning method can be regarded as a particular case of channel width selection as well, except that we impose the resource constraints on the selected architecture. However, our method learns the architecture through a single training process, while typical NAS methods may train hundreds of models with different architectures to determine the best-performing one [9, 68]. We highlight that our efficiency is in line with the goal of neural architecture search.
Ii-C Other Alternatives for Network Compression
There is a line of research [10, 51, 64, 28] that aims to approximate the weight matrices of the neural networks with several low-rank matrices using techniques like the Single Value Decomposition (SVD) . However, these methods cannot be applied to the convolutional weights, and thus the acceleration in inference is limited.
Weight quantization [5, 8, 48, 59, 6] reduces the model size by using a low bit-width number of the weights and hidden activations. For example, Courbariaux et al.  and Rastegari et al.  quantize the real-valued weights into binary or ternary ones, i.e., the weight values are restricted to or . Cheng et al.  quantize CNNs with a predefined codebok. Despite the significant model-size reduction and inference acceleration, these methods often come with a mild accuracy drop due to the low precision.
Iii Dependency-Aware Filter Pruning
Iii-a Dependency Analysis
Generally, we assume a typical CNN involves multiple convolution operators (Conv layers), batch normalizations (BN layers) , and non-linearities, which are applied to the input signals sequentially as in Fig. 2. Practically, each channel is transformed independently in the BN layers and non-linearities, while inter-channel information is fused in the Conv layers. To prune filters (i.e., channels) with minimal impact on the network output, we analyze the role each channel plays in the Conv layers as follows.
Let be the hidden activations after normalization before scaling in the BN layer. The scaled activations can be formulated as111For simplicity, we omit the shifting parameters in a typical BN layer, and the bias term in Eq. (3).
where denotes the scaling factor of the BN layer and (resp. ) is the channel of (resp. ). Then, a Lipschitz-continuous non-linearity is applied to , namely,
Afterward, all channels of are fused into via a convolution operation, and different channels contribute to the fused activation differently. Formally, let be the convolution filter, where denotes the kernel size. We have
where denotes the convolution operator. As convolution is an affine transformation, we re-formulate the linearity of Eq. (3) explicitly:
where , , and are the unfolded versions of , , and , respectively. Factorize along the channel axis, and we have
where and .
Then, we analyze the contribution of each channel as follows:222 Here, we assume the non-linearity provides
zero activations given zero inputs, and most widely-used non-linearities, such as ReLU
Here, we assume the non-linearity provides zero activations given zero inputs, and most widely-used non-linearities, such as ReLU and its variants [42, 7, 20], satisfy this property.
where denotes the Lipschitz constant of function , and and are the unfolded versions of and , respectively. Since the normalization operation in BN layer uniformize the activations (i.e., ) across channels, we quantify the contribution of the channel by
which serves as our metric for network pruning.
Iii-B Filter Selection
Let be the pruning ratio, and be the number of filters in the convolutional layer. Generally, previous works can be divided into two groups according to the target network.
Pruning with Pre-defined Target Network
Many previous work [24, 23, 25] prune a fixed ratio of filters in each layer. In other words, there will be filters pruned from the layer. The architecture of the target network is known even without pruning. However, recent work [16, 34] reveals that this stretagy cannot find the optimal distribution of the neuron numbers of each convolutional layer across the network, as some layers will be over-parameterized while some under-parameterized.
Pruning as Architecture Search
Network Slimming  treats pruning as a special form of architecture search, i.e., search for the optimal channel width of each layer. It compares the importance of each convolutional filter across the entire network and prunes filters of less importance. This approach provides more flexibility of the compressed architecture as a higher pruning ratio can be achieved if a specific layer is sparse and vice versa.
However, according to our practice, we find that sometimes too many filters of a layer (or occasionally all filters of a layer) are pruned in this strategy, leading to severely degraded performance. This is because it does not take the intrinsic statistical variation among different layers into consideration. Suppose there are two layers and the corresponding scaling factors are and , respectively. Our target is to prune half of the filters, i.e., . Apparently, the second and third channels should be pruned from the first layer, and the first and third channels should be pruned from the second layer. However, if we rank the scaling factors globally, all filters of the first layer will be pruned, which is obviously unreasonable.
To alleviate this issue, we instead select the unimportant filters based on the intra-layer statistics. Let be the importance of the channel in the layer. Then, filters with importance factor will be pruned, where the threshold is a hyper-parameter. Formally, the set of filters to be pruned in the layer is:
In our solution, the choice of the filters to be pruned in one layer is made independent of the statistics of other layers, so that the intrinsic statistical differences among layers will not result in dramatically unbalanced neural architecture.
Iii-C Automatic Control of Sparsity Regularization
Network Slimming  imposes an regularization on the model parameters to promote model sparsity. However, choosing a proper regularization coefficient is non-trivial and mostly requires manual tuning based on human heuristics. For example, Network Slimming performs a grid search in a set of candidate coefficients for each dataset and network architecture. However, different pruning ratios require different levels of model sparsity, and thus different coefficients . It is extremely inefficient to tune for each experimental setting.
To escape from manually choosing and meet the required model sparsity at the same time, we propose to automatically control the regularization coefficient . Following the practice in , an regularization is imposed on the scaling factors of the batch normalization layers. As shown in Alg. 1, at the end of the epoch, we calculate the overall sparsity of the model:
Given the total number of epochs , we compute the expected sparsity gain, and if the sparsity gain within an epoch does not meet the requirement, i.e., , the regularization coefficient is increased by . If the model is over-sparse, i.e., , the coefficient is decreased by .
This strategy guarantees that the model meets the desired model sparsity, and that the pruned filters contribute negligibly to the outputs.
Iv Experimental Results
In this section, we first describe the details of our implementation in Sec. IV-A, and report the experimental results on the CIFAR  datasets in Sec. IV-B and the ImageNet  dataset in Sec. IV-D.
Iv-a Implementation Details
Our implementation is based on the official training sources of Network Slimming in the PyTorch library.333https://github.com/Eric-mingjie/rethinking-network-pruning We follow the “train, prune, and finetune” pipeline as depicted in Fig. 1.
Datasets and Data Augmentation
datasets. For the CIFAR and SVHN datasets, we follow the common practice of data augmentation: zero-padding of 4 pixels on each side of the image and random cropp of apatch. On the ImageNet dataset, we adopt the standard data augmentation strategy as in the prior work [20, 21, 22, 50]: resize images to have the shortest edge of 256 pixels and then randomly crop a
patch. Besides, we adopt random horizontal flip on the cropped image for the CIFAR and ImageNet datasets. The input data is normalized by subtracting the channel-wise means and dividing the channel-wise standard deviations before being fed to the network.
The threshold in Eq. (8) is set to unless otherwise specified, and in all experiments. We use the SGD optimizer with a momentum of and a weight decay of . The initial learning rate is and divided by a factor of at the specified epochs. We train for epochs on the CIFAR datasets and epochs on the SVHN dataset. The learning rate decays at and of the total training epochs. On the ImageNet dataset, we train for epochs and decay the learning rate every epochs.
Half-precision Training on ImageNet
We train models on the ImageNet dataset with half-precision (FP16), using the Apex library,444https://github.com/NVIDIA/apex where parameters of batch normalization are represented in FP32 while others in FP16. This allows us to train the ResNet-50 model within hours on RTX Ti GPUs. Despite training with FP16, we do not observe obvious performance degradation in our experiments. For example, as shown in Tab. IV, we achieve a top-1 accuracy of with the Pre-ResNet-50 architecture on the ImageNet dataset, which is very close to that in the original paper  or reported in .
Train, Prune, and, Finetune
We adopt the three-stage pipeline, i.e., train, prune, and finetune, as in many previous pruning methods [24, 23, 25, 37, 40, 62]. (See Fig. 1.) In the experiments, we found that in the first stage, the model sparsity grows rapidly when the learning rate is large. After the learning rate decays, the model sparsity hardly increases unless an extremely large is reached. Therefore, to effectively promote model sparsity, we keep the learning rate fixed in the first stage, and decays the learning rate normally when in the third stage. On CIFAR datasets, we train for 160 epochs for the first stage, and on the ImageNet dataset, we train only 40 epochs for the first stage. On both CIFAR and ImageNet datasets, we finetune for a full episode.
|VGG11||SLM ||0.5||92.13 ()||91.91 ()|
|Ours||92.02 ()||92.17 ()|
|VGG16||SLM||0.6||93.73 ()||93.65 ()|
|Ours||93.57 ()||93.70 ()|
|VGG19||SLM (from )||0.7||93.53 ()||93.60 ()|
|Ours||93.66 ()||93.53 ()|
|Res56||SFP||0.4||93.59 ()||92.26 ()|
|ASFP ||93.59 ()||92.44 ()|
|FPGM ||93.59 ()||92.93 ()|
|SLM||93.56 ()||93.33 ()|
|Ours||93.73 ()||93.86 ()|
|2-5[1pt/2pt]||SLM||0.5||93.56 ()||92.90 ()|
|Ours||93.73 ()||93.62 ()|
|2-5[1pt/2pt]||SLM||0.6||93.56 ()||91.94 ()|
|Ours||93.73 ()||92.68 ()|
|Res110||SFP||0.4||93.68 ()||93.38 ()|
|ASFP ||93.68 ()||93.20 ()|
|FPGM ||93.68 ()||93.73 ()|
|SLM||94.61 ()||94.49 ()|
|Ours||94.43 ()||94.75 ()|
|2-5[1pt/2pt]||SLM||0.5||94.61 ()||94.24 ()|
|Ours||94.43 ()||94.52 ()|
|2-5[1pt/2pt]||SLM||0.6||94.61 ()||93.47 ()|
|Ours||94.43 ()||94.57 ()|
|Res164||SLM (from )||0.4||95.04 ()||94.77 ()|
|Ours||94.86 ()||95.01 ()|
|2-5[1pt/2pt]||SLM||0.5||95.04 ()||94.52 ()|
|Ours||94.86 ()||94.83 ()|
|2-5[1pt/2pt]||SLM (from )||0.6||95.04 ()||94.23 ()|
|Ours||94.86 ()||94.53 ()|
Prune with Short Connections
In the Pre-Act-ResNet architecture, operators are arranged in the “BN, ReLU, and Conv” order. As depicted in Fig. 3, given the input feature maps, we perform a “feature selection” right after the first batch normalization layer (BN1) to filter out less important channels according to the dependency-aware channel importance (Eq. (7)). For the first and second convolutional layers (Conv1 and Conv2), we prune both the input and output dimensions of their kernels. (The pruned channels are represented as the dotted planes in Fig. 3.) For the last convolutional layer (Conv3), we prune only the input dimension of Conv3 to preserve the structure of the residual path. After pruning, the number of channels in the residual path remains unchanged. Note that when computing the model sparsity (Eq. (9)), the “feature selection” is not taken into account because it does not actually prune any filters. For example, in the case of Fig. 3, there are only 2 filters pruned, i.e., the second filter of Conv2 and the first filter of Conv3.
Iv-B Results on CIFAR
We first evaluate our method on the CIFAR10 and CIFAR100 datasets. Experiments on the CIFAR datasets are conducted using the VGGNets and ResNets with various depths. On the CIFAR datasets, we record the mean and standard deviation over a 10-fold validation. It is worthy of noting that, as described in Sec. III-B, Network Slimming  often results in unstable architectures, whose performance is greatly degraded. (See Sec. V-D
for details.) Therefore, for Network Slimming, we skip the outliers and restart the pipeline if the accuracy islower than the mean accuracy. Quantitative results on CIFAR10 and CIFAR100 datasets are summarized in Tab. I and Tab. II, respectively. Additionally, a curve of the classification accuracy v.s. the pruning ratio is shown in Fig. 4.
|VGG11||SLM ||0.3||69.33 ()||66.54 ()|
|Ours||68.24 ()||67.84 ()|
|VGG16||SLM||0.3||73.50 ()||73.36 ()|
|Ours||72.16 ()||73.59 ()|
|Ours||72.16 ()||73.59 ()|
|VGG19||SLM (from )||0.5||72.63 ()||72.32 ()|
|Ours||71.19 ()||72.48 ()|
|Res164||SLM (from )||0.4||76.80 ()||76.22 ()|
|Ours||76.43 ()||77.74 ()|
|2-5[1pt/2pt]||SLM||0.6||76.80 ()||74.17 ()|
|Ours||76.43 ()||76.28 ()|
We start with the simpler architecture, VGGNet, which is a sequential architecture without skip connections. We find that pruning a large number of filters brings a puny performance drop. Take the VGGNet-19 as an example. On the CIFAR10 dataset, with 70% of the filters pruned, both Network Slimming and our method even bring a little performance gain. And interestingly, increasing model depth does not always enhance performance. On both CIFAR10 and CIFAR100 datasets, VGGNet-16 achieves better (or comparable) performance than VGGNet-19. These observations demonstrate the VGGNet is heavily over-parameterized for the CIFAR datasets, and that pruning a proportion of filters brings negligible influence to the performance.
Pruning the ResNet architectures is more complicated because of the residual paths. As described in Sec. IV-A and Fig. 3, we preserve the number of channels in the residual path and only prune filters inside the bottleneck architecture. By pruning the same proportion of filters, our method consistently achieves better results compared with the Network Slimming  baseline.
Iv-C Results on SVHN
We then apply the proposed pruning algorithm to the ResNet family on the SVHN dataset, following the same evaluation protocol as in Sec. IV-B. It can be seen from Tab. III that our approach outperforms the state-of-the-art baseline method  under various model depths and pruning ratios. Also, Network Slimming  often collapses when the pruning ratio is high, e.g., , while our approach is more tolerant of high pruning ratios and still maintains a competitive accuracy. For example, only an accuracy of is sacrisficed for of filters being pruned from the ResNet-56 backbone. Furthermore, similar to the circumstances on the CIFAR datasets, pruning a proportion of filters may even bring a performance gain (e.g., when or of filters are pruned), indicating a moderate pruning ratio can alleviate the over-fitting problem on the relatively small datasets, such as CIFAR and SVHN.
|Res20||SLM||0.2||95.85 ()||95.82 ()|
|Ours||95.85 ()||96.18 ()|
|2-5[1pt/2pt]||SLM||0.4||95.85 ()||95.77 ()|
|Ours||95.85 ()||96.20 ()|
|2-5[1pt/2pt]||SLM||0.6||95.85 ()||95.66 ()|
|Ours||95.85 ()||96.15 ()|
|Ours||95.85 ()||95.49 ()|
|Res56||SLM||0.2||96.87 ()||96.62 ()|
|Ours||96.87 ()||97.04 ()|
|2-5[1pt/2pt]||SLM||0.4||96.87 ()||96.56 ()|
|Ours||96.87 ()||97.00 ()|
|Ours||96.87 ()||97.03 ()|
|Ours||96.87 ()||96.77 ()|
Iv-D Results on ImageNet
Here, we evaluate the proposed method on the large-scale and challenging ImageNet  benchmark. The results of Network Slimming  and our method are obtained from our implementation, while other results come from the original papers. We compare against several recently-proposed pruning methods with various criterion, including the weight norm , norm of batch-norm factors [37, 61], and a data-dependent pruning method . As summarized in Tab. IV, under the same pruning ratios, our method consistently outperforms the Network Slimming baseline, and retains a comparable number of parameters and complexity (FLOPs). Even compared with the data-dependent pruning method , our method still achieves competitive performance.
|Li et al. ||N/A||72.04||1.93||2.76|
|Ye et al. -v1||N/A||74.56||1.73||3.69|
|Ye et al. -v2||N/A||75.27||2.36||4.47|
V Ablation Study
In this section, we conduct several ablation studies to justify our design choice. All the experiments in this section are conducted on the CIFAR100 dataset.
V-a The Effectiveness of Dependency-aware Importance Estimation
In the first ablation study, we verify that our method can more accurately identify less important filters, thus leading to a better compressed architecture. This can be evidenced by 1) the less performance drop after pruning, and 2) the better final performance after finetuning.
With the same pruning ratio, e.g., , we assume that the importance estimation is more accurate if the pruned model (without finetuning) achieves higher performance on the validation set. Thus, the accuracy of importance estimation can be measured by the performance of pruned networks under the same pruning ratio. In this experiment, we compare the following three strategies: (a) Network Slimming  which measures filter importance by the batch-norm scaling factors only; (b) the dependency-aware importance estimation in Eq. (7); and (c) the dependency-aware importance estimation + automatic regularization control.
Firstly, we conduct an illustrative experiment on the VGGNet-16 backbone with a pruning ratio of . As shown in Fig. 5, the strategy (c) obtains a compressed model with the desired sparsity and achieves the best accuracy after finetuning. Then, we quantitatively compare these three strategies on the VGGNet-16 and ResNet-56 backbones. The statistics over a 10-fold validation are reported in Tab. V.
|VGG16||SLM||0.3||52.19 ()||73.36 ()|
|0.3||61.19 ()||73.57 ()|
|0.3||72.83 ()||73.59 ()|
|Res56||SLM||0.5||1.41 ()||71.13 ()|
|0.5||5.29 ()||73.62 ()|
|0.5||55.29 ()||74.53 ()|
The results in Tab. V reveal that 1) the dependency-aware importance estimation is able to measure the filter importance more accurately as it achieves a much higher performance before finetuning compared with the Network Slimming, and 2) the automatic regularization control assists to derive a model with desired sparsity and search for a better architecture, evidenced by the favorable performance after finetuning.
V-B Fixed v.s. Adjustable Regularization Coefficient
There are two alternative approaches that can help achieve the desired mode sparsity: (a) fix the threshold and adjust the regularization coefficient during training; and (b) fix and search for a suitable after training.
We compare these two alternatives on the ResNet-56 backbone with a pruning ratio of , which means of the filters will be pruned. For strategy (a), the regularization coefficient is fixed to , as suggested by .
As shown in Tab. VI, under the same pruning ratio, strategy (a) performs favorably against strategy (b) in terms of the performance before and after finetuning. This justifies our design of dynamically adjusting during training.
V-C Pruning as Architecture Search
As pointed out in Sec. III-B, Network Slimming  may lead to unreasonable compressed architectures as too many filters can be pruned in a single layer. In this experiment, we verify that our method can derive better compressed architectures. To test the difference of the pruned architectures, we re-initialize the parameters of pruned models, and then train the pruned models for a full episode as in the standard pipeline. Note that we are essentially training the compressed architecture from scratch under the “scratch-E” setting in . The results in Tab. VII indicate that our method derives better compressed architectures, as evidenced by the superior performance when training from scratch.
|Res164||SLM||76.80 ()||74.17 ()||75.05 ()|
|2-5[1pt/2pt]||Ours||76.43 ()||76.43 ()||76.41 ()|
V-D Pruning Stability
selects filters to be pruned by ranking channel importance of different layers across the entire network, leading to unstable architectures. We empirically verify the claim that with a large pruning ratio, our method can still achieve promising results, while Network Slimming leads to collapsed models with a high probability.
|CIFAR10||SLM||0.7||10.00 / 0||10.00 / 0||10.00 / 0||10.00 / 0||10.00 / 0|
|Ours||93.93 / 24||93.66 / 25||93.94 / 27||93.70 / 23||93.89 / 27|
|CIFAR100||SLM||0.4||1.00 / 0||1.00 / 1||1.00 / 0||1.00 / 0||1.00 / 0|
|Ours||73.24 / 29||73.60 / 37||73.92 / 35||73.47 / 37||73.71 / 37|
Here, we design two experiments. In the first experiment, we give an intuitionistic illustration of the compressed network architecture induced by Network Slimming and our method. We use the VGGNet-16 backbone with a pruning ratio of . The filter distributions of compressed architectures are shown in Fig. 6.
In the second experiment, we conduct a 5-fold validation on the CIFAR10 and CIFAR100 datasets, again using the VGGNet-16 backbone. The results in Tab. VIII indicate that under a relatively high pruning ratio, our method can still achieve high performance while Network Slimming collapses in all runs.
In this paper, we propose a principled criteria to identify the unimportant filters with consideration of the inter-layer dependency. Based on this, we prune filters based on the local channel importance, and introduce an automatic-regularization-control mechanism to dynamically adjust the coefficient of sparsity regularization. In the end, our method is able to compress the state-of-the-art neural networks with a minimal accuracy drop. Comprehensive experimental results on CIFAR, SVHN, and ImageNet datasets demonstrate that our approach performs favorably against the Network Slimming  baseline and achieve competitive performance among the concurrent data-dependent and data-independent pruning approaches, indicating the essential role of the inter-layer dependency in principled filter pruning algorithms.
This research was supported by Major Project for New Generation of AI under Grant No. 2018AAA0100400, NSFC (61922046), the national youth talent support program, and Tianjin Natural Science Foundation (18ZXZNGX00110).
-  (2017) Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC) 13 (3), pp. 32. Cited by: §II-A, §II-A.
-  (2019) ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. In International Conference on Learning Representations, Cited by: §II-B.
-  (2018) “Learning-compression” algorithms for neural net pruning. In , pp. 8532–8541. Cited by: §II-A.
-  (2016) Bi-level semantic representation analysis for multimedia event detection. IEEE transactions on cybernetics 47 (5), pp. 1180–1197. Cited by: §I.
Compressing neural networks with the hashing trick.
International Conference on Machine Learning, pp. 2285–2294. Cited by: §II-C.
-  (2017) Quantized cnn: a unified approach to accelerate and compress convolutional networks. IEEE transactions on neural networks and learning systems 29 (10), pp. 4730–4743. Cited by: §II-C.
-  (2016) Fast and accurate deep network learning by exponential linear units (elus). In International Conference on Learning Representations, Cited by: footnote 2.
-  (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to +1 or-1. arXiv preprint arXiv:1602.02830. Cited by: §II-C.
-  (2019) Chamnet: towards efficient network design through platform-aware model adaptation. In International Conference on Computer Vision and Pattern Recognition, pp. 11398–11407. Cited by: §II-B.
-  (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In Neural Information Processing Systems, pp. 1269–1277. Cited by: §II-C.
-  (2015) Fully connected cascade artificial neural network architecture for attention deficit hyperactivity disorder classification from functional magnetic resonance imaging data. IEEE transactions on cybernetics 45 (12), pp. 2668–2679. Cited by: §I.
-  (2017) Learning to prune deep neural networks via layer-wise optimal brain surgeon. In Neural Information Processing Systems, pp. 4857–4867. Cited by: §II-A.
-  (2016) Stacked convolutional denoising auto-encoders for feature representation. IEEE transactions on cybernetics 47 (4), pp. 1017–1027. Cited by: §I.
-  (2008) Robust statistics on riemannian manifolds via the geometric median. In International Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §II-A.
-  (2015) Fast r-cnn. In International Conference on Computer Vision and Pattern Recognition, pp. 1440–1448. Cited by: §I.
-  (2018) Morphnet: fast & simple resource-constrained structure learning of deep networks. In International Conference on Computer Vision and Pattern Recognition, pp. 1586–1595. Cited by: §I, §I, §III-B.
-  (2016) Dynamic network surgery for efficient dnns. In Neural Information Processing Systems, pp. 1379–1387. Cited by: §II-A.
-  (2015) Learning both weights and connections for efficient neural network. In Neural Information Processing Systems, pp. 1135–1143. Cited by: §II-A.
-  (1993) Second order derivatives for network pruning: optimal brain surgeon. In Neural Information Processing Systems, pp. 164–171. Cited by: §II-A.
-  (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In International Conference on Computer Vision, pp. 1026–1034. Cited by: §IV-A, footnote 2.
-  (2016) Deep residual learning for image recognition. In International Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §II-B, §IV-A, §IV-A.
-  (2016) Identity mappings in deep residual networks. In European Conference on Computer Vision, pp. 630–645. Cited by: §IV-A, §IV-A, §IV-A.
-  (2019) Asymptotic soft filter pruning for deep convolutional neural networks. IEEE transactions on cybernetics. Cited by: §I, §II-A, §III-B, §IV-A, TABLE I.
Soft filter pruning for accelerating deep convolutional neural networks.
IJCAI. International Joint Conferences on Artificial Intelligence, pp. 2234–2240. Cited by: §I, §III-B, §IV-A.
-  (2019) Filter pruning via geometric median for deep convolutional neural networks acceleration. In International Conference on Computer Vision and Pattern Recognition, pp. 4340–4349. Cited by: §II-A, §II-A, §II-A, §III-B, §IV-A, TABLE I.
-  (2017) Channel pruning for accelerating very deep neural networks. In International Conference on Computer Vision, pp. 1398–1406. Cited by: §II-A, §II-A.
-  (2017) Densely connected convolutional networks. In International Conference on Computer Vision and Pattern Recognition, pp. 4700–4708. Cited by: §II-B.
Ltnn: a layerwise tensorized compression of multilayer neural network. IEEE transactions on neural networks and learning systems 30 (5), pp. 1497–1511. Cited by: §II-C.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, Cited by: §I, §III-A.
-  (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §IV-A, §IV.
-  (2012) Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems, pp. 1097–1105. Cited by: §I.
-  (1990) Optimal brain damage. In Neural Information Processing Systems, pp. 598–605. Cited by: §II-A.
-  (2017) Pruning filters for efficient convnets. In International Conference on Learning Representations, Cited by: §I, §I, §II-A, §II-A, §II-A, §IV-D, TABLE IV.
-  (2019) Data-driven neuron allocation for scale aggregation networks. In International Conference on Computer Vision and Pattern Recognition, pp. 11526–11534. Cited by: §III-B.
-  (2019) Toward compact convnets via structure-sparsity regularized filter pruning. IEEE transactions on neural networks and learning systems. Cited by: TABLE IV.
-  (2018) Progressive neural architecture search. In European Conference on Computer Vision, pp. 19–34. Cited by: §II-B.
-  (2017) Learning efficient convolutional networks through network slimming. In International Conference on Computer Vision, pp. 2736–2744. Cited by: §I, §I, §I, §II-A, §II-A, §II-A, §III-B, §III-C, §III-C, §IV-A, §IV-A, §IV-B, §IV-B, §IV-C, §IV-D, TABLE I, TABLE II, TABLE III, TABLE IV, Fig. 6, Fig. 6, §V-A, §V-B, §V-C, §V-D, TABLE VII, §VI.
-  (2019) Rethinking the value of network pruning. In International Conference on Learning Representations, Cited by: TABLE I, TABLE II, §V-C.
-  (2015) Fully convolutional networks for semantic segmentation. In International Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §I.
-  (2017) Thinet: a filter level pruning method for deep neural network compression. In International Conference on Computer Vision, pp. 5058–5066. Cited by: §IV-A, TABLE IV.
-  (2017) An adaptive semisupervised feature analysis for video semantic recognition. IEEE transactions on cybernetics 48 (2), pp. 648–660. Cited by: §I.
-  (2013) Rectifier nonlinearities improve neural network acoustic models. In International Conference on Machine Learning, Cited by: footnote 2.
-  (2019) Importance estimation for neural network pruning. In International Conference on Computer Vision and Pattern Recognition, pp. 11264–11272. Cited by: §II-A, §II-A, §IV-A, §IV-D, TABLE IV.
-  (2010) Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning, pp. 807–814. Cited by: footnote 2.
-  (2011) Reading digits in natural images with unsupervised feature learning. In NeurIPS Workshop on Deep Learning and Unsupervised Feature Learning, Cited by: §IV-A.
-  (2018) Efficient neural architecture search via parameter sharing. In International Conference on Machine Learning, Cited by: §II-B.
-  (2015) Channel-level acceleration of deep face representations. IEEE Access 3, pp. 2163–2175. Cited by: §II-A.
-  (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: §II-C.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §IV-A, §IV-D, §IV.
-  (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §II-B, §IV-A, §IV-A.
-  (2015) Structured transforms for small-footprint deep learning. In Neural Information Processing Systems, pp. 3088–3096. Cited by: §II-C.
-  (2017) Training sparse neural networks. In CVPRW, pp. 138–145. Cited by: §II-A.
-  (2019) PyTorch: an imperative style, high-performance deep learning library. In Neural Information Processing Systems, Cited by: §IV-A.
-  (2014) Deep learning face representation by joint identification-verification. In Advances in neural information processing systems, pp. 1988–1996. Cited by: §I.
-  (2019) Mnasnet: platform-aware neural architecture search for mobile. In International Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: §II-B.
-  (2016) Cross-modal retrieval with cnn visual features: a new baseline. IEEE transactions on cybernetics 47 (2), pp. 449–460. Cited by: §I.
-  (2016) Learning structured sparsity in deep neural networks. In Neural Information Processing Systems, pp. 2074–2082. Cited by: §I.
-  (2019) FBNet: hardware-aware efficient convnet design via differentiable neural architecture search. In International Conference on Computer Vision and Pattern Recognition, pp. 10734–10742. Cited by: §II-B.
-  (2016) Quantized convolutional neural networks for mobile devices. In International Conference on Computer Vision and Pattern Recognition, pp. 4820–4828. Cited by: §II-C.
Multitask spectral clustering by exploring intertask correlation. IEEE transactions on cybernetics 45 (5), pp. 1083–1094. Cited by: §I.
-  (2018) Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. In International Conference on Learning Representations, Cited by: §I, §IV-D, TABLE IV.
-  (2018) Accelerating convolutional neural networks by removing interspatial and interkernel redundancies. IEEE transactions on cybernetics 50 (2), pp. 452–464. Cited by: §II-A, §IV-A.
Visual tracking with convolutional random vector functional link network. IEEE transactions on cybernetics 47 (10), pp. 3243–3253. Cited by: §I.
-  (2015) Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (10), pp. 1943–1955. Cited by: §II-C.
-  (2015) Compact deep neural networks for device based image classification. In 2015 IEEE International Conference on Multimedia & Expo Workshops, ICME Workshops 2015, Turin, Italy, June 29 - July 3, 2015, pp. 1–6. Cited by: §II-A, §II-A.
-  (2019) A knee-guided evolutionary algorithm for compressing deep neural networks. IEEE transactions on cybernetics. Cited by: §II-A.
-  (20192019) Accelerate cnn via recursive bayesian pruning. In International Conference on Computer Vision, Cited by: §II-A.
Neural architecture search with reinforcement learning. In International Conference on Learning Representations, Cited by: §II-B.