I Introduction
Convolutional neural networks (CNNs) have been widely applied in a variety of computer vision applications, including classification
[25, 50, 13], object detection [5, 44] and semantic segmentation [3, 37], etc. Scaling up the size of models has been the main reason for the success of deep learning. For instance, the depth of ImageNet Classification Challenge
[45] winner models, has evolved from 8 of AlexNet [25] to over 100 of ResNet [13]. Empirically, larger networks can exhibit better performances, while they are also known to be heavily overparameterized [57].However, large CNNs might be incompatible with their deployment in realworld applications, suffering severely from the massive computational and storage overhead. Therefore, it is really necessary to obtain the compact networks with efficient inference.
Pruning [10] is a common approach to slim neural networks via removing the redundant weights, filters and layers. Weight pruning can achieve a higher compression ratio but leads to unstructured sparsity of CNNs, which makes it hard to leverage the efficient Basic Linear Algebra Subprograms (BLAS) libraries [39]. Therefore, structured sparsity pruning becomes more attractive since it can reduce the model size as well as accelerate the inference speed.
Among all structured sparsity pruning approaches, sparsity learning (SL) [59, 53, 36, 52, 38, 46, 7, 21, 54, 23, 56, 30, 32, 51, 28, 1, 34], or called sparsity regularization, is a popular and powerful direction these days. These works introduce sparsity regularization on structures during the training phase. Training with structured regularization can transfer significant features to a small number of filters and automatically obtains a structuresparse model [53]. However, in these SL approaches, the sparsity regularization imposed on all filters is the same, without the consideration of specified characteristics of different parts of the models. Theoretically and empirically, two main results can be incurred from such indiscriminate regularization. First, the performance of important filters for the prediction is also influenced by the equal regularization, so the final prediction precision after SL may drop by a large margin. And sometimes it is difficult to restore the performance after pruning. Second, different filters consume different computational resources. According to our analysis and experimental observations, traditional SL tends to zero out lots of light filters while retaining the computationheavy filters. Therefore, we can hardly obtain the optimal structuresparse networks using the indiscriminate SL.
In this paper, we propose a novel SL form, namely SaliencyAdaptive Sparsity Learning (SASL), to learn better compact neural networks. The saliency of a convolutional filter is considered from two aspects: 1) the importance for prediction performance, which is defined as the change in the loss function induced by removing this filter from the neural network. 2) the consumed resources, especially the computational costs. During SL, the regularization for every filter will be adjusted adaptively according to the saliency of filters, and the calculation for saliency will only lead to very minimum overhead. In the pruning phase, saliency is also proven to be a better criterion than others, and the proposed hard sample mining strategy can further improve the effectiveness and efficiency of saliency. In brief, the main contributions of this paper are:

We propose a novel form of sparsity learning, the SaliencyAdaptive Sparsity Learning (SASL). Compared with traditional SL, our optimized format can better preserve the performance of models and reduce more computation for inference without too much overhead.

In the pruning stage, we observe and analyze that saliency is a better criterion than previous methods. Since saliency is datadependent, a hard sample mining strategy is proposed to further enhance the effectiveness and efficiency of it.

Extensive experiments on two benchmark datasets for various CNN architectures demonstrate the effectiveness and efficiency of the proposed approach. Typically, on ILSVRC2012 dataset, 49.7% FLOPs of ResNet50 are reduced with only 0.05% top5 accuracy degradation, which significantly outperforms stateoftheart methods.
The remainder of this paper is organized as follows: we review work related to network pruning and sparsity learning in Section II. Section III introduces the motivation of this paper and detail of proposed SASL approach. The experimental results and corresponding analyses are presented in Section IV. Finally, we conclude our work in Section V. The code of SASL will be shared on our website: http://staff.ustc.edu.cn/~chenzhibo/resources.html for public research.
Ii Related Work
Network pruning has been a longstudied project ever since the very early stage of neural network. In this section, we review the significant development of network pruning.
Weight Pruning. In the 1990s, Optimal Brain Damage [26] and Optimal Brain Surgeon [11] were proposed, in which, unimportant weights were removed based on the Hessian of the loss function. In recent work, Han et al. [10] brought back this idea by pruning the weights of which the absolute values are smaller than a preset threshold. And Molchanov et al. [40] proposed Variational Dropout to prune redundant weights. Moreover, [2] formulates weight pruning into an optimization problem by finding the weights to minimize the loss while satisfying the pruning cost condition. Kang [22] proposed the weight pruning scheme considering the accelerator. Frankel et al. [9] developed the lottery ticket theory, based on which, only using the weights of winning tickets can present comparable or even better performance than the original model. But finding the winning tickets initialization is complex and computational expensive. To solve this, Morcos et al. [42] proposed a generalization method which allows reusing the same wining tickets across various datasets. And Ding et al. [8] focused on the optimizer and designed a novel momentumSGD, which shows superior capability to find better winning tickets. However, the nature of unstructuredsparsity makes it only yield effective compression but cannot speed up the inference without dedicated hardware/library.
Structured Pruning. Therefore, much attention has been focused on structured pruning to accelerate the inference of neural networks. Filter pruning, or called channel pruning, is the most common and flexible way of structured pruning, since filter is the smallest unit of structure.
Many heuristic methods are proposed to prune filters based on the handcrafted features. For example, based on the
smallnormlessimportant belief, Li et al. [29] proposed to prune filters according to the filter weight norm. Average percentage of zeros (APoZ) in the output was used in [18] to measure the importance of filters. He et al. [17] pruned filters to minimize the feature reconstruction error of the next layer. Similarly, [39] pruned redundant filters via estimating the statistics of its next layer. Yu et al. [55] implemented feature ranking to obtain importance score and propagated it throughout the network to find the pruned filters. And Chin et al. [4] considered pruning as a ranking problem and then compensated the layerwise approximation error to improve the performance of previous heuristic metrics. [16]used reinforcement learning to decide the pruned ratio of each layer. He
et al. [14] proposed the concept of soft pruning, which allows pruned filters to recover during training. Furthermore, Huang et al. [20] trained pruning agents to remove structures in a datadriven way. [41] used Taylorexpansion to estimate the importance of each filter and then iteratively pruned the least important filters. He et al. [15] proposed to prune the redundant filters via geometric median. Zhao et al. [58] proposed a variational Bayesian pruning scheme based on the distribution of channel saliency. Liu et al. [35]proposes the concept of MetaPruning, which combines meta learning with evolutionary algorithm to provide an efficient automatic channel pruning approach. These methods directly pruned filters of the unsparsified models, which would erroneously abandon useful features and result in huge accuracy degradation.
Sparsity Learning. So recent approaches [59, 53, 36, 52, 38, 46, 7, 21, 54, 23, 56, 30, 32, 51, 28, 1, 34] have adopted sparsity learning to introduce structured sparsity in the training phase. Zhang et al. [59] incorporated sparsity constraints into the loss function to reduce the number of filters. Similarly, Wen et al. [53] utilized Group Lasso to automatically obtain structured sparsity during training. Moreover, Liu et al. [36] proposed network slimming to apply norm on the channel scaling factors, which can reuse the
parameters in BatchNormalization layers so there is little training overhead. After SL, filters with small scaling factors will be pruned. And
[21] extended this idea, which utilized the scaling factors for coarser structures apart from filters, such as layers and branches. Lin et al. [34] further used generative adversarial learning to pick the structures for pruning. In [28], a neuronbudgetaware sparsity learning method is proposed to avoid trialanderror. Compared with directly pruning, these methods obtain structuresparse neural networks in the training stage. As a consequence, redundant filters could be removed with less accuracy decline. However, the above methods impose the same sparsity regularization on all filters indiscriminately, which is the critical problem to be solved in this paper. In the most recent work, Yun
et al. [56] also recognized and tried to tackle this with the proposed approach of Trimmed regularizer, which allows the filters with largest norm penaltyfree from regularization. But this optimization method is still quite simple, and as discussed later, Trimmed regularizer can be seen as a special case of our work.Iii Proposed Approach
Iiia Motivation
We start by showing and analyzing the flaws of previous indiscriminate sparsity learning. Typically, we can formulate the optimization objective of sparsity learning as:
(1) 
Here is the overall trainable weights of current CNN, while is the used training set. is the original loss function of the neural network, such as cross entropy. And is the sparsity regularization for filter . Generally, the form of can be Group Lasso or norm penalty on the structurecorresponding scaling factors, which is widely used to achieve sparsity. is the hyperparameter to balance the original loss and the sparsity regularization. Note that all the filters share the same parameter indiscriminately.
This is a coarse format of sparsity learning that may lead to critical problems. The sparsity is achieved without guidance, and we will show the drawbacks from two aspects in the following sections IIIA1 and IIIA2. Actually, we can integrate some prior information into the sparsity learning, thus better structuresparse neural networks can be obtained.
IiiA1 Importance of Filters
Different filters in a neural network are of different importance. Same as [41], in this paper we define the importance of filter as the error induced by the removal of it. Under an i.i.d. assumption, this error can be measured as the squared difference of prediction losses with and without filter :
(2) 
For a given model , we denote the corresponding optimal pruned model as . And in the transition from to
, the probability of filter
to be pruned is denoted as . Intuitively, the less important a filter is, the more likely it is to be pruned. So we can assume that the relationship between and conforms to an inverse correlation function, as Fig. 1 shows.Here we simply divide all filters of a model into three categories. For the least important filters, pruning them directly is not risky. On the contrary, the most significant filters are essential for the prediction precision, and any impact on them will incur a performance decline of the model. The role of sparsity learning is to help identify the filters in the middle part, i.e., in Fig. 1.
One of the critical issues of indiscriminate sparsity learning is that the importance difference of filters is disregarded. Regularization on the important filters can lead to massive accuracy drop and sometimes this drop cannot be recovered. Moreover, regularization effects on deteriorate the representational capacity of current model, so the sparsity learning would fail to maximally identify the redundant filters.
IiiA2 Computational Resources
Most of the previous approaches equate the goal of network pruning, i.e., reducing more consumed resources, with removing more structures, such as convolutional filters. However, without the consideration of consumed computational resources (or memory footprint) of different filters, the directions of these two statements are different, and sometimes the gap cannot be ignored. Specifically, convolutional filters of different layers in one model would cost different resources. Basically, it depends on the following three factors: 1. input feature map size; 2. number of input feature maps; 3. filter size. Without guidance, traditional sparsity learning such as norm scheme can only zero out more filters rather than reducing more computational complexity.
Here we conduct an experiment to show the inconsistency of the indiscriminate sparsity learning. We implement network slimming [36] for VGGNet on CIFAR10 dataset [24]. When planning to prune 70% filters, we record the distribution of filters to be pruned which possess the least scaling factors. All these scaling factors are less than , so we can assume that these filters have been already sparsified. Then we display the normalized computational complexity and sparsity ratio of filters in all layers in Fig. 2.
We can observe that traditional sparsity learning algorithm tends to zero out more light filters, while most of the computationheavy filters, such as filters in layer , are retained. Obviously, indiscriminate sparsity learning cannot obtain the optimal structuresparse networks in terms of complexity reduction.
IiiB Saliency Estimation
Therefore, in order to solve the critical problems incurred from traditional sparsity learning, we propose to impose adaptive regularization on different filters discriminatively. A new attribute: saliency, is introduced to estimate filters and then guide the regularization distribution, which is considered from two aspects: the importance for final prediction and the consumed computational resources.
First we need to estimate the importance of all filters during sparsity learning. The most precise evaluation for importance is as Equation 2 shows. However, this is extremely computationally expensive, since it requires evaluating all kinds of versions of the neural network, one for each pruned filter. A method to avoid this is to approximate in the vicinity of using secondorder Taylor expansion as Optimal Brain Damage [26] shows:
(3) 
Here and are the gradient and Hessian of filter , respectively. However, computing Hessian matrices sometimes is also computationally expensive, especially for large networks. So we can adopt a more compact approximation, i.e., using the firstorder expansion as [41] does. So the importance can be calculated as:
(4) 
In this format, calculating the importance will not bring too much computation overhead since is already known from the backpropagation during training.
Then we need to estimate the consumed computational resources
of different filters. We denote the three influential factors: input feature map size (considering padding pattern and stride), number of input feature maps and filter size as
, and , respectively. So the normalized computational resources of filter can be calculated as:(5) 
Note that the calculation for should be dynamic in both sparsity learning and pruning, since the number of valid feature maps can be decreased. During sparsity learning, the sparsified filters are excluded, of which the scaling factors are less than . And in the pruning phase, the pruned filters will not be counted in.
Finally, we can calculate the saliency of filter as:
(6) 
In this definition, saliency can be understood as the average prediction gain with a unit computational cost. We will show in the following sections that saliency is very effective in both sparsity learning and pruning.
IiiC Adaptive Sparsity Regularization
Based on saliency estimation, we can adaptively set the regularization strength according to the feature of filter. The indiscriminate format of Equation 1 now becomes as:
(7) 
Note that the value of regularization factor is now dependent on the filter . In this paper, we implement this idea based on the scaling factor scheme as network slimming [36] does, but it can be easily generalized to all kinds of sparsity learnings. In the scaling factor scheme, for each filter (including the one in convolutional layers and fullyconnected layers), a scaling factor is introduced, which is multiplied to the output of the corresponding filter. Then during the sparsity learning, the regularization term (i.e., in Equation (7)) is imposed on these scaling factors. These scaling factors can be seen as the agents to identify the filters.
Since BatchNormalization (BN) layer has been adopted by most of the modern CNNs, we can reuse the parameters in BN as the scaling factors. Typically, BN layer performs the following transformation in the network:
(8) 
Here and are the input feature and output feature of the BN layer, while and
are the values of mean and standard deviation of input features among current batch
. and are the trainable affine transformation parameters, i.e., scale and shift. Therefore, we can directly leverage the parameter in BN as the scaling factors since they perform the same function. In this way, we would not introduce any additional parameters.Due to the distinct saliency distributions of different models, directly using the norm of saliency to guide the regularization distribution is not generalized enough. So we propose to utilize saliency to sort all filters, and then based on the ranking of filters, a hierarchy scheme is adopted to adaptively set the sparsity regularization. Typically, for the filters of the most significant class, no regularization will be imposed on, and for the least significant filters, we distribute the most strong regularization penalty. Dedicated design for the hierarchy classification and norm of regularization can lead to better results, but it also requires timeconsuming manual working. So we simply adopt a fiveclass hierarchy scheme, and the corresponding regularization multiplying factors are set as . We will show in the experiments that even such simple design can lead to an excellent result.
Note that the saliency estimation and ranking are all along with the sparsity learning, which means the significant filters at the beginning might be rated as useless during the training. Therefore, the regularization distribution is always dynamic. It can precisely detect the current state and adopt the appropriate action. The detailed algorithm of SASL is summarized in Algorithm 1.
IiiD Iterative Pruning with Hard Sample Mining
After sparsity learning, an effective criterion is needed to discard filters. Most of the previous sparsity learning approaches, such as [36], prune filters from the energy term, i.e., norm of the scaling factor, due to the ”smallernormlessimportant” belief. However, this criterion is not an excellent one since energy cannot fully represent the importance (see in Equation 4, importance also takes gradient into consideration) and the consumed resource is not even considered. So we propose to use saliency as the criterion for pruning, several advantages of which are listed as follows:

Saliency is effective to estimate filters, from both the importance and resource aspects.

It is globally consistent throughout the whole network and sensitivity analysis for each layer is not needed.

This method is able to be applied to any layer in the network, including traditional convolutions, skip connections and fullyconnections.

Although saliency estimates the finegrained structure, i.e., filter, using saliency to prune can also automatically remove the coarse structures, such as resblock or multibranch, if the whole structure is thought to be redundant.
The overall proposed pruning procedures are illustrated in Fig. 3. Compared with single pass pruning, an iterative pruning and finetuning strategy is adopted to achieve better results, since the estimation for importance and resource is always changing during the pruning process. As a datadependent metric, saliency is sensitive to the used input data. And one main drawback is the potential intensive computation because all training data is utilized for saliency estimation during pruning, especially the training set is huge, such as ILSVRC2012. And this is much more obvious for our multipass pruning scheme. However, we do not need to make use of the full training set for saliency estimation. Inspired by OHEM [47], in this paper, we propose a hard sample mining approach to optimize this. In detail, before pruning, we calculate the training loss of each sample. Then we pick the samples with top 30% losses, which are defined as the hard samples. And in the pruning phase, we only use the hard samples for saliency estimation. The extra computation of saliency estimation can be dramatically reduced while the pruning effect can be even better than the original scheme which used the whole training set. We will analyze this in the experiment part.
Iv Experimental Results and Analyses
In this section, we empirically demonstrate the effectiveness of SASL on two benchmark datasets. We implement our method based on the publicly available deep learning framework PyTorch
[43]. We introduce the used datasets and pruned neural networks in IVA, and then present the training details in IVB. In IVC and IVD, we show the experimental results on the two datasets. Finally, we implement a series of ablation experiments in IVE to further reveal the superiority of the proposed framework.Approach 


Accuracy (%)  Params (%)  FLOPs (%)  

VGGNet  L1 [29]  93.25  93.40  0.15  64.0  34.2  
SSS [21]  93.96  93.63  0.33  66.7  36.3  
FPGM [15]  93.58  93.54  0.04    34.2  
GAL [34]  93.96  93.42  0.54  82.2  45.2  
VCP [58]  93.25  93.18  0.07  73.3  39.1  
SASL  93.69  93.89  0.20  86.9  49.5  
Res56  L1 [29]  93.04  93.06  0.02  13.7  27.6  
SFP [14]  93.59  92.26  1.33    52.6  
CP [17]  92.80  91.80  1.00    50.0  
NISP [55]      0.03  42.6  43.6  
FPGM [15]  93.59  93.49  0.10    52.6  
GAL [34]  93.26  93.38  0.12  11.8  37.6  
VCP [58]  93.04  92.26  0.78  20.5  20.3  
SASL  93.63  93.88  0.25  18.9  35.9  
SASL*  93.63  93.58  0.05  36.6  57.1  
Res110  L1 [29]  93.53  93.30  0.23  32.4  38.6  
SFP [14]  93.68  93.38  0.30    40.8  
NISP [55]      0.18  43.8  43.3  
FPGM [15]  93.68  93.74  0.06    52.3  
GAL [34]  93.50  93.59  0.09  4.1  18.7  
VCP [58]  93.21  92.96  0.25  41.3  36.4  
SASL  93.83  93.99  0.16  31.9  51.7  
SASL*  93.83  93.80  0.03  54.3  70.2 

SASL and SASL* are the conservative and aggressive schemes, respectively. Accuracy is the prediction performance drop between pruned model and baseline models, the smaller, the better. A negative value of Accuracy means performance improvement after pruning.
Iva Datasets and Network Models
IvA1 Datasets
Two classical classification datasets: CIFAR10 [24] and ILSVRC2012 [45], are adopted in this paper. CIFAR10 dataset consists of images with resolution 32
32, which is classified into 10 classes. The training and test sets contain 50,000 and 10,000 images, respectively. A standard data augmentation scheme
[19, 31], including shifting and mirroring, is adopted. All input data is normalized with channel means and standard deviations.As for ILSVRC2012, it is a huge dataset with 1.2 million training images and 50,000 validation images which are drawn from 1000 classes. We adopt the same data augmentation scheme as PyTorch official examples [43]. In the test stage, we will report the singlecentercrop validation error of the model as the performance.
IvA2 Network Models
On CIFAR10 dataset, we evaluate our framework on two popular network architectures: VGGNet [48] and ResNet [13]. VGGNet is originally designed for ILSVRC2012 classification task. In our experiment, a variation of VGGNet for CIFAR10 dataset is taken from [29]. For ResNet, two ResNets of 56 layers and 110 layers are used. On ILSVRC2012 dataset, we adopt the deep ResNet50 for pruning. BatchNormalization layers are adopted in all models to achieve better performance.
IvB Training Details
IvB1 Normal Training
In normal training, we train all the CNNs from scratch as baselines. All the models are trained using the optimizer of stochastic gradient descent (SGD). On CIFAR10 dataset, we train VGGNet and ResNet using minibatch size of 64 for 160 and 240 epochs, respectively. The initial learning rate is set as 0.1, and is divided by 10 at 50% and 75% of the total number of training epochs. And on ILSVRC2012 dataset, we train ResNet50 for 90 epochs, with a batch size of 256. The initial learning rate is 0.1, and we divide it by 10 after 30 and 60 epochs. A weight decay of
and a Nesterov momentum
[49] of 0.9 without dampening are used in our experiments to improve the performance. We also adopt the weight initialization introduced by [12].Approach 









SFP [14]  76.15  74.61  92.87  92.06  1.54  0.81  41.8  
CP [17]      92.20  90.80    1.40  50.0  
GDP [33]  75.13  71.89  92.30  90.71  3.24  1.59  51.3  
DCP [60]  76.01  74.95  92.93  92.32  1.06  0.61  55.8  
ThiNet [39]  72.88  71.01  91.14  90.02  1.87  1.12  55.8  
SSS [21]  76.12  74.18  92.86  91.91  1.94  0.95  31.3  
GAL [34]  76.15  71.80  92.87  90.82  4.35  2.05  55.0  
TaylorFO [41]  76.18  74.50      1.68    45.0  
FPGM [15]  76.15  75.59  92.87  92.63  0.56  0.24  42.2  
CSGD [6]  75.33  74.93  92.56  92.27  0.40  0.29  46.2  
SASL  76.15  75.76  92.87  92.82  0.39  0.05  49.7  
SASL*  76.15  75.15  92.87  92.47  1.00  0.40  56.1 

SASL and SASL* are the conservative and aggressive schemes, respectively. Acc. is the prediction performance drop between pruned model and baseline models, the smaller, the better.
IvB2 Sparsity Learning and Pruning
Although our framework can adaptively distribute the sparsity regularization, a base regularization value should be determined in advance, which can control a tradeoff between prediction performance and structure sparsity. Empirically, we use relative larger for the simple VGGNet () while smaller for the complicated ResNet (). Other settings are the same as normal training.
When we prune the filters from the structuresparse models, saliency is used to discard the specified filters. In our experiments, the pruning procedure is achieved via building a new compact model and then copying the retained weights from the original model.
IvB3 FineTuning
Using pruning, we can obtain more compact models. Then we need to finetune them to restore the performance. The learning rate of finetuning for all models is set as . On CIFAR10 datasets, we finetune the pruned models for 20 epochs, while on ILSVRC2012 dataset, we only finetune the pruned ResNet for 10 epochs.
IvC Results on CIFAR10
For the CIFAR10 dataset, we test our SASL on VGGNet, ResNet56 and 110. As shown in TABLE I, our SASL outperforms stateoftheart methods in all three networks. For VGGNet, SASL reduces 49.5% FLOPs with even 0.2% accuracy improvement, while previous works [29, 21, 15, 34, 58] are worse in both two aspects. For example, GAL [34] only prunes 45.2% FLOPs and incurs 0.54% accuracy degradation.
For ResNet56 and 110, we prune filters of different ratios to achieve different tradeoffs between accuracy and complexity. In TABLE I, SASL means the conservative scheme that tries to preserve the accuracy, while SASL* denotes the aggressive scheme. Comparing with other works, we can find that our framework also achieves stateoftheart performance for ResNet. For pruning ResNet56, SASL* reduces more FLOPs than FPGM [15] (57.1% v.s. 52.6%) and better preserves the accuracy (degradation: 0.05% v.s. 0.10%). On ResNet110, SASL achieves a higher FLOPs reduction (51.7% v.s. 36.4%) with 0.16% accuracy increase, while VCP [58] harms the performance of prediction. These results demonstrate the effectiveness of SASL, which strongly aligns with our previous analysis.
IvD Results on ILVSRC2012
SASL was also evaluated on ILSVRC2012 dataset for pruning ResNet50. Similarly, we adopt both the conservative and aggressive schemes. TABLE II shows the superior performance of SASL. Under various pruned FLOPs ratios, our approach consistently achieves stateoftheart performance when compared with other methods [14, 17, 33, 60, 39, 21, 34, 15, 6]. To be specific, the conservative scheme SASL reduces 49.7% FLOPs with very negligible 0.39% top1 and 0.05% top5 accuracy degradation, while SSS [21] incurs huger performance deterioration (1.94% top1 and 0.95% top5 accuracy drops) and only prunes 31.3% FLOPs. Our SASL* also performs well, which reduces more FLOPs than DCP [60] (56.1% v.s. 55.8%) with better performance maintaining. Compared with previous methods, SASL estimates the saliency of different filters and intelligently distributes the regularization to obtain better structuresparse networks, which is the main cause of its superior performance.
IvE Ablation Study
In this part, we conduct a series of ablation experiments to validate the effectiveness of proposed schemes. For simplicity and reliability, all the following experiments are conducted on CIFAR10 for ResNet56. Without specification, the hyperparameter setting is the same as stated in IVB.
IvE1 Different Sparsity Regularization
First, we analyze the effectiveness of the special saliencyadaptive sparsity learning. For comparison, we run the experiments of traditional indiscriminate sparsity learning as baseline. To better show the insight, we also replace the regularization guider, i.e., saliency, with two of its factors, importance and resource. After all kinds of sparsity learnings, we prune filters of different ratios of the models to get a close complexity reduction. Then we finetune all the models and show the classification accuracy fluctuation in TABLE III. In this table, we can see that SASL works much better than the indiscriminate one, with 0.47% accuracy improvement. The importance and resourceguided versions can also improve the performance than baseline, but they are both worse than the integrated version. Saliency gives attention to both aspects so it can better guide the sparsity learning.
Tradition  Importance  Resource  Saliency  

FLOPs  57.0%  57.0%  57.1%  57.1% 
ACC  0.52%  0.36%  0.33%  0.05% 
IvE2 Hierarchy Scheme Extension
Based on the saliency, we classified the filters into five classes and then adaptively impose the regularization. Here we change the hierarchy scheme with different number of classes to explore the influence of this parameter. Oneclass scheme equals the traditional sparsity learning and fiveclass scheme is the proposed one. We also change the base regularization value according to the number of classes so as to impose the same amount of regularization. The results after pruning and finetuning are shown in Fig. 4. We can find that with the increase of classes, there is a growing in final accuracy, and the increase is slow down when the number of classes is already large. Note that finetuning this parameter may even lead to better results.
IvE3 Saliency as Criterion
After sparsity learning, we need to adopt a criterion to discard filters. In this paper, we claim that saliency is also an excellent metric which takes both the importance and resource into consideration. Here we compare saliency with other criteria for pruning to show the superiority. The most common criterion in previous sparsity learning approaches is based on the energy term, i.e., the norm of scaling factors or mean value of filters. We also prune the filters from the aspects of importance and resource. The difference between saliency and resource is that the importance factor in saliency (in Equation 6) is replaced with the energy term. TABLE IV shows the accuracy results of reducing the same ratio of FLOPs. Not surprisingly, saliency is better than other criteria.
Energy  Importance  Resource  Saliency  

FLOPs  57.0%  56.9%  57.2%  57.1% 
ACC  0.25%  0.22%  0.18%  0.05% 
IvE4 Input Data for Saliency Estimation
The proposed criterion for pruning, saliency, is datadependent, which means saliency estimation could be sensitive to the used input data. Directly using all training data would bring huge complexity overhead, especially for the multipass pruning scheme. In this paper, we propose a hard sample mining strategy for efficient and effective saliency estimation. We compare it with using all the training data for saliency estimation to prune filters. Surprisingly, hard sample mining strategy can not only reduce the complexity overhead for the saliency estimation, but also improve the overall performance (93.58% v.s. 93.51%). We attribute the success to the correlation between hard samples and test sets. Easy samples cannot provide too much information for guidance, and sometimes such information can be deemed as the noise, which would influence the accurate pruning action. Preserving the performance for hard samples can make the model work better on the test set.
IvF Discussions
Based on the experimental results and comparisons with other approaches, here we present several discussions to better analyze our special designs.
IvF1 CE Loss as Compensation for Regularization
As we point out earlier, in traditional indiscriminate sparsity learning, the regularization term, i.e., in Equation 1, is without guidance, while the original objective function , such as cross entropy (CE) loss in classification task, can compensate the regularization effect to some extent. The CE loss’s gradient will weaken the effect of the regularizer if the filter is truly important to the final performance. However, this compensation effect is still weak, due to the nonconvex optimization procedure of current deep learning framework from backpropagation. As previous method [36] shows, one way to avoid removing important filters is to reduce the regularization strength. The drawback of this method is that it requires a multipass ”SLpruning” iteration to obtain enough sparisified filters, which is very inefficient and computationally expensive. In comparison, our design makes use of more prior information to provide a hierarchy scheme, which allows larger regularization. Therefore, efficient singlepass SL while preserving the prediction performance is possible.
In addition, the above guidance from CE loss only considers the importance term. In our work, consumed resources term is also integrated into the estimation metric (saliency), which can better guide the sparsity direction for FLOPs reduction.
IvF2 Performance Improvement after Pruning
As we can observe in TABLE I, several pruned models achieves higher prediction precision after SASL. For example, when reducing 35.9% FLOPs of ResNet56 on CIFAR10, we can improve the performance with 0.25% (from 93.63% to 93.88%). We hypothesize this is due to the regularization effect of sparsity learning, which naturally selects important features in intermediate layers of a neural network. This effect can remove the redundancy as well as the noisy information. This phenomenon is obvious on the simple task such as CIFAR10 classification, while for pruning networks on the complex ILSVRC2012 dataset, the performance improvement is not very evident since the redundancy of original models is much smaller.
IvF3 Comparison with Trimmed
The most recent work [56] also recognized the critical problem of traditional indiscriminate sparsity learning and proposed the method of Trimmed regularizer, in which, filters with the largest norm will not be imposed with regularization, and [56] only implemented this on the simple MNIST dataset for LeNet5 [27].
Compared with Trimmed, our work optimizes sparsity learning and works better in two aspects. First, as pointed out earlier, the main belief smallnormlessimportant of [56] works not very well, so the distinguishing of filters only based on the norm cannot be very precise. In comparison, our proposed metric, i.e., saliency, is integrated with the Taylorexpansion approximated importance and consumed computational resources, which can better represent the significance of filters. Second, [56] only leaves several filters to be penaltyfree, which can be viewed as the twoclass hierarchy scheme, a special case of our work. As seen in Fig. 4, the simple twoclass scheme cannot help search the optimal structuresparse networks effectively.
V Conclusion
Current deep convolutional neural networks are effective with high inference complexity. In this paper, we first analyze the critical problems of previous indiscriminate sparsity learning approach and then propose a novel structured regularization form, namely SASL, which can distribute the regularization value for each filter according to saliency adaptively. SASL can better preserve the performance and zeroout more computationheavy filters. We also propose to use saliency as the criterion for pruning. To further improve the effectiveness and efficiency of this datadependent criterion, we utilize a hard sample mining strategy, which shows better performance and also saves computational overhead. Experiments demonstrate the superiority of SASL over stateoftheart methods. In future work, we plan to investigate how to combine SASL with other acceleration algorithms that are orthogonal to our scheme, such as matrix decomposition, to obtain better performance.
References

[1]
(2019)
Regularizing deep neural networks by enhancing diversity in feature extraction
. IEEE Transactions on neural networks and learning systems 30 (9), pp. 2650–2661. Cited by: §I, §II.  [2] (2016) Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167. Cited by: §II.
 [3] (2014) Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062. Cited by: §I.
 [4] (2018) Layercompensated pruning for resourceconstrained convolutional neural networks. arXiv preprint arXiv:1810.00518. Cited by: §II.
 [5] (2016) RFCN: Object detection via regionbased fully convolutional networks. In Advances in neural information processing systems, pp. 379–387. Cited by: §I.

[6]
(2019)
Centripetal SGD for pruning very deep convolutional networks with complicated structure.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 4943–4953. Cited by: §IVD, TABLE II. 
[7]
(2018)
Autobalanced filter pruning for efficient convolutional neural networks.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §I, §II.  [8] (2019) Global sparse momentum sgd for pruning very deep neural networks. In Advances in Neural Information Processing Systems, pp. 6379–6391. Cited by: §II.
 [9] (2018) The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635. Cited by: §II.
 [10] (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §I, §II.
 [11] (1993) Second order derivatives for network pruning: optimal brain surgeon. In Advances in neural information processing systems, pp. 164–171. Cited by: §II.
 [12] (2015) Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §IVB1.
 [13] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §I, §IVA2.
 [14] (2018) Soft filter pruning for accelerating deep convolutional neural networks. arXiv preprint arXiv:1808.06866. Cited by: §II, §IVD, TABLE I, TABLE II.
 [15] (2019) Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4340–4349. Cited by: §II, §IVC, §IVC, §IVD, TABLE I, TABLE II.
 [16] (2018) AMC: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–800. Cited by: §II.
 [17] (2017) Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397. Cited by: §II, §IVD, TABLE I, TABLE II.
 [18] (2016) Network trimming: a datadriven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250. Cited by: §II.
 [19] (2016) Deep networks with stochastic depth. In European conference on computer vision, pp. 646–661. Cited by: §IVA1.
 [20] (2018) Learning to prune filters in convolutional neural networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 709–718. Cited by: §II.
 [21] (2018) Datadriven sparse structure selection for deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 304–320. Cited by: §I, §II, §IVC, §IVD, TABLE I, TABLE II.
 [22] (2019) Acceleratoraware pruning for convolutional neural networks. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §II.
 [23] (2019) Sparse artificial neural networks using a novel smoothed lasso penalization. IEEE Transactions on Circuits and Systems II: Express Briefs 66 (5), pp. 848–852. Cited by: §I, §II.
 [24] (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §IIIA2, §IVA1.
 [25] (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §I.
 [26] (1990) Optimal brain damage. In Advances in neural information processing systems, pp. 598–605. Cited by: §II, §IIIB.
 [27] (2015) LeNet5, convolutional neural networks. URL: http://yann. lecun. com/exdb/lenet 20, pp. 5. Cited by: §IVF3.
 [28] (2019) Structured pruning of neural networks with budgetaware regularization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9108–9116. Cited by: §I, §II.
 [29] (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §II, §IVA2, §IVC, TABLE I.
 [30] (2019) OICSR: OutInChannel Sparsity Regularization for Compact Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7046–7055. Cited by: §I, §II.
 [31] (2013) Network in network. arXiv preprint arXiv:1312.4400. Cited by: §IVA1.
 [32] (2019) Toward compact convnets via structuresparsity regularized filter pruning. IEEE Transactions on neural networks and learning systems. Cited by: §I, §II.
 [33] (2018) Accelerating Convolutional Networks via Global & Dynamic Filter Pruning.. In IJCAI, pp. 2425–2432. Cited by: §IVD, TABLE II.
 [34] (2019) Towards optimal structured cnn pruning via generative adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2790–2799. Cited by: §I, §II, §IVC, §IVD, TABLE I, TABLE II.
 [35] (2019) Metapruning: meta learning for automatic neural network channel pruning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3296–3305. Cited by: §II.
 [36] (2017) Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744. Cited by: §I, §II, §IIIA2, §IIIC, §IIID, §IVF1.
 [37] (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §I.
 [38] (2017) Learning sparse neural networks through regularization. arXiv preprint arXiv:1712.01312. Cited by: §I, §II.
 [39] (2017) Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pp. 5058–5066. Cited by: §I, §II, §IVD, TABLE II.

[40]
(2017)
Variational dropout sparsifies deep neural networks.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pp. 2498–2507. Cited by: §II.  [41] (2019) Importance estimation for neural network pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11264–11272. Cited by: §II, §IIIA1, §IIIB, TABLE II.
 [42] (2019) One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. In Advances in Neural Information Processing Systems, pp. 4933–4943. Cited by: §II.
 [43] (2017) Automatic differentiation in PyTorch. Cited by: §IVA1, §IV.
 [44] (2015) Faster RCNN: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §I.
 [45] (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §I, §IVA1.
 [46] (2018) Feature selection with l2,12 regularization. IEEE Transactions on neural networks and learning systems 29 (10), pp. 4967–4982. Cited by: §I, §II.
 [47] (2016) Training regionbased object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 761–769. Cited by: §IIID.
 [48] (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §IVA2.
 [49] (2013) On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. Cited by: §IVB1.
 [50] (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §I.
 [51] (2019) Structured pruning for efficient convnets via incremental regularization. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §I, §II.
 [52] (2017) A novel pruning algorithm for smoothing feedforward neural networks based on group lasso method. IEEE Transactions on neural networks and learning systems 29 (5), pp. 2012–2024. Cited by: §I, §II.
 [53] (2016) Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pp. 2074–2082. Cited by: §I, §II.
 [54] (2019) Learning optimized structure of neural networks by hidden node pruning with l1 regularization. IEEE Transactions on cybernetics. Cited by: §I, §II.
 [55] (2018) NISP: Pruning networks using neuron importance score propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9194–9203. Cited by: §II, TABLE I.
 [56] (2019) Trimming the l1 regularizer: statistical analysis, optimization, and applications to deep learning. In International Conference on Machine Learning, pp. 7242–7251. Cited by: §I, §II, §IVF3, §IVF3.
 [57] (2016) Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §I.
 [58] (2019) Variational convolutional neural network pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2780–2789. Cited by: §II, §IVC, §IVC, TABLE I.
 [59] (2016) Less is more: Towards compact CNNs. In European Conference on Computer Vision, pp. 662–677. Cited by: §I, §II.
 [60] (2018) Discriminationaware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems, pp. 875–886. Cited by: §IVD, TABLE II.