1 Introduction
Ever since the renaissance of convolutional neural networks (CNNs) a decade ago, the general trend has been pushing them deeper and deeper to achieve better performance alexnet; vgg; resnet, along with a drastic increase in computation and storage requirements for them. However, it is no easy thing to effectively deploy these unprecedentedly large models in practical applications. Additionally, recent works denil2013predicting; ba2014do have been making further efforts to confirm the overparameterization problem lying in deep CNNs, revealing that these powerful models do not necessarily have to be so cumbersome.
To bridge the gap between limited computational resources and superior performance of deep CNNs, various model acceleration algorithms have been proposed, including network pruning liu2017learning; he2017channel, lowrank factorization jaderberg2014speeding; denton2014exploiting, network quantization courbariaux2016binarized; rastegari2016xnor, and knowledge distillation hinton2015distilling; romero2014fitnets, among which network pruning is of active research interests.
Despite their favorable performance in certain cases, current magnitudebased approaches to network pruning have some inherent limitations and cannot guarantee stable behaviors in general situations. As pointed out in he2019pruning, magnitudebased methods rely on two preconditions to achieve satisfactory performance, i.e. large magnitude deviation and small minimum magnitude. Without the guarantee of the two prerequisites, these methods are prone to mistakenly removing some weights that are crucial to network’s performance. Unfortunately, these requirements are not always met, and we empirically observe that the performance of magnitudebased pruning algorithms deteriorates rapidly under those adverse circumstances.
In order to address this important issue and derive a more general approach to accelerating deep CNNs, we propose to discover parameter redundancy among feature channels from a novel perspective. To be precise, we argue that channels revealing similar feature representations have functional overlap and that most channels within each such similarity group can be discarded without compromising network’s representational power. The proposed similaritybased approach is more general in that it remains applicable beyond the restricted scenarios of its magnitudebased counterpart, as will be demonstrated through our experiments. Figure 1 shows a graphical illustration of our motivation.
Our contributions can be summarized as follows:

[leftmargin=7mm]

Propose a novel perspective, i.e. similaritybased approach, for channellevel pruning.

Introduce an effective metric for evaluating channel similarity via probabilistic modelling.

Develop an efficient channel pruning algorithm via hierarchical clustering based on this metric.
2 Related work
Nonstructured pruning. Earlier works on network pruning lecun1990optimal; hassibi1993second
remove redundant parameters by analyzing the Hessian matrix of loss function. Several magnitudebased methods
han2015learning; han2015deep; guo2016dynamic drop network weights with insignificant values to perform model compression. dai2018compressing; molchanov2017 leverage Bayesian theory to achieve more interpretable pruning. Similar to our approach, srinivas2015data; mariet2015diversity also exploit the notion of similarity for reducing parameter redundancy. However, they only consider fullyconnected layers and obtain limited acceleration. son2018clusteringintroduces kmeans clustering to extract the centroids of kernels and achieves computation reduction via kernel sharing.
Structured Pruning. To avoid the requirement of specific hardwares and libraries for sparse matrix operations, various channellevel pruning algorithms were proposed. wen2016learning; lebedev2016fast; huang2018data; zhou2016less impose carefully designed sparsity constraints on network weights during training to facilitate the removal of redundant channels. These algorithms rely on training networks from scratch with sparsity and cannot be directly applied to accelerating pretrained ones. he2017channel; luo2017thinet; zhuang2018discriminationtransform network pruning into an optimization problem. Analogous to our work, roychowdhury2017reducing investigates the presence of duplicate filters in neural networks and finds that some of them can be reduced without impairing network’s performance.
Neural Architecture Search. While compact CNN models zhang2018shufflenet; howard2017mobilenets
were mostly designed in handcrafted manner, neural architecture search has also shown its potential in discovering efficient models. Some methods leverage reinforcement learning
baker2017designing; barret2017neuralreal2017large; liu2018hierarchical to conduct architecture search in discrete spaces, the others luo2018neural; liu2018darts perform optimization in continuous ones.3 Network pruning via channel similarity
3.1 Channel similarity
Let us denote the activations of an arbitrary convolutional layer by , where
represent the height & width of feature maps, the number of feature channels, and the batch size, respectively. Each channel can thus be represented by a 3dimensional tensor
, with its corresponding channel index.Intuitively, if we want to quantify the similarity between two feature channels, mean squared error (MSE) provides a simple yet effective metric, which is commonly adopted to evaluate to what extent a tensor deviates from another one. Therefore, using previously introduced notations, the distance, or dissimilarity, between two feature channels and can be defined as follows:
(1) 
However, naively computing channel distance in this manner has two obvious limitations. First, this definition is not consistent across different data batches, as the activation values depend on the input samples fed to the network. Second, computing the distance between all possible pairs of feature channels via this formulation is computationally inefficient in view of its time complexity, not to speak of the fact that modern CNNs generally contain millions of activations.
To address these issues, we propose a probabilistic approach to estimating channel distance. Precisely, each activation is represented by a random variable
with meanand variance
, where index the height & width of feature maps and the position in current batch, respectively.Proposition 1.
Assume that the activations belonging to the same channel are i.i.d. and that any two activations from two different channels are mutually independent. For any two channels in the same convolutional layer, the distance between them converges in probability to the following value:
(2) 
Proof.
Detailed derivation can be found in Appendix B of the supplementary material. ∎
The proposition above provides a nice approximation to the distance between two feature channels under our probabilistic settings, since generally reaches up to tens of thousands in modern CNNs. While the assumption of mutual independence may appear rather strong at first glance, we empirically validate its reasonability in Section 4.3 by showing that the proposed probabilistic approach induces no bias to channel distance estimation compared to its naive activationvaluebased counterpart.
Now if we reexamine the time complexity of this novel formulation for computing channel distance, it drops down to . In reallife applications, for a batch of 64 normal images of size (e.g. smartphone), the saving in computing time can be considerable (up to for lower layers). In short, this probabilistic approach is scalable to all kinds of application scenarios given the fact that the number of channels rarely exceeds 1000 in existing CNN architectures. Now that we have instantiated the notion of channel similarity via the proposed channel distance,^{1}^{1}1If the distance between two channels and is smaller than that between and another channel , then is considered more similar to than to . the next step is to find an efficient way to derive the statistical information of the activations in each feature channel.
3.2 Channel similarity via batch normalization
Batch normalization (BN) batchnorm
has been introduced to enable faster and more stable training of deep CNNs and is now becoming an indispensable component in deep learning. The composite of three consecutive operations: linear convolution, batch normalization, and
rectified linear unit(ReLU)
relu, is widely adopted as building blocks in stateoftheart deep CNN architectures resnet; densenet.The way BN normalizes the activations within a feature channel motivates us to base our probabilistic settings on the statistical information in the outputs of BN layers. In particular, BN normalizes the activations using minibatch statistics, which perfectly matches our definition of channel distance.
Let us denote the th input and output feature maps of a BN layer by and , respectively. For an arbitrary minibatch , this BN layer performs the following transformation:
(3) 
where and denote the minibatch mean and variance, and are two trainable parameters of an affine transformation that helps to restore the representational power of the network. is a small positive number added to the minibatch variance for numerical stability.
Given the fact that the overwhelming majority of modern CNN architectures adopt the convention of inserting a BN layer after each convolutional layer, it would be very convenient for us to directly leverage the statistical information provided by BN layers for channel distance estimation. Under the probabilistic settings established in Section 3.1, the batchnormalized activations of the th feature channel are i.i.d. random variables with mean and variance , i.e. . Consequently, we can straightforwardly compute the distance between two batchnormalized feature channels in the same convolutional layer as follows:
(4) 
3.3 From similarity to redundancy
Existing magnitudebased pruning algorithms rely on the argument that removing those parameters with relatively insignificant values will incur little impact on the network’s performance. However, as pointed out in lecun1990optimal; hassibi1993second, this intuitive idea does not seem to be theoretically wellfounded. In contrast, we derive a theoretical support to justify the reasonability of our similaritybased pruning approach. In particular, we show that the removal of an arbitrary feature channel will not impair the network’s representational power in a dramatic way, as long as there exists another channel that is sufficiently similar to the removed one and can be exploited as a substitution.
Let us consider two consecutive convolutional layers and , we have using similar notations as before. Suppose that batch normalization and nonlinear activation are applied after each linear convolution, then we have:
(5) 
where represents the th kernel matrix in the th convolutional filter and
denotes the convolution operation. Note that BN nullifies the effect of bias vectors in convolutional layers, hence they are deprecated in the formulation above.
Inspired by srinivas2015data, we explore and analyze to what extent the activations of will be shifted if we remove a feature channel from and compensate the consequent loss of representational power by exploiting the channels in that are similar to the removed one. Suppose that and are two similar feature channels in , now if we remove the former and properly update the kernel matrix corresponding to the latter in an attempt to minimize the resulting performance decay, then for each feature channel after pruning, we have:
(6) 
Note that we replace the kernel matrix by , which is a simple and intuitive way to compensate the loss of representational power resulted by pruning . Computing the distance between and using Equation 1 gives:
(7) 
where .
Proposition 2.
For each feature channel in the th convolutional layer, the distance shift caused by removing the feature channel from the th convolutional layer, as defined in Equation 7, admits the following upper bound:
(8) 
where and corresponds to the size of each kernel matrix .
Proof.
Detailed derivation can be found in Appendix B of the supplementary material. ∎
Through the conclusion of Proposition 2, we can notice that depends on the size of feature channels, the size and norm of the kernel matrix. In practice, the coefficient is typically a small value of magnitude , which means that the removal of a feature channel results in rather limited shift on the next layer’s activations and hence has little impact on the network’s representational power, as long as there exists adequately similar channels to replace its function. Note that the abovementioned onetoone substitution strategy is mainly introduced for theoretical concern. In practical applications, it is clearly a suboptimal solution from an overall perspective. For instance, we could have exploited more channels to substitute for the pruned one, i.e. a linear combination. In our implementation, the retained kernel matrices after pruning are adaptively updated by gradientdescentbased optimization algorithms rather than manually computed in a fixed way, as this automatic approach normally returns a superior updating strategy. Experimental results well reflect the effectiveness of this choice.
3.4 Similaritybased pruning via hierarchical clustering
Now that we have properly defined our metric of channel similarity and have theoretically demonstrated the feasibility of our similaritybased pruning approach, it is very natural to resort to clustering algorithms for the subsequent pruning process. Precisely, we want to group the channels within each convolutional layer into similarity clusters based on its channel distance matrix.^{2}^{2}2A square matrix that groups the distance between all possible pairs of channels within a convolutional layer. Given the results of clustering, we only need to retain one representative channel for each cluster, as the representation information provided by the others is highly similar and hence redundant. Here we select the channel with the largest to enable faster and easier finetuning process after the pruning operation.
To this end, hierarchical clustering (HC) is introduced here as our agglomerative method. Compared with other popular clustering algorithms, HC is more adaptive to our demand as it requires only one hyperparameter, i.e. the threshold distance. In particular, this threshold allows us to simultaneously control the clustering result and hence the pruning ratio of all layers with a single global parameter. This property is crucial to discovering efficient and compact CNN models, as the target architecture is automatically determined by the pruning algorithm liu2018rethinking. In order to render the distance values comparable across all layers and make our oneshot pruning more stable, we further normalize the values of each distance matrix to before feeding them to the clustering algorithm. Empirical results in Section 4.4 showcase the importance of this step.
The overall workflow of our similaritybased pruning algorithm can be summarized as follows:

Construct a channel distance matrix for each convolutional layer using Equation 4.

Normalize each channel distance matrix: .

Perform hierarchical clustering of channels with global threshold on each convolutional layer using its normalized channel distance matrix obtained from step 2.

For each cluster, retain the channel with the largest and remove the others.

Finetune the resulting pruned model for a few epochs to restore performance.
Corresponding pseudo code can be found in Appendix C of the supplementary material.
4 Experiments and analysis
We empirically demonstrate the effectiveness of our similaritybased channel pruning algorithm on two representative CNN architectures VGGNet vgg and ResNet resnet. Results are reported on benchmark datasets CIFAR cifar and ImageNet imagenet. Several stateoftheart channel pruning algorithms are introduced for performance comparison, including SSL wen2016learning, RDF roychowdhury2017reducing, PFGM he2019pruning, TN luo2017thinet, CP he2017channel, NS liu2017learning, SFP he2018soft.
Implementation details. For all experiments on CIFAR10 and CIFAR100 datasets, we finetune the pruned models using SGD optimizer and batch size 64 for 60 epochs. Learning rate begins at 0.01, decays by 10 at 25, 50 epoch for ResNet and 30, 50 epoch for VGGNet, respectively. On ImageNet dataset, all training settings are kept the same except for a batch size 256. A weight decay of
and a Nesterov momentum of 0.9 are utilized. More details can be found in Appendix A.
Acceleration metric. We adopt FLOPs as the acceleration metric to evaluate all pruning algorithms in our experiments, as it plays a decisive role on the network’s inference speed. Different from most previous works, we count all floating point operations that take place during the inference phase when computing overall FLOPs, not only those related to convolution operations, since nontensor layers (e.g. BN, ReLU and pooling layers) also occupy a considerable part of inference time on GPU luo2017thinet. Additionally, tensor multiplyadds counts for two FLOPs. Reported results of all baseline methods are computed using their publicly available configuration files under our settings.
4.1 VGGNet and ResNet on CIFAR
We first evaluate the acceleration performance of our algorithm on CIFAR dataset. The results are summarized in Table 1. Compared with other algorithms, ours consistently achieves better accuracy while reducing more FLOPs. To showcase the superior performance of our approach over the others under different pruning ratios, we further plot the test accuracy in function of the proportion of pruned FLOPs in Figure 2. While the other methods rival ours in accuracy under relatively low pruning ratio, they begin to experience severe performance decay as the ratio goes up. In contrast, our algorithm reveals very stable behaviors in all architecturedataset settings and maintains decent performance even under extremely aggressive pruning ratio.
4.2 ResNet on ImageNet
We then apply our algorithm to pruning ResNet50 on ImageNet2012 to validate its effectiveness on largescale datasets. As shown in Table 1, our algorithm outperforms all 5 competitors by a notable margin in terms of Top1 and Top5 accuracy drop, achieving 45.90% reduction of FLOPs at the cost of only 0.34% drop of Top5 accuracy. We further investigate how our algorithm extends to varying pruning ratios and how this theoretical reduction of computation translates to realistic acceleration on modern GPUs. The results presented in Table 1
(c) are obtained using PyTorch
pytorch and cuDNN v7.1 on a TITAN X Pascal GPU. The gap between FLOPs saving and time saving can be attributed to the fact that noncomputing operations, e.g. IO query and buffer switch, also influence inference speed.4.3 Channel distance estimation
As discussed in Section 3.1, calculating channel distance via activationvaluebased approach, i.e. using Equation 1, is datadependent and inconsistent across different data batches. To elaborate on this point, we study the performance of pruned VGG16 model obtained using randomly sampled batches. As illustrated in Figure 3, while activationbased approach matches probabilistic approach in mean accuracy, its performance fluctuates significantly across different input samples and exhibits highly unstable pattern. In addition, this problem of instability gets worse with higher pruning ratio and smaller batch size. In contrast, probabilistic approach does not rely on input data and consistently achieves satisfactory performance. This result strongly confirms the effectiveness and necessity of the proposed probabilistic approach to channel distance estimation.
In order to validate that our probabilistic approach provides a good approximation to the mathematical expectation of its activationbased counterpart and that the assumption of Proposition 1 causes no bias in the whole estimation process, we further compare the distance matrix computed using Equation 1 (averaged over 20 trials) and that computed using Equation 4. As we can see from Figure 3, there only exists minor discrepancy between the mean distance matrix obtained via activationbased approach and the distance matrix estimated via probabilistic approach, which effectively supports our claim. This observation also explains why the two approaches share similar performance in mean accuracy. More visualization results can be found in Appendix D.
4.4 Normalized channel distance matrix
To explore the effect of normalizing channel distance matrix before performing HC, we visualize the distribution of values within both the normalized matrix and the unnormalized one in Figure 4. From the visualization results, we observe that, without distance normalization, value distribution is rather lopsided across different layers. This phenomenon results in imbalanced target architecture when all layers are pruned using a global threshold, i.e. some upper layers are boiled down to a single channel. As we can see from Figure 4, normalization operation effectively alleviates this dilemma, enabling automatic search of target architecture while ensuring efficient propagation of feature information through the network. Comparison of accuracy curve between the two well corroborates this point.
4.5 Performance analysis
We further look into the reason for the superior performance of our algorithm over the other automatic pruning algorithms in finding efficient and compact architectures. As shown in Figure 5, NS and SSL prune very aggressively over upper layers, making it extremely hard to propagate feature information extracted in lower layers up to the classification layer. Moreover, upper layers generate very limited FLOPs as the feature maps there are already downsampled several times and thus of smaller size, explaining why our algorithm retains more channels while pruning the same amount of FLOPs.
5 Conclusion
We propose a novel perspective to perform channellevel pruning and accelerate deep CNNs by showing that channels revealing similar feature information have functional overlap and that most channels within each such similarity group can be removed with minor impact on network’s representational power. Experimental results well support the reasonability of our intuition. In the future, we will try extending our approach to more general neural network models, such as RNNs and GNNs.
[heading=bibintoc]
A. Implementation
Implementation details. On CIFAR dataset, we make use of a variant of VGG16 liu2017learning and a 3stage preactivation ResNet he2016identity. On ImageNet dataset, a 4stage ResNet50 with bottleneck structure resnet is adopted. Our algorithm is implemented in Pytorch pytorch. For NS, we take its original implementation in Pytorch. For SSL and RDF, we reimplement them in Pytorch by following their original papers. For TN, PFGM, CP and SFP, we directly take their acceleration results from the literature.
Pruning details. While channellevel pruning is straightforward to realize for singlebranch CNN models like AlexNet alexnet and VGGNet, special concerns are required for more sophisticated architectures with crosslayer connections, such as ResNet and DenseNet densenet, to ensure the consistency of forward and backward propagation. To achieve this purpose while not compromising the flexibility of pruning operation over certain layers, we exploit the strategy of inserting a channel selection layer in each residual block, as proposed in liu2017learning.
B. Proof
Proposition 3.
Assume that the activations belonging to the same channel are i.i.d. and that any two activations from two different channels are mutually independent. For any two channels in the same convolutional layer, the distance between them converges in probability to the following value:
(9) 
Proof.
The proof of this proposition is a direct application of the weak law of large numbers. Since the activations within the same feature channel are i.i.d., we can denote
. In addition, if we define that , and are mutually independent, and , then we have:(10) 
The sequence of random variables are i.i.d. and . We can thus apply the weak law of large numbers to Equation 10:
(11) 
Note that and are mutually independent, therefore the r.h.s. of Equation 9
can be derived by using the first two moments (mean and variance) of
and as follows:(12) 
This concludes the proof of Proposition 3. ∎
Proposition 4.
For each feature channel in the th convolutional layer, the distance shift caused by removing the feature channel from the th convolutional layer, as defined in Equation 7, admits the following upper bound:
(13) 
where and corresponds to the size of each kernel matrix .
Proof.
For an arbitrary feature channel in the th convolutional layer, let and denote the image patch centered at . Note that is convolved with to generate the activation in the th row and th column in the th output feature channel. We transform the l.h.s of Equation 13 as follows:
(14) 
Note that the linear convolution between and , both of size , is equivalent to their inner product. Applying CauchySchwarz inequality to Equation 14 gives:
(15) 
The last inequality stems from the fact that each activation in appears at most times in the convolution operation. Actually, most activations in participate exactly times in the convolution operation, except for those lying near the border of , which appear less often.
For the two most popular choices of nonlinear activation functions in modern CNN architectures, sigmoid and ReLU, we have the extra property that
and , based on which we can easily derive the following inequality:(16) 
Based on the conclusion of Equation 16, we transform the inequality in Equation 15 as follows:
(17) 
Since can be whichever feature channel in the th convolutional layer, we can further narrow the upper bound by taking the minimum over all .
(18) 
where and (resp. ).
This concludes the proof of proposition 4. ∎
C. Pseudo code
Pseudo code of the proposed channelsimilaritybased pruning algorithm.
D. Visualization
We visualize the channel distance matrix computed via the activationvaluebased approach (averaged over 20 trials using randomly sampled image batches) and that estimated via the proposed probabilistic approach for a number of convolutional layers of VGG16 on CIFAR10 in Figure 6. Their absolute difference is added for more visually accessible illustration.
Comments
There are no comments yet.