Exploiting Channel Similarity for Accelerating Deep Convolutional Neural Networks

08/06/2019
by   Yunxiang Zhang, et al.
Shanghai Jiao Tong University
5

To address the limitations of existing magnitude-based pruning algorithms in cases where model weights or activations are of large and similar magnitude, we propose a novel perspective to discover parameter redundancy among channels and accelerate deep CNNs via channel pruning. Precisely, we argue that channels revealing similar feature information have functional overlap and that most channels within each such similarity group can be removed without compromising model's representational power. After deriving an effective metric for evaluating channel similarity through probabilistic modeling, we introduce a pruning algorithm via hierarchical clustering of channels. In particular, the proposed algorithm does not rely on sparsity training techniques or complex data-driven optimization and can be directly applied to pre-trained models. Extensive experiments on benchmark datasets strongly demonstrate the superior acceleration performance of our approach over prior arts. On ImageNet, our pruned ResNet-50 with 30

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

page 14

02/23/2020

Gradual Channel Pruning while Training using Feature Relevance Scores for Convolutional Neural Networks

The enormous inference cost of deep neural networks can be scaled down b...
02/10/2021

CIFS: Improving Adversarial Robustness of CNNs via Channel-wise Importance-based Feature Selection

We investigate the adversarial robustness of CNNs from the perspective o...
10/28/2018

Discrimination-aware Channel Pruning for Deep Neural Networks

Channel pruning is one of the predominant approaches for deep model comp...
05/22/2020

PruneNet: Channel Pruning via Global Importance

Channel pruning is one of the predominant approaches for accelerating de...
01/04/2020

Discrimination-aware Network Pruning for Deep Model Compression

We study network pruning which aims to remove redundant channels/kernels...
10/21/2021

CATRO: Channel Pruning via Class-Aware Trace Ratio Optimization

Deep convolutional neural networks are shown to be overkill with high pa...
07/03/2020

Learning to Prune in Training via Dynamic Channel Propagation

In this paper, we propose a novel network training mechanism called "dyn...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Ever since the renaissance of convolutional neural networks (CNNs) a decade ago, the general trend has been pushing them deeper and deeper to achieve better performance alexnet; vgg; resnet, along with a drastic increase in computation and storage requirements for them. However, it is no easy thing to effectively deploy these unprecedentedly large models in practical applications. Additionally, recent works denil2013predicting; ba2014do have been making further efforts to confirm the over-parameterization problem lying in deep CNNs, revealing that these powerful models do not necessarily have to be so cumbersome.

To bridge the gap between limited computational resources and superior performance of deep CNNs, various model acceleration algorithms have been proposed, including network pruning liu2017learning; he2017channel, low-rank factorization jaderberg2014speeding; denton2014exploiting, network quantization courbariaux2016binarized; rastegari2016xnor, and knowledge distillation hinton2015distilling; romero2014fitnets, among which network pruning is of active research interests.

Despite their favorable performance in certain cases, current magnitude-based approaches to network pruning have some inherent limitations and cannot guarantee stable behaviors in general situations. As pointed out in he2019pruning, magnitude-based methods rely on two preconditions to achieve satisfactory performance, i.e. large magnitude deviation and small minimum magnitude. Without the guarantee of the two prerequisites, these methods are prone to mistakenly removing some weights that are crucial to network’s performance. Unfortunately, these requirements are not always met, and we empirically observe that the performance of magnitude-based pruning algorithms deteriorates rapidly under those adverse circumstances.

In order to address this important issue and derive a more general approach to accelerating deep CNNs, we propose to discover parameter redundancy among feature channels from a novel perspective. To be precise, we argue that channels revealing similar feature representations have functional overlap and that most channels within each such similarity group can be discarded without compromising network’s representational power. The proposed similarity-based approach is more general in that it remains applicable beyond the restricted scenarios of its magnitude-based counterpart, as will be demonstrated through our experiments. Figure 1 shows a graphical illustration of our motivation.

Our contributions can be summarized as follows:

  • [leftmargin=7mm]

  • Propose a novel perspective, i.e. similarity-based approach, for channel-level pruning.

  • Introduce an effective metric for evaluating channel similarity via probabilistic modelling.

  • Develop an efficient channel pruning algorithm via hierarchical clustering based on this metric.

2 Related work

Non-structured pruning. Earlier works on network pruning lecun1990optimal; hassibi1993second

remove redundant parameters by analyzing the Hessian matrix of loss function. Several magnitude-based methods 

han2015learning; han2015deep; guo2016dynamic drop network weights with insignificant values to perform model compression. dai2018compressing; molchanov2017 leverage Bayesian theory to achieve more interpretable pruning. Similar to our approach, srinivas2015data; mariet2015diversity also exploit the notion of similarity for reducing parameter redundancy. However, they only consider fully-connected layers and obtain limited acceleration. son2018clustering

introduces k-means clustering to extract the centroids of kernels and achieves computation reduction via kernel sharing.

Structured Pruning. To avoid the requirement of specific hardwares and libraries for sparse matrix operations, various channel-level pruning algorithms were proposed. wen2016learning; lebedev2016fast; huang2018data; zhou2016less impose carefully designed sparsity constraints on network weights during training to facilitate the removal of redundant channels. These algorithms rely on training networks from scratch with sparsity and cannot be directly applied to accelerating pre-trained ones. he2017channel; luo2017thinet; zhuang2018discriminationtransform network pruning into an optimization problem. Analogous to our work, roychowdhury2017reducing investigates the presence of duplicate filters in neural networks and finds that some of them can be reduced without impairing network’s performance.

Neural Architecture Search. While compact CNN models zhang2018shufflenet; howard2017mobilenets

were mostly designed in hand-crafted manner, neural architecture search has also shown its potential in discovering efficient models. Some methods leverage reinforcement learning 

baker2017designing; barret2017neural

or evolutionary algorithms 

real2017large; liu2018hierarchical to conduct architecture search in discrete spaces, the others luo2018neural; liu2018darts perform optimization in continuous ones.

Figure 1: Magnitude-based approach vs similarity-based approach to channel pruning.

3 Network pruning via channel similarity

3.1 Channel similarity

Let us denote the activations of an arbitrary convolutional layer by , where

represent the height & width of feature maps, the number of feature channels, and the batch size, respectively. Each channel can thus be represented by a 3-dimensional tensor

, with its corresponding channel index.

Intuitively, if we want to quantify the similarity between two feature channels, mean squared error (MSE) provides a simple yet effective metric, which is commonly adopted to evaluate to what extent a tensor deviates from another one. Therefore, using previously introduced notations, the distance, or dissimilarity, between two feature channels and can be defined as follows:

(1)

However, naively computing channel distance in this manner has two obvious limitations. First, this definition is not consistent across different data batches, as the activation values depend on the input samples fed to the network. Second, computing the distance between all possible pairs of feature channels via this formulation is computationally inefficient in view of its time complexity, not to speak of the fact that modern CNNs generally contain millions of activations.

To address these issues, we propose a probabilistic approach to estimating channel distance. Precisely, each activation is represented by a random variable

with mean

and variance

, where index the height & width of feature maps and the position in current batch, respectively.

Proposition 1.

Assume that the activations belonging to the same channel are i.i.d. and that any two activations from two different channels are mutually independent. For any two channels in the same convolutional layer, the distance between them converges in probability to the following value:

(2)
Proof.

Detailed derivation can be found in Appendix B of the supplementary material. ∎

The proposition above provides a nice approximation to the distance between two feature channels under our probabilistic settings, since generally reaches up to tens of thousands in modern CNNs. While the assumption of mutual independence may appear rather strong at first glance, we empirically validate its reasonability in Section 4.3 by showing that the proposed probabilistic approach induces no bias to channel distance estimation compared to its naive activation-value-based counterpart.

Now if we reexamine the time complexity of this novel formulation for computing channel distance, it drops down to . In real-life applications, for a batch of 64 normal images of size (e.g. smartphone), the saving in computing time can be considerable (up to for lower layers). In short, this probabilistic approach is scalable to all kinds of application scenarios given the fact that the number of channels rarely exceeds 1000 in existing CNN architectures. Now that we have instantiated the notion of channel similarity via the proposed channel distance,111If the distance between two channels and is smaller than that between and another channel , then is considered more similar to than to . the next step is to find an efficient way to derive the statistical information of the activations in each feature channel.

3.2 Channel similarity via batch normalization

Batch normalization (BN) batchnorm

has been introduced to enable faster and more stable training of deep CNNs and is now becoming an indispensable component in deep learning. The composite of three consecutive operations: linear convolution, batch normalization, and

rectified linear unit

(ReLU

relu, is widely adopted as building blocks in state-of-the-art deep CNN architectures resnet; densenet.

The way BN normalizes the activations within a feature channel motivates us to base our probabilistic settings on the statistical information in the outputs of BN layers. In particular, BN normalizes the activations using mini-batch statistics, which perfectly matches our definition of channel distance.

Let us denote the -th input and output feature maps of a BN layer by and , respectively. For an arbitrary mini-batch , this BN layer performs the following transformation:

(3)

where and denote the mini-batch mean and variance, and are two trainable parameters of an affine transformation that helps to restore the representational power of the network. is a small positive number added to the mini-batch variance for numerical stability.

Given the fact that the overwhelming majority of modern CNN architectures adopt the convention of inserting a BN layer after each convolutional layer, it would be very convenient for us to directly leverage the statistical information provided by BN layers for channel distance estimation. Under the probabilistic settings established in Section 3.1, the batch-normalized activations of the -th feature channel are i.i.d. random variables with mean and variance , i.e. . Consequently, we can straightforwardly compute the distance between two batch-normalized feature channels in the same convolutional layer as follows:

(4)

3.3 From similarity to redundancy

Existing magnitude-based pruning algorithms rely on the argument that removing those parameters with relatively insignificant values will incur little impact on the network’s performance. However, as pointed out in lecun1990optimal; hassibi1993second, this intuitive idea does not seem to be theoretically well-founded. In contrast, we derive a theoretical support to justify the reasonability of our similarity-based pruning approach. In particular, we show that the removal of an arbitrary feature channel will not impair the network’s representational power in a dramatic way, as long as there exists another channel that is sufficiently similar to the removed one and can be exploited as a substitution.

Let us consider two consecutive convolutional layers and , we have using similar notations as before. Suppose that batch normalization and non-linear activation are applied after each linear convolution, then we have:

(5)

where represents the -th kernel matrix in the -th convolutional filter and

denotes the convolution operation. Note that BN nullifies the effect of bias vectors in convolutional layers, hence they are deprecated in the formulation above.

Inspired by srinivas2015data, we explore and analyze to what extent the activations of will be shifted if we remove a feature channel from and compensate the consequent loss of representational power by exploiting the channels in that are similar to the removed one. Suppose that and are two similar feature channels in , now if we remove the former and properly update the kernel matrix corresponding to the latter in an attempt to minimize the resulting performance decay, then for each feature channel after pruning, we have:

(6)

Note that we replace the kernel matrix by , which is a simple and intuitive way to compensate the loss of representational power resulted by pruning . Computing the distance between and using Equation 1 gives:

(7)

where .

Proposition 2.

For each feature channel in the -th convolutional layer, the distance shift caused by removing the feature channel from the -th convolutional layer, as defined in Equation 7, admits the following upper bound:

(8)

where and corresponds to the size of each kernel matrix .

Proof.

Detailed derivation can be found in Appendix B of the supplementary material. ∎

Through the conclusion of Proposition 2, we can notice that depends on the size of feature channels, the size and norm of the kernel matrix. In practice, the coefficient is typically a small value of magnitude , which means that the removal of a feature channel results in rather limited shift on the next layer’s activations and hence has little impact on the network’s representational power, as long as there exists adequately similar channels to replace its function. Note that the above-mentioned one-to-one substitution strategy is mainly introduced for theoretical concern. In practical applications, it is clearly a sub-optimal solution from an overall perspective. For instance, we could have exploited more channels to substitute for the pruned one, i.e. a linear combination. In our implementation, the retained kernel matrices after pruning are adaptively updated by gradient-descent-based optimization algorithms rather than manually computed in a fixed way, as this automatic approach normally returns a superior updating strategy. Experimental results well reflect the effectiveness of this choice.

3.4 Similarity-based pruning via hierarchical clustering

Now that we have properly defined our metric of channel similarity and have theoretically demonstrated the feasibility of our similarity-based pruning approach, it is very natural to resort to clustering algorithms for the subsequent pruning process. Precisely, we want to group the channels within each convolutional layer into similarity clusters based on its channel distance matrix.222A square matrix that groups the distance between all possible pairs of channels within a convolutional layer. Given the results of clustering, we only need to retain one representative channel for each cluster, as the representation information provided by the others is highly similar and hence redundant. Here we select the channel with the largest to enable faster and easier fine-tuning process after the pruning operation.

To this end, hierarchical clustering (HC) is introduced here as our agglomerative method. Compared with other popular clustering algorithms, HC is more adaptive to our demand as it requires only one hyper-parameter, i.e. the threshold distance. In particular, this threshold allows us to simultaneously control the clustering result and hence the pruning ratio of all layers with a single global parameter. This property is crucial to discovering efficient and compact CNN models, as the target architecture is automatically determined by the pruning algorithm liu2018rethinking. In order to render the distance values comparable across all layers and make our one-shot pruning more stable, we further normalize the values of each distance matrix to before feeding them to the clustering algorithm. Empirical results in Section 4.4 showcase the importance of this step.

The overall workflow of our similarity-based pruning algorithm can be summarized as follows:

  • Construct a channel distance matrix for each convolutional layer using Equation 4.

  • Normalize each channel distance matrix: .

  • Perform hierarchical clustering of channels with global threshold on each convolutional layer using its normalized channel distance matrix obtained from step 2.

  • For each cluster, retain the channel with the largest and remove the others.

  • Fine-tune the resulting pruned model for a few epochs to restore performance.

Corresponding pseudo code can be found in Appendix C of the supplementary material.

(a) Comparison of acceleration performance on CIFAR. Dataset Model Automatic Pre-train Accuracy FLOPs Pruned CIFAR-10 VGG-16 Base 93.39% 627.36M Ours () 92.71% 182.31M 70.94% NS liu2017learning 92.33% 182.47M 70.92% SSL wen2016learning 91.65% 248.94M 60.32% RDF roychowdhury2017reducing 91.84% 314.33M 49.90% ResNet-110 Base 94.65% 341.23M Ours () 94.25% 141.27M 58.60% NS liu2017learning 93.57% 141.77M 58.45% PFGM he2019pruning 93.73% 177.38M 48.02% SFP he2018soft 93.38% 217.38M 36.30% CIFAR-100 VGG-16 Base 72.10% 627.45M Ours () 71.22% 334.16M 46.74% NS liu2017learning 68.82% 441.89M 29.57% SSL wen2016learning 68.80% 374.10M 40.38% RDF roychowdhury2017reducing 69.63% 388.53M 38.08% ResNet-110 Base 75.62% 341.28M Ours () 74.56% 153.73M 54.96% NS liu2017learning 73.67% 153.96M 54.89% SSL wen2016learning 71.28% 161.95M 52.55% RDF roychowdhury2017reducing 71.52% 190.03M 44.32%

(b) Comparison of acceleration performance on ImageNet for ResNet-50. Dataset Model Automatic Pre-train Top-1 Top-5 FLOPs Pruned ImageNet Ours () -0.80% -0.34% 4.44B 45.90% TN luo2017thinet -3.26% -1.53% 5.23B 36.27% NS liu2017learning -2.16% -1.17% 4.49B 45.37% CP he2017channel -3.25% 5.49B 33.13% PFGM he2019pruning -1.32% -0.55% 4.46B 45.74% SFP he2018soft -1.54% -0.81% 5.23B 36.27%

(c) Wall-clock time saving of pruned ResNet-50 on ImageNet. Dataset Model Top-1 Top-5 FLOPs Pruned Time Pruned ImageNet Ours () 0.08% 0.07% 1472B 29.97% 0.429s 24.74% Ours () -0.48% -0.19% 1262B 39.99% 0.391s 31.40% Ours () -1.15% -0.57% 1052B 49.99% 0.362s 36.67% Ours () -2.66% -1.43% 839B 60.03% 0.331s 41.93%

Table 1: Acceleration results on CIFAR and ImageNet datasets. “Base” represents the uncompressed baseline model obtained via normal training, “” denotes the threshold used for controlling pruning ratio, and “–” means that the result is meaningless or not available. In “Automatic” column, “✓” and “✗” indicate whether the target compact architecture is automatically discovered by the pruning algorithm or manually pre-defined. In “Pre-train” column, “✓” and “✗” indicate whether the algorithm can be directly applied to pruning pre-trained models or not. On CIFAR dataset, we report the accuracy on test set. On ImageNet dataset, single view evaluation of Top-1 and Top-5 accuracy drop is reported to accommodate the discrepancy among different deep learning frameworks. The overall FLOPs and wall-clock time in (c) correspond to a data batch of size 256.

4 Experiments and analysis

We empirically demonstrate the effectiveness of our similarity-based channel pruning algorithm on two representative CNN architectures VGGNet vgg and ResNet resnet. Results are reported on benchmark datasets CIFAR cifar and ImageNet imagenet. Several state-of-the-art channel pruning algorithms are introduced for performance comparison, including SSL wen2016learning, RDF roychowdhury2017reducing, PFGM he2019pruning, TN luo2017thinet, CP he2017channel, NS liu2017learning, SFP he2018soft.

Implementation details. For all experiments on CIFAR-10 and CIFAR-100 datasets, we fine-tune the pruned models using SGD optimizer and batch size 64 for 60 epochs. Learning rate begins at 0.01, decays by 10 at 25, 50 epoch for ResNet and 30, 50 epoch for VGGNet, respectively. On ImageNet dataset, all training settings are kept the same except for a batch size 256. A weight decay of

and a Nesterov momentum of 0.9 are utilized. More details can be found in Appendix A.

Acceleration metric. We adopt FLOPs as the acceleration metric to evaluate all pruning algorithms in our experiments, as it plays a decisive role on the network’s inference speed. Different from most previous works, we count all floating point operations that take place during the inference phase when computing overall FLOPs, not only those related to convolution operations, since non-tensor layers (e.g. BN, ReLU and pooling layers) also occupy a considerable part of inference time on GPU luo2017thinet. Additionally, tensor multiply-adds counts for two FLOPs. Reported results of all baseline methods are computed using their publicly available configuration files under our settings.

Figure 2: Accuracy-acceleration curve on CIFAR dataset. Best viewed in color.
Figure 3: Left: Performance comparison between activation-value-based approach and its probabilistic counterpart. For each batch size, 20 trials are conducted using randomly sampled batches. Blue line and shaded region represent the mean accuracy and accuracy range of the former. The accuracy of the latter is added for comparison (dotted red line). Right: Channel distance matrix obtained using Equation 1 and averaged over 20 trials of batch size 256 (leftmost), using Equation 4 (middle), and their absolute difference (rightmost) for VGG-16 on CIFAR-10. Best viewed in color.

4.1 VGGNet and ResNet on CIFAR

We first evaluate the acceleration performance of our algorithm on CIFAR dataset. The results are summarized in Table 1. Compared with other algorithms, ours consistently achieves better accuracy while reducing more FLOPs. To showcase the superior performance of our approach over the others under different pruning ratios, we further plot the test accuracy in function of the proportion of pruned FLOPs in Figure 2. While the other methods rival ours in accuracy under relatively low pruning ratio, they begin to experience severe performance decay as the ratio goes up. In contrast, our algorithm reveals very stable behaviors in all architecture-dataset settings and maintains decent performance even under extremely aggressive pruning ratio.

4.2 ResNet on ImageNet

We then apply our algorithm to pruning ResNet-50 on ImageNet-2012 to validate its effectiveness on large-scale datasets. As shown in Table 1, our algorithm outperforms all 5 competitors by a notable margin in terms of Top-1 and Top-5 accuracy drop, achieving 45.90% reduction of FLOPs at the cost of only 0.34% drop of Top-5 accuracy. We further investigate how our algorithm extends to varying pruning ratios and how this theoretical reduction of computation translates to realistic acceleration on modern GPUs. The results presented in Table 1

(c) are obtained using PyTorch 

pytorch and cuDNN v7.1 on a TITAN X Pascal GPU. The gap between FLOPs saving and time saving can be attributed to the fact that non-computing operations, e.g. IO query and buffer switch, also influence inference speed.

4.3 Channel distance estimation

As discussed in Section 3.1, calculating channel distance via activation-value-based approach, i.e. using Equation 1, is data-dependent and inconsistent across different data batches. To elaborate on this point, we study the performance of pruned VGG-16 model obtained using randomly sampled batches. As illustrated in Figure 3, while activation-based approach matches probabilistic approach in mean accuracy, its performance fluctuates significantly across different input samples and exhibits highly unstable pattern. In addition, this problem of instability gets worse with higher pruning ratio and smaller batch size. In contrast, probabilistic approach does not rely on input data and consistently achieves satisfactory performance. This result strongly confirms the effectiveness and necessity of the proposed probabilistic approach to channel distance estimation.

In order to validate that our probabilistic approach provides a good approximation to the mathematical expectation of its activation-based counterpart and that the assumption of Proposition 1 causes no bias in the whole estimation process, we further compare the distance matrix computed using Equation 1 (averaged over 20 trials) and that computed using Equation 4. As we can see from Figure 3, there only exists minor discrepancy between the mean distance matrix obtained via activation-based approach and the distance matrix estimated via probabilistic approach, which effectively supports our claim. This observation also explains why the two approaches share similar performance in mean accuracy. More visualization results can be found in Appendix D.

4.4 Normalized channel distance matrix

To explore the effect of normalizing channel distance matrix before performing HC, we visualize the distribution of values within both the normalized matrix and the unnormalized one in Figure 4. From the visualization results, we observe that, without distance normalization, value distribution is rather lopsided across different layers. This phenomenon results in imbalanced target architecture when all layers are pruned using a global threshold, i.e. some upper layers are boiled down to a single channel. As we can see from Figure 4, normalization operation effectively alleviates this dilemma, enabling automatic search of target architecture while ensuring efficient propagation of feature information through the network. Comparison of accuracy curve between the two well corroborates this point.

4.5 Performance analysis

We further look into the reason for the superior performance of our algorithm over the other automatic pruning algorithms in finding efficient and compact architectures. As shown in Figure 5, NS and SSL prune very aggressively over upper layers, making it extremely hard to propagate feature information extracted in lower layers up to the classification layer. Moreover, upper layers generate very limited FLOPs as the feature maps there are already down-sampled several times and thus of smaller size, explaining why our algorithm retains more channels while pruning the same amount of FLOPs.

Figure 4: Left: Distribution of values in both the unnormalized distance matrix and the normalized one for VGG-16 on CIFAR-10. Gray area shows the distance values below the pruning threshold, i.e. those pairs of channels that should be grouped into clusters. Right: Performance comparison between the two cases on CIFAR-10. Best viewed in color.
Figure 5: Architecture of pruned VGG-16 model with 80% reduction of FLOPs on CIFAR-10 dataset. Two automatic pruning algorithms, SSL and NS, are introduced for comparison. Best viewed in color.

5 Conclusion

We propose a novel perspective to perform channel-level pruning and accelerate deep CNNs by showing that channels revealing similar feature information have functional overlap and that most channels within each such similarity group can be removed with minor impact on network’s representational power. Experimental results well support the reasonability of our intuition. In the future, we will try extending our approach to more general neural network models, such as RNNs and GNNs.

[heading=bibintoc]

A.   Implementation

Implementation details. On CIFAR dataset, we make use of a variant of VGG-16 liu2017learning and a 3-stage pre-activation ResNet he2016identity. On ImageNet dataset, a 4-stage ResNet-50 with bottleneck structure resnet is adopted. Our algorithm is implemented in Pytorch pytorch. For NS, we take its original implementation in Pytorch. For SSL and RDF, we re-implement them in Pytorch by following their original papers. For TN, PFGM, CP and SFP, we directly take their acceleration results from the literature.

Pruning details. While channel-level pruning is straightforward to realize for single-branch CNN models like AlexNet alexnet and VGGNet, special concerns are required for more sophisticated architectures with cross-layer connections, such as ResNet and DenseNet densenet, to ensure the consistency of forward and backward propagation. To achieve this purpose while not compromising the flexibility of pruning operation over certain layers, we exploit the strategy of inserting a channel selection layer in each residual block, as proposed in liu2017learning.

B.   Proof

Proposition 3.

Assume that the activations belonging to the same channel are i.i.d. and that any two activations from two different channels are mutually independent. For any two channels in the same convolutional layer, the distance between them converges in probability to the following value:

(9)
Proof.

The proof of this proposition is a direct application of the weak law of large numbers. Since the activations within the same feature channel are i.i.d., we can denote

. In addition, if we define that , and are mutually independent, and , then we have:

(10)

The sequence of random variables are i.i.d. and . We can thus apply the weak law of large numbers to Equation 10:

(11)

Note that and are mutually independent, therefore the r.h.s. of Equation 9

can be derived by using the first two moments (mean and variance) of

and as follows:

(12)

This concludes the proof of Proposition 3. ∎


Proposition 4.

For each feature channel in the -th convolutional layer, the distance shift caused by removing the feature channel from the -th convolutional layer, as defined in Equation 7, admits the following upper bound:

(13)

where and corresponds to the size of each kernel matrix .

Proof.

For an arbitrary feature channel in the -th convolutional layer, let and denote the image patch centered at . Note that is convolved with to generate the activation in the -th row and -th column in the -th output feature channel. We transform the l.h.s of Equation 13 as follows:

(14)

Note that the linear convolution between and , both of size , is equivalent to their inner product. Applying Cauchy-Schwarz inequality to Equation 14 gives:

(15)

The last inequality stems from the fact that each activation in appears at most times in the convolution operation. Actually, most activations in participate exactly times in the convolution operation, except for those lying near the border of , which appear less often.

For the two most popular choices of non-linear activation functions in modern CNN architectures, sigmoid and ReLU, we have the extra property that

and , based on which we can easily derive the following inequality:

(16)

Based on the conclusion of Equation 16, we transform the inequality in Equation 15 as follows:

(17)

Since can be whichever feature channel in the -th convolutional layer, we can further narrow the upper bound by taking the minimum over all .

(18)

where   and   (resp. ).

This concludes the proof of proposition 4. ∎

C.   Pseudo code

Pseudo code of the proposed channel-similarity-based pruning algorithm.

  Input: Threshold distance and feature channels before pruning
  Output: Retained channels
  for  to  do
     Initialize and
     for  to  do
        for  to  do
                                           # Equation 4 of the main paper
           
        end for
     end for
      and
     Normalization:
     Perform hierarchical clustering with threshold on using to obtain
     for  to  do
        
        
     end for
  end for
Algorithm 1 Channel-similarity-based pruning via hierarchical clustering of feature channels

D.   Visualization

We visualize the channel distance matrix computed via the activation-value-based approach (averaged over 20 trials using randomly sampled image batches) and that estimated via the proposed probabilistic approach for a number of convolutional layers of VGG-16 on CIFAR-10 in Figure 6. Their absolute difference is added for more visually accessible illustration.

Figure 6: Visualization of the channel distance matrix computed using different approaches. The leftmost figure of each triplet shows the matrix obtained via activation-value-based approach (averaged over 20 trials of batch size 256), the middle shows that estimated via probabilistic approach, while the rightmost shows their absolute difference. Best viewed in color.