In recent years, Deep Neural Networks (DNNs) have achieved significant progress in the most of computer vision tasks, e.g. classification  and object detection . However, with the boost of accuracy performance, the architecture of networks becomes deeper and wider, which leads to much higher requirement of computational resource for inference. For example, the popular network ResNet-152  has 60.2 million parameters with 232 MB model size, and it needs 11.3 billion float point operations for one forward process. Such huge cost of resource make it hard to deploy a typical deep model on resource constrained devices, e.g., mobile phones or embedded gadgets. With the desire of deployment for edge computing, the study of deep neural network compression has obtained considerable attention on both academia and industry.
In the 1990s, LeCun et al.  first observed that if some weights have less influence on final decision, then we could prune them with slight accuracy loss. Pruning is one of the most popular models for model compression, since it will not hurt the original network structure. Recently, a lot of works based on pruning has been published. Han et al.  pruned non-important connections for model compression (weights pruning). He et al.  and Luo et al.  both utilized filter pruning. The former applied two-step algorithms to realize channel pruning for deep neural network acceleration, and the latter proposed a novel pruning mechanism that whether a filter can be pruned depends on the outputs of its next layer.
Intuitively, the filters of the convolutional neural networks are not independent. Despite that different filters try to learn features from different points, some points may have potential similarity or connections. Therefore, we believe that filters in networks could be divided into several groups, namely clustering. Contrary to the filter pruning method in  which considered each filter as a group, we propose a novel spectral clustering filter pruning for CNNs Compression with Soft Self-adaption Manners. First, we cluster filters into groups with spectral clustering, then we rank the groups by their contributions (weights) to the final decision. Some filers in a group will be pruned if they have low influences. Furthermore, inspired by , we combine our proposed method with soft filter pruning, where the pruned filters will be updated iteratively during training to retain the accuracy.
Contribution. Our contributions of this paper are as follows. (1) Considering the connections in filters, we proposed a novel spectral clustering filter pruning for CNNs Compression, which pruned a group of related but redundant filters instead of each independent filters. (2) Our proposed method provided maximum protection of correlations between filters with spectral clustering, therefore our approach can train a model from scratch with fast convergence, meanwhile, the model can achieve comparable performance with much fewer parameters. (3) Experiments on two datasets (i.e., MNIST and Cifar-10) demonstrate the high effectiveness of our proposed approach.
2 Related Works
2.1 Spectral Clustering
Spectral clustering algorithms attempted to partition data points into groups such that the points of a group are similar to each other and dissimilar to others outside of the group. Its essence is point to point clustering method that converts clustering problem into graph optimal partition problem. Comparing with other traditional clustering algorithms (e.g.
, k-means and mixture models ), it enables better results and faster in convergence. Given some data points, the basic spectral clustering algorithms can be formulated as below:
Calculate the degree matrix and the adjacency matrix ;
Calculate the Laplacian matrix ;
Clustering the Eigen matrix using k-means by row.
Shi et al.  firstly applied spectral clustering algorithm in images segmentation task. They formulated the problem from a graph theoretic perspective and introduced the normalized cut to segment the image. After that, lots of works [8, 10, 21] based on spectral clustering started to show its promising ability in computer vision. Yang et al  presented a novel kernel fuzzy similarity measure, which uses membership distribution in partition matrix obtained by KFCM to reduce the sensitivity of scaling parameter, for image segmentation. To overcome the common challenge of scalability in image segmentation, Tung et al.  proposed a method which first performs an over-segmentation of the image at the pixel level using spectral clustering, and then merges the segments using a combination of stochastic ensemble consensus and a second round of spectral clustering at the segment level.
2.2 Model Compression and Acceleration
The previous approaches of model compression and acceleration can be roughly divided into three parts : parameter pruning, low-rank factorization and knowledge distillation. The parameters of convolution or fully-connected layer in a typical CNN may have large redundancy. Based on this situation, some works [22, 4, 11, 27]23, 1, 19] aimed to compress models by training a new compact neural network which could be on par with a large model. For example,  followed a student-teacher network to realize the compact network training, i.e. knowledge distillation. As the starting point of our work, parameter pruning [3, 12, 29, 25, 5] focused on removing some weights or filters which are redundant or not important to the final performance. Molchanov et al.  introduced a Taylor expansion to approximate the change in the cost function induced by removing unnecessary netowrk parameters. Li et al.  presented an acceleration method to prune filters with their connecting feature maps from CNNs that have a small contribution to ouput. Yu et al. 
applyed feature ranking techniques to measure the importance of responses in the second-to-last layer, and propagated the importance of neurons backwards to earlier layers. The nodes with low propagated improtance are then pruned. Different from Hard filter pruning, He et al. proposed soft filter pruning, which allows the pruned filters to be updated during the training procedure. Liu et al. 
imposed L1 regularization on the scaling factors in batch normalization (BN) layers to identify insignificant channels (or neurons), which are pruned at the following step.
In contrast to these existing works, we applied spectral clustering algorithms to model compression. The intuitional idea is that filters in networks are not independent, they could be divided into several groups (e.g., spectral clustering) based on potential similarity or connections.
The schematic overview of our framework has been summarized in Alg.1 and Alg.2. We will introduce the annotations and assumptions before going into more details about how to do spectral clustering on network filters, and how to train the model with soft self-adaption manners.
Deep neural networks can be parameterized with trainable variables. We denote the ConvNets model as , where is the connecting weights of the i-th convolutional layer which connect the layer i and layer i+1. and are height, width, in-channels which are the input of the layer, out-channels which are the input of the layer, respectively. L is the depth of convolutional network. This formula can be easily tuned for fully connected layers, so we skip the details. In order to do the spectral clustering on the filters, we need to reshape the filters to a new weight matrix with dimension , where is the number of filters, is the number of weights( attributes) of single filter.
For some parameters used in our strategy, we denote bandwidth parameter as
used in radial basis function, the pruning rate asused in i-th layer and the numbers of clustering groups as . The reason we can do the filter clustering into various groups and pruning is based on the assumptions that every filter learns different aspects of the target, while some filters are trivial for target, because the “Black Box” of neural networks brings some repeated efforts inadvertently.
3.2 Spectral Clustering Filter Pruning with Soft Self-adaption manners
In previous filter pruning for model compression, once the filters are pruned, it will not update it again, which means the model capacities will drop dramatically. Whereas, in our frame work, we will update the pruned filters iteratively to retain the accuracy at same time. Meanwhile, other model compression frameworks always reformulate the form of objective function by combining the L1 norm and L2 norm to get compact model. In , according to novel objective function, it will get exclusive sparsity and group sparsity, while the definitions about clustering groups of filters are too intuitive. To this end, we will propose a reasonable and reliable manners to get the groups of filters.
3.2.1 Spectral Clustering for Filter Pruning.
Spectral clustering has become one of the most popular modern clustering algorithms in recent years. Compared to the traditional algorithms, such as K-means, single linage, spectral clustering has many fundamental advantages. When it comes to filter pruning, it’s necessary to modify some default settings.
Cosine similarity matrix.
When analyzing the big data, especially with hundreds of attributes, Euclidean distance suffers from the ‘curse of dimensionality’. Specifically, using Euclidean distance will produce tremendous distance when two vectors are very similar by all attributes except for one attribute. Hence, cosine distance is recommended in case of deep neural networks filter pruning. As summarized in algorithm 1, firstly, we compute the pair-wise cosine similarities of filters from the same layer to form a similarity matrix. To ensure the correctness, we need to eliminate the zero filters and need some trick to avoid infinity. Without confusion, we still use to denote the filters matrix participated in pruning process. Specifically,
where is the cosine similarity between two filters j, k in i-th layer, using 1 to minus the cosine distance is to fit the nature definitions of similarity that same filters’ distance is 0, more different filters has larger value. is the k-th filter of i-th layer’s reshaped filter matrix . We implement radius basis transformation in cosine similarity traditionally, denoted by . We define the transformed matrix as , called ad.
K-Means on feature matrix. Then, we can compose the normalized Laplacian matrix based on . Specifically, we compute a diagonal matrix which is a symmetric matrix where the diagonal elements is sum of the each column of . Using minus to compute the unnormalized Laplacian matrix. Finally, we exploit the Eigen decomposition of normalized Laplacian matrix to get feature matrix for clustering. To reduce the dimension of , we will choose k eigenvectors corresponding to k largest eigenvalues to compose a new feature matrix, while without confusion, we still using to illustrate the process. While in practice we often abandon the dimension reducing operation to maintain all information of filters. whether to perform the dimension reducing operations or not depends on your tasks. Specifically,
We then apply K-Means algorithm to feature matrix to get the filters group labels, denoted as , where each element in belongs to [1,2,…, ]. We need to minimize the following objective function:
where is column of and is a cluster center, k [1,2,…, ]
3.2.2 Filter Pruning with Soft Self-adaption Manners
As we mentioned before, the filter pruning operations with soft manners ensure the capacity of model while significantly increase the computation complexity and speed up the training. In this section, we will go into details about how to apply soft manners to our filter pruning.
Filter selection After spectral clustering operations on , we will get the group labels of each filter in layer, naturally separate the filter into groups. According to , we get group set as , where contains some filters with group label k.We will evaluate the importance of each group by applying norm. In intuition, the convolutional results of the filter with larger norm will lead to relative more activation values, thus have more numerical impact on the target. To this end, we define the group effect in layer to indicates the total efforts of a group to final target, denoted as . After comparing the group effects among different groups, we choose groups to pruning, where is the pruning rate in i-th layer related with the amounts of group. Empirically, we will use L2 norm. Like,
where is -th weight of -th filter in layer .
Filter Pruning. In filter selection above, we rank the group effect, and choose some group to be pruned together. In practice, we actually choose same pruning rate, whereas, for different
, we still prune different numbers of filters. The reason why we choose different cluster group amounts for different layers is because different layers has great bias on filter amounts, which means different numbers of features will be learned. The heuristic way we choose is to choose the group number proportional to the number of class in classification task. On the other hand, we treated convolutional layers and fully connected layers differently, such as, in some classification task, we won’t prune the last layers, where each filter represent all learned features for one class. The choice of cluster groups number and the different treatment of layers depends on task.
By now, why can we simply prune filters at same time? Intuitively, due to the large model capacity and we will update the model in next epoch. Finally, the model become more flexible, better generalization and retain same accuracy.
Reconstruction. After pruning filters, we recover the filters in next epoch. However, how to choose the frequency of pruning. Specifically, how many epochs are appropriate between every two pruning operations, which I call it ‘pruning gaps’? if we choose a small pruning gaps, which will update frequently while the filters don’t have enough time to recover. Consequently, it will prune almost same filters in following pruning steps. When it comes to large pruning gaps, although leaving enough time for filters to recover, the model isn’t compact as expected. Hence, there is a balance between accuracy and compactness. Empirically, set pruning gaps equals one or two epoch is enough, meanwhile, you can set two epochs to recover before ending the training process.
In this section, we evaluate our proposed approach on several classification datasets with some popular base networks.
4.1 Datasets and Settings
MNIST . It is a classical handwritten digits dataset which contains grayscale images, has a training set of 60,000 examples, and a test set of
examples. The 5-layers convolution neural network (LeNet-5) for handwritten digits classification opened a new gate for deep learning.
CIFAR-10 . It’s a widely-used object classification dataset in computer vision. The dataset consists of , color images in 10 classes (e.g., cat, automobile, airplane, etc.), with images per class. The dataset is divided into five training batches and one test batch, each with images. The test batch contains exactly randomly-selected images from each class. The training batches contain the remaining images in random order, about images from each class.
Our models are implemented on Tensorflow framework. In particular,For MNIST dataset, we adopt two convolutional layers follows by two fully connected layers, named LeNet-4, as the base network, then we train the network for 20 epochs with constant learning rate of 0.07. We evaluate our proposed method with five different pruning ratios, i.e., , , and . Besides, the number of spectral clustering groups is related to the number of filters which participate in the operation of filter pruning. Note that we don’t prune the last layer in this paper. For CIFAR-10 setting, we follow the LeNet-5 and ResNet-32 as the base network structures. All models are trained for 200 epochs with multi-step learning rate policy (0.003 for the first 75 epochs, 0.0005 for the following 75 epochs, and 0.00025 for the rest epochs). Because of compact and simple network structure, we just prune the second the third layers in LeNet-5, however, we do filter pruning for all convolution layers in ResNet-32.
4.2 Evaluation Protocol
FLOPs. Floating-point operations(FLOPs) is a widely-used measure to test the model computational complexity. We follow the criterion in , the formulation is described as below.
For convolutional layers:
where , are the size of input feature map, is the size of the kernel, and are the input and output channels respectively.
For fully connected layers:
where and are the input and output channels respectively.
Parameters sparsity. Parameters sparsity is a percentage of how many parameters are realistically used over all theoretic parameters, as in the following formulation:
where denotes the the number of all parameters in layer.
4.3 Results on MNIST
As we can see in Table 1, we evaluated our approach with five different pruning ratios, and also compared with other state-of-art methods, e.g., CGES . First, with the increase of pruning ratio, the filters become more sparse, fewer and fewer parameters can be utilized for final decision, however, our approach still could achieve considerable performance , , , and with , , , , and pruning ratio respectively. Second, interestingly, we notice that when the pruning ratio is (or ) which means that in each training step, we pruned of all groups of filters, our model with remaining parameters even could achieve better accuracy than baseline with margin. It confirmed our intuition that in spite of plenty of parameters in networks, they can be grouped based on relevance and complementarity. Pruning some redundant groups (filters) leverage others to focus on their responsibility and learn more discriminative features. Third, comparing with CGES, our model has better performance and lower FLOPs. Although our parameter sparsity is not as good as CGES, we have a better balance between accuracy and computation complexity.
4.4 Results on CIFAR-10
LeNet-5 base network. We first evaluate our method on CIFAR-10 dataset with LeNet-5, and set the pruning ratio by 0.4. Table 2 explains the comparison between our performance and CGES. As we can see, our approach achieved the state-of-art performance. After training, we pruned about parameters of network and reduced FLOPs, however, our model do not loss any capacity, still maintain the final accuracy with . When looking at the performance of CGES, it sacrificed accuracy only for decreasing parameters. That’ all thanks to the effectiveness of Spectral Clustering Filter Pruning.
ResNet-32 base network. We also apply our proposed pruning method on a deeper network to show it’s strength. As a deep and powerful network, ResNet is our primary choice. We utilize ResNet-32 as our base network, and the pruning ratio is set by 0.4. Results are illustrated in Table 3. Comparing with baseline, parameters are pruned by our approach, and we reduce the computation complexity by in FLOPs. Meanwhile, the model only loss the precision with a small margin.
|Pruned ratio||Baseline Acc(%)||Pruned Acc(%)||Acc Loss(%)||Pruned FLOPs||Param Sp(%)|
|Layer||Weights(K)||FLOPs(K)||Pruned Weights(%)||Pruned FLOPs(%)|
4.5 Ablation Study
How the FLOPs and parameters sparsity changes during training? In order to understand the process of pruning, we show the varying curve of FLOPs and parameters sparsity with five different pruning ratios in Figure 2. It’s intuitional that both FLOPs and parameters sparsity are decreasing progressively during training because some unimportant filters are removed. However, there are some fluctuations in curves, especially in the right one. This is because (1) we cluster all filters before pruning, and remove groups of filters if they are unimportant. Hence, the number of pruned filters depends on the situation of weightes updating; (2) we do filter pruning with soft self-adaption manners, which means the pruned filters will be updated iteratively during training. We argue that because of two reasons mentioned above, our proposed approach can achieve considerable performance with fewer parameters with fewer parameters.
Which part of filters are more important in each layer? As we explain above, if a group of filters is insensitive to the output, then it will be pruned, i.e., these filters are not important. In order to analyze the important of filtrs in each layer, we do experiments on MNIST with LeNet-4, as shown in Figure 4. Intertestingly, we can find that the more a layer is close to the output (the last fully-connected layer), the less filters are pruned. The superficial layers (e.g., conv1) only learn some simple features, such as color and edges, however, the deeper layers (e.g., fc1) can learn more abstract features, such as profile and parts. Therefore, it’s intuitional that some parameters in fc1 are more important than those in conv1. Again, we argue that our proposed approach which apply spectral clustering in filter pruning is effective and explicable.
In this paper, we introduce a novel filter pruning method – spectral clustering filter pruning with soft self-adaption manners(SCSP), to compress the convolutional neural networks. We for the first time, employ the spectral clustering on filters layer by layer to explore their intrinsic connections. The experiments show the efficacy of proposed methods.
-  Anoop Korattikara Balan, Vivek Rathod, Kevin P Murphy, and Max Welling. Bayesian dark knowledge. In Advances in Neural Information Processing Systems, pages 3438–3446, 2015.
-  Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282, 2017.
-  Chao Qian Ke Tang Chunhui Jiang, Guiying Li. Efficient dnn neuron pruning by minimizing layer-wise nonlinear reconstruction error. In IJCAI 2018, 2018.
-  Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pages 1269–1277, 2014.
-  Chunhui Jiang Xiaofen Lu Ke Tang Guiying Li, Chao Qian. Optimization based layer-wise magnitude-based pruning for dnn compression. In IJCAI 2018, 2018.
-  S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In NIPS, 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2015.
-  John R Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe. Deep clustering: Discriminative embeddings for segmentation and separation. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 31–35. IEEE, 2016.
-  Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
-  Jian Hou, Weixue Liu, E Xu, and Hongxia Cui. Towards parameter-independent data clustering and image segmentation. Pattern Recognition, 60:25–36, 2016.
-  Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
-  Yuchen Guo Jungong Han Bin Wang Jing Zhong, Guiguang Ding. Where to prune: Using lstm to guide end-to-end pruning. In IJCAI 2018, 2018.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
-  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pages 598–605, 1990.
-  Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
-  Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2755–2763. IEEE, 2017.
-  Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. In ICCV, 2017.
-  Ping Luo, Zhenyao Zhu, Ziwei Liu, Xiaogang Wang, Xiaoou Tang, et al. Face model compression by distilling knowledge from neurons. In AAAI, pages 3560–3566, 2016.
-  Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient transfer learning. arXiv preprint arXiv:1611.06440, 2016.
-  Umut Ozertem, Deniz Erdogmus, and Robert Jenssen. Mean shift spectral clustering. Pattern Recognition, 41(6):1924–1938, 2008.
-  Roberto Rigamonti, Amos Sironi, Vincent Lepetit, and Pascal Fua. Learning separable filters. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 2754–2761. IEEE, 2013.
-  Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
-  Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.
-  Yuchao Li Yongjian Wu Feiyue Huang Baochang Zhang Shaohui Lin, Rongrong Ji. Accelerating convolutional networks via global and dynamic filter pruning. In IJCAI 2018, 2018.
-  Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000.
-  Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang Wang, et al. Convolutional neural networks with low-rank regularization. arXiv preprint arXiv:1511.06067, 2015.
A. Torralba, R. Fergus, and W.T. Freeman.
80 million tiny images: A large data set for nonparametric object and scene recognition.IEEE TPAMI, 2008.
-  Frederick Tung and Greg Mori. Clip-q: Deep network compression learning by in-parallel pruning-quantization. In CVPR, 2018.
-  Frederick Tung, Alexander Wong, and David A Clausi. Enabling scalable spectral clustering for image segmentation. Pattern Recognition, 43(12):4069–4076, 2010.
-  He Yang, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Y. Yang. Soft filter pruning for accelerating deep convolutional neural networks. In IJCAI 2018, 2018.
-  Yifang Yang, Yuping Wang, and Xingsi Xue. A novel spectral clustering method with superpixels for image segmentation. In Optik-International Journal for Light and Electron Optics, 2016.
-  Xiangyu Zhang Yihui He and Jian Sun. Channel pruning for accelerating very deep neural networks. In ICCV, 2017.
Jaehong Yoon and Sung Ju Hwang.
Combined group and exclusive sparsity for deep neural networks.
International Conference on Machine Learning, pages 3958–3966, 2017.
-  Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, and Larry S Davis. Nisp: Pruning networks using neuron importance score propagation. arXiv preprint arXiv:1711.05908, 2017.