SCSP: Spectral Clustering Filter Pruning with Soft Self-adaption Manners

06/14/2018
by   Huiyuan Zhuo, et al.
0

Deep Convolutional Neural Networks (CNN) has achieved significant success in computer vision field. However, the high computational cost of the deep complex models prevents the deployment on edge devices with limited memory and computational resource. In this paper, we proposed a novel filter pruning for convolutional neural networks compression, namely spectral clustering filter pruning with soft self-adaption manners (SCSP). We first apply spectral clustering on filters layer by layer to explore their intrinsic connections and only count on efficient groups. By self-adaption manners, the pruning operations can be done in few epochs to let the network gradually choose meaningful groups. According to this strategy, we not only achieve model compression while keeping considerable performance, but also find a novel angle to interpret the model compression process.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/25/2019

COP: Customized Deep Model Compression via Regularized Correlation-Based Filter-Level Pruning

Neural network compression empowers the effective yet unwieldy deep conv...
09/06/2018

2PFPCE: Two-Phase Filter Pruning Based on Conditional Entropy

Deep Convolutional Neural Networks (CNNs) offer remarkable performance o...
03/05/2020

Cluster Pruning: An Efficient Filter Pruning Method for Edge AI Vision Applications

Even though the Convolutional Neural Networks (CNN) has shown superior r...
05/08/2020

Pruning Algorithms to Accelerate Convolutional Neural Networks for Edge Applications: A Survey

With the general trend of increasing Convolutional Neural Network (CNN) ...
11/20/2018

Stability Based Filter Pruning for Accelerating Deep CNNs

Convolutional neural networks (CNN) have achieved impressive performance...
06/23/2020

PFGDF: Pruning Filter via Gaussian Distribution Feature for Deep Neural Networks Acceleration

The existence of a lot of redundant information in convolutional neural ...
07/20/2018

Principal Filter Analysis for Guided Network Compression

Principal Filter Analysis (PFA), is an elegant, easy to implement, yet e...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, Deep Neural Networks (DNNs) have achieved significant progress in the most of computer vision tasks, e.g. classification [13] and object detection [24]. However, with the boost of accuracy performance, the architecture of networks becomes deeper and wider, which leads to much higher requirement of computational resource for inference. For example, the popular network ResNet-152 [7] has 60.2 million parameters with 232 MB model size, and it needs 11.3 billion float point operations for one forward process. Such huge cost of resource make it hard to deploy a typical deep model on resource constrained devices, e.g., mobile phones or embedded gadgets. With the desire of deployment for edge computing, the study of deep neural network compression has obtained considerable attention on both academia and industry.

In the 1990s, LeCun et al. [15] first observed that if some weights have less influence on final decision, then we could prune them with slight accuracy loss. Pruning is one of the most popular models for model compression, since it will not hurt the original network structure. Recently, a lot of works based on pruning has been published. Han et al. [6] pruned non-important connections for model compression (weights pruning). He et al. [33] and Luo et al. [18] both utilized filter pruning. The former applied two-step algorithms to realize channel pruning for deep neural network acceleration, and the latter proposed a novel pruning mechanism that whether a filter can be pruned depends on the outputs of its next layer.

Intuitively, the filters of the convolutional neural networks are not independent. Despite that different filters try to learn features from different points, some points may have potential similarity or connections. Therefore, we believe that filters in networks could be divided into several groups, namely clustering. Contrary to the filter pruning method in [34] which considered each filter as a group, we propose a novel spectral clustering filter pruning for CNNs Compression with Soft Self-adaption Manners. First, we cluster filters into groups with spectral clustering, then we rank the groups by their contributions (weights) to the final decision. Some filers in a group will be pruned if they have low influences. Furthermore, inspired by [31], we combine our proposed method with soft filter pruning, where the pruned filters will be updated iteratively during training to retain the accuracy.

Contribution. Our contributions of this paper are as follows. (1) Considering the connections in filters, we proposed a novel spectral clustering filter pruning for CNNs Compression, which pruned a group of related but redundant filters instead of each independent filters. (2) Our proposed method provided maximum protection of correlations between filters with spectral clustering, therefore our approach can train a model from scratch with fast convergence, meanwhile, the model can achieve comparable performance with much fewer parameters. (3) Experiments on two datasets (i.e., MNIST and Cifar-10) demonstrate the high effectiveness of our proposed approach.

2 Related Works

2.1 Spectral Clustering

Spectral clustering algorithms attempted to partition data points into groups such that the points of a group are similar to each other and dissimilar to others outside of the group. Its essence is point to point clustering method that converts clustering problem into graph optimal partition problem. Comparing with other traditional clustering algorithms (e.g.

, k-means and mixture models ), it enables better results and faster in convergence. Given some data points

, the basic spectral clustering algorithms can be formulated as below:

  1. Calculate the degree matrix and the adjacency matrix ;

  2. Calculate the Laplacian matrix ;

  3. Calculate the first eigenvectors (corresponding to the first eigenvalues) of the Laplacian matrix with eigenvalue decomposition, and reconstruct them as a Eigen matrices;

  4. Clustering the Eigen matrix using k-means by row.

Shi et al. [26] firstly applied spectral clustering algorithm in images segmentation task. They formulated the problem from a graph theoretic perspective and introduced the normalized cut to segment the image. After that, lots of works [8, 10, 21] based on spectral clustering started to show its promising ability in computer vision. Yang et al [32] presented a novel kernel fuzzy similarity measure, which uses membership distribution in partition matrix obtained by KFCM to reduce the sensitivity of scaling parameter, for image segmentation. To overcome the common challenge of scalability in image segmentation, Tung et al. [30] proposed a method which first performs an over-segmentation of the image at the pixel level using spectral clustering, and then merges the segments using a combination of stochastic ensemble consensus and a second round of spectral clustering at the segment level.

2.2 Model Compression and Acceleration

The previous approaches of model compression and acceleration can be roughly divided into three parts [2]: parameter pruning, low-rank factorization and knowledge distillation. The parameters of convolution or fully-connected layer in a typical CNN may have large redundancy. Based on this situation, some works [22, 4, 11, 27]

concentrated on decomposing matrices/tensors to estimate the informative parameters with

low-rank factorization. Other studies [23, 1, 19] aimed to compress models by training a new compact neural network which could be on par with a large model. For example, [9] followed a student-teacher network to realize the compact network training, i.e. knowledge distillation. As the starting point of our work, parameter pruning [3, 12, 29, 25, 5] focused on removing some weights or filters which are redundant or not important to the final performance. Molchanov et al. [20] introduced a Taylor expansion to approximate the change in the cost function induced by removing unnecessary netowrk parameters. Li et al. [16] presented an acceleration method to prune filters with their connecting feature maps from CNNs that have a small contribution to ouput. Yu et al. [35]

applyed feature ranking techniques to measure the importance of responses in the second-to-last layer, and propagated the importance of neurons backwards to earlier layers. The nodes with low propagated improtance are then pruned. Different from Hard filter pruning, He et al.

[31] proposed soft filter pruning, which allows the pruned filters to be updated during the training procedure. Liu et al. [17]

imposed L1 regularization on the scaling factors in batch normalization (BN) layers to identify insignificant channels (or neurons), which are pruned at the following step.

In contrast to these existing works, we applied spectral clustering algorithms to model compression. The intuitional idea is that filters in networks are not independent, they could be divided into several groups (e.g., spectral clustering) based on potential similarity or connections.

Figure 1: SCSP and training process. In epoch, we perform SCSP, firstly, do spectral clustering to group the filters. In above picture, we cluster filters into 3 groups, then do filter pruning. In next epoch, we follow normal updating process, then repeat SCSP again in epoch.

3 Methodology

The schematic overview of our framework has been summarized in Alg.1 and Alg.2. We will introduce the annotations and assumptions before going into more details about how to do spectral clustering on network filters, and how to train the model with soft self-adaption manners.

3.1 Preliminary

Deep neural networks can be parameterized with trainable variables. We denote the ConvNets model as , where is the connecting weights of the i-th convolutional layer which connect the layer i and layer i+1. and are height, width, in-channels which are the input of the layer, out-channels which are the input of the layer, respectively. L is the depth of convolutional network. This formula can be easily tuned for fully connected layers, so we skip the details. In order to do the spectral clustering on the filters, we need to reshape the filters to a new weight matrix with dimension , where is the number of filters, is the number of weights( attributes) of single filter.

For some parameters used in our strategy, we denote bandwidth parameter as

used in radial basis function, the pruning rate as

used in i-th layer and the numbers of clustering groups as . The reason we can do the filter clustering into various groups and pruning is based on the assumptions that every filter learns different aspects of the target, while some filters are trivial for target, because the “Black Box” of neural networks brings some repeated efforts inadvertently.

3.2 Spectral Clustering Filter Pruning with Soft Self-adaption manners

In previous filter pruning for model compression, once the filters are pruned, it will not update it again, which means the model capacities will drop dramatically. Whereas, in our frame work, we will update the pruned filters iteratively to retain the accuracy at same time. Meanwhile, other model compression frameworks always reformulate the form of objective function by combining the L1 norm and L2 norm to get compact model. In [34], according to novel objective function, it will get exclusive sparsity and group sparsity, while the definitions about clustering groups of filters are too intuitive. To this end, we will propose a reasonable and reliable manners to get the groups of filters.

3.2.1 Spectral Clustering for Filter Pruning.

Spectral clustering has become one of the most popular modern clustering algorithms in recent years. Compared to the traditional algorithms, such as K-means, single linage, spectral clustering has many fundamental advantages. When it comes to filter pruning, it’s necessary to modify some default settings.

Cosine similarity matrix.

When analyzing the big data, especially with hundreds of attributes, Euclidean distance suffers from the ‘curse of dimensionality’. Specifically, using Euclidean distance will produce tremendous distance when two vectors are very similar by all attributes except for one attribute. Hence, cosine distance is recommended in case of deep neural networks filter pruning. As summarized in algorithm 1, firstly, we compute the pair-wise cosine similarities of filters from the same layer to form a similarity matrix

. To ensure the correctness, we need to eliminate the zero filters and need some trick to avoid infinity. Without confusion, we still use to denote the filters matrix participated in pruning process. Specifically,

(1)
(2)

where is the cosine similarity between two filters j, k in i-th layer, using 1 to minus the cosine distance is to fit the nature definitions of similarity that same filters’ distance is 0, more different filters has larger value. is the k-th filter of i-th layer’s reshaped filter matrix . We implement radius basis transformation in cosine similarity traditionally, denoted by . We define the transformed matrix as , called ad.

K-Means on feature matrix. Then, we can compose the normalized Laplacian matrix based on . Specifically, we compute a diagonal matrix which is a symmetric matrix where the diagonal elements is sum of the each column of . Using minus to compute the unnormalized Laplacian matrix. Finally, we exploit the Eigen decomposition of normalized Laplacian matrix to get feature matrix for clustering. To reduce the dimension of , we will choose k eigenvectors corresponding to k largest eigenvalues to compose a new feature matrix, while without confusion, we still using to illustrate the process. While in practice we often abandon the dimension reducing operation to maintain all information of filters. whether to perform the dimension reducing operations or not depends on your tasks. Specifically,

(3)
(4)

We then apply K-Means algorithm to feature matrix to get the filters group labels, denoted as , where each element in belongs to [1,2,…, ]. We need to minimize the following objective function:

(5)

where is column of and is a cluster center, k [1,2,…, ]

function: Spectral-Clustering() returns the clustering labels

Inputs: filters , bandwidth , cluster groups

Reshape the to with dimension

Drop zero weight filter to get new filters

For filter filter in do

Compute normalized Laplacian matrix

Do Eigen decomposition on

If reduce-dimension=False

then Return

Do K-means( ) according to cluster groups

Return clustering labels

Algorithm 1 Spectral Clustering for Filter Pruning

3.2.2 Filter Pruning with Soft Self-adaption Manners

As we mentioned before, the filter pruning operations with soft manners ensure the capacity of model while significantly increase the computation complexity and speed up the training. In this section, we will go into details about how to apply soft manners to our filter pruning.

Filter selection After spectral clustering operations on , we will get the group labels of each filter in layer, naturally separate the filter into groups. According to , we get group set as , where contains some filters with group label k.We will evaluate the importance of each group by applying norm. In intuition, the convolutional results of the filter with larger norm will lead to relative more activation values, thus have more numerical impact on the target. To this end, we define the group effect in layer to indicates the total efforts of a group to final target, denoted as . After comparing the group effects among different groups, we choose groups to pruning, where is the pruning rate in i-th layer related with the amounts of group. Empirically, we will use L2 norm. Like,

(6)

where is -th weight of -th filter in layer .

Filter Pruning. In filter selection above, we rank the group effect, and choose some group to be pruned together. In practice, we actually choose same pruning rate, whereas, for different

, we still prune different numbers of filters. The reason why we choose different cluster group amounts for different layers is because different layers has great bias on filter amounts, which means different numbers of features will be learned. The heuristic way we choose is to choose the group number proportional to the number of class in classification task. On the other hand, we treated convolutional layers and fully connected layers differently, such as, in some classification task, we won’t prune the last layers, where each filter represent all learned features for one class. The choice of cluster groups number and the different treatment of layers depends on task.

By now, why can we simply prune filters at same time? Intuitively, due to the large model capacity and we will update the model in next epoch. Finally, the model become more flexible, better generalization and retain same accuracy.

Reconstruction. After pruning filters, we recover the filters in next epoch. However, how to choose the frequency of pruning. Specifically, how many epochs are appropriate between every two pruning operations, which I call it ‘pruning gaps’? if we choose a small pruning gaps, which will update frequently while the filters don’t have enough time to recover. Consequently, it will prune almost same filters in following pruning steps. When it comes to large pruning gaps, although leaving enough time for filters to recover, the model isn’t compact as expected. Hence, there is a balance between accuracy and compactness. Empirically, set pruning gaps equals one or two epoch is enough, meanwhile, you can set two epochs to recover before ending the training process.

Input: model parameters , bandwidth , cluster groups , pruning rate , pruning gaps ,

Initialize

while and t  do

for each layer , do

clustering label

computing group effect using norm

then rank

zeroize groups of filters

end for

update filters

end while

Algorithm 2 Filter Pruning with Soft Self-adaption Manners

4 Experiments

In this section, we evaluate our proposed approach on several classification datasets with some popular base networks.

4.1 Datasets and Settings

MNIST [14]. It is a classical handwritten digits dataset which contains grayscale images, has a training set of 60,000 examples, and a test set of

examples. The 5-layers convolution neural network (LeNet-5) for handwritten digits classification opened a new gate for deep learning.

CIFAR-10 [28]. It’s a widely-used object classification dataset in computer vision. The dataset consists of , color images in 10 classes (e.g., cat, automobile, airplane, etc.), with images per class. The dataset is divided into five training batches and one test batch, each with images. The test batch contains exactly randomly-selected images from each class. The training batches contain the remaining images in random order, about images from each class.

Implementation details.

Our models are implemented on Tensorflow framework. In particular,

For MNIST dataset, we adopt two convolutional layers follows by two fully connected layers, named LeNet-4, as the base network, then we train the network for 20 epochs with constant learning rate of 0.07. We evaluate our proposed method with five different pruning ratios, i.e., , , and . Besides, the number of spectral clustering groups is related to the number of filters which participate in the operation of filter pruning. Note that we don’t prune the last layer in this paper. For CIFAR-10 setting, we follow the LeNet-5 and ResNet-32 as the base network structures. All models are trained for 200 epochs with multi-step learning rate policy (0.003 for the first 75 epochs, 0.0005 for the following 75 epochs, and 0.00025 for the rest epochs). Because of compact and simple network structure, we just prune the second the third layers in LeNet-5, however, we do filter pruning for all convolution layers in ResNet-32.

4.2 Evaluation Protocol

FLOPs. Floating-point operations(FLOPs) is a widely-used measure to test the model computational complexity. We follow the criterion in [20], the formulation is described as below.

For convolutional layers:

(7)

where , are the size of input feature map, is the size of the kernel, and are the input and output channels respectively.

For fully connected layers:

(8)

where and are the input and output channels respectively.

Parameters sparsity. Parameters sparsity is a percentage of how many parameters are realistically used over all theoretic parameters, as in the following formulation:

(9)

where denotes the the number of all parameters in layer.

4.3 Results on MNIST

As we can see in Table 1, we evaluated our approach with five different pruning ratios, and also compared with other state-of-art methods, e.g., CGES [34]. First, with the increase of pruning ratio, the filters become more sparse, fewer and fewer parameters can be utilized for final decision, however, our approach still could achieve considerable performance , , , and with , , , , and pruning ratio respectively. Second, interestingly, we notice that when the pruning ratio is (or ) which means that in each training step, we pruned of all groups of filters, our model with remaining parameters even could achieve better accuracy than baseline with margin. It confirmed our intuition that in spite of plenty of parameters in networks, they can be grouped based on relevance and complementarity. Pruning some redundant groups (filters) leverage others to focus on their responsibility and learn more discriminative features. Third, comparing with CGES, our model has better performance and lower FLOPs. Although our parameter sparsity is not as good as CGES, we have a better balance between accuracy and computation complexity.

4.4 Results on CIFAR-10

LeNet-5 base network. We first evaluate our method on CIFAR-10 dataset with LeNet-5, and set the pruning ratio by 0.4. Table 2 explains the comparison between our performance and CGES. As we can see, our approach achieved the state-of-art performance. After training, we pruned about parameters of network and reduced FLOPs, however, our model do not loss any capacity, still maintain the final accuracy with . When looking at the performance of CGES, it sacrificed accuracy only for decreasing parameters. That’ all thanks to the effectiveness of Spectral Clustering Filter Pruning.

ResNet-32 base network. We also apply our proposed pruning method on a deeper network to show it’s strength. As a deep and powerful network, ResNet is our primary choice. We utilize ResNet-32 as our base network, and the pruning ratio is set by 0.4. Results are illustrated in Table 3. Comparing with baseline, parameters are pruned by our approach, and we reduce the computation complexity by in FLOPs. Meanwhile, the model only loss the precision with a small margin.

Pruned ratio Baseline Acc(%) Pruned Acc(%) Acc Loss(%) Pruned FLOPs Param Sp(%)
0.1 98.50 98.540.04 -0.04 10.00 4.51
0.2 98.50 98.630.08 -0.13 15.70 7.57
0.3 98.50 98.40.11 0.10 38.10 18.56
0.4 98.50 98.460.18 0.04 32.20 20.34
0.5 98.50 98.420.12 0.08 42.20 21.68
CGES 98.47 97.12 1.35 29.00 67.92
Table 1: Results about different pruning rate performed in MNIST. In first row and second row, pruned network has better performance because of more efficient learning. In third row, pruning rate 0.4 is a good balance between accuracy and FLOPs
Full model CGES() ours
Accuracy(%) 71.09 70.05 71.09
Weight pruned(%) 3.60 18.13
FLOPs pruned(%) 6.42
Table 2: On LeNet-5, pruned weights and accuracy with different methods.
Full model Ours
Accuracy(%) 91.74 90.80
Weight pruned(%) 12.10
FLOPs pruned(%) 11.3
Table 3: On ResNet-32, pruned weights and accuracy with different methods.
Figure 2: FLOPs and parameters sparsity change with training epochs.
Layer Weights(K) FLOPs(K) Pruned Weights(%) Pruned FLOPs(%)
conv1 0.38 587.94 53.13 53.13
conv2 37.07 14,528.96 27.60 27.60
fc1 2,483.91 4,967.83 22.65 22.65
total 2,599.52 18,812.688384 20.34 32.20
Table 4: Using two convolutional layers and two fully connected layers in MNIST.

4.5 Ablation Study

How the FLOPs and parameters sparsity changes during training? In order to understand the process of pruning, we show the varying curve of FLOPs and parameters sparsity with five different pruning ratios in Figure 2. It’s intuitional that both FLOPs and parameters sparsity are decreasing progressively during training because some unimportant filters are removed. However, there are some fluctuations in curves, especially in the right one. This is because (1) we cluster all filters before pruning, and remove groups of filters if they are unimportant. Hence, the number of pruned filters depends on the situation of weightes updating; (2) we do filter pruning with soft self-adaption manners, which means the pruned filters will be updated iteratively during training. We argue that because of two reasons mentioned above, our proposed approach can achieve considerable performance with fewer parameters with fewer parameters.

Which part of filters are more important in each layer? As we explain above, if a group of filters is insensitive to the output, then it will be pruned, i.e., these filters are not important. In order to analyze the important of filtrs in each layer, we do experiments on MNIST with LeNet-4, as shown in Figure 4. Intertestingly, we can find that the more a layer is close to the output (the last fully-connected layer), the less filters are pruned. The superficial layers (e.g., conv1) only learn some simple features, such as color and edges, however, the deeper layers (e.g., fc1) can learn more abstract features, such as profile and parts. Therefore, it’s intuitional that some parameters in fc1 are more important than those in conv1. Again, we argue that our proposed approach which apply spectral clustering in filter pruning is effective and explicable.

5 Conclusion

In this paper, we introduce a novel filter pruning method – spectral clustering filter pruning with soft self-adaption manners(SCSP), to compress the convolutional neural networks. We for the first time, employ the spectral clustering on filters layer by layer to explore their intrinsic connections. The experiments show the efficacy of proposed methods.

References