Exploring Linear Relationship in Feature Map Subspace for ConvNets Compression

03/15/2018 ∙ by Dong Wang, et al. ∙ 0

While the research on convolutional neural networks (CNNs) is progressing quickly, the real-world deployment of these models is often limited by computing resources and memory constraints. In this paper, we address this issue by proposing a novel filter pruning method to compress and accelerate CNNs. Our work is based on the linear relationship identified in different feature map subspaces via visualization of feature maps. Such linear relationship implies that the information in CNNs is redundant. Our method eliminates the redundancy in convolutional filters by applying subspace clustering to feature maps. In this way, most of the representative information in the network can be retained in each cluster. Therefore, our method provides an effective solution to filter pruning for which most existing methods directly remove filters based on simple heuristics. The proposed method is independent of the network structure, thus it can be adopted by any off-the-shelf deep learning libraries. Experiments on different networks and tasks show that our method outperforms existing techniques before fine-tuning, and achieves the state-of-the-art results after fine-tuning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the collection of huge volume of labeled data, tremendous power of graphical processing units (GPUs) and parallel computation, convolutional neural networks (CNNs) have achieved the state-of-the-art performance in a wide variety of computer vision tasks, such as image classification 

[14], object detection [30], image segmentation [13]

, and human pose estimation 

[29]. As flexible function approximators by scaling to millions of parameters, CNNs can extract high-level and more discriminative features compared with the traditional elaborative hand-crafted ones.

However, modern CNNs heavily rely on the intensive computing and memory resources despite their overwhelming success. For instance, the ResNet-50 [14] has more than 50 convolutional layers, requiring over 95MB storage memory and over 3.8 billion floating number multiplications when processing an image. The VGG-16 model [32]

has 138.34 million parameters, taking up more than 500MB storage space, and needs 30.94 billion float point operations (FLOPs) to classify a single image. It is very difficult to deploy these complex CNN models in some specific scenarios where computing resource is constrained, i.e., a task must be completed with limited resources such as computing time, storage space, and battery power.

Both academia and industry have developed methods to reduce the amount of parameters in CNNs. Ba et al. [2]

used class probabilities produced by a pre-trained model as “soft targets” to feed a tiny network, successfully transferring the cumbersome model to a compact one while maintaining the generalization capability of the model. The student-teacher paradigm in 

[2]

has shown its effectiveness in compressing CNNs, however, to devise a new tiny network is not a trivial task. Moreover, it remains an open problem on how to define the inherent “knowledge” in the teacher model. Tensor factorization based methods 

[23, 20, 18, 10] factorize an over-parameterized convolutional layer into several light ones. However, decomposing convolution favoured by modern CNNs (e.g., GoogleNet [34], ResNet [14], and Xception [6]) is still an intractable problem. Moreover, tensor decomposition techniques expand the target network deeper, incurring more convolution operations.

Filter pruning has been proposed to address this issue. Since the network architecture is constant after the filter pruning, the obtained model is compatible with any off-the-shelf deep learning frameworks. In addition, since volumes of both convolutional kernels and intermediate activations are shrunken, the required memory is reduced remarkably. This strategy also allows complementary compression methods to be employed to gain a more compact model. The advantages of filter pruning lead to increasing attention to research in this direction. He et al. [15]

learned a sparse weight vector to measure the importance of filters by applying the LASSO regression. Luo

et al. [27] used statistical information computed from the next layer to guide the filter pruning for the current layer in a greedy way. Both methods directly prune some filters. It is obvious that the information contained in pruned filters can no longer be utilized once the filters have been pruned.

Figure 1: Visualization of output feature maps produced by the first convolutional layer of VGG-16 [32]. The example images are randomly chosen from ILSCVR-12 [31]. We show the linear feature maps in different subspaces and the feature maps produced by 8 clustered convolutional filters. It is obvious that redundant feature maps can be eliminated through subspace clustering algorithm.

In this paper, we propose a novel filter pruning method by leveraging the linear relationship in feature map subspace. As shown in Figure 1, since different feature maps from one convolutional layer originate from the same image with different convolutional filters by a linear operation, the outputs will be linearly dependent in different subspaces. Among various subspace analysis approaches, subspace clustering is an excellent method to cluster linearly distributed data in different subspaces. Motivated by this, instead of measuring the importance of filters [25, 16] or feature maps [15, 27, 1] and subsequently removing the trivial kernels, we attempt to seek the most representative information by casting the filter selection problem into a subspace clustering problem on intermediate activations. We firstly cluster feature maps into subspaces. This allows the clustering of the corresponding filters in the next layer which take these feature maps as input. Also, filters in the upper layer that produce these feature maps can be clustered. Then, we iterate this process to prune the whole network layer by layer.

In summary, this paper makes the following contributions:

  • We propose a novel filter pruning method based on the linear relationship in feature map subspace to compress and accelerate CNN models. To the best of our knowledge, it is the first work to investigate clustering method for CNN model compression. Moreover, it is also the first work to employ subspace clustering to accelerate CNNs. We can prune the redundant information in feature maps and simultaneously retain the most representative information.

  • We devise a flexible filter pruning framework that is independent of the network structure. Thus our method can be well supported by any off-the-shelf deep learning libraries.

  • Compared to the original heavy network, experiments demonstrate that the learned portable network achieves a comparable accuracy, but has significantly lower memory usage and computational cost. We achieve consistent improvement on various tasks, exceed other filter pruning works, and obtain the state-of-the-art results.

2 Related Work

Recently, several works on CNN acceleration focused on network pruning thanks to its apparent benefits. Along this line, various strategies have been adopted, e.g., fine-grained pruning [12], group-level pruning [36, 24], filter-level pruning [16, 25], layer-level pruning [38] and feature maps pruning [9, 1, 15, 27]. Han et al. [12] introduced sparsity regularization approach to calculate and remove connections with small weights. The major drawback of this fine-grained pruning is the loss of universality and flexibility due to the unstructured pruned parameters, which heavily hinders the pruned models to be transferred to real applications.

Group-level pruning approaches alleviate the above problem by learning solid sparse patterns. Lebedev et al. [24] used group-wise brain damage process to sparsify convolution kernels. This generates one sparsity pattern per group (2D kernels) in convolutional layers. Then the entire group with small weights can be removed. Similarly, Wen et al. [36] proposed the Structured Sparsity Learning (SSL) method to regularize filter, channel, filter shape and depth structures. Zagoruyko et al. [38] demonstrated a layer-level pruning technique. For the network consisting of multiple homogeneous stages (each stage is a set of convolutional blocks), some stages are removed by merging attention maps into specific cost function. Thus, it provides an approach to combine network pruning with knowledge distillation. Figurnov et al. [9] explored feature map pruning which only kept a subset of rows in the patch matrix by using solid sparsity patterns, i.e., perforation mask [9]

, and interpolated the missing output values. Perforation mask was predefined and could be in grid or pooling structure. However, this method only shortens the inference time and does not compress the model. Similar to 

[24], it is only supported by deep learning frameworks which reduce generalized convolution to matrix multiplication.

Compared with the aforementioned pruning strategies, filter-level pruning is more efficient in accelerating very deep neural networks. For two consecutive convolutional blocks, which are indispensable in all CNNs, after the filter pruning for the former block, the number of input channels of the latter block is also reduced. Moreover, the shape of the chunk created by the latter block is constant. By minimizing construction errors of feature maps, the outputs of CNN endpoints are retained. Thus, it is vital to determine which filters are to be eliminated. Some methodologies use kernel importance. An intuitive possible way is to use the magnitude of weights. Li et al. [25] calculated absolute weight sum of each filter as its importance score. Denseness in filters is an alternative. Hu et al. [16] depicted the significance of each filter by calculating the Average Percentage of Zeros (APoZ) in it.

Another type of approaches conquers the filter selection challenge by converting it into channel selection for feature maps. Anwar et al. [1] exhibited over 100 random trials on channel selection. However, it is time consuming and laborious per trial. Thus, inflexibility is a common problem on very deep models and large datasets. ThiNet [27] used statistical information computed from the next layer to guide the filter pruning of the current layer. The pruned convolutional layer was forced to mimic the original one by minimizing the reconstruction errors of feature maps. Although the works from He et al. [15] and Luo et al. [27] have similar workflows, their channel selection strategies are different. The main insight of [15] was to learn a weight vector for feature maps. The weight vector is optimized for channel selection with fixed convolutional filters. Then the convolutional filters are used to reconstruct error with the weight vector fixed. In practice, the weight vector is optimized for multiple times and the filters just once to obtain the final result since the two step iteration is time consuming.

In addition, there are a variety of techniques for compressing convolution filters. A representative approach is Low-rank approximation [23, 20, 18, 10]. It breaks a convolutional layer into several small pieces by applying tensor decomposition strategies, e.g. CP decomposition [23] and Tucker decomposition [20]. Other methods include parameter quantization [10, 4, 37, 11] and structural matrix design [35, 5, 33].

3 Proposed Method

In this section, we describe a novel filter pruning method based on the linear relationship in feature map subspace. We first introduce the overall framework, then present the details of each step. Finally, we describe our pruning strategy which takes both efficiency and effectiveness into consideration.

3.1 Overall Framework

Figure 2: Illustration of our filters pruning method. First, we cluster the input feature maps of layer by a subspace clustering algorithm. Then the filters in layer and the channels of filters in layer can be pruned by respectively calculating the average of corresponding filters and channels of each feature map cluster.

Filter pruning is an effective method for reducing the complexity of neural networks. There are two key points in filter pruning. The first is filter selection, i.e., we need to seek the most representative filters to retain as much information as possible. The second is reconstruction, i.e., the following feature maps shall be reconstructed using the clustered filters. The main difference between our method and previous works is the strategy in filter selection. Most existing methods directly prune filters which make weak contribution to the neural network. The drawback is that the information of pruned filters can not be further utilized, which influences the result of feature map reconstruction. Our method, on the other hand, utilizes subspace clustering algorithm to reduce the number of feature maps, which can simultaneously eliminate the redundant feature maps and retain the most representative information in feature maps. To prune the input feature maps from to desired , we group the feature maps into clusters. Then we calculate the average of corresponding filters of each feature map cluster. A pre-trained model is pruned layer by layer with a predefined compression rate.

We summarize our filter pruning method on a single convolutional layer in Figure 2. We aim to prune the filters in layer and layer . Once feature maps of layer are clustered, we can cluster corresponding filters in layer and corresponding channels of filter in layer . The method has the following key steps:

  1. Feature map clustering. Since different feature maps from a convolutional layer are generated from the same image with different convolutional filters by a linear operation, the output feature maps will be linearly dependent in different subspaces. Therefore, we leverage a subspace clustering algorithm to cluster the feature maps.

  2. Filter clustering and reconstruction. After subspace clustering, we can cluster corresponding input channels of filter in the next layer which take these feature maps as input. Filters in the upper layer that produce these feature maps can also be clustered. Then we reconstruct the following feature maps using the pruned filters. Note that the pruned network has exactly the same structure but with fewer filters. In other words, the original thick network becomes a much thinner model.

  3. Fine-tuning.

    Fine-tuning is a necessary step to recover the generalization ability influenced by filter pruning, which is time consuming on large datasets and complex models. For efficiency considerations, we fine-tune part of epochs after all pruned feature maps have been reconstructed.

  4. Iterate to step 1 to prune the next layer.

3.2 Filter Pruning

we propose a two-step algorithm for filter pruning. In the first step, we aim to seek the most representative filters. Since there is a linear relationship in different feature map subspace, we utilize a subspace clustering algorithm to estimate average feature maps which contain as much representative information as possible. Then, the corresponding filters of each feature map cluster are clustered. In the second step, we reconstruct the following feature maps using the average filters with linear least squares.

Formally, we use to denote the convolution process in layer , where is the input tensor which has feature maps of in size. is a set of filters with kernels of in dimension, which generates a new tensor with feature maps.

3.2.1 Subspace clustering.

To prune the channels of feature maps from to desired , we cluster the feature maps into clusters. Since different feature maps are generated from the same image, and convolutional filter is a linear operation, the output feature maps of the image will be linearly dependent in different subspaces which satisfy subspace distribution, i.e., one feature map can be expressed in a subspace as a linear combination of other feature maps in the same subspace. This property is called self-expressiveness. We leverage a subspace clustering algorithm [7] to cluster the feature maps. Mathematically, this idea is formalized as an optimization problem

(1)

where is reshaped , and is a self-expressiveness matrix. The subspace clustering algorithm is summarized in Algorithm 1.

Input: The input feature maps , the desired number of input channels of filter, .
Steps:
1. Reshape to .
2. Learn the self-expressiveness matrix from Eq. (1).

3. Construct an affinity matrix by

.
4. Calculate the Laplacian matrix of .

5. Calculate the eigenvector matrix

of corresponding to its

smallest nonzero eigenvalues.

6. Perform k-means clustering algorithm on the rows of

.
Output: The clustering result of with clusters.
Algorithm 1 Subspace Clustering.

3.2.2 Filter clustering.

After clustering feature maps into clusters, we represent the indices of each cluster as . Then, we can prune the channels of filter in layer by calculating the average channel of each cluster. For the -th filter , the average channel can be calculated through the clustering index

(2)

where is the -th channel of filter , is the number of elements in . Then the pruned -th filter is the concatenation of , . For , we can obtain their pruned filters using Eq. (2). Then the pruned filters for layers are .

Naturally, the filters of upper layer that produce feature maps can also be clustered

(3)

where is the -th filters of , is the number of elements in . The result is , where is the number of filters in layer .

3.2.3 Reconstruction error minimization.

We reconstruct the output feature maps with pruned filters by linear least squares. This problem can be formulated as:

(4)

where is the Frobenius norm, is the convolution operation. are the feature maps produced by the pruned layer . The complete filter pruning process is summarized in Algorithm 2.

Input: The original convolutional filters and , the indices of clustering result .
Steps:
1. For layer , calculate the aggregated channel for each filter through the clustering indices, .
2. For layer , calculate the aggregated filter for each cluster through the clustering indices, .
3. Minimize the reconstruction error between the original output and the pruned output of layer by Eq. (4).
Output: The pruned convolutional filters and .
Algorithm 2 Filter Pruning.

3.3 Pruning Strategy

The network architectures can be divided into two types, the traditional single path and multi-path convolutional architectures. AlexNet [22] or VGGNet [32] is the representative for the former, while the latter mainly includes some recent networks equipped with some novel structures like Inception in GoogLeNet [34] or residual blocks in ResNet [14].

We use different strategies to prune these two types of networks. For VGG-16, we apply the single layer pruning strategy to the convolutional layer step by step. For ResNet, some restrictions are incurred due to its special structure. For example, the channel numbers of the residual learning branch and the identity mapping branch in the same block need to be consistent in order to finish the sum operation. Thus it is hard to directly prune the last convolutional layer in the residual learning branch. Since most parameters appear in the first two layers, pruning the first two layers is a feasible option which is illustrated in Figure 3.

Figure 3: Illustration of the ResNet pruning strategy. For each residual block, we only prune the first two convolutional layers, keeping the block output dimension unchanged.

4 Experiment

Our method is tested on combinations of three popular CNN models with three benchmark datasets: VGG-16 [32] on ILSCVR-12 [31] and PASCAL VOC 2007 [8], ResNet-50 [14] on ILSCVR-12 [31], and CMU-pose [3] on MSCOCO-14 [26].

Firstly, we compare several filter selection strategies including ours by pruning single layer for VGG-16 [32] on ILSCVR-12 to exhibit efficiency of our algorithm, followed by whole model pruning for VGG-16 [32]. Secondly, we show the performance of pruning the network with residual architecture, for which ResNet-50 [14] is selected. Finally, we apply our method to Faster R-CNN [30] and CMU-Pose [3]

to evaluate the generalization capability of our algorithm to challenge visual tasks of object detection and human pose estimation. All the experiments were implemented within Caffe 

[19].

The performance of ConvNets compression is evaluated with different speed-up ratios. Assume that is the number of filters in the original layer and is that of the pruned layer, then

(5)

4.1 Experiments on VGG-16

VGG-16 [32] is a classic single path CNN with 13 convolutional layers, which has been widely used in vision tasks as a powerful feature extractor. We use single layer pruning and whole model pruning to evaluate the efficiency of our method. The effectiveness is measured by the decrease of top-5 accuracy on validation dataset. The top-5 accuracy of VGG-16 [32] on ILSCVR-12 [31] validation dataset is .

4.1.1 Single layer pruning.

Figure 4: Single layer performance analysis under different speed-up ratios (without fine-tuning), measured by decrease of top-5 accuracy on ILSCVR-12 validation dataset.

We first evaluate the single layer acceleration performance of our method. We compare our approach with several existing filter selection strategies. sparse vector [15] preserves filters according to their importance scores learned by a sparsity regularization method. max response [25] selects channels based on corresponding filters that have high absolute sum of weights. To differentiate our approach from the common clustering algorithms, we also select kmeans as a baseline. In addition, to validate the necessity of elaborative hand-crafted filter selection algorithm, we also take two naive criteria into consideration. first k selects the first k channels. random randomly selects a fixed amount of filters. After filter pruning, feature maps reconstruction is implemented without the fine-tuning step. The effectiveness of the methods is measured by reduction of top-5 accuracy on the validation dataset after the reconstruction procedure is accomplished.

Similar to [15], we extracted 10 samples per class, i.e. a total of images, to prune channels and minimize reconstruction errors. Images were resized such that the shorter dimension is 256. Then random cropping was applied and the resulting image patches were fed into the network. The testing was made on a crop of pixels at the center of the image. The self-expressiveness matrices for convolutional layers were learned with mini-batch size of 128 and the learning rate varied from to in epochs. After pruning filters, we used a batch size of 64 and varied the learning rate from to to minimize the reconstruct error until the loss did not drop continuously. All parameters were optimized with Adam [21]. We pruned three convolutional layers, i.e., conv3_1, conv4_1 and conv4_2, with aforementioned methods including ours under several speed-up ratios. The results are shown in Figure 4.

As expected, the loss on accuracy is proportional to the speed-up ratio, i.e., error increases as speed-up ratio increases. With the same speed-up ratio, our approach consistently outperforms other methods in different convolutional layers under different speed-up ratios. This shows that our subspace clustering based pruning method can retain more representative information. This enables the feature maps to be reconstructed more effectively. Although the key idea of the kmeans option is also clustering, it can not explore the linear relationship between feature maps, obtaining a coarse clustering result. Nevertheless, the performance of kmeans is consistently better than the two naive approaches, indicating clustering based pruning strategy is feasible in practice. max response performs with high loss of accuracy, sometimes even worse than first k. This is probably because max response ignores correlations between different filters. The random selection option shows good performance, even better than the heuristic methods in some cases. However, this method is not robust in feature maps reconstruction, making it not applicable in practice. In summary, the naive pruning strategies have shown some weakness, which implies that proper filter selection is vital for filter pruning.

It is also noticeable that pruning gradually becomes more difficult from shallow to deep layers. It indicates that whereas shallow layers have much more redundancy, deeper layers make more contribution to the final performance, which is consistent with the observation in [39] and [15]. This means it is preferable to prune more parameters in shallow layers rather than deep layers to accelerate the model. Moreover, Figure 4 shows that our filter pruning method leads to smaller increase of error compared with other strategies when the deeper layers are compressed.

4.1.2 Whole model pruning.

Method
Jaderberg et al. [18] - 9.7 29.7
Asym. [39] 0.28 3.84 -
Filter pruning [25] (fine-tuned) 0.8 8.6 14.6
He et al. [15] (without fine-tune) 2.7 7.9 22.0
Ours (without fine-tune) 2.6 3.7 8.7
He et al. [15] (fine-tuned) 0 1.0 1.7
Ours (fine-tuned) 0 0.5 1.1
Table 1: Accelerating the VGG-16 model using a speedup ratio of , , and respectively. The results show decreases of top-5 validation accuracy (1-view, baseline ).

The whole model acceleration results under , , are demonstrated in Table 1. Firstly, we applied our approach layer by layer sequentially. Then, our pruned model was fine-tuned for 10 epoches with a fixed learning rate to gain a higher accuracy. We augmented the data by random cropping of pixels and mirror the cropped patch. Other parameters were the same as in our single layer pruning experiment. Since the last group of convolutional layers (i.e., convx) affects the classification more significantly, these layers were not pruned. After the filter pruning and reconstruction, our approach outperforms the sparse vector method [15] by a large margin, which is consistent with the results of single layer analysis. In addition, our approach produces more compact models since we do not have the constraint on remaining channels ratios for shallow layers (conv1_x to conv3_x) and deep layers (conv4_x) as required in [15]. After fine-tuning, our method achieves speed-up without decrease of accuracy. Under and , the accuracy of our method only drops by and respectively. Our approach outperforms the state-of-the-art filter level pruning approaches ([25] and [15]). This is because our method retains as much representative information as possible by exploring linear relationship between feature maps via subspace clustering, thus, recovering better approximation to the original data in the subsequent output volume.

4.2 Experiments on ResNet

We also tested our method on the recently proposed multi-path network ResNet [14]

. We selected ResNet-50 as a representation of the ResNet family. During the implementation, we merged batch normalization 

[17]

into convolutional weights. This does not affect the outputs of the networks, so that each convolutional layer is followed by ReLU 

[28]. Since ResNet-50 consists of residual blocks, we pruned each block step by step, i.e., we pruned ResNet-50 from block 2a to 5c sequentially. In this experiment, for each block, we only pruned the convolutional layers that learned the residual mapping. Therefore, we only pruned the first two layers of each block in ResNet-50 for simplicity, leaving the block output and projection shortcuts unchanged. Pruning these parts may lead to further compression, but can be quite difficult if not entirely impossible. We leave this exploration as a future work. After each block had been pruned, we used Adam [21] with mini-batch size of 64 and varied the learning rate from to to minimize reconstruction error until the loss did not drop continuously. The model was fine-tuned in 20 epochs with fixed learning rate to gain a higher accuracy.

Method Increased err.
He et al. [15] 8.0
Ours(without fine-tune) 5.2
He et al. [15](enhanced) 4.0
He et al. [15](enhanced, fine-tuned) 1.4
Ours(fine-tune) 1.0
Table 2: acceleration for ResNet-50 on ILSCVR-12. The results show decrease from the baseline networks top-5 accuracy of (one view).

The results of acceleration on ResNet-50 are presented in Table 2. Our approach outperforms the state-of-the-art method [15] both before or after the fine-tuning. In addition, while pruning ResNet-50, He et al. [15] kept 70% and 30% channels for sensitive residual blocks and other blocks respectively. Our approach, without these constraints, is simpler and more efficient. Our pruning strategy can obtain more representative filters by eliminating redundancy in feature map subspace, enabling the reconstruct error to be better minimized.

4.3 Generalization Capability of the Pruned Model

To explore the generalization capability of our method, we ran experiments on two challenging vision tasks: object detection and human pose estimation. We used Faster R-CNN [30] on PASCAL VOC 2007 for the former task and CMU-pose [3] on MSCOCO14 [26] for the latter one. Both networks were accelerated by our approach under and speed-up ratios. The performance is evaluated in terms of mean Average Precision (mAP).

4.3.1 Acceleration for object detection.

Speedup mAP mAP
Baseline 68.7 -
He et al. [15] () 68.3 0.4
Ours () 68.5 0.2
He et al. [15] () 66.9 1.8
Ours () 67.7 1.0
Table 3: and acceleration on Faster R-CNN detection.

For convenience, we compressed the Faster R-CNN model with VGG-16 as its backbone. Since there is no redundancy in the convolutional layers except those in VGG-16, we used the same parameters as in our VGG-16 experiment. To compare with the alternative approach fairly, we followed the setting as in [15]

. We first performed channel pruning on VGG-16 on the ImageNet. Then we used the pruned model as the pre-trained model for Faster R-CNN. The model acceleration is demonstrated on the PASCAL VOC 2007 object detection benchmark 

[8] which contains 5k training images and 5k testing images. From Table 3, we observe mAP drop with our model, which outperforms the method of He et al. [15]. Such small mAP drop will not generate significant negative effect in real applications, but brings much benefit in efficiency and model complexity reduction.

4.3.2 Acceleration for human pose estimation.

Speedup mAP mAP
Baseline 57.6 -
56.8 0.8
55.7 1.9
Table 4: and acceleration on CMU-Pose human pose estimation.

CMU-pose [3] is a bottom-up approach for multi-person 2D pose estimation. It simultaneously predicts heat maps and part affinity fields (PAFs) for body parts and body limps respectively. Then it joins two corresponding detection results into the same group by using an associated PAF. The architecture of the network consists of two parts. The first 10 layers of VGG-19 [32] is used as the first part of the network to extract features. In the second part, the network is split into two branches: one branch predicts the confidence maps, and the other predicts the affinity fields. Each branch is an iterative prediction architecture, which refines the predictions over successive stages with intermediate supervision at each stage. Since there is no redundancy in the last convolutional layer of each stage (i.e., conv5_5_CPM_Lx and Mconv7_stagex_Lx), we pruned the rest convolutional layers in the same manner as in the single layer pruning strategy. Similar to our VGG-16 experiment, we randomly selected samples for filter pruning and reconstruction. After pruning and reconstruction, the model was fine-tuned in epoches with a fixed learning rate . Other parameters were the same as in our VGG-16 pruning experiment. CMU-pose model compression results under and are demonstrated in Table 4. The results show 0.8% mAP drop of our model, which showcase the effectiveness of our method.

5 Conclusion

Current deep CNNs are accurate with high inference costs. In this paper, we have presented a novel filter pruning method for deep neural networks. Since it is observable that there is linear relationship in different feature map subspaces, we can eliminate the redundancy in convolutional filters by applying subspace clustering on feature maps. Different from existing filter pruning methods which directly remove filters based on their importance, our approach better retrieves the representative information according to the linear relationship in feature map subspaces, so most important information can be retained by the mean of each cluster. Our method only requires off-the-shelf libraries. The reduced CNNs are inference efficient networks while maintaining accuracy. Compelling speed-up and accuracy are demonstrated on both VGG-Net and ResNet with ILSCVR-12. Moreover, experiments on other computer vision tasks also show the feasibility of our compression method in practice.

References

  • [1] Anwar, S., Sung, W.: Compact deep convolutional neural networks with coarse pruning. arXiv preprint arXiv:1610.09639 (2016)
  • [2] Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Advances in Neural Information Processing Systems. pp. 2654–2662 (2014)
  • [3]

    Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Computer Vision and Pattern Recognition. pp. 1–9 (2017)

  • [4]

    Chen, W., Wilson, J., Tyree, S., Weinberger, K., Chen, Y.: Compressing neural networks with the hashing trick. In: International Conference on Machine Learning. pp. 2285–2294 (2015)

  • [5] Cheng, Y., Yu, F.X., Feris, R.S., Kumar, S., Choudhary, A., Chang, S.F.: An exploration of parameter redundancy in deep networks with circulant projections. In: International Conference on Computer Vision. pp. 2857–2865 (2015)
  • [6] Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: Computer Vision and Pattern Recognition. pp. 1–9 (2017)
  • [7] Elhamifar, E., Vidal, R.: Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(11), 2765–2781 (2012)
  • [8] Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge 2007 (voc 2007) results (2007) (2008)
  • [9] Figurnov, M., Ibraimova, A., Vetrov, D.P., Kohli, P.: Perforatedcnns: Acceleration through elimination of redundant convolutions. In: Advances in Neural Information Processing Systems. pp. 947–955 (2016)
  • [10] Gong, Y., Liu, L., Yang, M., Bourdev, L.: Compressing deep convolutional networks using vector quantization. Computer Science (2014)
  • [11] Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In: International Conference on Learning Representations. pp. 1–13 (2016)
  • [12] Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems. pp. 1135–1143 (2015)
  • [13] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: International Conference on Computer Vision. pp. 2980–2988 (2017)
  • [14] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition. pp. 770–778 (2016)
  • [15] He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: International Conference on Computer Vision. pp. 1–9 (2017)
  • [16] Hu, H., Peng, R., Tai, Y.W., Tang, C.K.: Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250 (2016)
  • [17] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning. pp. 448–456 (2015)
  • [18] Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. In: British Machine Vision Conference. pp. 1–12 (2014)
  • [19] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: International Conference on Multimedia. pp. 675–678 (2014)
  • [20] Kim, Y.D., Park, E., Yoo, S., Choi, T., Yang, L., Shin, D.: Compression of deep convolutional neural networks for fast and low power mobile applications. In: International Conference on Learning Representations. pp. 1–13 (2016)
  • [21] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations. pp. 1–14 (2015)
  • [22] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems. pp. 1097–1105 (2012)
  • [23] Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I., Lempitsky, V.: Speeding-up convolutional neural networks using fine-tuned cp-decomposition. Computer Science (2014)
  • [24] Lebedev, V., Lempitsky, V.: Fast convnets using group-wise brain damage. In: Computer Vision and Pattern Recognition. pp. 2554–2564 (2016)
  • [25] Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets. In: International Conference on Learning Representations. pp. 1–13 (2017)
  • [26] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context 8693, 740–755 (2014)
  • [27] Luo, J.H., Wu, J., Lin, W.: Thinet: A filter level pruning method for deep neural network compression. In: International Conference on Computer Vision. pp. 1–9 (2017)
  • [28]

    Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: International Conference on Machine Learning. pp. 807–814 (2010)

  • [29] Newell, A., Huang, Z., Deng, J.: Associative embedding: End-to-end learning for joint detection and grouping. In: Advances in Neural Information Processing Systems. pp. 2274–2284 (2017)
  • [30] Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems. pp. 91–99 (2015)
  • [31] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115(3), 211–252 (2015)
  • [32] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations. pp. 1–13 (2015)
  • [33] Sindhwani, V., Sainath, T., Kumar, S.: Structured transforms for small-footprint deep learning. In: Advances in Neural Information Processing Systems. pp. 3088–3096 (2015)
  • [34] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., et al.: Going deeper with convolutions. In: Computer Vision and Pattern Recognition. pp. 1–9 (2015)
  • [35] Wang, Y., Xu, C., Xu, C., Tao, D.: Beyond filters: Compact feature map for portable deep model. In: International Conference on Machine Learning. pp. 3703–3711 (2017)
  • [36] Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. In: Advances in Neural Information Processing Systems. pp. 2074–2082 (2016)
  • [37] Wu, J., Leng, C., Wang, Y., Hu, Q., Cheng, J.: Quantized convolutional neural networks for mobile devices. In: Computer Vision and Pattern Recognition. pp. 4820–4828 (2016)
  • [38] Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In: International Conference on Learning Representations. pp. 1–14 (2017)
  • [39] Zhang, X., Zou, J., He, K., Sun, J.: Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(10), 1943–1955 (2016)