1 Introduction
With the collection of huge volume of labeled data, tremendous power of graphical processing units (GPUs) and parallel computation, convolutional neural networks (CNNs) have achieved the stateoftheart performance in a wide variety of computer vision tasks, such as image classification
[14], object detection [30], image segmentation [13], and human pose estimation
[29]. As flexible function approximators by scaling to millions of parameters, CNNs can extract highlevel and more discriminative features compared with the traditional elaborative handcrafted ones.However, modern CNNs heavily rely on the intensive computing and memory resources despite their overwhelming success. For instance, the ResNet50 [14] has more than 50 convolutional layers, requiring over 95MB storage memory and over 3.8 billion floating number multiplications when processing an image. The VGG16 model [32]
has 138.34 million parameters, taking up more than 500MB storage space, and needs 30.94 billion float point operations (FLOPs) to classify a single image. It is very difficult to deploy these complex CNN models in some specific scenarios where computing resource is constrained, i.e., a task must be completed with limited resources such as computing time, storage space, and battery power.
Both academia and industry have developed methods to reduce the amount of parameters in CNNs. Ba et al. [2]
used class probabilities produced by a pretrained model as “soft targets” to feed a tiny network, successfully transferring the cumbersome model to a compact one while maintaining the generalization capability of the model. The studentteacher paradigm in
[2]has shown its effectiveness in compressing CNNs, however, to devise a new tiny network is not a trivial task. Moreover, it remains an open problem on how to define the inherent “knowledge” in the teacher model. Tensor factorization based methods
[23, 20, 18, 10] factorize an overparameterized convolutional layer into several light ones. However, decomposing convolution favoured by modern CNNs (e.g., GoogleNet [34], ResNet [14], and Xception [6]) is still an intractable problem. Moreover, tensor decomposition techniques expand the target network deeper, incurring more convolution operations.Filter pruning has been proposed to address this issue. Since the network architecture is constant after the filter pruning, the obtained model is compatible with any offtheshelf deep learning frameworks. In addition, since volumes of both convolutional kernels and intermediate activations are shrunken, the required memory is reduced remarkably. This strategy also allows complementary compression methods to be employed to gain a more compact model. The advantages of filter pruning lead to increasing attention to research in this direction. He et al. [15]
learned a sparse weight vector to measure the importance of filters by applying the LASSO regression. Luo
et al. [27] used statistical information computed from the next layer to guide the filter pruning for the current layer in a greedy way. Both methods directly prune some filters. It is obvious that the information contained in pruned filters can no longer be utilized once the filters have been pruned.In this paper, we propose a novel filter pruning method by leveraging the linear relationship in feature map subspace. As shown in Figure 1, since different feature maps from one convolutional layer originate from the same image with different convolutional filters by a linear operation, the outputs will be linearly dependent in different subspaces. Among various subspace analysis approaches, subspace clustering is an excellent method to cluster linearly distributed data in different subspaces. Motivated by this, instead of measuring the importance of filters [25, 16] or feature maps [15, 27, 1] and subsequently removing the trivial kernels, we attempt to seek the most representative information by casting the filter selection problem into a subspace clustering problem on intermediate activations. We firstly cluster feature maps into subspaces. This allows the clustering of the corresponding filters in the next layer which take these feature maps as input. Also, filters in the upper layer that produce these feature maps can be clustered. Then, we iterate this process to prune the whole network layer by layer.
In summary, this paper makes the following contributions:

We propose a novel filter pruning method based on the linear relationship in feature map subspace to compress and accelerate CNN models. To the best of our knowledge, it is the first work to investigate clustering method for CNN model compression. Moreover, it is also the first work to employ subspace clustering to accelerate CNNs. We can prune the redundant information in feature maps and simultaneously retain the most representative information.

We devise a flexible filter pruning framework that is independent of the network structure. Thus our method can be well supported by any offtheshelf deep learning libraries.

Compared to the original heavy network, experiments demonstrate that the learned portable network achieves a comparable accuracy, but has significantly lower memory usage and computational cost. We achieve consistent improvement on various tasks, exceed other filter pruning works, and obtain the stateoftheart results.
2 Related Work
Recently, several works on CNN acceleration focused on network pruning thanks to its apparent benefits. Along this line, various strategies have been adopted, e.g., finegrained pruning [12], grouplevel pruning [36, 24], filterlevel pruning [16, 25], layerlevel pruning [38] and feature maps pruning [9, 1, 15, 27]. Han et al. [12] introduced sparsity regularization approach to calculate and remove connections with small weights. The major drawback of this finegrained pruning is the loss of universality and flexibility due to the unstructured pruned parameters, which heavily hinders the pruned models to be transferred to real applications.
Grouplevel pruning approaches alleviate the above problem by learning solid sparse patterns. Lebedev et al. [24] used groupwise brain damage process to sparsify convolution kernels. This generates one sparsity pattern per group (2D kernels) in convolutional layers. Then the entire group with small weights can be removed. Similarly, Wen et al. [36] proposed the Structured Sparsity Learning (SSL) method to regularize filter, channel, filter shape and depth structures. Zagoruyko et al. [38] demonstrated a layerlevel pruning technique. For the network consisting of multiple homogeneous stages (each stage is a set of convolutional blocks), some stages are removed by merging attention maps into specific cost function. Thus, it provides an approach to combine network pruning with knowledge distillation. Figurnov et al. [9] explored feature map pruning which only kept a subset of rows in the patch matrix by using solid sparsity patterns, i.e., perforation mask [9]
, and interpolated the missing output values. Perforation mask was predefined and could be in grid or pooling structure. However, this method only shortens the inference time and does not compress the model. Similar to
[24], it is only supported by deep learning frameworks which reduce generalized convolution to matrix multiplication.Compared with the aforementioned pruning strategies, filterlevel pruning is more efficient in accelerating very deep neural networks. For two consecutive convolutional blocks, which are indispensable in all CNNs, after the filter pruning for the former block, the number of input channels of the latter block is also reduced. Moreover, the shape of the chunk created by the latter block is constant. By minimizing construction errors of feature maps, the outputs of CNN endpoints are retained. Thus, it is vital to determine which filters are to be eliminated. Some methodologies use kernel importance. An intuitive possible way is to use the magnitude of weights. Li et al. [25] calculated absolute weight sum of each filter as its importance score. Denseness in filters is an alternative. Hu et al. [16] depicted the significance of each filter by calculating the Average Percentage of Zeros (APoZ) in it.
Another type of approaches conquers the filter selection challenge by converting it into channel selection for feature maps. Anwar et al. [1] exhibited over 100 random trials on channel selection. However, it is time consuming and laborious per trial. Thus, inflexibility is a common problem on very deep models and large datasets. ThiNet [27] used statistical information computed from the next layer to guide the filter pruning of the current layer. The pruned convolutional layer was forced to mimic the original one by minimizing the reconstruction errors of feature maps. Although the works from He et al. [15] and Luo et al. [27] have similar workflows, their channel selection strategies are different. The main insight of [15] was to learn a weight vector for feature maps. The weight vector is optimized for channel selection with fixed convolutional filters. Then the convolutional filters are used to reconstruct error with the weight vector fixed. In practice, the weight vector is optimized for multiple times and the filters just once to obtain the final result since the two step iteration is time consuming.
In addition, there are a variety of techniques for compressing convolution filters. A representative approach is Lowrank approximation [23, 20, 18, 10]. It breaks a convolutional layer into several small pieces by applying tensor decomposition strategies, e.g. CP decomposition [23] and Tucker decomposition [20]. Other methods include parameter quantization [10, 4, 37, 11] and structural matrix design [35, 5, 33].
3 Proposed Method
In this section, we describe a novel filter pruning method based on the linear relationship in feature map subspace. We first introduce the overall framework, then present the details of each step. Finally, we describe our pruning strategy which takes both efficiency and effectiveness into consideration.
3.1 Overall Framework
Filter pruning is an effective method for reducing the complexity of neural networks. There are two key points in filter pruning. The first is filter selection, i.e., we need to seek the most representative filters to retain as much information as possible. The second is reconstruction, i.e., the following feature maps shall be reconstructed using the clustered filters. The main difference between our method and previous works is the strategy in filter selection. Most existing methods directly prune filters which make weak contribution to the neural network. The drawback is that the information of pruned filters can not be further utilized, which influences the result of feature map reconstruction. Our method, on the other hand, utilizes subspace clustering algorithm to reduce the number of feature maps, which can simultaneously eliminate the redundant feature maps and retain the most representative information in feature maps. To prune the input feature maps from to desired , we group the feature maps into clusters. Then we calculate the average of corresponding filters of each feature map cluster. A pretrained model is pruned layer by layer with a predefined compression rate.
We summarize our filter pruning method on a single convolutional layer in Figure 2. We aim to prune the filters in layer and layer . Once feature maps of layer are clustered, we can cluster corresponding filters in layer and corresponding channels of filter in layer . The method has the following key steps:

Feature map clustering. Since different feature maps from a convolutional layer are generated from the same image with different convolutional filters by a linear operation, the output feature maps will be linearly dependent in different subspaces. Therefore, we leverage a subspace clustering algorithm to cluster the feature maps.

Filter clustering and reconstruction. After subspace clustering, we can cluster corresponding input channels of filter in the next layer which take these feature maps as input. Filters in the upper layer that produce these feature maps can also be clustered. Then we reconstruct the following feature maps using the pruned filters. Note that the pruned network has exactly the same structure but with fewer filters. In other words, the original thick network becomes a much thinner model.

Finetuning.
Finetuning is a necessary step to recover the generalization ability influenced by filter pruning, which is time consuming on large datasets and complex models. For efficiency considerations, we finetune part of epochs after all pruned feature maps have been reconstructed.

Iterate to step 1 to prune the next layer.
3.2 Filter Pruning
we propose a twostep algorithm for filter pruning. In the first step, we aim to seek the most representative filters. Since there is a linear relationship in different feature map subspace, we utilize a subspace clustering algorithm to estimate average feature maps which contain as much representative information as possible. Then, the corresponding filters of each feature map cluster are clustered. In the second step, we reconstruct the following feature maps using the average filters with linear least squares.
Formally, we use to denote the convolution process in layer , where is the input tensor which has feature maps of in size. is a set of filters with kernels of in dimension, which generates a new tensor with feature maps.
3.2.1 Subspace clustering.
To prune the channels of feature maps from to desired , we cluster the feature maps into clusters. Since different feature maps are generated from the same image, and convolutional filter is a linear operation, the output feature maps of the image will be linearly dependent in different subspaces which satisfy subspace distribution, i.e., one feature map can be expressed in a subspace as a linear combination of other feature maps in the same subspace. This property is called selfexpressiveness. We leverage a subspace clustering algorithm [7] to cluster the feature maps. Mathematically, this idea is formalized as an optimization problem
(1) 
where is reshaped , and is a selfexpressiveness matrix. The subspace clustering algorithm is summarized in Algorithm 1.
3.2.2 Filter clustering.
After clustering feature maps into clusters, we represent the indices of each cluster as . Then, we can prune the channels of filter in layer by calculating the average channel of each cluster. For the th filter , the average channel can be calculated through the clustering index
(2) 
where is the th channel of filter , is the number of elements in . Then the pruned th filter is the concatenation of , . For , we can obtain their pruned filters using Eq. (2). Then the pruned filters for layers are .
Naturally, the filters of upper layer that produce feature maps can also be clustered
(3) 
where is the th filters of , is the number of elements in . The result is , where is the number of filters in layer .
3.2.3 Reconstruction error minimization.
We reconstruct the output feature maps with pruned filters by linear least squares. This problem can be formulated as:
(4) 
where is the Frobenius norm, is the convolution operation. are the feature maps produced by the pruned layer . The complete filter pruning process is summarized in Algorithm 2.
3.3 Pruning Strategy
The network architectures can be divided into two types, the traditional single path and multipath convolutional architectures. AlexNet [22] or VGGNet [32] is the representative for the former, while the latter mainly includes some recent networks equipped with some novel structures like Inception in GoogLeNet [34] or residual blocks in ResNet [14].
We use different strategies to prune these two types of networks. For VGG16, we apply the single layer pruning strategy to the convolutional layer step by step. For ResNet, some restrictions are incurred due to its special structure. For example, the channel numbers of the residual learning branch and the identity mapping branch in the same block need to be consistent in order to finish the sum operation. Thus it is hard to directly prune the last convolutional layer in the residual learning branch. Since most parameters appear in the first two layers, pruning the first two layers is a feasible option which is illustrated in Figure 3.
4 Experiment
Our method is tested on combinations of three popular CNN models with three benchmark datasets: VGG16 [32] on ILSCVR12 [31] and PASCAL VOC 2007 [8], ResNet50 [14] on ILSCVR12 [31], and CMUpose [3] on MSCOCO14 [26].
Firstly, we compare several filter selection strategies including ours by pruning single layer for VGG16 [32] on ILSCVR12 to exhibit efficiency of our algorithm, followed by whole model pruning for VGG16 [32]. Secondly, we show the performance of pruning the network with residual architecture, for which ResNet50 [14] is selected. Finally, we apply our method to Faster RCNN [30] and CMUPose [3]
to evaluate the generalization capability of our algorithm to challenge visual tasks of object detection and human pose estimation. All the experiments were implemented within Caffe
[19].The performance of ConvNets compression is evaluated with different speedup ratios. Assume that is the number of filters in the original layer and is that of the pruned layer, then
(5) 
4.1 Experiments on VGG16
VGG16 [32] is a classic single path CNN with 13 convolutional layers, which has been widely used in vision tasks as a powerful feature extractor. We use single layer pruning and whole model pruning to evaluate the efficiency of our method. The effectiveness is measured by the decrease of top5 accuracy on validation dataset. The top5 accuracy of VGG16 [32] on ILSCVR12 [31] validation dataset is .
4.1.1 Single layer pruning.
We first evaluate the single layer acceleration performance of our method. We compare our approach with several existing filter selection strategies. sparse vector [15] preserves filters according to their importance scores learned by a sparsity regularization method. max response [25] selects channels based on corresponding filters that have high absolute sum of weights. To differentiate our approach from the common clustering algorithms, we also select kmeans as a baseline. In addition, to validate the necessity of elaborative handcrafted filter selection algorithm, we also take two naive criteria into consideration. first k selects the first k channels. random randomly selects a fixed amount of filters. After filter pruning, feature maps reconstruction is implemented without the finetuning step. The effectiveness of the methods is measured by reduction of top5 accuracy on the validation dataset after the reconstruction procedure is accomplished.
Similar to [15], we extracted 10 samples per class, i.e. a total of images, to prune channels and minimize reconstruction errors. Images were resized such that the shorter dimension is 256. Then random cropping was applied and the resulting image patches were fed into the network. The testing was made on a crop of pixels at the center of the image. The selfexpressiveness matrices for convolutional layers were learned with minibatch size of 128 and the learning rate varied from to in epochs. After pruning filters, we used a batch size of 64 and varied the learning rate from to to minimize the reconstruct error until the loss did not drop continuously. All parameters were optimized with Adam [21]. We pruned three convolutional layers, i.e., conv3_1, conv4_1 and conv4_2, with aforementioned methods including ours under several speedup ratios. The results are shown in Figure 4.
As expected, the loss on accuracy is proportional to the speedup ratio, i.e., error increases as speedup ratio increases. With the same speedup ratio, our approach consistently outperforms other methods in different convolutional layers under different speedup ratios. This shows that our subspace clustering based pruning method can retain more representative information. This enables the feature maps to be reconstructed more effectively. Although the key idea of the kmeans option is also clustering, it can not explore the linear relationship between feature maps, obtaining a coarse clustering result. Nevertheless, the performance of kmeans is consistently better than the two naive approaches, indicating clustering based pruning strategy is feasible in practice. max response performs with high loss of accuracy, sometimes even worse than first k. This is probably because max response ignores correlations between different filters. The random selection option shows good performance, even better than the heuristic methods in some cases. However, this method is not robust in feature maps reconstruction, making it not applicable in practice. In summary, the naive pruning strategies have shown some weakness, which implies that proper filter selection is vital for filter pruning.
It is also noticeable that pruning gradually becomes more difficult from shallow to deep layers. It indicates that whereas shallow layers have much more redundancy, deeper layers make more contribution to the final performance, which is consistent with the observation in [39] and [15]. This means it is preferable to prune more parameters in shallow layers rather than deep layers to accelerate the model. Moreover, Figure 4 shows that our filter pruning method leads to smaller increase of error compared with other strategies when the deeper layers are compressed.
4.1.2 Whole model pruning.
Method  

Jaderberg et al. [18]    9.7  29.7 
Asym. [39]  0.28  3.84   
Filter pruning [25] (finetuned)  0.8  8.6  14.6 
He et al. [15] (without finetune)  2.7  7.9  22.0 
Ours (without finetune)  2.6  3.7  8.7 
He et al. [15] (finetuned)  0  1.0  1.7 
Ours (finetuned)  0  0.5  1.1 
The whole model acceleration results under , , are demonstrated in Table 1. Firstly, we applied our approach layer by layer sequentially. Then, our pruned model was finetuned for 10 epoches with a fixed learning rate to gain a higher accuracy. We augmented the data by random cropping of pixels and mirror the cropped patch. Other parameters were the same as in our single layer pruning experiment. Since the last group of convolutional layers (i.e., convx) affects the classification more significantly, these layers were not pruned. After the filter pruning and reconstruction, our approach outperforms the sparse vector method [15] by a large margin, which is consistent with the results of single layer analysis. In addition, our approach produces more compact models since we do not have the constraint on remaining channels ratios for shallow layers (conv1_x to conv3_x) and deep layers (conv4_x) as required in [15]. After finetuning, our method achieves speedup without decrease of accuracy. Under and , the accuracy of our method only drops by and respectively. Our approach outperforms the stateoftheart filter level pruning approaches ([25] and [15]). This is because our method retains as much representative information as possible by exploring linear relationship between feature maps via subspace clustering, thus, recovering better approximation to the original data in the subsequent output volume.
4.2 Experiments on ResNet
We also tested our method on the recently proposed multipath network ResNet [14]
. We selected ResNet50 as a representation of the ResNet family. During the implementation, we merged batch normalization
[17]into convolutional weights. This does not affect the outputs of the networks, so that each convolutional layer is followed by ReLU
[28]. Since ResNet50 consists of residual blocks, we pruned each block step by step, i.e., we pruned ResNet50 from block 2a to 5c sequentially. In this experiment, for each block, we only pruned the convolutional layers that learned the residual mapping. Therefore, we only pruned the first two layers of each block in ResNet50 for simplicity, leaving the block output and projection shortcuts unchanged. Pruning these parts may lead to further compression, but can be quite difficult if not entirely impossible. We leave this exploration as a future work. After each block had been pruned, we used Adam [21] with minibatch size of 64 and varied the learning rate from to to minimize reconstruction error until the loss did not drop continuously. The model was finetuned in 20 epochs with fixed learning rate to gain a higher accuracy.Method  Increased err. 

He et al. [15]  8.0 
Ours(without finetune)  5.2 
He et al. [15](enhanced)  4.0 
He et al. [15](enhanced, finetuned)  1.4 
Ours(finetune)  1.0 
The results of acceleration on ResNet50 are presented in Table 2. Our approach outperforms the stateoftheart method [15] both before or after the finetuning. In addition, while pruning ResNet50, He et al. [15] kept 70% and 30% channels for sensitive residual blocks and other blocks respectively. Our approach, without these constraints, is simpler and more efficient. Our pruning strategy can obtain more representative filters by eliminating redundancy in feature map subspace, enabling the reconstruct error to be better minimized.
4.3 Generalization Capability of the Pruned Model
To explore the generalization capability of our method, we ran experiments on two challenging vision tasks: object detection and human pose estimation. We used Faster RCNN [30] on PASCAL VOC 2007 for the former task and CMUpose [3] on MSCOCO14 [26] for the latter one. Both networks were accelerated by our approach under and speedup ratios. The performance is evaluated in terms of mean Average Precision (mAP).
4.3.1 Acceleration for object detection.
Speedup  mAP  mAP 

Baseline  68.7   
He et al. [15] ()  68.3  0.4 
Ours ()  68.5  0.2 
He et al. [15] ()  66.9  1.8 
Ours ()  67.7  1.0 
For convenience, we compressed the Faster RCNN model with VGG16 as its backbone. Since there is no redundancy in the convolutional layers except those in VGG16, we used the same parameters as in our VGG16 experiment. To compare with the alternative approach fairly, we followed the setting as in [15]
. We first performed channel pruning on VGG16 on the ImageNet. Then we used the pruned model as the pretrained model for Faster RCNN. The model acceleration is demonstrated on the PASCAL VOC 2007 object detection benchmark
[8] which contains 5k training images and 5k testing images. From Table 3, we observe mAP drop with our model, which outperforms the method of He et al. [15]. Such small mAP drop will not generate significant negative effect in real applications, but brings much benefit in efficiency and model complexity reduction.4.3.2 Acceleration for human pose estimation.
Speedup  mAP  mAP 

Baseline  57.6   
56.8  0.8  
55.7  1.9 
CMUpose [3] is a bottomup approach for multiperson 2D pose estimation. It simultaneously predicts heat maps and part affinity fields (PAFs) for body parts and body limps respectively. Then it joins two corresponding detection results into the same group by using an associated PAF. The architecture of the network consists of two parts. The first 10 layers of VGG19 [32] is used as the first part of the network to extract features. In the second part, the network is split into two branches: one branch predicts the confidence maps, and the other predicts the affinity fields. Each branch is an iterative prediction architecture, which refines the predictions over successive stages with intermediate supervision at each stage. Since there is no redundancy in the last convolutional layer of each stage (i.e., conv5_5_CPM_Lx and Mconv7_stagex_Lx), we pruned the rest convolutional layers in the same manner as in the single layer pruning strategy. Similar to our VGG16 experiment, we randomly selected samples for filter pruning and reconstruction. After pruning and reconstruction, the model was finetuned in epoches with a fixed learning rate . Other parameters were the same as in our VGG16 pruning experiment. CMUpose model compression results under and are demonstrated in Table 4. The results show 0.8% mAP drop of our model, which showcase the effectiveness of our method.
5 Conclusion
Current deep CNNs are accurate with high inference costs. In this paper, we have presented a novel filter pruning method for deep neural networks. Since it is observable that there is linear relationship in different feature map subspaces, we can eliminate the redundancy in convolutional filters by applying subspace clustering on feature maps. Different from existing filter pruning methods which directly remove filters based on their importance, our approach better retrieves the representative information according to the linear relationship in feature map subspaces, so most important information can be retained by the mean of each cluster. Our method only requires offtheshelf libraries. The reduced CNNs are inference efficient networks while maintaining accuracy. Compelling speedup and accuracy are demonstrated on both VGGNet and ResNet with ILSCVR12. Moreover, experiments on other computer vision tasks also show the feasibility of our compression method in practice.
References
 [1] Anwar, S., Sung, W.: Compact deep convolutional neural networks with coarse pruning. arXiv preprint arXiv:1610.09639 (2016)
 [2] Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Advances in Neural Information Processing Systems. pp. 2654–2662 (2014)

[3]
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multiperson 2d pose estimation using part affinity fields. In: Computer Vision and Pattern Recognition. pp. 1–9 (2017)

[4]
Chen, W., Wilson, J., Tyree, S., Weinberger, K., Chen, Y.: Compressing neural networks with the hashing trick. In: International Conference on Machine Learning. pp. 2285–2294 (2015)
 [5] Cheng, Y., Yu, F.X., Feris, R.S., Kumar, S., Choudhary, A., Chang, S.F.: An exploration of parameter redundancy in deep networks with circulant projections. In: International Conference on Computer Vision. pp. 2857–2865 (2015)
 [6] Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: Computer Vision and Pattern Recognition. pp. 1–9 (2017)
 [7] Elhamifar, E., Vidal, R.: Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(11), 2765–2781 (2012)
 [8] Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge 2007 (voc 2007) results (2007) (2008)
 [9] Figurnov, M., Ibraimova, A., Vetrov, D.P., Kohli, P.: Perforatedcnns: Acceleration through elimination of redundant convolutions. In: Advances in Neural Information Processing Systems. pp. 947–955 (2016)
 [10] Gong, Y., Liu, L., Yang, M., Bourdev, L.: Compressing deep convolutional networks using vector quantization. Computer Science (2014)
 [11] Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In: International Conference on Learning Representations. pp. 1–13 (2016)
 [12] Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems. pp. 1135–1143 (2015)
 [13] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask RCNN. In: International Conference on Computer Vision. pp. 2980–2988 (2017)
 [14] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition. pp. 770–778 (2016)
 [15] He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: International Conference on Computer Vision. pp. 1–9 (2017)
 [16] Hu, H., Peng, R., Tai, Y.W., Tang, C.K.: Network trimming: A datadriven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250 (2016)
 [17] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning. pp. 448–456 (2015)
 [18] Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. In: British Machine Vision Conference. pp. 1–12 (2014)
 [19] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: International Conference on Multimedia. pp. 675–678 (2014)
 [20] Kim, Y.D., Park, E., Yoo, S., Choi, T., Yang, L., Shin, D.: Compression of deep convolutional neural networks for fast and low power mobile applications. In: International Conference on Learning Representations. pp. 1–13 (2016)
 [21] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations. pp. 1–14 (2015)
 [22] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems. pp. 1097–1105 (2012)
 [23] Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I., Lempitsky, V.: Speedingup convolutional neural networks using finetuned cpdecomposition. Computer Science (2014)
 [24] Lebedev, V., Lempitsky, V.: Fast convnets using groupwise brain damage. In: Computer Vision and Pattern Recognition. pp. 2554–2564 (2016)
 [25] Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets. In: International Conference on Learning Representations. pp. 1–13 (2017)
 [26] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context 8693, 740–755 (2014)
 [27] Luo, J.H., Wu, J., Lin, W.: Thinet: A filter level pruning method for deep neural network compression. In: International Conference on Computer Vision. pp. 1–9 (2017)

[28]
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: International Conference on Machine Learning. pp. 807–814 (2010)
 [29] Newell, A., Huang, Z., Deng, J.: Associative embedding: Endtoend learning for joint detection and grouping. In: Advances in Neural Information Processing Systems. pp. 2274–2284 (2017)
 [30] Ren, S., He, K., Girshick, R., Sun, J.: Faster RCNN: Towards realtime object detection with region proposal networks. In: Advances in Neural Information Processing Systems. pp. 91–99 (2015)
 [31] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115(3), 211–252 (2015)
 [32] Simonyan, K., Zisserman, A.: Very deep convolutional networks for largescale image recognition. In: International Conference on Learning Representations. pp. 1–13 (2015)
 [33] Sindhwani, V., Sainath, T., Kumar, S.: Structured transforms for smallfootprint deep learning. In: Advances in Neural Information Processing Systems. pp. 3088–3096 (2015)
 [34] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., et al.: Going deeper with convolutions. In: Computer Vision and Pattern Recognition. pp. 1–9 (2015)
 [35] Wang, Y., Xu, C., Xu, C., Tao, D.: Beyond filters: Compact feature map for portable deep model. In: International Conference on Machine Learning. pp. 3703–3711 (2017)
 [36] Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. In: Advances in Neural Information Processing Systems. pp. 2074–2082 (2016)
 [37] Wu, J., Leng, C., Wang, Y., Hu, Q., Cheng, J.: Quantized convolutional neural networks for mobile devices. In: Computer Vision and Pattern Recognition. pp. 4820–4828 (2016)
 [38] Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In: International Conference on Learning Representations. pp. 1–14 (2017)
 [39] Zhang, X., Zou, J., He, K., Sun, J.: Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(10), 1943–1955 (2016)
Comments
There are no comments yet.