With the collection of huge volume of labeled data, tremendous power of graphical processing units (GPUs) and parallel computation, convolutional neural networks (CNNs) have achieved the state-of-the-art performance in a wide variety of computer vision tasks, such as image classification, object detection , image segmentation 
, and human pose estimation. As flexible function approximators by scaling to millions of parameters, CNNs can extract high-level and more discriminative features compared with the traditional elaborative hand-crafted ones.
However, modern CNNs heavily rely on the intensive computing and memory resources despite their overwhelming success. For instance, the ResNet-50  has more than 50 convolutional layers, requiring over 95MB storage memory and over 3.8 billion floating number multiplications when processing an image. The VGG-16 model 
has 138.34 million parameters, taking up more than 500MB storage space, and needs 30.94 billion float point operations (FLOPs) to classify a single image. It is very difficult to deploy these complex CNN models in some specific scenarios where computing resource is constrained, i.e., a task must be completed with limited resources such as computing time, storage space, and battery power.
Both academia and industry have developed methods to reduce the amount of parameters in CNNs. Ba et al. 
used class probabilities produced by a pre-trained model as “soft targets” to feed a tiny network, successfully transferring the cumbersome model to a compact one while maintaining the generalization capability of the model. The student-teacher paradigm in
has shown its effectiveness in compressing CNNs, however, to devise a new tiny network is not a trivial task. Moreover, it remains an open problem on how to define the inherent “knowledge” in the teacher model. Tensor factorization based methods[23, 20, 18, 10] factorize an over-parameterized convolutional layer into several light ones. However, decomposing convolution favoured by modern CNNs (e.g., GoogleNet , ResNet , and Xception ) is still an intractable problem. Moreover, tensor decomposition techniques expand the target network deeper, incurring more convolution operations.
Filter pruning has been proposed to address this issue. Since the network architecture is constant after the filter pruning, the obtained model is compatible with any off-the-shelf deep learning frameworks. In addition, since volumes of both convolutional kernels and intermediate activations are shrunken, the required memory is reduced remarkably. This strategy also allows complementary compression methods to be employed to gain a more compact model. The advantages of filter pruning lead to increasing attention to research in this direction. He et al. 27] used statistical information computed from the next layer to guide the filter pruning for the current layer in a greedy way. Both methods directly prune some filters. It is obvious that the information contained in pruned filters can no longer be utilized once the filters have been pruned.
In this paper, we propose a novel filter pruning method by leveraging the linear relationship in feature map subspace. As shown in Figure 1, since different feature maps from one convolutional layer originate from the same image with different convolutional filters by a linear operation, the outputs will be linearly dependent in different subspaces. Among various subspace analysis approaches, subspace clustering is an excellent method to cluster linearly distributed data in different subspaces. Motivated by this, instead of measuring the importance of filters [25, 16] or feature maps [15, 27, 1] and subsequently removing the trivial kernels, we attempt to seek the most representative information by casting the filter selection problem into a subspace clustering problem on intermediate activations. We firstly cluster feature maps into subspaces. This allows the clustering of the corresponding filters in the next layer which take these feature maps as input. Also, filters in the upper layer that produce these feature maps can be clustered. Then, we iterate this process to prune the whole network layer by layer.
In summary, this paper makes the following contributions:
We propose a novel filter pruning method based on the linear relationship in feature map subspace to compress and accelerate CNN models. To the best of our knowledge, it is the first work to investigate clustering method for CNN model compression. Moreover, it is also the first work to employ subspace clustering to accelerate CNNs. We can prune the redundant information in feature maps and simultaneously retain the most representative information.
We devise a flexible filter pruning framework that is independent of the network structure. Thus our method can be well supported by any off-the-shelf deep learning libraries.
Compared to the original heavy network, experiments demonstrate that the learned portable network achieves a comparable accuracy, but has significantly lower memory usage and computational cost. We achieve consistent improvement on various tasks, exceed other filter pruning works, and obtain the state-of-the-art results.
2 Related Work
Recently, several works on CNN acceleration focused on network pruning thanks to its apparent benefits. Along this line, various strategies have been adopted, e.g., fine-grained pruning , group-level pruning [36, 24], filter-level pruning [16, 25], layer-level pruning  and feature maps pruning [9, 1, 15, 27]. Han et al.  introduced sparsity regularization approach to calculate and remove connections with small weights. The major drawback of this fine-grained pruning is the loss of universality and flexibility due to the unstructured pruned parameters, which heavily hinders the pruned models to be transferred to real applications.
Group-level pruning approaches alleviate the above problem by learning solid sparse patterns. Lebedev et al.  used group-wise brain damage process to sparsify convolution kernels. This generates one sparsity pattern per group (2D kernels) in convolutional layers. Then the entire group with small weights can be removed. Similarly, Wen et al.  proposed the Structured Sparsity Learning (SSL) method to regularize filter, channel, filter shape and depth structures. Zagoruyko et al.  demonstrated a layer-level pruning technique. For the network consisting of multiple homogeneous stages (each stage is a set of convolutional blocks), some stages are removed by merging attention maps into specific cost function. Thus, it provides an approach to combine network pruning with knowledge distillation. Figurnov et al.  explored feature map pruning which only kept a subset of rows in the patch matrix by using solid sparsity patterns, i.e., perforation mask 
, and interpolated the missing output values. Perforation mask was predefined and could be in grid or pooling structure. However, this method only shortens the inference time and does not compress the model. Similar to, it is only supported by deep learning frameworks which reduce generalized convolution to matrix multiplication.
Compared with the aforementioned pruning strategies, filter-level pruning is more efficient in accelerating very deep neural networks. For two consecutive convolutional blocks, which are indispensable in all CNNs, after the filter pruning for the former block, the number of input channels of the latter block is also reduced. Moreover, the shape of the chunk created by the latter block is constant. By minimizing construction errors of feature maps, the outputs of CNN endpoints are retained. Thus, it is vital to determine which filters are to be eliminated. Some methodologies use kernel importance. An intuitive possible way is to use the magnitude of weights. Li et al.  calculated absolute weight sum of each filter as its importance score. Denseness in filters is an alternative. Hu et al.  depicted the significance of each filter by calculating the Average Percentage of Zeros (APoZ) in it.
Another type of approaches conquers the filter selection challenge by converting it into channel selection for feature maps. Anwar et al.  exhibited over 100 random trials on channel selection. However, it is time consuming and laborious per trial. Thus, inflexibility is a common problem on very deep models and large datasets. ThiNet  used statistical information computed from the next layer to guide the filter pruning of the current layer. The pruned convolutional layer was forced to mimic the original one by minimizing the reconstruction errors of feature maps. Although the works from He et al.  and Luo et al.  have similar workflows, their channel selection strategies are different. The main insight of  was to learn a weight vector for feature maps. The weight vector is optimized for channel selection with fixed convolutional filters. Then the convolutional filters are used to reconstruct error with the weight vector fixed. In practice, the weight vector is optimized for multiple times and the filters just once to obtain the final result since the two step iteration is time consuming.
In addition, there are a variety of techniques for compressing convolution filters. A representative approach is Low-rank approximation [23, 20, 18, 10]. It breaks a convolutional layer into several small pieces by applying tensor decomposition strategies, e.g. CP decomposition  and Tucker decomposition . Other methods include parameter quantization [10, 4, 37, 11] and structural matrix design [35, 5, 33].
3 Proposed Method
In this section, we describe a novel filter pruning method based on the linear relationship in feature map subspace. We first introduce the overall framework, then present the details of each step. Finally, we describe our pruning strategy which takes both efficiency and effectiveness into consideration.
3.1 Overall Framework
Filter pruning is an effective method for reducing the complexity of neural networks. There are two key points in filter pruning. The first is filter selection, i.e., we need to seek the most representative filters to retain as much information as possible. The second is reconstruction, i.e., the following feature maps shall be reconstructed using the clustered filters. The main difference between our method and previous works is the strategy in filter selection. Most existing methods directly prune filters which make weak contribution to the neural network. The drawback is that the information of pruned filters can not be further utilized, which influences the result of feature map reconstruction. Our method, on the other hand, utilizes subspace clustering algorithm to reduce the number of feature maps, which can simultaneously eliminate the redundant feature maps and retain the most representative information in feature maps. To prune the input feature maps from to desired , we group the feature maps into clusters. Then we calculate the average of corresponding filters of each feature map cluster. A pre-trained model is pruned layer by layer with a predefined compression rate.
We summarize our filter pruning method on a single convolutional layer in Figure 2. We aim to prune the filters in layer and layer . Once feature maps of layer are clustered, we can cluster corresponding filters in layer and corresponding channels of filter in layer . The method has the following key steps:
Feature map clustering. Since different feature maps from a convolutional layer are generated from the same image with different convolutional filters by a linear operation, the output feature maps will be linearly dependent in different subspaces. Therefore, we leverage a subspace clustering algorithm to cluster the feature maps.
Filter clustering and reconstruction. After subspace clustering, we can cluster corresponding input channels of filter in the next layer which take these feature maps as input. Filters in the upper layer that produce these feature maps can also be clustered. Then we reconstruct the following feature maps using the pruned filters. Note that the pruned network has exactly the same structure but with fewer filters. In other words, the original thick network becomes a much thinner model.
Fine-tuning is a necessary step to recover the generalization ability influenced by filter pruning, which is time consuming on large datasets and complex models. For efficiency considerations, we fine-tune part of epochs after all pruned feature maps have been reconstructed.
Iterate to step 1 to prune the next layer.
3.2 Filter Pruning
we propose a two-step algorithm for filter pruning. In the first step, we aim to seek the most representative filters. Since there is a linear relationship in different feature map subspace, we utilize a subspace clustering algorithm to estimate average feature maps which contain as much representative information as possible. Then, the corresponding filters of each feature map cluster are clustered. In the second step, we reconstruct the following feature maps using the average filters with linear least squares.
Formally, we use to denote the convolution process in layer , where is the input tensor which has feature maps of in size. is a set of filters with kernels of in dimension, which generates a new tensor with feature maps.
3.2.1 Subspace clustering.
To prune the channels of feature maps from to desired , we cluster the feature maps into clusters. Since different feature maps are generated from the same image, and convolutional filter is a linear operation, the output feature maps of the image will be linearly dependent in different subspaces which satisfy subspace distribution, i.e., one feature map can be expressed in a subspace as a linear combination of other feature maps in the same subspace. This property is called self-expressiveness. We leverage a subspace clustering algorithm  to cluster the feature maps. Mathematically, this idea is formalized as an optimization problem
where is reshaped , and is a self-expressiveness matrix. The subspace clustering algorithm is summarized in Algorithm 1.
3.2.2 Filter clustering.
After clustering feature maps into clusters, we represent the indices of each cluster as . Then, we can prune the channels of filter in layer by calculating the average channel of each cluster. For the -th filter , the average channel can be calculated through the clustering index
where is the -th channel of filter , is the number of elements in . Then the pruned -th filter is the concatenation of , . For , we can obtain their pruned filters using Eq. (2). Then the pruned filters for layers are .
Naturally, the filters of upper layer that produce feature maps can also be clustered
where is the -th filters of , is the number of elements in . The result is , where is the number of filters in layer .
3.2.3 Reconstruction error minimization.
We reconstruct the output feature maps with pruned filters by linear least squares. This problem can be formulated as:
where is the Frobenius norm, is the convolution operation. are the feature maps produced by the pruned layer . The complete filter pruning process is summarized in Algorithm 2.
3.3 Pruning Strategy
The network architectures can be divided into two types, the traditional single path and multi-path convolutional architectures. AlexNet  or VGGNet  is the representative for the former, while the latter mainly includes some recent networks equipped with some novel structures like Inception in GoogLeNet  or residual blocks in ResNet .
We use different strategies to prune these two types of networks. For VGG-16, we apply the single layer pruning strategy to the convolutional layer step by step. For ResNet, some restrictions are incurred due to its special structure. For example, the channel numbers of the residual learning branch and the identity mapping branch in the same block need to be consistent in order to finish the sum operation. Thus it is hard to directly prune the last convolutional layer in the residual learning branch. Since most parameters appear in the first two layers, pruning the first two layers is a feasible option which is illustrated in Figure 3.
Our method is tested on combinations of three popular CNN models with three benchmark datasets: VGG-16  on ILSCVR-12  and PASCAL VOC 2007 , ResNet-50  on ILSCVR-12 , and CMU-pose  on MSCOCO-14 .
Firstly, we compare several filter selection strategies including ours by pruning single layer for VGG-16  on ILSCVR-12 to exhibit efficiency of our algorithm, followed by whole model pruning for VGG-16 . Secondly, we show the performance of pruning the network with residual architecture, for which ResNet-50  is selected. Finally, we apply our method to Faster R-CNN  and CMU-Pose 
to evaluate the generalization capability of our algorithm to challenge visual tasks of object detection and human pose estimation. All the experiments were implemented within Caffe.
The performance of ConvNets compression is evaluated with different speed-up ratios. Assume that is the number of filters in the original layer and is that of the pruned layer, then
4.1 Experiments on VGG-16
VGG-16  is a classic single path CNN with 13 convolutional layers, which has been widely used in vision tasks as a powerful feature extractor. We use single layer pruning and whole model pruning to evaluate the efficiency of our method. The effectiveness is measured by the decrease of top-5 accuracy on validation dataset. The top-5 accuracy of VGG-16  on ILSCVR-12  validation dataset is .
4.1.1 Single layer pruning.
We first evaluate the single layer acceleration performance of our method. We compare our approach with several existing filter selection strategies. sparse vector  preserves filters according to their importance scores learned by a sparsity regularization method. max response  selects channels based on corresponding filters that have high absolute sum of weights. To differentiate our approach from the common clustering algorithms, we also select kmeans as a baseline. In addition, to validate the necessity of elaborative hand-crafted filter selection algorithm, we also take two naive criteria into consideration. first k selects the first k channels. random randomly selects a fixed amount of filters. After filter pruning, feature maps reconstruction is implemented without the fine-tuning step. The effectiveness of the methods is measured by reduction of top-5 accuracy on the validation dataset after the reconstruction procedure is accomplished.
Similar to , we extracted 10 samples per class, i.e. a total of images, to prune channels and minimize reconstruction errors. Images were resized such that the shorter dimension is 256. Then random cropping was applied and the resulting image patches were fed into the network. The testing was made on a crop of pixels at the center of the image. The self-expressiveness matrices for convolutional layers were learned with mini-batch size of 128 and the learning rate varied from to in epochs. After pruning filters, we used a batch size of 64 and varied the learning rate from to to minimize the reconstruct error until the loss did not drop continuously. All parameters were optimized with Adam . We pruned three convolutional layers, i.e., conv3_1, conv4_1 and conv4_2, with aforementioned methods including ours under several speed-up ratios. The results are shown in Figure 4.
As expected, the loss on accuracy is proportional to the speed-up ratio, i.e., error increases as speed-up ratio increases. With the same speed-up ratio, our approach consistently outperforms other methods in different convolutional layers under different speed-up ratios. This shows that our subspace clustering based pruning method can retain more representative information. This enables the feature maps to be reconstructed more effectively. Although the key idea of the kmeans option is also clustering, it can not explore the linear relationship between feature maps, obtaining a coarse clustering result. Nevertheless, the performance of kmeans is consistently better than the two naive approaches, indicating clustering based pruning strategy is feasible in practice. max response performs with high loss of accuracy, sometimes even worse than first k. This is probably because max response ignores correlations between different filters. The random selection option shows good performance, even better than the heuristic methods in some cases. However, this method is not robust in feature maps reconstruction, making it not applicable in practice. In summary, the naive pruning strategies have shown some weakness, which implies that proper filter selection is vital for filter pruning.
It is also noticeable that pruning gradually becomes more difficult from shallow to deep layers. It indicates that whereas shallow layers have much more redundancy, deeper layers make more contribution to the final performance, which is consistent with the observation in  and . This means it is preferable to prune more parameters in shallow layers rather than deep layers to accelerate the model. Moreover, Figure 4 shows that our filter pruning method leads to smaller increase of error compared with other strategies when the deeper layers are compressed.
4.1.2 Whole model pruning.
|Jaderberg et al. ||-||9.7||29.7|
|Filter pruning  (fine-tuned)||0.8||8.6||14.6|
|He et al.  (without fine-tune)||2.7||7.9||22.0|
|Ours (without fine-tune)||2.6||3.7||8.7|
|He et al.  (fine-tuned)||0||1.0||1.7|
The whole model acceleration results under , , are demonstrated in Table 1. Firstly, we applied our approach layer by layer sequentially. Then, our pruned model was fine-tuned for 10 epoches with a fixed learning rate to gain a higher accuracy. We augmented the data by random cropping of pixels and mirror the cropped patch. Other parameters were the same as in our single layer pruning experiment. Since the last group of convolutional layers (i.e., convx) affects the classification more significantly, these layers were not pruned. After the filter pruning and reconstruction, our approach outperforms the sparse vector method  by a large margin, which is consistent with the results of single layer analysis. In addition, our approach produces more compact models since we do not have the constraint on remaining channels ratios for shallow layers (conv1_x to conv3_x) and deep layers (conv4_x) as required in . After fine-tuning, our method achieves speed-up without decrease of accuracy. Under and , the accuracy of our method only drops by and respectively. Our approach outperforms the state-of-the-art filter level pruning approaches ( and ). This is because our method retains as much representative information as possible by exploring linear relationship between feature maps via subspace clustering, thus, recovering better approximation to the original data in the subsequent output volume.
4.2 Experiments on ResNet
We also tested our method on the recently proposed multi-path network ResNet 
. We selected ResNet-50 as a representation of the ResNet family. During the implementation, we merged batch normalization
into convolutional weights. This does not affect the outputs of the networks, so that each convolutional layer is followed by ReLU. Since ResNet-50 consists of residual blocks, we pruned each block step by step, i.e., we pruned ResNet-50 from block 2a to 5c sequentially. In this experiment, for each block, we only pruned the convolutional layers that learned the residual mapping. Therefore, we only pruned the first two layers of each block in ResNet-50 for simplicity, leaving the block output and projection shortcuts unchanged. Pruning these parts may lead to further compression, but can be quite difficult if not entirely impossible. We leave this exploration as a future work. After each block had been pruned, we used Adam  with mini-batch size of 64 and varied the learning rate from to to minimize reconstruction error until the loss did not drop continuously. The model was fine-tuned in 20 epochs with fixed learning rate to gain a higher accuracy.
|He et al. ||8.0|
|He et al. (enhanced)||4.0|
|He et al. (enhanced, fine-tuned)||1.4|
The results of acceleration on ResNet-50 are presented in Table 2. Our approach outperforms the state-of-the-art method  both before or after the fine-tuning. In addition, while pruning ResNet-50, He et al.  kept 70% and 30% channels for sensitive residual blocks and other blocks respectively. Our approach, without these constraints, is simpler and more efficient. Our pruning strategy can obtain more representative filters by eliminating redundancy in feature map subspace, enabling the reconstruct error to be better minimized.
4.3 Generalization Capability of the Pruned Model
To explore the generalization capability of our method, we ran experiments on two challenging vision tasks: object detection and human pose estimation. We used Faster R-CNN  on PASCAL VOC 2007 for the former task and CMU-pose  on MSCOCO14  for the latter one. Both networks were accelerated by our approach under and speed-up ratios. The performance is evaluated in terms of mean Average Precision (mAP).
4.3.1 Acceleration for object detection.
|He et al.  ()||68.3||0.4|
|He et al.  ()||66.9||1.8|
For convenience, we compressed the Faster R-CNN model with VGG-16 as its backbone. Since there is no redundancy in the convolutional layers except those in VGG-16, we used the same parameters as in our VGG-16 experiment. To compare with the alternative approach fairly, we followed the setting as in 
. We first performed channel pruning on VGG-16 on the ImageNet. Then we used the pruned model as the pre-trained model for Faster R-CNN. The model acceleration is demonstrated on the PASCAL VOC 2007 object detection benchmark which contains 5k training images and 5k testing images. From Table 3, we observe mAP drop with our model, which outperforms the method of He et al. . Such small mAP drop will not generate significant negative effect in real applications, but brings much benefit in efficiency and model complexity reduction.
4.3.2 Acceleration for human pose estimation.
CMU-pose  is a bottom-up approach for multi-person 2D pose estimation. It simultaneously predicts heat maps and part affinity fields (PAFs) for body parts and body limps respectively. Then it joins two corresponding detection results into the same group by using an associated PAF. The architecture of the network consists of two parts. The first 10 layers of VGG-19  is used as the first part of the network to extract features. In the second part, the network is split into two branches: one branch predicts the confidence maps, and the other predicts the affinity fields. Each branch is an iterative prediction architecture, which refines the predictions over successive stages with intermediate supervision at each stage. Since there is no redundancy in the last convolutional layer of each stage (i.e., conv5_5_CPM_Lx and Mconv7_stagex_Lx), we pruned the rest convolutional layers in the same manner as in the single layer pruning strategy. Similar to our VGG-16 experiment, we randomly selected samples for filter pruning and reconstruction. After pruning and reconstruction, the model was fine-tuned in epoches with a fixed learning rate . Other parameters were the same as in our VGG-16 pruning experiment. CMU-pose model compression results under and are demonstrated in Table 4. The results show 0.8% mAP drop of our model, which showcase the effectiveness of our method.
Current deep CNNs are accurate with high inference costs. In this paper, we have presented a novel filter pruning method for deep neural networks. Since it is observable that there is linear relationship in different feature map subspaces, we can eliminate the redundancy in convolutional filters by applying subspace clustering on feature maps. Different from existing filter pruning methods which directly remove filters based on their importance, our approach better retrieves the representative information according to the linear relationship in feature map subspaces, so most important information can be retained by the mean of each cluster. Our method only requires off-the-shelf libraries. The reduced CNNs are inference efficient networks while maintaining accuracy. Compelling speed-up and accuracy are demonstrated on both VGG-Net and ResNet with ILSCVR-12. Moreover, experiments on other computer vision tasks also show the feasibility of our compression method in practice.
-  Anwar, S., Sung, W.: Compact deep convolutional neural networks with coarse pruning. arXiv preprint arXiv:1610.09639 (2016)
-  Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Advances in Neural Information Processing Systems. pp. 2654–2662 (2014)
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Computer Vision and Pattern Recognition. pp. 1–9 (2017)
Chen, W., Wilson, J., Tyree, S., Weinberger, K., Chen, Y.: Compressing neural networks with the hashing trick. In: International Conference on Machine Learning. pp. 2285–2294 (2015)
-  Cheng, Y., Yu, F.X., Feris, R.S., Kumar, S., Choudhary, A., Chang, S.F.: An exploration of parameter redundancy in deep networks with circulant projections. In: International Conference on Computer Vision. pp. 2857–2865 (2015)
-  Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: Computer Vision and Pattern Recognition. pp. 1–9 (2017)
-  Elhamifar, E., Vidal, R.: Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(11), 2765–2781 (2012)
-  Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge 2007 (voc 2007) results (2007) (2008)
-  Figurnov, M., Ibraimova, A., Vetrov, D.P., Kohli, P.: Perforatedcnns: Acceleration through elimination of redundant convolutions. In: Advances in Neural Information Processing Systems. pp. 947–955 (2016)
-  Gong, Y., Liu, L., Yang, M., Bourdev, L.: Compressing deep convolutional networks using vector quantization. Computer Science (2014)
-  Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In: International Conference on Learning Representations. pp. 1–13 (2016)
-  Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems. pp. 1135–1143 (2015)
-  He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: International Conference on Computer Vision. pp. 2980–2988 (2017)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition. pp. 770–778 (2016)
-  He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: International Conference on Computer Vision. pp. 1–9 (2017)
-  Hu, H., Peng, R., Tai, Y.W., Tang, C.K.: Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250 (2016)
-  Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning. pp. 448–456 (2015)
-  Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. In: British Machine Vision Conference. pp. 1–12 (2014)
-  Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: International Conference on Multimedia. pp. 675–678 (2014)
-  Kim, Y.D., Park, E., Yoo, S., Choi, T., Yang, L., Shin, D.: Compression of deep convolutional neural networks for fast and low power mobile applications. In: International Conference on Learning Representations. pp. 1–13 (2016)
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations. pp. 1–14 (2015)
-  Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems. pp. 1097–1105 (2012)
-  Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I., Lempitsky, V.: Speeding-up convolutional neural networks using fine-tuned cp-decomposition. Computer Science (2014)
-  Lebedev, V., Lempitsky, V.: Fast convnets using group-wise brain damage. In: Computer Vision and Pattern Recognition. pp. 2554–2564 (2016)
-  Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets. In: International Conference on Learning Representations. pp. 1–13 (2017)
-  Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context 8693, 740–755 (2014)
-  Luo, J.H., Wu, J., Lin, W.: Thinet: A filter level pruning method for deep neural network compression. In: International Conference on Computer Vision. pp. 1–9 (2017)
-  Newell, A., Huang, Z., Deng, J.: Associative embedding: End-to-end learning for joint detection and grouping. In: Advances in Neural Information Processing Systems. pp. 2274–2284 (2017)
-  Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems. pp. 91–99 (2015)
-  Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115(3), 211–252 (2015)
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations. pp. 1–13 (2015)
-  Sindhwani, V., Sainath, T., Kumar, S.: Structured transforms for small-footprint deep learning. In: Advances in Neural Information Processing Systems. pp. 3088–3096 (2015)
-  Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., et al.: Going deeper with convolutions. In: Computer Vision and Pattern Recognition. pp. 1–9 (2015)
-  Wang, Y., Xu, C., Xu, C., Tao, D.: Beyond filters: Compact feature map for portable deep model. In: International Conference on Machine Learning. pp. 3703–3711 (2017)
-  Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. In: Advances in Neural Information Processing Systems. pp. 2074–2082 (2016)
-  Wu, J., Leng, C., Wang, Y., Hu, Q., Cheng, J.: Quantized convolutional neural networks for mobile devices. In: Computer Vision and Pattern Recognition. pp. 4820–4828 (2016)
-  Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In: International Conference on Learning Representations. pp. 1–14 (2017)
-  Zhang, X., Zou, J., He, K., Sun, J.: Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(10), 1943–1955 (2016)